# Sumber Dataset

Dataset yang digunakan dalam Capstone project kali ini adalah dataset [Cybersecurity Threat Detection Logs](https://www.kaggle.com/datasets/aryan208/cybersecurity-threat-detection-logs?select=cybersecurity_threat_detection_logs.csv) yang diperoleh melalui platform kaggle.

## Import Library

In [2]:
!pip install langchain_community
!pip install replicate

Collecting langchain_community
  Downloading langchain_community-0.3.26-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain_community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain_community)
  Downloading pydantic_settings-2.10.1-py3-none-any.whl.metadata (3.4 kB)
Collecting httpx-sse<1.0.0,>=0.4.0 (from langchain_community)
  Downloading httpx_sse-0.4.1-py3-none-any.whl.metadata (9.4 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading marshmallow-3.26.1-py3-none-any.whl.metadata (7.3 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-json<0.7,>=0.5.7->langchain_community)
  Downloading typing_inspect-0.9.0-py3-none-any.whl.metadata (1.5 kB)
Collecting python-dotenv>=0.21.0 (from pydantic-settings<3.0.0,>=2.4.0->langchain_community)
  Downloading python_dotenv-1.1.1-py3-none-any.whl.metadata (24 k

In [3]:
from langchain_community.llms import Replicate
from google.colab import userdata
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Pesiapan LLM

In [4]:
# Ambil token dari colab
api_token = userdata.get("api-tokens")

# Masukin token ke env variable
os.environ["REPLICATE_API_TOKEN"] = api_token

In [5]:
# Define LLM Default
llm = Replicate(
    model="ibm-granite/granite-3.3-8b-instruct"
)

## Data Loading

In [6]:
df = pd.read_csv("/content/cybersecurity_threat_detection_logs.csv")
df.head()

Unnamed: 0,timestamp,source_ip,dest_ip,protocol,action,threat_label,log_type,bytes_transferred,user_agent,request_path
0,2024-05-01T00:00:00,192.168.1.125,192.168.1.124,TCP,blocked,benign,firewall,10889,Nmap Scripting Engine,/
1,2024-07-18T00:00:00,192.168.1.201,192.168.1.201,ICMP,blocked,benign,application,36522,Nmap Scripting Engine,/
2,2024-04-07T00:00:00,192.168.1.248,192.168.1.15,HTTP,allowed,benign,application,20652,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,/login
3,2024-10-26T00:00:00,192.168.1.236,192.168.1.219,HTTP,allowed,benign,application,5350,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7...,/login
4,2024-10-31T00:00:00,192.168.1.221,192.168.1.61,ICMP,allowed,benign,application,40691,Mozilla/5.0 (Windows NT 10.0; Win64; x64) Appl...,/


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6000000 entries, 0 to 5999999
Data columns (total 10 columns):
 #   Column             Dtype 
---  ------             ----- 
 0   timestamp          object
 1   source_ip          object
 2   dest_ip            object
 3   protocol           object
 4   action             object
 5   threat_label       object
 6   log_type           object
 7   bytes_transferred  int64 
 8   user_agent         object
 9   request_path       object
dtypes: int64(1), object(9)
memory usage: 457.8+ MB


In [8]:
sample_df = df.sample(n=3000, random_state=42)

print("\nDataset sampel berhasil dibuat. Ukuran sampel:", len(sample_df), "baris")
print(sample_df.head())


Dataset sampel berhasil dibuat. Ukuran sampel: 3000 baris
                   timestamp      source_ip        dest_ip protocol   action  \
1324896  2024-01-08T00:00:00  192.168.1.110  192.168.1.189      UDP  allowed   
3566176  2024-07-02T00:00:00  192.168.1.136   192.168.1.31     HTTP  allowed   
1109043  2024-10-30T00:00:00  225.184.81.55  192.168.1.126      TCP  allowed   
4286042  2024-09-05T00:00:00   192.168.1.54  192.168.1.192     ICMP  blocked   
5395174  2024-11-09T00:00:00   192.168.1.90  192.168.1.171     HTTP  blocked   

        threat_label     log_type  bytes_transferred  \
1324896       benign  application              18452   
3566176       benign          ids              48583   
1109043   suspicious  application               5648   
4286042       benign          ids               8024   
5395174       benign     firewall              20369   

                                                user_agent   request_path  
1324896  Mozilla/5.0 (Windows NT 10.0; Win64; x

## Pra-Pemrosesan Data

In [9]:
# ==============================================================================
# PRA-PEMROSESAN DATA & PERSIAPAN INPUT TEKS UNTUK LLM
# ==============================================================================

columns_for_llm_input = [
    'source_ip',
    'dest_ip',
    'protocol',
    'action',
    'log_type',
    'user_agent',
    'request_path'
]

for col in columns_for_llm_input:
    if col not in sample_df.columns:
        print(f"Peringatan: Kolom '{col}' tidak ditemukan di dataset Anda")
for col in columns_for_llm_input:
    sample_df[col] = sample_df[col].fillna('')

sample_df['combined_log_text'] = sample_df.apply(
    lambda row: " | ".join([f"{col.replace('_', ' ').title()}: {row[col]}" for col in columns_for_llm_input]),
    axis=1
)

print("Contoh 'combined_log_text' dan 'threat_label' setelah pra-pemrosesan:")
print(sample_df[['combined_log_text', 'threat_label']].head())

Contoh 'combined_log_text' dan 'threat_label' setelah pra-pemrosesan:
                                         combined_log_text threat_label
1324896  Source Ip: 192.168.1.110 | Dest Ip: 192.168.1....       benign
3566176  Source Ip: 192.168.1.136 | Dest Ip: 192.168.1....       benign
1109043  Source Ip: 225.184.81.55 | Dest Ip: 192.168.1....   suspicious
4286042  Source Ip: 192.168.1.54 | Dest Ip: 192.168.1.1...       benign
5395174  Source Ip: 192.168.1.90 | Dest Ip: 192.168.1.1...       benign


In [10]:
# Hitung distribusi threat
threat_summary = sample_df['threat_label'].value_counts()

# Endpoint (request_path) paling sering diserang
top_request_paths = sample_df['request_path'].value_counts().head(10)

# IP sumber paling sering muncul
top_source_ips = sample_df['source_ip'].value_counts().head(10)

# User-agent mencurigakan yang paling banyak digunakan
top_user_agents = sample_df['user_agent'].value_counts().head(10)

# Distribusi ancaman berdasarkan action
action_vs_threat = pd.crosstab(sample_df['action'], sample_df['threat_label'])

# Konversi timestamp menjadi datetime (untuk analisis waktu)
sample_df['timestamp'] = pd.to_datetime(sample_df['timestamp'], errors='coerce')
sample_df['hour'] = sample_df['timestamp'].dt.hour

# Distribusi ancaman berdasarkan jam
hourly_threats = sample_df[sample_df['threat_label'] != 'benign']['hour'].value_counts().sort_index()

# Cetak hasil ringkasan
print("Threat Summary:\n", threat_summary)
print("\nTop Targeted Request Paths:\n", top_request_paths)
print("\nTop Source IPs:\n", top_source_ips)
print("\nTop User Agents:\n", top_user_agents)
print("\nAction vs Threat:\n", action_vs_threat)
print("\nThreats by Hour:\n", hourly_threats)


Threat Summary:
 threat_label
benign        2767
suspicious     165
malicious       68
Name: count, dtype: int64

Top Targeted Request Paths:
 request_path
/                1371
/login            219
/admin/config     125
/secure           113
/auth             111
/wp-login.php     103
/api/login         99
/index.php         95
/admin             91
/api/v1/data       90
Name: count, dtype: int64

Top Source IPs:
 source_ip
192.168.1.242      16
98.153.120.136     16
245.200.237.136    15
109.9.8.24         15
181.18.12.170      15
216.197.199.15     15
42.119.98.70       14
221.28.64.50       14
192.168.1.130      14
93.225.253.213     14
Name: count, dtype: int64

Top User Agents:
 user_agent
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36    620
SQLMap/1.6-dev                                                                                                           610
Mozilla/5.0 (Windows NT 10.0; Win64; x64) Ap

## Data Analyze (LLM Summatization)

In [11]:
print("Dominant Threat Type")
print(threat_summary)

prompt = f"""
This is the threat type distribution from 3,000 security logs:

{threat_summary}

Based on this, what is the most dominant threat? Why is it dangerous to the organization?

Give your answer in 3–4 bullet points.
"""

output = llm.invoke(prompt)
print("Output:")
print(output)


Dominant Threat Type
threat_label
benign        2767
suspicious     165
malicious       68
Name: count, dtype: int64
Output:
- **Most Dominant Threat:** The most dominant threat in the provided distribution is "benign," accounting for 2767 out of 3000 security logs. This indicates that the majority (approximately 92%) of the monitored activities are considered non-threatening or normal.

- **Low Risk from Benign Category:** While it might seem concerning that a significant portion (92%) of logs are labeled as benign, this is actually reassuring. It suggests that the organization's security measures are effectively filtering out most potential threats, leaving only a small fraction (around 8%) that require further investigation.

- **Danger from Suspicious and Malicious Threats:** Although fewer in number, the "suspicious" (165, ~5.5%) and "malicious" (68, ~2.3%) threats still pose a risk. These could represent attempts at unauthorized access, data breaches, or other malicious activitie

In [12]:
print("Most Targeted Request Paths")
print(top_request_paths)

prompt = f"""
Below is a summary of the most frequently accessed request paths from 3,000 security logs:

{top_request_paths}

Please analyze and answer:
- Which endpoints appear to be targeted most?
- Are any of these endpoints sensitive or commonly exploited?
- What risks could these patterns suggest for the organization?
- Recommend 1–2 high-level actions to protect these assets.

Give your answer in clear bullet points.
"""

output = llm.invoke(prompt)
print("Output:")
print(output)


Most Targeted Request Paths
request_path
/                1371
/login            219
/admin/config     125
/secure           113
/auth             111
/wp-login.php     103
/api/login         99
/index.php         95
/admin             91
/api/v1/data       90
Name: count, dtype: int64
Output:
- **Most Targeted Endpoints:**
  - `/`: Accessed 1371 times
  - `/login`: Accessed 219 times
  - `/wp-login.php`: Accessed 103 times
  - Other endpoints like `/admin/config`, `/secure`, `/auth`, `/api/login`, `/index.php`, `/admin`, and `/api/v1/data` also show considerable access.

- **Sensitive or Commonly Exploited Endpoints:**
  - `/login`: This endpoint is generally sensitive as it deals with user authentication. It's a common target for brute force attacks and credential stuffing.
  - `/wp-login.php`: Specific to WordPress installations, this is a known vulnerability point if not properly secured, often targeted by automated bots for exploiting WordPress-specific weaknesses.
  - `/api/login

In [13]:
print("Source IPs & User Agents")
print(top_source_ips)
print()
print(top_user_agents)
print()

prompt = f"""
Here are the top source IP addresses and user agents from 3,000 security log entries:

Top Source IPs:
{top_source_ips}

Top User Agents:
{top_user_agents}

Please analyze and answer in bullet points:
- Are any of the IPs or user agents suspicious or indicate automated scanning tools?
- Could this be part of a botnet or external probing?
- What security risks do these patterns suggest?
- Recommend 1–2 proactive actions to reduce exposure.

Keep the response concise, focused, and business-oriented.
"""

output = llm.invoke(prompt)
print("Output:")
print(output)


STEP 3 — Source IPs & User Agents
source_ip
192.168.1.242      16
98.153.120.136     16
245.200.237.136    15
109.9.8.24         15
181.18.12.170      15
216.197.199.15     15
42.119.98.70       14
221.28.64.50       14
192.168.1.130      14
93.225.253.213     14
Name: count, dtype: int64

user_agent
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36    620
SQLMap/1.6-dev                                                                                                           610
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36          596
curl/7.64.1                                                                                                              596
Nmap Scripting Engine                                                                                                    578
Name: count, dtype: int64

Output:
- **Suspicious IPs and User Agents:**


In [14]:
print("Time-Based Threat Distribution")
print(hourly_threats)
print()

prompt = f"""
Here is the number of non-benign (suspicious or malicious) security events per hour, based on 3,000 log entries:

{hourly_threats}

Please answer in bullet points:
- Are there certain hours where threats are more concentrated?
- What could be the reason behind these time-based patterns?
- What operational adjustments should the company consider (e.g., monitoring schedule, alerts, response team availability)?

Respond with 3–5 clear, business-oriented bullet points.
"""

output = llm.invoke(prompt)
print("Output:")
print(output)


Time-Based Threat Distribution
hour
0    233
Name: count, dtype: int64

Output:
- **Concentrated Threat Hours:** The data shows a high number of non-benign security events, with 233 suspicious or malicious events per hour during specific periods. Notably, this figure is significantly higher than the average hourly count across other hours, indicating concentrated threat activity.

- **Time-Based Patterns:** These patterns could be attributed to several factors. One possibility is that attackers may target systems during off-peak hours when fewer personnel are monitoring, exploiting potential gaps in coverage. Another reason might be automated attacks that follow predictable schedules or are triggered by external events (e.g., coordinated global attacks).

- **Operational Adjustments:**
  - **Enhanced Monitoring Schedule:** The company should consider adjusting its monitoring schedule to increase vigilance during the identified high-risk hours, ensuring adequate staffing and readiness t

In [15]:
print("Action vs Threat Evaluation")
print(action_vs_threat)
print()

prompt = f"""
This table shows how the security system responded to different types of threats:

{action_vs_threat}

Please answer:
- Does this indicate that the system is effectively blocking malicious activities?
- Are there any signs of missed threats or false positives?
- What does this mean for system efficiency and accuracy?
- Recommend 1–2 improvements to the company's threat detection policy.

Respond in 3–5 bullet points using clear business language.
"""

output = llm.invoke(prompt)
print("Output:")
print(output)


Action vs Threat Evaluation
threat_label  benign  malicious  suspicious
action                                     
allowed         1395         30          79
blocked         1372         38          86

Output:
- The security system effectively blocked 96.5% (1372/1421) of malicious activities, indicating a robust response against potential threats.

- There are signs of missed threats, as 30 benign activities were incorrectly flagged as malicious (false positives), representing 2.1% of total malicious classifications. Additionally, 86 suspicious activities were allowed, suggesting a possible oversight in detecting and addressing these uncertain threats.

- System efficiency appears high with successful blockages of the majority of malicious attempts, but accuracy needs improvement to minimize false positives and properly identify suspicious activities.

- Recommendations:
  a. Implement machine learning algorithms to refine threat classification, reducing false positives and improvi