<a href="https://colab.research.google.com/github/gopithecheetah/MyServices/blob/main/Detecting_Malicious_URLs_with_RandomForest.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 🔐 Hands-on Lab: Detecting Malicious URLs with Machine Learning
Welcome to this practical lab on **Machine Learning for Threat Detection**.  

In this exercise, we’ll train a **RandomForest model** to classify URLs as either *malicious* or *benign*.  

**Learning Goals:**
- Extract useful features from URLs  
- Train a RandomForest classifier  
- Evaluate performance with metrics and confusion matrix  
---


In [None]:
# 📌 Step 1: Install required libraries (if not already installed)
!pip install pandas scikit-learn matplotlib seaborn


In [None]:
# 📌 Step 2: Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt


## 📂 Step 3: Load Dataset
We will use an open dataset of **malicious and benign URLs**.  

For this lab, you can download the dataset from Kaggle:  
👉 [Malicious URLs Dataset](https://www.kaggle.com/datasets/sid321axn/malicious-urls-dataset)  

Upload the file `malicious_urls.csv` to Colab.


In [None]:
# 📌 Step 3: Load dataset
from google.colab import files
uploaded = files.upload()

df = pd.read_csv("malicious_urls.csv")
df.head()


## ⚙️ Step 4: Feature Engineering
We’ll extract some simple features from each URL:
- **Length of the URL**  
- **Number of digits**  
- **Number of special characters (@, ?, -, =, etc.)**  
- **Whether it contains HTTPS**  
- **Whether it contains an IP address**  


In [None]:
def extract_features(url):
    return {
        "url_length": len(url),
        "num_digits": sum(c.isdigit() for c in url),
        "num_special": sum(c in ['@', '?', '-', '=', '.', '#', '%'] for c in url),
        "has_https": 1 if "https" in url else 0,
        "has_ip": 1 if any(char.isdigit() for char in url.split('/')[2]) else 0,
    }

features = df['url'].apply(lambda x: pd.Series(extract_features(x)))
X = features
y = df['label'].map(lambda x: 1 if x == "malicious" else 0)  # 1 = malicious, 0 = benign

X.head()


## 🏋️ Step 5: Train/Test Split
We’ll split the dataset into **70% training** and **30% testing**.


In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)


## 🌲 Step 6: Train RandomForest Model
RandomForest is a powerful baseline for classification tasks in security.


In [None]:
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)


## 📊 Step 7: Evaluate Model
We will check **accuracy**, **precision**, **recall**, and **F1-score**.


In [None]:
y_pred = model.predict(X_test)

print("✅ Accuracy:", accuracy_score(y_test, y_pred))
print("\n📋 Classification Report:\n", classification_report(y_test, y_pred))


## 🔎 Step 8: Confusion Matrix
This shows how many malicious URLs were correctly detected vs. misclassified.


In [None]:
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues",
            xticklabels=["Benign", "Malicious"],
            yticklabels=["Benign", "Malicious"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()


# 🎯 Conclusion
- We successfully trained a **RandomForest model** to classify URLs.  
- Even with simple features, ML can detect malicious URLs effectively.  
- In real-world scenarios, we would add more advanced features (domain age, WHOIS data, entropy, etc.).  

🚀 Great job completing your first **AI for Cybersecurity** hands-on lab!
