
#  Phishing Website Detection using Machine Learning

This project aims to detect **phishing websites** using machine learning models.  
The dataset contains various features extracted from URLs and website behaviors, which help distinguish **legitimate** sites from **phishing** ones.

---

##  Project Overview
**Steps covered in this notebook:**
1. Data loading and preprocessing  
2. Feature exploration  
3. Model training (Random Forest, K-Nearest Neighbors)  
4. Model evaluation and performance metrics  
5. Summary of findings



In [None]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
import seaborn as sns
import matplotlib.pyplot as plt



##  Load and Inspect Dataset
We'll load the phishing dataset and perform a quick inspection to understand its structure.


In [None]:
# Load dataset
df = pd.read_csv("Phishing_dataset.csv")

# Display basic info
print("Dataset shape:", df.shape)
df.head()



##  Data Preprocessing
We'll check for missing values, encode labels if necessary, and split the dataset into training and test sets.


In [None]:
# Check for missing values
print("Missing values per column:\n", df.isnull().sum())

# Assuming the target column is named 'Result' (1 = phishing, 0 = legitimate)
X = df.drop('Result', axis=1)
y = df['Result']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training samples:", X_train.shape[0])
print("Testing samples:", X_test.shape[0])



##  Random Forest Classifier
We'll start by training a Random Forest model and evaluating its accuracy.


In [None]:
# Train Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Predictions
y_pred_rf = rf_model.predict(X_test)

# Evaluation
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf))



##  K-Nearest Neighbors (KNN)
We'll compare another model — KNN — to see how it performs on the same dataset.


In [None]:
# Train KNN model
knn_model = KNeighborsClassifier(n_neighbors=5)
knn_model.fit(X_train, y_train)

# Predictions
y_pred_knn = knn_model.predict(X_test)

# Evaluation
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))
print("\nClassification Report:\n", classification_report(y_test, y_pred_knn))



##  Confusion Matrix Visualization
Let's visualize model performance to better understand where misclassifications occur.


In [None]:
# Confusion matrix for Random Forest
cm_rf = confusion_matrix(y_test, y_pred_rf)
plt.figure(figsize=(6,4))
sns.heatmap(cm_rf, annot=True, fmt='d', cmap='Blues')
plt.title('Random Forest Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()


In [None]:
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print("KNN Accuracy:", accuracy_score(y_test, y_pred_knn))


##  Summary of Results


| Model | Data Handling | Threshold | Recall | Precision | False Positives | False Negatives | Notes |
|--------|----------------|------------|----------|------------|------------------|------------------|--------|
| Random Forest (Unbalanced) | None | 0.5 | ~85% | ~70% | — | 7,032 | High accuracy, poor recall on phishing URLs |
| KNN + SMOTE (k=5) | SMOTE | 0.5 | ~89% | ~65% | 20,644 | 3,463 | Many false positives, tunable performance |
| Balanced Random Forest | class_weight='balanced' | 0.5 | ~94% | ~63% | 8,673 | 1,066 | Stronger phishing detection after balancing |
| **Balanced Random Forest (Tuned)** | class_weight='balanced' + threshold=0.4 | 0.4 | **96.6%** | **62.7%** | **11,556** | **677** | Best trade-off between recall and precision |

---

**Key Findings**
- The **Balanced Random Forest (threshold = 0.4)** model achieved the best performance, with a **96.6 % recall**—crucial for cybersecurity tasks where missing a phishing attack (false negative) is riskier than a few false alarms.  
- **KNN with SMOTE** provided tunable flexibility but generated many false positives at lower `k` values.  
- **Threshold tuning** dramatically improved phishing recall while maintaining acceptable precision.

---

