## Model Training & Evaluation

The extracted feature matrix `X` and target labels `y` were split into training and testing sets using an 80â€“20 split to evaluate generalization performance.

A **Random Forest Classifier** was trained using carefully selected hyperparameters to control model complexity, handle class imbalance, and ensure reproducibility.


In [27]:
import sys
import os
sys.path.append(os.path.abspath(".."))

In [28]:
import numpy as np
from sklearn.model_selection import train_test_split
from src.train_model import model_train
from sklearn.metrics import confusion_matrix,classification_report

In [29]:
x_features= np.load("../datasets/x.npy")
y_label= np.load("../datasets/y.npy")

In [30]:
x_train,x_test,y_train,y_test = train_test_split(x_features,y_label,test_size=0.2,random_state = 42)

In [31]:
hyperparameters = {'n_estimators' : 100
,'max_depth' : 10
,'min_samples_split' : 5
,'min_samples_leaf' :2
,'max_features' : 'sqrt'
,'class_weight' : 'balanced'
,'random_state' :42}

In [32]:
model = model_train(hyperparameters,x_train,y_train)
y_pred = model.predict(x_test)
print("model score : ",model.score(x_test,y_test))


model score :  0.9409444644712568


In [35]:
y_pred

array([1, 0, 0, ..., 1, 1, 0], shape=(47159,))

In [33]:
print(confusion_matrix(y_test,y_pred))


[[18113  2011]
 [  774 26261]]


In [34]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.96      0.90      0.93     20124
           1       0.93      0.97      0.95     27035

    accuracy                           0.94     47159
   macro avg       0.94      0.94      0.94     47159
weighted avg       0.94      0.94      0.94     47159



## Evaluation Results
The trained model achieved strong performance on the test set:
- **Accuracy:** 94%
- **Phishing Recall:** 97%
- **Legitimate Precision:** 96%

These results indicate that the model effectively detects phishing URLs while maintaining low false positives for legitimate URLs. High recall for phishing URLs is especially important in security-focused applications.

Overall, the model demonstrates robust and balanced performance, making it suitable for URL-based phishing detection.