<a href="https://colab.research.google.com/github/mdkamrulhasan/machine_learning_concepts/blob/master/notebooks/supervised/classification_knn_DTree_HP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## What Will We Cover Today?

In this notebook, we will evaluate how the **k-Nearest Neighbor (kNN)** classifier performs on a real-world dataset for **Phishing Detection**.

We will compare the model under two different configurations:

### 1. k-Nearest Neighbor (kNN) and Decision Tree — Default Hyperparameters
- Train the model using its default settings.
- Evaluate its performance on the phishing detection dataset.

### 2. k-Nearest Neighbor (kNN) and Decision Tree — Hyperparameter Optimization (Grid Search)
- Apply Grid Search to tune the model’s hyperparameters.
- Train the optimized model.
- Compare its performance with the default configuration.


## [Phishing Websites Dataset](https://archive.ics.uci.edu/dataset/327/phishing+websites)


---


---





## Loading necessary python packages

In [2]:
# data processing packages
import pandas as pd

# Regressiong modeling package(s) (sklearn)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV

# model evaluation related packages
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

## Loading data and some preprocessing

In [3]:
# Load the  dataset

data_path = "https://raw.githubusercontent.com/mdkamrulhasan/data-public/refs/heads/main/miscellaneous/fishing_binary.csv"
df = pd.read_csv(data_path)

In [4]:
df.head()

Unnamed: 0,having_ip_address,url_length,shortining_service,having_at_symbol,double_slash_redirecting,prefix_suffix,having_sub_domain,sslfinal_state,domain_registration_length,favicon,...,popupwindow,iframe,age_of_domain,dnsrecord,web_traffic,page_rank,google_index,links_pointing_to_page,statistical_report,result
0,-1,1,1,1,-1,-1,-1,-1,-1,1,...,1,1,-1,-1,-1,-1,1,1,-1,-1
1,1,1,1,1,1,-1,0,1,-1,1,...,1,1,-1,-1,0,-1,1,1,1,-1
2,1,0,1,1,1,-1,-1,-1,-1,1,...,1,1,1,-1,1,-1,1,0,-1,-1
3,1,0,1,1,1,-1,-1,-1,1,1,...,1,1,-1,-1,1,-1,1,-1,1,-1
4,1,0,-1,1,1,-1,1,1,-1,1,...,-1,1,-1,-1,0,-1,1,1,1,1


In [5]:
# Separating features and labels dataframes
features_df, labels_df = df[df.columns[:-1]], df[df.columns[-1]]

## Modeling

Extracting features and labels as numpy matrices

In [6]:
X, y = features_df.values, labels_df.values

Splitting data into train, test splits

In [7]:
# test data amount (in terms of proportion)
TEST_PROP = 0.5
# Random number seed; important for experiment reproducibility
RANDOM_SEED = 0

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_PROP, random_state=RANDOM_SEED)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((5527, 30), (5527,), (5528, 30), (5528,))

## Feature Scaling

In [8]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## Model Training and Evaluation with Default Hyper-parameters

In [None]:
clf = KNeighborsClassifier()
# Train the model using the training set
clf.fit(X_train, y_train)

# Prediction and error estimation (traing data)
y_pred = clf.predict(X_train)
acc_train = accuracy_score(y_train, y_pred)
print("accuracy score (training data): %.2f" % acc_train)

# Prediction and error estimation (test data)
y_pred = clf.predict(X_test)
acc_test = accuracy_score(y_test, y_pred)
print("accuracy_score (test data): %.2f" % acc_test)

# Storing results in a dataframe
dtree_results = pd.DataFrame({
  'model': ['lr'],
  'train_acc': [round(acc_train, 2)],
  'test_acc': [round(acc_test, 2)]
})


accuracy score (training data): 0.96
accuracy_score (test data): 0.93




---



---



# Hyper-parameter Optimization

## Hyperparameter Grid Definition
## n_neighbors (k)

In [None]:
param_grid = {
    "n_neighbors": [1, 5, 10, 20, 30]
}

## Applying Five-fold Cross Validation on the Trninign Set

In [None]:
grid_search = GridSearchCV(
    estimator=KNeighborsClassifier(),
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

## Displaying Grid Search Results

In [None]:
print("Best k:", grid_search.best_params_["n_neighbors"])
print("Best CV Accuracy:", grid_search.best_score_)

# Best trained model
cv_model = grid_search.best_estimator_

Best k: 1
Best CV Accuracy: 0.936853526220615


## Evaluating Model Performance with the Hyperparameter Chosen through the HP Search

In [None]:
# Prediction and error estimation (traing data)
y_pred = cv_model.predict(X_train)
accuracy_train = accuracy_score(y_train, y_pred)
print("Accuracy (train data): %.2f" % accuracy_train)

# Prediction and error estimation (test data)
y_pred = cv_model.predict(X_test)
accuracy_test = accuracy_score(y_test, y_pred)
print("Accuracy (test data): %.2f" % accuracy_test)

# Storing results in a dataframe
ridge_results = pd.DataFrame({
  'model': ['model-cv'],
  'train_err': [round(accuracy_train, 2)],
  'test_err': [round(accuracy_test, 2)]
})

Accuracy (train data): 0.99
Accuracy (test data): 0.95




---



---



## Testing Decision Tree with and without HP Optimization

*Note: Decision Trees do not necessarily require feature encoding or feature scaling.*

## With Default Hyper Parameters

In [9]:
clf = DecisionTreeClassifier()
# Train the model using the training set
clf.fit(X_train, y_train)

# Prediction and error estimation (traing data)
y_pred = clf.predict(X_train)
acc_train = accuracy_score(y_train, y_pred)
print("accuracy score (training data): %.2f" % acc_train)

# Prediction and error estimation (test data)
y_pred = clf.predict(X_test)
acc_test = accuracy_score(y_test, y_pred)
print("accuracy_score (test data): %.2f" % acc_test)

# Storing results in a dataframe
dtree_results = pd.DataFrame({
  'model': ['lr'],
  'train_acc': [round(acc_train, 2)],
  'test_acc': [round(acc_test, 2)]
})


accuracy score (training data): 0.99
accuracy_score (test data): 0.95


## With Hyper Parameter Optimization

In [15]:
param_grid = {
    "max_depth": [None, 5, 10, 20, 30, 50],
    "min_samples_leaf": [1, 2, 5, 10, 20]
}

In [16]:
grid_search = GridSearchCV(
    estimator=DecisionTreeClassifier(),
    param_grid=param_grid,
    scoring="accuracy",
    cv=5,
    n_jobs=-1
)

grid_search.fit(X_train, y_train)

In [17]:
print("Best max_depth:", grid_search.best_params_["max_depth"])
print("Best min_samples_leaf:", grid_search.best_params_["min_samples_leaf"])
print("Best CV Accuracy:", grid_search.best_score_)

# Best trained model
cv_model = grid_search.best_estimator_

Best max_depth: 20
Best min_samples_leaf: 1
Best CV Accuracy: 0.9500619410373693


In [18]:
# Prediction and error estimation (traing data)
y_pred = cv_model.predict(X_train)
accuracy_train = accuracy_score(y_train, y_pred)
print("Accuracy (train data): %.2f" % accuracy_train)

# Prediction and error estimation (test data)
y_pred = cv_model.predict(X_test)
accuracy_test = accuracy_score(y_test, y_pred)
print("Accuracy (test data): %.2f" % accuracy_test)

# Storing results in a dataframe
ridge_results = pd.DataFrame({
  'model': ['model-cv'],
  'train_err': [round(accuracy_train, 2)],
  'test_err': [round(accuracy_test, 2)]
})

Accuracy (train data): 0.99
Accuracy (test data): 0.96
