In [110]:
import pandas as pd
import numpy as np
dta_security=pd.read_csv("https://raw.githubusercontent.com/ucr-courses/knn-lab/main/data/jaggia_ba_1e_ch09_Security.csv")
dta_security.head()

Unnamed: 0,Threat,WarTerms,Keywords,Links
0,0,6,5,5
1,0,3,5,8
2,0,5,8,7
3,0,4,7,4
4,0,5,5,2


Let's keep 10% of data for testing and remaning for training/validation

In [113]:
from sklearn.model_selection import train_test_split
[X_train_val, X_test, y_train_val, y_test] = train_test_split(dta_security.drop('Threat',axis=1), dta_security["Threat"], test_size=0.1, random_state=42)

In [115]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()

X_scaler=scaler.fit(X_train_val)
X_train_val_scaled=X_scaler.transform(X_train_val)
X_test_scaled=X_scaler.transform(X_test)

Setup up KNN model in scikit-learn

In [118]:
from sklearn.neighbors import KNeighborsClassifier

knn_model= KNeighborsClassifier()

Because we are interested in evaluating a variety of different possible values of $k$ in our model, we create a dictionary of values to explore using the label *n_neighbors* and a **range** of values between 1 and 10 (with the latter value being set to 10 + 1 since the end of **range** is exclusive). Enter:

In [121]:
paramGrid={'n_neighbors' : range( 1 , 10 + 1 )}

Now we are ready to fit our models and evaluate the different $k$ values. To do this, we will use $k$-fold cross-validation, where the $k$ in $k$-fold vs. the $k$ in KNN are not to be confused. Before we get started, we need to set up the procedure using the scikit-learn **GridSearchCV** function and store it in a variable with the label search. Within the function, the first two arguments are used to specify our model and our paramGrid from above; in addition, we use *cv* to set the number of folds to 10 and choose `'accuracy'` as our *scoring* metric to have results comparable to what is shown in the text (alternative scoring metrics may also be evaluated, which can be useful in various cases such as high class imbalance). Enter:

In [124]:
from sklearn.model_selection import GridSearchCV

search=GridSearchCV(knn_model, paramGrid, cv=10, scoring='accuracy')

Now execute cross_validation

In [127]:
searchFit=search.fit(X_train_val_scaled,y_train_val)

Check results

In [130]:
pd.DataFrame(searchFit.cv_results_)[['mean_test_score','params','rank_test_score']].sort_values(by='rank_test_score')

Unnamed: 0,mean_test_score,params,rank_test_score
7,0.685185,{'n_neighbors': 8},1
9,0.681481,{'n_neighbors': 10},2
8,0.674074,{'n_neighbors': 9},3
5,0.666667,{'n_neighbors': 6},4
6,0.666667,{'n_neighbors': 7},4
4,0.659259,{'n_neighbors': 5},6
2,0.62963,{'n_neighbors': 3},7
0,0.625926,{'n_neighbors': 1},8
3,0.622222,{'n_neighbors': 4},9
1,0.603704,{'n_neighbors': 2},10


The best performing k is 7

In [133]:
best_estimator=searchFit.best_estimator_
best_estimator

We can either use searchFit to make predictions or use best_estimator to make predictions. searchFit uses the best estimator by default. Therefore, the results will be the same.

In [136]:
predicted_prob=searchFit.predict_proba(X_test_scaled)

In [138]:
predicted_prob=best_estimator.predict_proba(X_test_scaled)
predicted_prob

array([[0.5  , 0.5  ],
       [0.25 , 0.75 ],
       [0.5  , 0.5  ],
       [0.375, 0.625],
       [0.25 , 0.75 ],
       [0.   , 1.   ],
       [0.25 , 0.75 ],
       [0.625, 0.375],
       [0.125, 0.875],
       [0.5  , 0.5  ],
       [0.375, 0.625],
       [0.   , 1.   ],
       [0.5  , 0.5  ],
       [0.   , 1.   ],
       [0.25 , 0.75 ],
       [0.75 , 0.25 ],
       [0.5  , 0.5  ],
       [0.75 , 0.25 ],
       [0.125, 0.875],
       [0.75 , 0.25 ],
       [0.   , 1.   ],
       [0.   , 1.   ],
       [0.75 , 0.25 ],
       [0.5  , 0.5  ],
       [0.125, 0.875],
       [0.375, 0.625],
       [0.375, 0.625],
       [0.25 , 0.75 ],
       [0.   , 1.   ],
       [0.625, 0.375]])

The first column in the output above shows the probabilities of the cases belonging to Class 0 (Not threat), while the second column lists the probabilities of the cases belonging to Class 1 (threat).

To classify these probability predictions, let's use a threshold probability that reflects the proportion of 1s in our train-validation dataset. NOte that in python index starts from 0, therefore predicted_prob[:,1] gives us the **second** column.

In [142]:
cutoff_prob=0.5

In [144]:

predicted_test=np.where(predicted_prob[:,1]>cutoff_prob,1,0)
predicted_test

array([0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 1, 0])

In [146]:
from sklearn.metrics import confusion_matrix
# first argument is true values, second argument is predicted values
confusion_matrix(y_test,predicted_test)

array([[ 8,  4],
       [ 4, 14]])

The confusion matrix is arranged as follows

|                     | Predicted Negative | Predicted Positive |
|---------------------|--------------------|--------------------|
| **Actual Negative** | True Negatives (TN) | False Positives (FP) |
| **Actual Positive** | False Negatives (FN) | True Positives (TP) |


In [150]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,predicted_test)

0.7333333333333333

Alternatively we can use classification report for several other metrics

In [153]:
from sklearn.metrics import classification_report
print(classification_report(y_test,predicted_test,target_names = ['Not threat', 'threat']))

              precision    recall  f1-score   support

  Not threat       0.67      0.67      0.67        12
      threat       0.78      0.78      0.78        18

    accuracy                           0.73        30
   macro avg       0.72      0.72      0.72        30
weighted avg       0.73      0.73      0.73        30

