# Classification with KNN

In this notebook we are going to implement document classification with KNN using TF-IDF to vectorize the text.
We are also going to show how to perform the grid search to find the best KNN hiperparameters

Note that most of the code is implemented in the document find_best_hyperparameters
Thus, we are going to import the KNN_classifier class and the pandas notebook

we will also perform the classification in two datasets Dmoz-Sports.csv and Dmoz-Computers.csv

In [1]:
from find_best_hyperparameters import KNN_classifier
import pandas as pd

We then instanciate the class KNN_classifier, that contains the code to find the best hiperparamters and the code to fit the best KNN.

In [2]:
KNNC = KNN_classifier()

## Dmoz-Sports.csv

first we need to load the dataset and splits its text and labels

In [3]:
df = pd.read_csv('Dmoz-Sports.csv')
X = df['text'].values
labels = df['class'].values

Afterward we are performing the grid search.
You can check the code for the grid_seach method however, the important part of it is that it will vectorize the documents with a TfidfVectorizer(), then it will split the X and labels into train and test sets (80% train and 20% tests).

it will then perform a grid search on the k hiperparam in the KNN. you can pass other parameters in the following format:

params = { 'n_neighbors': [4, 8, 12, 16, 20, 24, 28, 32]}

In [4]:
grid_Sport = KNNC.grid_search(X, labels)

The results of the grid search are then stored into a pandas dataframe

In [5]:
df_sports = pd.DataFrame(grid_Sport.cv_results_)
df_sports

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_score,split1_test_score,mean_test_score,std_test_score,rank_test_score
0,0.009161,0.000141,4.633988,0.254061,4,{'n_neighbors': 4},0.70963,0.708148,0.708889,0.000741,8
1,0.00809,1.7e-05,4.630293,0.004409,8,{'n_neighbors': 8},0.769259,0.769815,0.769537,0.000278,7
2,0.008014,0.000441,4.667385,0.027659,12,{'n_neighbors': 12},0.77963,0.789444,0.784537,0.004907,6
3,0.008058,0.000205,4.775419,0.09578,16,{'n_neighbors': 16},0.79037,0.798519,0.794444,0.004074,5
4,0.007586,0.000221,4.678444,0.096939,20,{'n_neighbors': 20},0.796481,0.799259,0.79787,0.001389,4
5,0.007378,0.000314,4.894635,0.112063,24,{'n_neighbors': 24},0.798704,0.80037,0.799537,0.000833,3
6,0.014971,0.000679,1.540799,0.002685,28,{'n_neighbors': 28},0.802407,0.802037,0.802222,0.000185,1
7,0.008949,9.6e-05,1.50781,0.007633,32,{'n_neighbors': 32},0.803148,0.797778,0.800463,0.002685,2


We then check the best estimetor in the grid seach

In [6]:
grid_Sport.best_estimator_

KNeighborsClassifier(n_neighbors=28)

Afterwards, we will train the best model using the hole train dataset and retrun the results of the tests

here we are still spliting the train and test in 80/20 %

In [7]:
report, micro_f1 = KNNC.train_best(X, labels, grid_Sport.best_estimator_)

The train best method will return a report containing the precision, recall, accuracy and f1-score and a micro_f1

we then print those values to evaluate the results

In [8]:
print(report)
print(f'f1_micro = {micro_f1}')

              precision    recall  f1-score   support

    Baseball       0.74      0.71      0.73       114
  Basketball       0.81      0.87      0.84       102
     Bowling       0.94      0.97      0.96       101
     Cricket       0.66      0.89      0.75        90
     Cycling       0.94      0.87      0.90       107
  Equestrian       0.94      0.89      0.92        95
     Fencing       0.83      0.95      0.89       111
      Flying       0.77      0.89      0.83        97
    Football       0.70      0.57      0.63       107
        Golf       1.00      0.86      0.93        94
  Gymnastics       0.91      0.91      0.91        95
      Hockey       0.82      0.53      0.65       101
    Lacrosse       0.71      0.87      0.78       100
     Martial       0.98      0.93      0.95        88
 Motorsports       0.87      0.71      0.78       101
   Paintball       0.91      0.92      0.91        84
     Running       0.85      0.92      0.88       108
     Skating       0.88    

## Dmoz-Computers.csv

For this dataset we aregoing to repeat the entire processe that we did with the other one

In [9]:

df = pd.read_csv('Dmoz-Computers.csv')
X = df['text'].values
labels = df['class'].values

In [10]:
grid_Computers = KNNC.grid_search(X, labels)

In [11]:
df_computer = pd.DataFrame(grid_Computers.cv_results_)
df_computer

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_n_neighbors,params,split0_test_score,split1_test_score,mean_test_score,std_test_score,rank_test_score
0,0.004764,0.000948,1.916852,0.020274,4,{'n_neighbors': 4},0.607105,0.599211,0.603158,0.003947,8
1,0.005805,0.000125,1.897177,0.047546,8,{'n_neighbors': 8},0.635526,0.623421,0.629474,0.006053,7
2,0.004739,0.000491,1.891349,0.103187,12,{'n_neighbors': 12},0.649211,0.632895,0.641053,0.008158,3
3,0.005283,0.000156,1.853592,0.008166,16,{'n_neighbors': 16},0.657105,0.633158,0.645132,0.011974,1
4,0.005487,0.000791,1.899796,0.024385,20,{'n_neighbors': 20},0.653158,0.630526,0.641842,0.011316,2
5,0.006056,0.000866,1.91384,0.023272,24,{'n_neighbors': 24},0.652632,0.628684,0.640658,0.011974,4
6,0.006155,0.001283,0.693498,0.005935,28,{'n_neighbors': 28},0.65,0.630526,0.640263,0.009737,5
7,0.004502,1.1e-05,0.686108,0.01162,32,{'n_neighbors': 32},0.647632,0.631316,0.639474,0.008158,6


In [12]:
report, micro_f1 = KNNC.train_best(X, labels, grid_Computers.best_estimator_)

In [13]:
print(report)
print(f'f1_micro = {micro_f1}')

              precision    recall  f1-score   support

  Artificial       0.74      0.80      0.77       104
         CAD       0.72      0.82      0.77       100
   Companies       0.45      0.67      0.54       106
    Computer       0.80      0.81      0.81        95
 Consultants       0.52      0.73      0.61       100
        Data       0.65      0.82      0.72       202
   Education       0.81      0.85      0.83        96
    Graphics       0.83      0.92      0.87       106
    Hardware       0.72      0.57      0.63       104
    Internet       0.73      0.56      0.64       108
      Mobile       0.80      0.76      0.78       103
  Multimedia       0.76      0.77      0.77       101
        Open       0.57      0.63      0.60        95
 Programming       0.51      0.37      0.43       104
    Robotics       0.79      0.83      0.81       101
    Security       0.69      0.64      0.66        85
    Software       0.56      0.10      0.17        97
     Systems       0.74    

### save the grid search results

In [14]:
df_computer.to_csv('grid_search_KKN_Dmoz-Computer.csv')
df_sports.to_csv('grid_search_KKN_Dmoz-Sports.csv')