# k-Nearest Neighbors Classifier (k-NN)

The k-NN is one of the simplest classifiers in Machine Learning. Differently from other common supervised techniques, it does not perform a **learning**; instead, the algorithm checks the distance between the instance that will be classified and other feature vectors from the dataset. Due to its simplicity, it is often used in benchmarks of complex classifiers, like Artificial Neural Network (**ANN**) and Suport Vector Machine (**SVM**).

This notebook shows how the dimensionality reduction algorithms required for this work affects the 
accuracy of the k-NN classifier in the classification task posed by the Covertype dataset. Sections
are organized as follows: first, we import all necessary libraries to run k-NN from the
`sklearn` library and load the dataset, separating data from targets; then we load the results
datasets and define a function to perform k-NN over them, running also cross-validation
to validate the results; finally we compare the performance of the classifier according
to each reduced or extracted set of attributes.

## Implementation

### Import libraries 
Let's first import the libraries, mainly `pandas`, `numpy` and k-NN implementation from `sklearn`:

In [11]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
import random

### Load results datasets

Now, load the train and test datasets, taking the targets apart from the features:

In [12]:
# Train dataset
dataset_train = pd.read_csv('datasets/covertype_norm_train.csv')
# Targets
target_train = dataset_train.iloc[:,-1]
# Dataset without classes
data_train = dataset_train.iloc[:,:-1]
data_train.head()

Unnamed: 0,elevation,aspect,slope,horiz_dist_hydro,vert_dist_hydro,horiz_dist_road,hillshade_9,hill_shade_noon,hill_shade_15,horiz_dist_fire,...,soil_type_30,soil_type_31,soil_type_32,soil_type_33,soil_type_34,soil_type_35,soil_type_36,soil_type_37,soil_type_38,soil_type_39
0,-0.573753,-0.518424,-0.428658,0.436024,-0.475092,-0.979056,0.927864,0.14452,-0.534162,-0.220768,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
1,1.656009,-0.010549,0.868502,-0.516497,-0.280544,1.81761,0.862413,0.665801,-0.534162,2.273548,...,-0.14199,-0.214265,4.938531,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
2,0.169501,-0.799569,0.632655,0.45517,1.89191,-0.388051,0.796962,-1.245563,-1.335438,-0.687429,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
3,-1.205043,1.268208,1.576043,0.23499,1.648725,-0.649457,-2.933743,-0.15956,1.956291,-0.501856,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
4,-1.057345,0.152697,0.986425,0.134472,0.530073,-1.041945,0.404256,1.056762,-0.014415,-0.79477,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986


In [13]:
# Test dataset
dataset_test = pd.read_csv('datasets/covertype_norm_test.csv')
# Targets
target_test = dataset_test.iloc[:,-1]
# Dataset without classes
data_test = dataset_test.iloc[:,:-1]
data_test.head()

Unnamed: 0,elevation,aspect,slope,horiz_dist_hydro,vert_dist_hydro,horiz_dist_road,hillshade_9,hill_shade_noon,hill_shade_15,horiz_dist_fire,...,soil_type_30,soil_type_31,soil_type_32,soil_type_33,soil_type_34,soil_type_35,soil_type_36,soil_type_37,soil_type_38,soil_type_39
0,-1.350358,1.730737,-0.782429,-0.889847,-0.783127,-0.407751,-0.381155,0.2314,0.5703,-0.35176,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
1,-0.857238,0.78754,1.104349,-0.253237,1.081293,-0.297127,-1.7229,1.360843,1.848011,-0.840253,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
2,0.422017,1.794222,0.632655,0.459957,0.3031,1.02733,-1.101116,-1.028362,0.332083,0.262267,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
3,-1.698163,1.277277,3.10905,-0.27717,1.373115,-1.150296,-4.733644,-1.549644,2.151197,-1.108606,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
4,-0.630926,1.81236,-0.782429,1.063061,0.416587,-0.907074,-0.282979,0.10108,0.440364,-0.429082,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986


Now, load the dataset with the results from the **genetic** algorithm executions:

In [14]:
# Read selected attributes of GA
ga  = pd.read_csv('results/ga_selected_attributes.csv')

# GA dataset info
print('---------------- Genetic Algorithm ----------------')
print("Shape: ", ga.shape)
print("First Data:\n", ga.iloc[0])

---------------- Genetic Algorithm ----------------
Shape:  (2430, 59)
First Data:
 population             25
max_generations        50
combination_prob      0.7
mutation_prob        0.05
iteration               1
elevation           False
aspect              False
slope               False
horiz_dist_hydro    False
vert_dist_hydro     False
horiz_dist_road     False
hillshade_9          True
hill_shade_noon      True
hill_shade_15       False
horiz_dist_fire      True
wild_area_0          True
wild_area_1         False
wild_area_2          True
wild_area_3         False
soil_type_0         False
soil_type_1         False
soil_type_2         False
soil_type_3          True
soil_type_4         False
soil_type_5         False
soil_type_6         False
soil_type_7         False
soil_type_8         False
soil_type_9          True
soil_type_10        False
soil_type_11         True
soil_type_12        False
soil_type_13        False
soil_type_14        False
soil_type_15        False
soil_t

Now, load the dataset with the results from the **PSO** algorithm executions:

In [15]:
# Read selected Attributes of PSO
pso = pd.read_csv('results/pso_selected_attributes.csv')

# PSO dataset info
print('---------------- Particle Swarm Optimization ----------------')
print("Shape: ", pso.shape)
print("First Data:\n", pso.iloc[0])

---------------- Particle Swarm Optimization ----------------
Shape:  (270, 57)
First Data:
 swarm_size                25
max_iterations            50
fitness             0.873008
elevation              False
aspect                 False
slope                   True
horiz_dist_hydro       False
vert_dist_hydro        False
horiz_dist_road         True
hillshade_9            False
hill_shade_noon        False
hill_shade_15          False
horiz_dist_fire        False
wild_area_0             True
wild_area_1            False
wild_area_2             True
wild_area_3            False
soil_type_0            False
soil_type_1            False
soil_type_2             True
soil_type_3            False
soil_type_4            False
soil_type_5            False
soil_type_6            False
soil_type_7            False
soil_type_8            False
soil_type_9             True
soil_type_10           False
soil_type_11            True
soil_type_12           False
soil_type_13           False
soil_typ

### k-NN classifier definition

In [21]:
# Defining K's
ks = [1,3,5,7,9]
acc_list = ['k1','k3','k5','k7','k9']

def perform_knn(attr):
    '''
    Performs knn for a given dataset.
    '''
    columns = attr.columns.tolist()[:-54] + acc_list
    new_df = pd.DataFrame(columns=columns)
    
    for index in range(len(attr)):
        vector = attr.iloc[index,-54:].tolist()
        sliced_train = data_train.iloc[:, vector]       
        sliced_test = data_test.iloc[:, vector]
        
        # Perform knn
        r = []
        for k in ks:
            knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
            knn.fit(sliced_train, target_train)
            r.append(knn.score(sliced_test, target_test))
        
        new_df.loc[index] = attr.iloc[0,:-54].tolist() + r
        print(index)
    
    return new_df

### k-NN execution

In [17]:
# Perform knn for all attibutes of dataset
full_dataset = pd.DataFrame(columns=(['n_attr'] + data_train.columns.tolist()))
full_dataset.loc[0] = [54] + [True for i in range(54)]

full_dataset_result = perform_knn(full_dataset)
full_dataset_result

In [None]:
# Perform knn for PSO
knn_pso_result = perform_knn(pso)
# Store results of PSO
knn_pso_result.to_csv('results/knn_pso.csv')

In [None]:
# Perform knn for GA
knn_ga_result = perform_knn(ga)
# Store results of GA
knn_ga_result.to_csv('results/knn_ga.csv')

## Comparative results

Before comparing, we need to load the results for each algorithm:

In [7]:
# PSO results
knn_pso = pd.read_csv('results/knn_pso.csv')
# GA results
knn_ga = pd.read_csv('results/knn_ga.csv')

### Original Dataset

For the original dataset, we've got **81.69%** of accuracy:

In [8]:
knn_original.style

Unnamed: 0,n_attr,accuracy,ks
0,54.0,0.816998,7.0
1,54.0,0.803062,10.0
2,54.0,0.783195,15.0
3,54.0,0.768581,20.0
4,54.0,0.75813,25.0


### Particle Swarm Optimization

For the dataset with the attributes selected by the PSO algorithm, we've got **81.21%** of accuracy:

In [9]:
def get_n_selected(data, index):
    row = data.loc[index]
    return row.iloc[-54:].tolist().count(True)

pso_columns = ['id','n_attr'] + ['swarm_size', 'max_iterations', 'accuracy', 'ks']
pso_best_results = pd.DataFrame(columns=pso_columns)

index = 0
for i in knn_pso.columns.tolist()[-5:]:
    row = knn_pso.loc[knn_pso[i].idxmax()]
    pso_best_results.loc[index] = [int(row['Unnamed: 0']), 
                                   get_n_selected(pso, int(row['Unnamed: 0'])), 
                                   row['swarm_size'], row['max_iterations'], row[i], i]
    index += 1
    
pso_best_results.style

Unnamed: 0,id,n_attr,swarm_size,max_iterations,accuracy,ks
0,47,33,25,50,0.812162,k7
1,47,33,25,50,0.796924,k10
2,47,33,25,50,0.778096,k15
3,155,26,25,50,0.763641,k20
4,155,26,25,50,0.754022,k25


### Genetic Algorithm

For the dataset with the attributes selected by the genetic algorithm, we've got **73.86%** of accuracy:

In [10]:
ga_columns = ['id', 'n_attr'] + ga.columns.tolist()[:4] + ['accuracy', 'ks']
ga_best_results = pd.DataFrame(columns=ga_columns)

index = 0
for i in knn_ga.columns.tolist()[-5:]:
    row = knn_ga.loc[knn_ga[i].idxmax()]
    ga_best_results.loc[index] = [int(row['Unnamed: 0']), 
                                  get_n_selected(ga, 
                                                 int(row['Unnamed: 0']))] +\
    row.loc[ga.columns.tolist()[:4]].tolist() + [ row[i], i ]
    index += 1
    
ga_best_results.style

Unnamed: 0,id,n_attr,population,max_generations,combination_prob,mutation_prob,accuracy,ks
0,144,13,25,50,0.7,0.05,0.738675,k7
1,144,13,25,50,0.7,0.05,0.727598,k10
2,144,13,25,50,0.7,0.05,0.71579,k15
3,144,13,25,50,0.7,0.05,0.704767,k20
4,144,13,25,50,0.7,0.05,0.697332,k25


## Selected attributes of best results

Having the best results, we now show the selected attributes by each algorithm, using the following function:

In [11]:
def selected_attr(data, best_results):
    '''
    Having the best results for each algorithm, print
    the selected attributes.
    '''
    s = data.iloc[best_results.iloc[0,0],-54:]
    return [ s.index[i] for i in range(len(s)) if s[i] == True]

### PSO

In [12]:
print('Size:', pso_best_results.iloc[0,1], \
      '& Attrs:', selected_attr(pso, pso_best_results))

Size: 33 & Attrs: ['elevation', 'aspect', 'slope', 'horiz_dist_hydro', 'horiz_dist_road', 'hillshade_9', 'hill_shade_noon', 'horiz_dist_fire', 'wild_area_0', 'wild_area_1', 'wild_area_2', 'soil_type_0', 'soil_type_2', 'soil_type_4', 'soil_type_8', 'soil_type_9', 'soil_type_10', 'soil_type_11', 'soil_type_12', 'soil_type_15', 'soil_type_16', 'soil_type_19', 'soil_type_21', 'soil_type_23', 'soil_type_24', 'soil_type_30', 'soil_type_31', 'soil_type_32', 'soil_type_34', 'soil_type_35', 'soil_type_36', 'soil_type_37', 'soil_type_38']


### GA

In [13]:
print('Size:', ga_best_results.iloc[0,1], \
      '& Attrs:', selected_attr(ga, ga_best_results))

Size: 13 & Attrs: ['elevation', 'horiz_dist_road', 'horiz_dist_fire', 'wild_area_0', 'wild_area_2', 'soil_type_19', 'soil_type_21', 'soil_type_22', 'soil_type_28', 'soil_type_31', 'soil_type_37', 'soil_type_38', 'soil_type_39']
