# k-Nearest Neighbors Classifier (k-NN)

The k-NN is one of the simplest classifiers in Machine Learning. Differently from other common supervised techniques, it does not perform a **learning**; instead, the algorithm checks the distance between the instance that will be classified and other feature vectors from the dataset. Due to its simplicity, it is often used in benchmarks of complex classifiers, like Artificial Neural Network (**ANN**) and Suport Vector Machine (**SVM**).

This notebook shows how the dimensionality reduction algorithms required for this work affects the 
accuracy of the k-NN classifier in the classification task posed by the Covertype dataset. Sections
are organized as follows: first, we import all necessary libraries to run k-NN from the
`sklearn` library and load the dataset, separating data from targets; then we load the results
datasets and define a function to perform k-NN over them, running also cross-validation
to validate the results; finally we compare the performance of the classifier according
to each reduced or extracted set of attributes.

## Implementation

### Import libraries 
Let's first import the libraries, mainly `pandas`, `numpy` and k-NN implementation from `sklearn`:

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
import random

### Loading the datasets

Now, load the train and test datasets, taking the targets apart from the features:

In [2]:
# Train dataset
dataset_train = pd.read_csv('datasets/covertype_norm_train.csv')
# Targets
target_train = dataset_train.iloc[:,-1]
# Dataset without classes
data_train = dataset_train.iloc[:,:-1]
data_train.head()

Unnamed: 0,elevation,aspect,slope,horiz_dist_hydro,vert_dist_hydro,horiz_dist_road,hillshade_9,hill_shade_noon,hill_shade_15,horiz_dist_fire,...,soil_type_30,soil_type_31,soil_type_32,soil_type_33,soil_type_34,soil_type_35,soil_type_36,soil_type_37,soil_type_38,soil_type_39
0,-0.573753,-0.518424,-0.428658,0.436024,-0.475092,-0.979056,0.927864,0.14452,-0.534162,-0.220768,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
1,1.656009,-0.010549,0.868502,-0.516497,-0.280544,1.81761,0.862413,0.665801,-0.534162,2.273548,...,-0.14199,-0.214265,4.938531,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
2,0.169501,-0.799569,0.632655,0.45517,1.89191,-0.388051,0.796962,-1.245563,-1.335438,-0.687429,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
3,-1.205043,1.268208,1.576043,0.23499,1.648725,-0.649457,-2.933743,-0.15956,1.956291,-0.501856,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
4,-1.057345,0.152697,0.986425,0.134472,0.530073,-1.041945,0.404256,1.056762,-0.014415,-0.79477,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986


In [3]:
# Test dataset
dataset_test = pd.read_csv('datasets/covertype_norm_test.csv')
# Targets
target_test = dataset_test.iloc[:,-1]
# Dataset without classes
data_test = dataset_test.iloc[:,:-1]
data_test.head()

Unnamed: 0,elevation,aspect,slope,horiz_dist_hydro,vert_dist_hydro,horiz_dist_road,hillshade_9,hill_shade_noon,hill_shade_15,horiz_dist_fire,...,soil_type_30,soil_type_31,soil_type_32,soil_type_33,soil_type_34,soil_type_35,soil_type_36,soil_type_37,soil_type_38,soil_type_39
0,-1.350358,1.730737,-0.782429,-0.889847,-0.783127,-0.407751,-0.381155,0.2314,0.5703,-0.35176,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
1,-0.857238,0.78754,1.104349,-0.253237,1.081293,-0.297127,-1.7229,1.360843,1.848011,-0.840253,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
2,0.422017,1.794222,0.632655,0.459957,0.3031,1.02733,-1.101116,-1.028362,0.332083,0.262267,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
3,-1.698163,1.277277,3.10905,-0.27717,1.373115,-1.150296,-4.733644,-1.549644,2.151197,-1.108606,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986
4,-0.630926,1.81236,-0.782429,1.063061,0.416587,-0.907074,-0.282979,0.10108,0.440364,-0.429082,...,-0.14199,-0.214265,-0.202489,-0.039088,-0.081433,-0.016657,-0.044107,-0.220216,-0.219696,-0.172986


Now, load the dataset with the results from the **genetic** algorithm executions:

In [4]:
# Read selected attributes of GA
ga  = pd.read_csv('results/ga_selected_attributes.csv')

# GA dataset info
print('---------------- Genetic Algorithm ----------------')
print("Shape: ", ga.shape)
print("First Data:\n", ga.iloc[0])

---------------- Genetic Algorithm ----------------
Shape:  (2430, 59)
First Data:
 population             25
max_generations        50
combination_prob      0.7
mutation_prob        0.05
iteration               1
elevation           False
aspect              False
slope               False
horiz_dist_hydro    False
vert_dist_hydro     False
horiz_dist_road     False
hillshade_9          True
hill_shade_noon      True
hill_shade_15       False
horiz_dist_fire      True
wild_area_0          True
wild_area_1         False
wild_area_2          True
wild_area_3         False
soil_type_0         False
soil_type_1         False
soil_type_2         False
soil_type_3          True
soil_type_4         False
soil_type_5         False
soil_type_6         False
soil_type_7         False
soil_type_8         False
soil_type_9          True
soil_type_10        False
soil_type_11         True
soil_type_12        False
soil_type_13        False
soil_type_14        False
soil_type_15        False
soil_t

Now, load the dataset with the results from the **PSO** algorithm executions:

In [5]:
# Read selected Attributes of PSO
pso = pd.read_csv('results/pso_selected_attributes.csv')

# PSO dataset info
print('---------------- Particle Swarm Optimization ----------------')
print("Shape: ", pso.shape)
print("First Data:\n", pso.iloc[0])

---------------- Particle Swarm Optimization ----------------
Shape:  (270, 57)
First Data:
 swarm_size                25
max_iterations            50
fitness             0.873008
elevation              False
aspect                 False
slope                   True
horiz_dist_hydro       False
vert_dist_hydro        False
horiz_dist_road         True
hillshade_9            False
hill_shade_noon        False
hill_shade_15          False
horiz_dist_fire        False
wild_area_0             True
wild_area_1            False
wild_area_2             True
wild_area_3            False
soil_type_0            False
soil_type_1            False
soil_type_2             True
soil_type_3            False
soil_type_4            False
soil_type_5            False
soil_type_6            False
soil_type_7            False
soil_type_8            False
soil_type_9             True
soil_type_10           False
soil_type_11            True
soil_type_12           False
soil_type_13           False
soil_typ

Now, load the dataset with the results from the **Correlation filter** executions:

In [6]:
# Read selected attributes of filtered algorithm
filtered = pd.read_csv('results/filter_selected_attrs.csv')

# PSO dataset info
print("---------------- Filtered by Spearman's Correlation ----------------")
print("Shape: ", filtered.shape)
print("First Data:\n", filtered.iloc[0])

---------------- Filtered by Spearman's Correlation ----------------
Shape:  (3, 54)
First Data:
 elevation           False
aspect              False
slope                True
horiz_dist_hydro    False
vert_dist_hydro     False
horiz_dist_road      True
hillshade_9         False
hill_shade_noon      True
hill_shade_15        True
horiz_dist_fire      True
wild_area_0          True
wild_area_1         False
wild_area_2          True
wild_area_3          True
soil_type_0         False
soil_type_1         False
soil_type_2         False
soil_type_3         False
soil_type_4         False
soil_type_5         False
soil_type_6         False
soil_type_7         False
soil_type_8         False
soil_type_9          True
soil_type_10        False
soil_type_11         True
soil_type_12        False
soil_type_13        False
soil_type_14        False
soil_type_15        False
soil_type_16        False
soil_type_17        False
soil_type_18        False
soil_type_19        False
soil_type_20      

### k-NN classifier definitions

In [7]:
# Defining K's
ks = [1,3,5,7,9]
acc_list = ['k1','k3','k5','k7','k9']

# knn for selector algorithms (PSO, GA and FILTERED)
def perform_knn_selector(attr):
    '''
    Performs knn for a given dataset.
    '''
    columns = attr.columns.tolist()[:-54] + acc_list
    new_df = pd.DataFrame(columns=columns)
    
    for index in range(len(attr)):
        vector = attr.iloc[index,-54:].tolist()
        sliced_train = data_train.iloc[:, vector]       
        sliced_test = data_test.iloc[:, vector]
        
        # Perform knn
        r = []
        for k in ks:
            knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
            knn.fit(sliced_train, target_train)
            r.append(knn.score(sliced_test, target_test))
        
        new_df.loc[index] = attr.iloc[index,:-54].tolist() + r
        print(index)
    
    return new_df

# knn for extractor algorithms (PCA, LDA)
def perform_knn_extractor(train, test):
    '''
    Performs knn for a given dataset.
    '''
    train_target = train['cover_type']
    train_data   = train.loc[:, train.columns != 'cover_type']
    test_target  = test['cover_type']
    test_data    = test.loc[:, test.columns != 'cover_type']
    
    r = []
    for k in ks:
        knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
        knn.fit(train_data, train_target)
        r.append(knn.score(test_data, test_target))
        
    return r

### k-NN execution

k-NN will be executed for **all** solutions obtained with all possible parameter variations
of the algorithms used for dimensionality reduction. In fact, one solution was selected
for each of then, but all results are useful for further discussions.

Cells below show how such executions were performed:

In [None]:
# Perform knn for all attibutes of dataset
full_dataset = pd.DataFrame(columns=(['n_attr'] + data_train.columns.tolist()))
full_dataset.loc[0] = [54] + [True for i in range(54)]

knn_original = perform_knn_selector(full_dataset)

In [None]:
# Perform knn for PSO
knn_pso_result = perform_knn_selector(pso)
# Store results of PSO
knn_pso_result.to_csv('results/knn_pso.csv')

In [None]:
# Perform knn for GA
knn_ga_result = perform_knn_selector(ga)
# Store results of GA
knn_ga_result.to_csv('results/knn_ga.csv')

In [None]:
# Perform knn for filteres attributes
knn_filtered_result = perform_knn_selector(filtered)
# Store results of Filtered
knn_filtered_result.to_csv('results/knn_filtered.csv')

In [8]:
# Perform knn for PCA
representativity = [75,85,95]
knn_pca_result = pd.DataFrame(columns=(['representativity (%)'] + ks))
index = 0
for r in representativity:
    train = pd.read_csv('results/pca_' + str(r) + '.csv')
    test  = pd.read_csv('results/pca_test_' + str(r) + '.csv')
    knn_pca_result.loc[index] = [r] + perform_knn_extractor(train, test)
    index += 1
    
# Store results of PCA
knn_pca_result.to_csv('results/knn_pca.csv')

In [9]:
# Perform knn for LDA
n_components = range(1,7)
knn_lda_result = pd.DataFrame(columns=(['n_components'] + ks))
index = 0
for n in n_components:
    train = pd.read_csv('results/lda_' + str(n) + '.csv')
    test  = pd.read_csv('results/lda_test_' + str(n) + '.csv')
    knn_lda_result.loc[index] = [n] + perform_knn_extractor(train, test)
    index += 1
    
# Store results of LDA
knn_lda_result.to_csv('results/knn_lda.csv')   

## Comparative results

Before comparing, we need to load the results for each algorithm:

In [16]:
# PSO results
knn_pso = pd.read_csv('results/knn_pso.csv')
# GA results
knn_ga = pd.read_csv('results/knn_ga.csv')
# filtered results
knn_filtered = pd.read_csv('results/knn_filtered.csv')
# PCA results
knn_pca = pd.read_csv('results/knn_pca.csv')
# LDA results
knn_lda = pd.read_csv('results/knn_lda.csv')

### Original Dataset

For the original dataset, we've got **83.31%** of accuracy, for $k=1$:

In [17]:
knn_original.style

Unnamed: 0,n_attr,k1,k3,k5,k7,k9
0,54,0.833195,0.803869,0.790557,0.778702,0.770175


### Particle Swarm Optimization


In [18]:
def get_n_selected(data, index):
    row = data.loc[index]
    return row.iloc[-54:].tolist().count(True)

pso_columns = ['id','n_attr','swarm_size', 'max_iterations', 'accuracy', 'ks']
pso_best_results = pd.DataFrame(columns=pso_columns)

index = 0
for i in knn_pso.columns.tolist()[-5:]:
    row = knn_pso.loc[knn_pso[i].idxmax()]
    pso_best_results.loc[index] = [int(row['Unnamed: 0']), 
                                   get_n_selected(pso, int(row['Unnamed: 0'])), 
                                   row['swarm_size'], row['max_iterations'], row[i], i]
    index += 1
    
pso_best_results.style

Unnamed: 0,id,n_attr,swarm_size,max_iterations,accuracy,ks
0,32,22,25,50,0.854825,k1
1,32,22,25,50,0.824459,k3
2,32,22,25,50,0.806988,k5
3,32,22,25,50,0.797005,k7
4,37,23,25,50,0.782654,k9


### Genetic Algorithm


In [19]:
ga_columns = ['id', 'n_attr'] + ga.columns.tolist()[:4] + ['accuracy', 'ks']
ga_best_results = pd.DataFrame(columns=ga_columns)

index = 0
for i in knn_ga.columns.tolist()[-5:]:
    row = knn_ga.loc[knn_ga[i].idxmax()]
    ga_best_results.loc[index] = [int(row['Unnamed: 0']), 
                                  get_n_selected(ga, 
                                                 int(row['Unnamed: 0']))] +\
    row.loc[ga.columns.tolist()[:4]].tolist() + [ row[i], i ]
    index += 1
    
ga_best_results.style

Unnamed: 0,id,n_attr,population,max_generations,combination_prob,mutation_prob,accuracy,ks
0,184,17,25,50,0.7,0.05,0.772879,k1
1,184,17,25,50,0.7,0.05,0.768511,k3
2,184,17,25,50,0.7,0.05,0.769967,k5
3,184,17,25,50,0.7,0.05,0.765183,k7
4,184,17,25,50,0.7,0.05,0.757903,k9


### Correlation filter method


In [20]:
filtered_columns = ['id', 'n_attr', 'accuracy', 'ks']
filtered_best_results = pd.DataFrame(columns=filtered_columns)

index = 0
for i in knn_filtered.columns.tolist()[-5:]:
    row = knn_filtered.loc[knn_filtered[i].idxmax()]
    filtered_best_results.loc[index] = [int(row['Unnamed: 0']), 
                                  get_n_selected(filtered, 
                                                 int(row['Unnamed: 0']))] + [ row[i], i ]
    index += 1
    
filtered_best_results.style

Unnamed: 0,id,n_attr,accuracy,ks
0,2,32,0.815724,k1
1,2,32,0.790141,k3
2,2,32,0.777662,k5
3,2,32,0.767679,k7
4,2,32,0.759567,k9


### PCA


In [21]:
pca_columns = ['id', 'representativity (%)', 'accuracy', 'ks']
pca_best_results = pd.DataFrame(columns=pca_columns)

index = 0
for i in knn_pca.columns.tolist()[-5:]:
    row = knn_pca.loc[knn_pca[i].idxmax()]
    pca_best_results.loc[index] = [int(row['Unnamed: 0']), row['representativity (%)'], row[i], i ]
    index += 1
    
pca_best_results.style

Unnamed: 0,id,representativity (%),accuracy,ks
0,0,75,0.314268,1
1,2,95,0.246048,3
2,0,75,0.228577,5
3,0,75,0.214018,7
4,0,75,0.197587,9


### LDA


In [22]:
lda_columns = ['id', 'n_attr', 'accuracy', 'ks']
lda_best_results = pd.DataFrame(columns=lda_columns)

index = 0
for i in knn_lda.columns.tolist()[-5:]:
    row = knn_lda.loc[knn_lda[i].idxmax()]
    lda_best_results.loc[index] = [int(row['Unnamed: 0']), row['n_components'], row[i], i ]
    index += 1
    
lda_best_results.style

Unnamed: 0,id,n_attr,accuracy,ks
0,5,6,0.43906,1
1,5,6,0.375624,3
2,4,5,0.335899,5
3,5,6,0.297005,7
4,5,6,0.272463,9


## Selected attributes of best results

Having the best results, we now show the selected attributes by each algorithm, using the following function:

In [81]:
def selected_attr(data, best_results):
    '''
    Having the best results for each algorithm, print
    the selected attributes.
    '''
    s = data.iloc[best_results.iloc[0,0],-54:]
    return [ s.index[i] for i in range(len(s)) if s[i] == True]

### PSO

In [82]:
print('Size:', pso_best_results.iloc[0,1], \
      '& Attrs:', selected_attr(pso, pso_best_results))

Size: 22 & Attrs: ['elevation', 'aspect', 'horiz_dist_hydro', 'horiz_dist_road', 'hillshade_9', 'horiz_dist_fire', 'wild_area_0', 'wild_area_2', 'soil_type_0', 'soil_type_2', 'soil_type_3', 'soil_type_21', 'soil_type_22', 'soil_type_23', 'soil_type_27', 'soil_type_28', 'soil_type_32', 'soil_type_33', 'soil_type_34', 'soil_type_37', 'soil_type_38', 'soil_type_39']


### GA

In [83]:
print('Size:', ga_best_results.iloc[0,1], \
      '& Attrs:', selected_attr(ga, ga_best_results))

Size: 17 & Attrs: ['elevation', 'vert_dist_hydro', 'horiz_dist_road', 'wild_area_0', 'soil_type_9', 'soil_type_11', 'soil_type_18', 'soil_type_21', 'soil_type_22', 'soil_type_28', 'soil_type_30', 'soil_type_32', 'soil_type_34', 'soil_type_35', 'soil_type_37', 'soil_type_38', 'soil_type_39']


### Correlation filter

In [84]:
print('Size:', filtered_best_results.iloc[0,1], \
      '& Attrs:', selected_attr(filtered, filtered_best_results))

Size: 32 & Attrs: ['slope', 'horiz_dist_hydro', 'vert_dist_hydro', 'horiz_dist_road', 'hill_shade_noon', 'hill_shade_15', 'horiz_dist_fire', 'wild_area_0', 'wild_area_2', 'wild_area_3', 'soil_type_3', 'soil_type_4', 'soil_type_8', 'soil_type_9', 'soil_type_11', 'soil_type_12', 'soil_type_16', 'soil_type_18', 'soil_type_19', 'soil_type_20', 'soil_type_21', 'soil_type_22', 'soil_type_23', 'soil_type_28', 'soil_type_30', 'soil_type_31', 'soil_type_32', 'soil_type_34', 'soil_type_36', 'soil_type_37', 'soil_type_38', 'soil_type_39']
