# Data Mining Coursework - Spring 2023

Katerina Marie (Katya) Reichert - 33781583

I worked and submitted alone :)

# Part 1: Nearest Neighbor Algorithm

### Section A

You are required to write a Python code implementing the simplest Nearest Neighbour algorithm (that is, using just 1 neighbour), with the Minkowski distance, both discussed in lecture of week 1. 

Your code will read the power q appearing in the Mionkowski distance, and will classify each record from the test dataset based on the training dataset. Remember, to classify a record from the test set you need to find its nearest neighbour in the training set (this is the one which minimizes the distance to the test set record); take the class of the nearest neighbour as the predicted class for the test set record. 

After classifying all the records in the test set, your code needs to calculate and display the **accuracy, recall, precision, and F1 measure with respect to the class "M"**(which is assumed to be the positive class), of the predictions on the test dataset. Run your code to produce results first for **Manhattan distance** and then for **Euclidian distance**, which are particular cases of Minkowski distance (q=1, and q=2, see lecture week 1).

In [22]:
import pandas as pd
import numpy as np

from sklearn.metrics import f1_score
from sklearn.metrics import precision_recall_fscore_support
from sklearn.metrics import accuracy_score

import sys
import os

# Training Data

### Read in training csv

In [2]:
df = pd.read_csv(os.path.join('csv/', 'sonar_train.csv'))
print(df.shape)
df.head()

(139, 61)


Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,...,A52,A53,A54,A55,A56,A57,A58,A59,A60,Class
0,0.0079,0.0086,0.0055,0.025,0.0344,0.0546,0.0528,0.0958,0.1009,0.124,...,0.0176,0.0127,0.0088,0.0098,0.0019,0.0059,0.0058,0.0059,0.0032,R
1,0.0599,0.0474,0.0498,0.0387,0.1026,0.0773,0.0853,0.0447,0.1094,0.0351,...,0.0013,0.0005,0.0227,0.0209,0.0081,0.0117,0.0114,0.0112,0.01,M
2,0.0093,0.0269,0.0217,0.0339,0.0305,0.1172,0.145,0.0638,0.074,0.136,...,0.0212,0.0091,0.0056,0.0086,0.0092,0.007,0.0116,0.006,0.011,R
3,0.0151,0.032,0.0599,0.105,0.1163,0.1734,0.1679,0.1119,0.0889,0.1205,...,0.0061,0.0015,0.0084,0.0128,0.0054,0.0011,0.0019,0.0023,0.0062,R
4,0.0317,0.0956,0.1321,0.1408,0.1674,0.171,0.0731,0.1401,0.2083,0.3513,...,0.0201,0.0248,0.0131,0.007,0.0138,0.0092,0.0143,0.0036,0.0103,R


Sort training data in ascending order for easier searching

In [3]:
df.sort_values(by=['A' + str(x) for x in range(1,61)], inplace=True)
df.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,...,A52,A53,A54,A55,A56,A57,A58,A59,A60,Class
136,0.0015,0.0186,0.0289,0.0195,0.0515,0.0817,0.1005,0.0124,0.1168,0.1476,...,0.0108,0.0075,0.0089,0.0036,0.0029,0.0013,0.001,0.0032,0.0047,M
66,0.0025,0.0309,0.0171,0.0228,0.0434,0.1224,0.1947,0.1661,0.1368,0.143,...,0.0149,0.0077,0.0036,0.0114,0.0085,0.0101,0.0016,0.0028,0.0014,R
40,0.0036,0.0078,0.0092,0.0387,0.053,0.1197,0.1243,0.1026,0.1239,0.0888,...,0.0119,0.0055,0.0035,0.0036,0.0004,0.0018,0.0049,0.0024,0.0016,R
12,0.005,0.0017,0.027,0.045,0.0958,0.083,0.0879,0.122,0.1977,0.2282,...,0.0165,0.0056,0.001,0.0027,0.0062,0.0024,0.0063,0.0017,0.0028,M
7,0.0065,0.0122,0.0068,0.0108,0.0217,0.0284,0.0527,0.0575,0.1054,0.1109,...,0.0069,0.0025,0.0027,0.0052,0.0036,0.0026,0.0036,0.0006,0.0035,R


No additional preprocessing is required for the data, as it is all numerical data and appropriately scaled. Since we are not looking to finetune any hyperparameters for the model, we don't need to do cross validation or model testing with the training data. We can go directly to final predictions on the test data.

## Model Creation

### Create function for calculating the Minkowski distance

I added an optional parameter of q, so the user can easily change this. It defaults to q=1 for Manhattan distance

In [4]:
def Minkowski(vec1, vec2, q=1):
    vec1, vec2 = np.array(vec1), np.array(vec2)
    return np.sum(np.abs(vec1 - vec2)**q)**(1/q)

### Create nearest-neighbor prediction function

__Input__: 

- examples: Pandas DataFrame -- the examples to create predictions on, where each example has a shape of (1, 60)
- q (optional): power for Minkowski Distance (defaults to Manhattan distance)

__Output__: List[str] -- the corresponding predictions in order 

In [5]:
def predict_class(examples, q=1):
    
    predictions = []
    
    for i, row in examples.iterrows():
        
        nn_class, distance = None, 99999999999999999999
        
        for j, comp in df.iterrows():
            
            d = Minkowski(row, comp[:-1], q)
            
            if d < distance:
                nn_class, distance = comp[-1], d
                
        predictions.append(nn_class)
    
    return predictions

## Predictions

### Read in test data

In [6]:
test_df = pd.read_csv(os.path.join('csv/', 'sonar_test.csv'))
print(test_df.shape)
test_df.head()

(69, 61)


Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,...,A52,A53,A54,A55,A56,A57,A58,A59,A60,Class
0,0.0125,0.0152,0.0218,0.0175,0.0362,0.0696,0.0873,0.0616,0.1252,0.1302,...,0.0041,0.0074,0.003,0.005,0.0048,0.0017,0.0041,0.0086,0.0058,R
1,0.053,0.0885,0.1997,0.2604,0.3225,0.2247,0.0617,0.2287,0.095,0.074,...,0.0244,0.0199,0.0257,0.0082,0.0151,0.0171,0.0146,0.0134,0.0056,M
2,0.0368,0.0279,0.0103,0.0566,0.0759,0.0679,0.097,0.1473,0.2164,0.2544,...,0.0105,0.0024,0.0018,0.0057,0.0092,0.0009,0.0086,0.011,0.0052,M
3,0.0164,0.0173,0.0347,0.007,0.0187,0.0671,0.1056,0.0697,0.0962,0.0251,...,0.009,0.0223,0.0179,0.0084,0.0068,0.0032,0.0035,0.0056,0.004,R
4,0.0216,0.0124,0.0174,0.0152,0.0608,0.1026,0.1139,0.0877,0.116,0.0866,...,0.0052,0.0049,0.0096,0.0134,0.0122,0.0047,0.0018,0.0006,0.0023,R


### Create Predictions

Run prediction code on train dataset for **Manhattan Distance**

In [7]:
m_predictions = predict_class(test_df.drop(columns='Class'))

Run prediction code on train dataset for **Euclidean Distance**

In [8]:
e_predictions = predict_class(test_df.drop(columns='Class'), q=2)

Append both predicted classes to the test df

In [9]:
test_df['Manhattan Distance'] = m_predictions
test_df['Euclidean Distance'] = e_predictions

test_df.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,...,A54,A55,A56,A57,A58,A59,A60,Class,Manhattan Distance,Euclidean Distance
0,0.0125,0.0152,0.0218,0.0175,0.0362,0.0696,0.0873,0.0616,0.1252,0.1302,...,0.003,0.005,0.0048,0.0017,0.0041,0.0086,0.0058,R,R,R
1,0.053,0.0885,0.1997,0.2604,0.3225,0.2247,0.0617,0.2287,0.095,0.074,...,0.0257,0.0082,0.0151,0.0171,0.0146,0.0134,0.0056,M,M,M
2,0.0368,0.0279,0.0103,0.0566,0.0759,0.0679,0.097,0.1473,0.2164,0.2544,...,0.0018,0.0057,0.0092,0.0009,0.0086,0.011,0.0052,M,M,M
3,0.0164,0.0173,0.0347,0.007,0.0187,0.0671,0.1056,0.0697,0.0962,0.0251,...,0.0179,0.0084,0.0068,0.0032,0.0035,0.0056,0.004,R,R,R
4,0.0216,0.0124,0.0174,0.0152,0.0608,0.1026,0.1139,0.0877,0.116,0.0866,...,0.0096,0.0134,0.0122,0.0047,0.0018,0.0006,0.0023,R,R,R


## Result Analysis

Make a cleaner dataframe for visualization

In [10]:
results_df = test_df.drop(columns=['A' + str(x) for x in range(1,61)])
results_df.head()

Unnamed: 0,Class,Manhattan Distance,Euclidean Distance
0,R,R,R
1,M,M,M
2,M,M,M
3,R,R,R
4,R,R,R


For each, display the accuracy, recall, precision, and F1 measure with respect to the class "M"

#### Accuracy function:

In [11]:
def accuracy(actual, pred):
    return sum(map(lambda x, y: int(x == y), actual, pred))/len(actual)

#### Precision function:

Precision = TruePositives / (TruePositives + FalsePositives) with "M" as positive class

In [12]:
def precision(col_name):
    
    actual = results_df.loc[results_df[col_name] == 'M']['Class']
    pred = results_df.loc[results_df[col_name] == 'M'][col_name]
    
    true_pos = sum(map(lambda x, y: int(x == y), actual, pred))/len(actual)
    false_pos = sum(map(lambda x, y: int(x != y), actual, pred))/len(actual)
    
    return true_pos/(true_pos + false_pos)

#### Recall function:

Recall = TruePositives / (TruePositives + FalseNegatives)

In [13]:
def recall(col_name):
    
    actual = results_df.loc[results_df['Class'] == 'M']['Class']
    pred = results_df.loc[results_df['Class'] == 'M'][col_name]
    
    true_pos = sum(map(lambda x, y: int(x == y), actual, pred))/len(actual)
    false_neg = sum(map(lambda x, y: int(x != y), actual, pred))/len(actual)
    
    return true_pos/(true_pos + false_neg)

#### F1 Measure function:

F1-Measure = (2 * Precision * Recall) / (Precision + Recall)

In [14]:
def f1(col_name):
    
    p = precision(col_name)
    r = recall(col_name)
    
    return (2 * p * r)/(p + r)

### Manhattan Distance

Using functions from Scikit Learn

In [15]:
a = accuracy_score(results_df['Class'], results_df['Manhattan Distance'])
p,r,f,s = precision_recall_fscore_support(results_df['Class'], results_df['Manhattan Distance'], labels=['M'])

print('accuracy:', a)
print('precision:', p[0])
print('recall:', r[0])
print('F1 measure:', f[0])

accuracy: 0.8840579710144928
precision: 0.8536585365853658
recall: 0.9459459459459459
F1 measure: 0.8974358974358975


Using my custom functions

In [16]:
m_acc = accuracy(results_df['Class'], results_df['Manhattan Distance'])
m_prec = precision('Manhattan Distance')
m_rec = recall('Manhattan Distance')
m_f1 = f1('Manhattan Distance')

print('accuracy:', m_acc)
print('precision:', m_prec)
print('recall:', m_rec)
print('f1 measure:', m_f1)

accuracy: 0.8840579710144928
precision: 0.8536585365853658
recall: 0.9459459459459459
f1 measure: 0.8974358974358975


### Euclidean Distance

Using functions from Scikit Learn

In [17]:
a = accuracy_score(results_df['Class'], results_df['Euclidean Distance'])
p,r,f,s = precision_recall_fscore_support(results_df['Class'], results_df['Euclidean Distance'], labels=['M'])

print('accuracy:', a)
print('precision:', p[0])
print('recall:', r[0])
print('F1 measure:', f[0])

accuracy: 0.8985507246376812
precision: 0.8571428571428571
recall: 0.972972972972973
F1 measure: 0.9113924050632912


Using my custom functions

In [18]:
e_acc = accuracy(results_df['Class'], results_df['Euclidean Distance'])
e_prec = precision('Euclidean Distance')
e_rec = recall('Euclidean Distance')
e_f1 = f1('Euclidean Distance')

print('accuracy:', e_acc)
print('precision:', e_prec)
print('recall:', e_rec)
print('f1 measure:', e_f1)

accuracy: 0.8985507246376812
precision: 0.8571428571428571
recall: 0.972972972972973
f1 measure: 0.9113924050632912


## Results Summary

In [26]:
d = {'Manhattan Distance': np.around([m_acc, m_prec, m_rec, m_f1], decimals=3), 
     'Euclidean Distance': np.around([e_acc, e_prec, e_rec, e_f1], decimals=3)}

pd.DataFrame(data=d, index=['Accuracy', 'Precision', 'Recall', 'F1 Measure'])

Unnamed: 0,Manhattan Distance,Euclidean Distance
Accuracy,0.884,0.899
Precision,0.854,0.857
Recall,0.946,0.973
F1 Measure,0.897,0.911


In all measures, Euclidean distance out performs Manhattan Distance. 

### Section B

Run your code for the power q as a positive integer number from **1 to 20** and **display the accuracy, recall, precision, and F1 measure** on the test set in a chart. Which value of q leads to the best accuracy on the test set?