## Imports

In [101]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import unittest as ut
import math


## Data Loading

In [50]:
wine_train = pd.read_csv("data/part1/wine-training", delimiter=" ")
wine_test = pd.read_csv("data/part1/wine-test", delimiter=" ")

# check it's loaded in correctly
display(wine_train.head())
display(wine_test.head())

Unnamed: 0,Alcohol,Malic_acid,Ash,Alcalinity_of_ash,Magnesium,Total_phenols,Flavanoids,Nonflavanoid_phenols,Proanthocyanins,Color_intensity,Hue,OD280/OD315_of_diluted_wines,Proline,Class
0,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,1
1,12.25,1.73,2.12,19.0,80.0,1.65,2.03,0.37,1.63,3.4,1.0,3.17,510.0,2
2,11.82,1.47,1.99,20.8,86.0,1.98,1.6,0.3,1.53,1.95,0.95,3.33,495.0,2
3,13.05,2.05,3.22,25.0,124.0,2.63,2.68,0.47,1.92,3.58,1.13,3.2,830.0,1
4,13.29,1.97,2.68,16.8,102.0,3.0,3.23,0.31,1.66,6.0,1.07,2.84,1270.0,1


Unnamed: 0,Alcohol,Malic_acid,Ash,Alcalinity_of_ash,Magnesium,Total_phenols,Flavanoids,Nonflavanoid_phenols,Proanthocyanins,Color_intensity,Hue,OD280/OD315_of_diluted_wines,Proline,Class
0,12.7,3.55,2.36,21.5,106.0,1.7,1.2,0.17,0.84,5.0,0.78,1.29,600.0,3
1,12.2,3.03,2.32,19.0,96.0,1.25,0.49,0.4,0.73,5.5,0.66,1.83,510.0,3
2,14.13,4.1,2.74,24.5,96.0,2.05,0.76,0.56,1.35,9.2,0.61,1.6,560.0,3
3,13.05,1.65,2.55,18.0,98.0,2.45,2.43,0.29,1.44,4.25,1.12,2.51,1105.0,1
4,14.19,1.59,2.48,16.5,108.0,3.3,3.93,0.32,1.86,8.7,1.23,2.82,1680.0,1


## Data Exploration 

Lets try to understand our data first.

No missing values.

In [51]:
# check for missing values
display(np.where(pd.isnull(wine_train)))
display(np.where(pd.isnull(wine_test)))
display(np.where(pd.isna(wine_train)))
display(np.where(pd.isna(wine_test)))

(array([], dtype=int64), array([], dtype=int64))

(array([], dtype=int64), array([], dtype=int64))

(array([], dtype=int64), array([], dtype=int64))

(array([], dtype=int64), array([], dtype=int64))

Data Questions:
* I'm unsure what 'Class' represents here.
* We have 14 features, I thought we only had 13?

In [52]:
# check data types
print(wine_train.dtypes)
print(wine_train.shape)

Alcohol                         float64
Malic_acid                      float64
Ash                             float64
Alcalinity_of_ash               float64
Magnesium                       float64
Total_phenols                   float64
Flavanoids                      float64
Nonflavanoid_phenols            float64
Proanthocyanins                 float64
Color_intensity                 float64
Hue                             float64
OD280/OD315_of_diluted_wines    float64
Proline                         float64
Class                             int64
dtype: object
(89, 14)


After checking the data further, I have found
* 'Class' is which type of wine it is, which should be nominal (will transform).
* 'Class' is the extra attribute.

In [59]:
wine_train['Class'] = wine_train['Class'].astype('category')
wine_train.dtypes

Alcohol                          float64
Malic_acid                       float64
Ash                              float64
Alcalinity_of_ash                float64
Magnesium                        float64
Total_phenols                    float64
Flavanoids                       float64
Nonflavanoid_phenols             float64
Proanthocyanins                  float64
Color_intensity                  float64
Hue                              float64
OD280/OD315_of_diluted_wines     float64
Proline                          float64
Class                           category
dtype: object

There are only 3 classes in our dataset

In [61]:
wine_train['Class'].unique()

[1, 2, 3]
Categories (3, int64): [1, 2, 3]

Split into predictor/response variables

In [77]:
train_X = wine_train.drop('Class', axis=1)
train_y = wine_train['Class']

## Implement KNN

KNN works by checking the closest k nodes with each feature.
* Use Euclidean distance to calculate the space between multiple dimensions using the following algorithm taken from "https://en.wikipedia.org/wiki/Euclidean_distance":  
$d(p,q) = \sqrt (p_1 - q_1)^2 + (p_2 - q_2)^2 + ... + (p_n - q_n)^2$

In [123]:
def distance(ob1: list[float], ob2: list[float]) -> float:
    """Calculates the sum of the Euclidean distances between all of two observations features."""
    if len(ob1) != len(ob2):
        raise ValueError("The number of features are not the same for both observations.")
    sum: float = 0
    for i in range(0, len(ob1)):
        sum = sum + (ob1[i] - ob2[i]) ** 2 
    return math.sqrt(sum)

2.0
2.8284271247461903


In [148]:
# test
p1 = train_X.loc[0, :].tolist()
p2 = train_X.loc[1, :].tolist()
distance(p1, p2)

970.5798659564291