# K Nearest Neighbors (Penguin Predictor Version 2)

In this version of our penguin predictor, we will make a couple improvements from version 1.
1. Split our data in half for a more rigourous test of accuracy
2. Add a hyperparameter K to our nearest neighbor algorithm
3. Choose the best value for K to get the highest accuracy

## 1. Load our data

Load our penguins data. This time, we will split it into a set of training data and a set of testing data. This will give us a more accurate test of how well our knn algorithm works. 

Follow the code in the slides and [see here](https://realpython.com/train-test-split-python-data/) for more information on test_train_split.

In [19]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# TODO 1: Load our penguins data as a dataframe using pd.read_csv
data = pd.read_csv("penguins.csv")


# TODO 2 This time, we will split our data in half, with 50% being testing data and 50% being training data
X = data[["culmen_length_mm", "culmen_depth_mm", "flipper_length_mm", "body_mass_g"]]
y = data["species"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.5, random_state = 42)



In [20]:
# run this to see if you got it correct
assert np.isclose(len(X_train), .5*len(data)), f"\033[91mExpected {.5*len(data)} but got {len(X_train)}\033[0m"
assert np.isclose(len(X_test), .5*len(data)), f"\033[91mExpected {.5*len(data)} but got {len(X_test)}\033[0m"

# 2. Choose our algorithm

Create a function called k_nearest_neighbors which should
   1. Accept 4 parameters: K, unknown, X_train and y_train 
   2. Calculate the distance of an unknown penguin to all points in X_train (our 50% training data)
   3. Find the K closest neighbors and print out their distances
   4. Returns the neighbor species with the majority vote

For a more detailed outline of KNN, see [here](https://www.geeksforgeeks.org/machine-learning/k-nearest-neighbours/)

In [21]:
def euclidean_distance(penguin1, penguin2):
    distance = 0

    distance = np.sqrt(np.square(penguin1["culmen_length_mm"] - penguin2["culmen_length_mm"]) + np.square(penguin1["flipper_length_mm"] - penguin2["flipper_length_mm"]) + np.square(penguin1["body_mass_g"] - penguin2["body_mass_g"]))
    return distance

def k_nearest_neighbor(unknown, K, X_train, y_train):
    distances = []
    Adelie = 0
    Gentoo = 0
    Chinstrap = 0

    for index, row in X_train.iterrows():
        distance = euclidean_distance(unknown, row)
        species = y_train.loc[index]
        distances.append((distance, species))
    
    distances.sort(key=lambda penguin: penguin[0])

    votes = {"Adelie": 0, "Gentoo": 0, "Chinstrap": 0}
    for i in range(K):
        votes[distances[i][1]] += 1

    return max(votes, key=votes.get)


In [22]:
# run this as a sanity check to see if you implemented the above block somewhat correctly
# passing does NOT garuntee that it is correct
penguin1 = data.iloc[0]
prediction = k_nearest_neighbor(penguin1, 3, X_train, y_train)
assert prediction == "Adelie"

## 3. Testing/Tuning

Now let's test our new accuracy. Use x_test and y_test to check your accuracy using your check_accuracy function.

In [23]:
# TODO 4 Paste in your compute_accuracy function
def compute_accuracy(predictions, answers):
    correct = 0
    incorrect = 0

    for i in range(len(predictions)):
        if predictions[i] == answers[i]:
            correct += 1
        else:
            incorrect += 1
    
    accuracy = (correct / (correct + incorrect)) * 100
    return round(accuracy, 2)
    

In [24]:
# TODO 5 use your k_nearest_neighbor algorithm to predict the species of X_test
# Calculate the accuracy against y_test
# Find the value of k that leads to the highest accuracy.

answers = y_test.tolist()
predictions = []
for i in range(len(X_test)):
    predictions.append(k_nearest_neighbor(X_test.iloc[i], 7, X_train, y_train))

print(compute_accuracy(predictions, answers))

#7

76.74
