### This is a simple notebook to build and visualize the kNN algorithm.

It accompanies Chapter 2 of the book.

Copyright: Viviana Acquaviva (2023)

Additions and Modifications by Julieta Gruszko (2025)

License: [BSD-3-clause](https://opensource.org/license/bsd-3-clause/)

#### List the names your group members below:

In [None]:
import numpy as np

import matplotlib

import matplotlib.pyplot as plt

import matplotlib.patches as mpatches

import pandas as pd 

import sklearn

from sklearn.tree import DecisionTreeClassifier # how methods are imported 

from sklearn import metrics # this will give us access to evaluation metrics

from sklearn import neighbors # here comes the method of the day

In [None]:
font = {'size'   : 20}
matplotlib.rc('font', **font)
matplotlib.rc('xtick', labelsize=20) 
matplotlib.rc('ytick', labelsize=20) 
matplotlib.rcParams['figure.dpi'] = 300

### Read in data from file

In [None]:
LearningSet = pd.read_csv('../Data/HPLearningSet.csv')

LearningSet = LearningSet.drop(LearningSet.columns[0], axis=1) #We want to drop the first column of the file

In [None]:
#By now we know data frames

LearningSet 

### Let's pick the same initial train/test set we had in the previous exercise, and split into the same features and labels.

In [None]:
TrainSet =  LearningSet.iloc[:13,:] #.iloc is used to slice data frames using positional indices

TestSet = LearningSet.iloc[13:,:]

Xtrain = TrainSet.drop(['P_NAME','P_HABITABLE'],axis=1) #This contains stellar mass, period, and distance

Xtest = TestSet.drop(['P_NAME','P_HABITABLE'],axis=1)  #This contains stellar mass, period, and distance

ytrain = TrainSet.P_HABITABLE #This contains the ground truth label, or output

ytest = TestSet.P_HABITABLE #This contains the ground truth  label, or output

### We are now ready to deploy the kNN (k Nearest Neighbor) algorithm.

k Nearest Neighbors is a simple algorithm based on the idea of distance: we look for the k (an integer) objects that are closest to the one we would like to classify, and take the majority vote among the k classes of the k neighbors.

If you are wondering: what is even there to fit?

As we discussed in class, finding nearest neighbors can be computationally slow, especially for large data sets. 
When it comes to kNN, sk-learn uses the "fit" method to organize the training data into a structure that will make it easier to find nearest neighbors for future data points you want to predict labels for. That means that while $\texttt{fit}$ may take some time to run (especially for large data sets), subsequent calls to $\texttt{predict}$ will be faster. This matches our usual expectations for machine learning methods: training can be slow, but you don't need to train your network many times; using a trained network to make predictions is fast, and we often call that method repeatedly. 

You can find more information in [this post](https://stats.stackexchange.com/questions/349842/why-do-we-need-to-fit-a-k-nearest-neighbors-classifier) and in the scikit-learn documentation. 

First, we'll make our network object. We'll set the "k" to 3 this time, since we're working with such a small data set.

In [None]:
model = neighbors.KNeighborsClassifier(n_neighbors = 3)

In [None]:
model #Look at the model object to see hyperparameter settings

In [None]:
#If you're not using VS Code (or you want to print the information to an output file, for example), you can get the parameters and settings in a dictionary by using:
model.get_params()

### For visualization purposes, let's use only the first two features to build the model.

In [None]:
Xtrain2D = Xtrain.iloc[# your code here. Use the same training set we started with, but select only the first 2 columns

Xtest2D = Xtest.iloc[# Same for the test data set

#### Build model by applying the ".fit" method to the training set. Then predict labels for the test set.

In [None]:
# Write code to fit your kNN model to the training data set
ytestpred = # Write code to predict labels for your test data set

In [None]:
ytestpred, ytest.values #compare

#### Calculate accuracy on the train set and on the test set (train score and test score):

In [None]:
print(metrics.accuracy_score(ytrain, model.predict(Xtrain2D))) #This compares the true labels for the train set with the predicted labels for the train set

print() # do the same for the test set

#### After fitting and predicting, we can access the k neighbors for each element in the test set like this:

In [None]:
model.kneighbors(Xtest2D)

Take a look at the documentation for the kneighbors method called above. What are these returned values?

### Let's now visualize our results, similarly to what we did for the DT.

First, let's just look at the data in the 2 axes we're using for the kNN algorithm.

In [None]:
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#20B2AA','#FF00FF'])
a = plt.scatter(TrainSet['S_MASS'], TrainSet['P_PERIOD'], marker = '*',\
            c = TrainSet['P_HABITABLE'], s = 100, label = 'Train', cmap=cmap, alpha=0.5) 


a = plt.scatter(TestSet['S_MASS'], TestSet['P_PERIOD'], marker = 'o',\
            c = TestSet['P_HABITABLE'], s = 100, label = 'Test', cmap=cmap, alpha = 0.5)

# Manually create a custom legend entry
trainmarker = plt.Line2D([0], [0], markeredgecolor='black', markerfacecolor = 'none', markersize = 10, marker='*', linestyle='')
testmarker = plt.Line2D([0], [0], markeredgecolor='black', markerfacecolor = 'none', markersize = 10, marker='o', linestyle='')
# Add the custom legend
plt.legend([trainmarker, testmarker], ['Train', 'Test'])


plt.xlabel('Mass of Parent Star (Solar Mass Units)')
plt.ylabel('Period of Orbit (days)');

We can use the distance of the third neighbor as the radius of the circle that encompasses neighbors.


In [None]:
for i in range(len(TestSet)): # cycle through elements of the test set
    
    print(model.kneighbors(Xtest2D)[0][i,2]) # this prints out the third element of the distances vector

The following code draws a circle encompassing the 3 nearest neighbors for each data point. It also sets the plot range to make the circles appear circular (1 to 1 aspect ratio) and make them easier to see. Notice that the plot isn't showing us the outlier with period > 1000 days.

In [None]:
plt.figure(figsize=(10,6))
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#20B2AA','#FF00FF'])
a = plt.scatter(TrainSet['S_MASS'], TrainSet['P_PERIOD'], marker = '*',\
            c = TrainSet['P_HABITABLE'], s = 100, label = 'Train', cmap=cmap, alpha=0.5) 


a = plt.scatter(TestSet['S_MASS'], TestSet['P_PERIOD'], marker = 'o',\
            c = TestSet['P_HABITABLE'], s = 100, label = 'Test', cmap=cmap, alpha = 0.5) 


for i in range(len(TestSet)): #plot neighbors

    circle1=plt.Circle((TestSet['S_MASS'].iloc[i],TestSet['P_PERIOD'].iloc[i]),model.kneighbors(Xtest.iloc[:,:2])[0][i,2],\
                      lw = 0.7, edgecolor='k',facecolor='none')
    
     
    plt.gca().add_artist(circle1)
    

plt.gca().set_aspect(1)

bluepatch = mpatches.Patch(color='#20B2AA', label='Not Habitable')
magentapatch = mpatches.Patch(color='#FF00FF', label='Habitable')

plt.legend();

ax = plt.gca()
leg = ax.get_legend()

# Manually create a custom legend entry
trainmarker = plt.Line2D([0], [0], markeredgecolor='black', markerfacecolor = 'none', markersize = 10, marker='*', linestyle='', label='Train')
testmarker = plt.Line2D([0], [0], markeredgecolor='black', markerfacecolor = 'none', markersize = 10, marker='o', linestyle='', label='Test')
# Add the custom legend
plt.legend([trainmarker, testmarker], ['Train', 'Test'])


plt.legend(handles=[trainmarker,testmarker, magentapatch, bluepatch] ,\
           loc = 'upper left', fontsize = 14)

plt.xlim(-130,70)
plt.ylim(0,140)
plt.xlabel('Mass of Parent Star (Solar Mass Units)')
plt.ylabel('Period of Orbit (days)');

#plt.savefig('HabPlanetsKNN2features.png', dpi = 300)

### Do you notice any issue here?

### If one dimension has a much bigger range than others, it will dominate the decision process. This issue can be solved by <b>scaling</b>. Scaling is a very important pre-processing step for most ML algorithms.

See some examples of different scaling algorithms [here](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html).

We will go with RobustScaler, which is more resistant to outliers than the standard version.


In [None]:
scaler = sklearn.preprocessing.RobustScaler()

In [None]:
scaler.fit(Xtrain) # important: we don't look at the test set when we're determining the settings for the scaler, just the the train set. Then we'll apply this same scaling to the test set.

In [None]:
scaledXtrain = scaler.transform(Xtrain)

In [None]:
scaledXtrain

In [None]:
scaledXtest = scaler.transform(Xtest) # Now we apply the same scaling to the test data. note that these are now numpy arrays, not data frames

Just a quick review of working with numpy arrays: we can select the value in one of the columns for all the rows like this:

In [None]:
scaledXtrain[:, 0] # selects all rows, just column 0 (stellar mass)
scaledXtrain[:, :2] # selects all rows, just columns 0 and 1 (stellar mass and orbital period)

We'll plot the orbital period and stellar mass of the scaled data. No need to color-code by habitability, we just want to see what the distributions look like after scaling.

The plotting code here looks a big different, since we're working with numpy arrays and not data frames.

In [None]:

cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#20B2AA','#FF00FF'])
a = plt.scatter(scaledXtrain[:, 0], scaledXtrain[:, 1], marker = '*', s = 100, label = 'Train', alpha=0.5) 

a = plt.scatter(scaledXtest[:, 0], scaledXtest[:, 1], marker = 'o', s = 100, label = 'Test', alpha = 0.5)

plt.legend()
plt.xlabel('Mass of Parent Star (Solar Mass Units)')
plt.ylabel('Period of Orbit (days)');

What is the effect of the scaling? Make sure to address the mean/median, width of the distributions, and any outliers.

In [None]:
scaler.inverse_transform #This unscales if needed

Your turn! Train your kNN model with the scaled stellar mass and scaled orbital period. Then make a prediction for the scaled test set.

In [None]:
# your code to train and predict here. Make sure you're using the same 2 features as before, not all 3

As before, we can look at the resulting neighbors for the test set:

In [None]:
model.kneighbors(scaledXtest[:,:2])

 ### The distances of neighbors look more balanced and give equal weight to all features:

In [None]:
plt.figure(figsize=(10,6))#, aspect_ratio = 'equal')
cmap = matplotlib.colors.LinearSegmentedColormap.from_list("", ['#20B2AA','#FF00FF'])
plt.scatter(scaledXTrain[:,0], scaledXTrain[:,1], marker = '*',\
            c = ytrain, s = 100, label = 'Train', cmap=cmap, alpha=0.5) #, 

plt.scatter(scaledXtest[:,0], scaledXtest[:,1], marker = 'o',\
            c = ytest, s = 100, label = 'Test', cmap=cmap, alpha=0.5) #label = ,

for i in range(len(TestSet)):

    circle1=plt.Circle((scaledXtest[i,0],scaledXtest[i,1]),model.kneighbors(scaledXtest[:,:2])[0][i,2],\
                       edgecolor='k',facecolor='none', lw = 0.7)
    plt.gca().add_artist(circle1)

plt.gca().set_aspect(1)

plt.legend();

ax = plt.gca()
leg = ax.get_legend()
# Manually create a custom legend entry
trainmarker = plt.Line2D([0], [0], markeredgecolor='black', markerfacecolor = 'none', markersize = 10, marker='*', linestyle='', label='Train')
testmarker = plt.Line2D([0], [0], markeredgecolor='black', markerfacecolor = 'none', markersize = 10, marker='o', linestyle='', label='Test')
# Add the custom legend
plt.legend([trainmarker, testmarker], ['Train', 'Test'])


plt.legend(handles=[trainmarker,testmarker, magentapatch, bluepatch] ,\
           loc = 'upper left', fontsize = 14)



plt.xlabel('Mass of Parent Star (Earth Mass Units)')
plt.ylabel('Period of Orbit (days)');


plt.xlim(-2.5,2.5)
plt.ylim(-1.,2.5);

#plt.savefig('HabPlanetsKNNscaled.png', dpi = 300)

### Note: for the purpose of application (not visualization), we should use all three features. You'll try this on the homework. 

### Discussion questions:
    
1. We discovered that kNN needs scaling! Does DT have the same issue?

2. Any thoughts on strengths/weaknesses?

### Acknowledgement statement: every assignment you submit will include an acknowledgement statement crediting the resources (human or otherwise) that you relied on for your work. In this case, your group mates are all already credited, but if you used any other resources, credit them here.

### That's it for kNN, for now. Upload your completed notebook to Gradescope to submit it, and then you're done for today.