# Week 13 Group Homework

## 1. In markdown, describe kNN in your own words.

k-Nearest Neighbors, or k-NN, is a nonparametric algorithm that can be used for classification and regression problems. The k-NN method assumes that the distance between points is a measure of similarity. When we use in in a classifier context, the k-NN algorithm calculates the distances between a test observation, x, and all of the training data, to find the set of k values closest to x. Then, those k values are used to create a prediction for x. 

The standard distance metric is Euclidean but other metrics can be used, depending on the application (e.g. cosine similarity, chi-squared - especially good for classifying based on textures or shapes). 

The most commonly used statistic for k-NN model performance is the average accuracy (the fraction of correctly classified observations). Model performance can be heavily influenced by k, the number of neighbors. When k is too low, the model is underfit; it has low bias but high variance. When k is too high, the model is overfit, it has high bias and low variance. The goal is to find the value of k that balances the performance, bias, and variance. 

Computation time increases as the size of the data set increases. In some contexts, this can be a drawback.

## 2. Using the kNN example from class, write a function that finds the optimal value for k. You should iteraate over a range of values and return the k and the score when the accuracy score is maximized. Be sure to only use odd values.

In [1]:
import pandas as pd 
import numpy as np

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

In [2]:
diabetes_df = pd.read_csv('diabetes.csv')
diabetes_df.head(5)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [3]:
# Creating the feature matrix, X, and splitting out the response vector, y.  

X = diabetes_df.drop('Outcome', axis=1)
y = diabetes_df['Outcome']

#Split data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.5, random_state=42)

#Standardization - transforming all data so that mean is zero and SD is 1

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

#Specifying the model, and k, the number of neighbors

knn = KNeighborsClassifier(n_neighbors=13)

knn.fit(X_train, y_train)
y_predicted = knn.predict(X_test)
print(y_predicted)

[0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 1
 0 0 0 0 0 0 1 0 0 1 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 0
 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 0 0 0 1 0 0 1 0 1 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0
 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 0 1 0 0 0 0 1 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 0 1 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 1
 0 0 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 1 0 0 0 0 0 0
 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0
 1 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 1 0 1 0 1 0 0]


#### Some scratchwork:

In [4]:
#We could compute the scores of the classifier for a range of values of k, and then see if we can find a maximum. This is a kind of "naive" approach

klist=[]
for i in range (1,100):
        knn = KNeighborsClassifier(n_neighbors=i)
        knn.fit(X_train, y_train)
        y_pred = knn.predict(X_test)
        klist.append(knn.score(X_test, y_test))

In [5]:
#Let's check it out:
print(klist)

[0.6770833333333334, 0.6848958333333334, 0.71875, 0.6822916666666666, 0.703125, 0.703125, 0.7109375, 0.7083333333333334, 0.7109375, 0.7161458333333334, 0.734375, 0.7239583333333334, 0.7317708333333334, 0.7291666666666666, 0.734375, 0.7317708333333334, 0.7473958333333334, 0.7395833333333334, 0.7421875, 0.7526041666666666, 0.75, 0.7552083333333334, 0.7369791666666666, 0.7291666666666666, 0.734375, 0.7291666666666666, 0.7317708333333334, 0.734375, 0.7213541666666666, 0.7265625, 0.71875, 0.7265625, 0.7265625, 0.7317708333333334, 0.7369791666666666, 0.7317708333333334, 0.7421875, 0.7369791666666666, 0.7447916666666666, 0.7526041666666666, 0.7447916666666666, 0.7526041666666666, 0.7473958333333334, 0.7447916666666666, 0.7526041666666666, 0.7578125, 0.7552083333333334, 0.7578125, 0.7552083333333334, 0.75, 0.7552083333333334, 0.75, 0.7604166666666666, 0.7526041666666666, 0.7604166666666666, 0.7473958333333334, 0.7552083333333334, 0.7473958333333334, 0.7552083333333334, 0.75, 0.7578125, 0.75260

In [6]:
#This is a list, so we can find the maximum value in the list and get it's index position.

#max value
max(klist)

0.765625

In [7]:
#Index position
np.argmax(klist)

66

In [8]:
# Since indexing starts with 0, the actual k = index + 1 = 67.

#### Function for tuning k

In [9]:
def knn_tune ():
    knn_list=[]
    for i in range(1,100):
        knn = KNeighborsClassifier(n_neighbors=i)
        knn.fit(X_train, y_train)
        y_predicted = knn.predict(X_test)
        knn_list.append(knn.score(X_test,y_test))
    return knn_list

In [16]:
#As above, to find the optimal k, find the index of the highest score in the list, and add 1

opt_k=np.argmax(knn_tune())+1

opt_score=max(knn_tune())

print('The optimal k is: ' + str(opt_k) + ' with an accuracy score of: ' + str(opt_score))

The optimal k is: 67 with an accuracy score of: 0.765625


## 3. How did the panel influence your thoughts about working in tech, specifically work in the data realm? Discuss with your group and summarize your thoughts in under 250 words. 

I enjoyed the discussion with the panel and appreciated their insights and expertise. The three main takeaways for me were: 1) you have to have solid coding foundations, 2) for long term success, you must be a lifetime learner and 3) it pays to be a good communicator.

There was one question that I kept wanting to ask, but I didn't, because it's a bit of a downer question and I didn't want to introduce any fears or negative thoughts into what was an enjoyable, uplifting, and inspiring conversation. I wanted to know their thoughts on market saturation. Data science bootcamps, certificates, and even degree programs are everywhere, now, and, while data science may be a growing industry, it seems like the number of individuals who go through training programs will rapidly outpace the number of available jobs (if it hasn't already). I read one data scientists blog entry about it, from 2019, she's been in the field for over 10 years, and I keep refraining from posting it to Slack because again I don't want to generate anxiety. I do think that it is something worth considering. I spoke about it with my group for a little bit and our general conclusion was to just ignore that possibility (market saturation). 