<img src="https://imgur.com/WafvawM.png" style="float: left; margin: 30px; height: 55px">

# Classification and KNN with NHL data

_Author: J Ghorbani_

---

Below you will practice KNN classification on a dataset of NHL statistics.

You will be predicting the `Rank` of a team from predictor variables of your choice.

In [2]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [3]:
# web location:
local_csv = '../data/NHL_Data_GA.csv'

### 1. Load the NHL data

In [4]:
# A:

df = pd.read_csv(local_csv)
df.head()

Unnamed: 0,Team,PTS,Rank,TOI,GF,GA,GF60,GA60,GF%,SF,...,FF%,CF,CA,CF60,CA60,CF%,Sh%,Sv%,PDO,PIM
0,Washington10,121,1,2001:52:00,115,73,3.45,2.19,61.2,1112,...,51.3,2138,1935,64.1,58.0,52.5,10.34,93.03,1034,1269
1,Vancouver11,117,1,2056:14:00,94,72,2.74,2.1,56.6,1143,...,53.1,2144,1870,62.6,54.6,53.4,8.22,93.16,1014,985
2,San Jose10,113,1,1929:54:00,90,68,2.8,2.11,57.0,1065,...,50.9,1985,1876,61.7,58.3,51.4,8.45,93.46,1019,1195
3,Chicago10,112,1,2020:23:00,104,83,3.09,2.46,55.6,1186,...,58.1,2093,1572,62.2,46.7,57.1,8.77,90.44,992,966
4,Vancouver12,111,1,2052:02:00,86,74,2.51,2.16,53.8,1078,...,51.0,2085,1880,61.0,55.0,52.6,7.98,93.36,1013,1049


### 2. Perform any required data cleaning. Do some EDA.

In [5]:
# A:
df.describe()

Unnamed: 0,PTS,Rank,GF,GA,GF60,GA60,GF%,SF,SA,SF60,...,FF%,CF,CA,CF60,CA60,CF%,Sh%,Sv%,PDO,PIM
count,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,...,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0,90.0
mean,91.977778,2.022222,83.288889,83.288889,2.442222,2.444,49.981111,1068.333333,1068.333333,31.252222,...,49.966667,1973.466667,1973.466667,57.735556,57.798889,49.972222,7.814556,92.182556,999.988889,990.966667
std,12.524114,0.820767,10.376339,9.694484,0.325331,0.313522,4.644554,95.929047,75.514118,2.237637,...,2.797913,176.468299,154.148928,4.124476,4.291106,2.844313,0.866942,0.928621,12.292772,178.049321
min,62.0,1.0,57.0,64.0,1.7,1.73,38.0,815.0,868.0,25.8,...,43.1,1565.0,1572.0,49.5,46.7,43.7,5.9,89.83,978.0,689.0
25%,82.25,1.0,76.0,75.5,2.2325,2.2025,46.825,1011.5,1022.25,29.55,...,47.775,1855.25,1877.0,54.275,54.6,47.925,7.235,91.555,992.0,881.25
50%,92.5,2.0,84.0,84.0,2.4,2.495,49.7,1072.0,1072.0,31.4,...,50.05,1981.5,1961.0,58.05,58.35,50.4,7.73,92.25,1000.5,960.0
75%,102.0,3.0,90.0,89.0,2.6,2.67,53.625,1143.0,1125.75,32.775,...,51.775,2112.75,2077.25,60.85,60.4,52.0,8.27,92.87,1007.75,1101.5
max,121.0,3.0,115.0,107.0,3.45,3.24,61.2,1311.0,1245.0,35.6,...,58.1,2341.0,2332.0,64.9,67.5,57.1,10.34,93.94,1034.0,1515.0


### 3. Set up the `Rank` variable as your target. How many classes are there?

In [6]:
# A:
df.columns

Index(['Team', 'PTS', 'Rank', 'TOI', 'GF', 'GA', 'GF60', 'GA60', 'GF%', 'SF',
       'SA', 'SF60', 'SA60', 'SF%', 'FF', 'FA', 'FF60', 'FA60', 'FF%', 'CF',
       'CA', 'CF60', 'CA60', 'CF%', 'Sh%', 'Sv%', 'PDO', 'PIM'],
      dtype='object')

In [8]:
df.Rank.value_counts()

3    31
2    30
1    29
Name: Rank, dtype: int64

### 4. What is the baseline accuracy?

In [9]:
# A:
31 * 100/(31+30+29)

34.44444444444444

### 5. Choose 4 features to be your predictor variables and set up your design matrix.

In [14]:
# A:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

feature_cols = ['GF', 'GA', 'CA', 'SF']
X = df[feature_cols]
y = df.Rank

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

In [16]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
# We fit to figure out the distribution
scaler.fit(X_train)
# now we transform everything using that
# if you wanted to do it all in one step ==> X_train = scaler.fit_transform(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

### 6. Fit a `KNeighborsClassifier` with 1 neighbor using the target and predictors.

In [17]:
# A:
# Calculate testing error.
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

y_pred_class = knn.predict(X_test)


### 7. Evaluate the accuracy of your model.
- Is it better than baseline?
- Is it legitimate?

In [18]:
# A:
testing_accuracy = metrics.accuracy_score(y_test, y_pred_class)
print('the accuracy is: ',testing_accuracy)
testing_error = 1 - testing_accuracy

print('the error is: ',testing_error)

the accuracy is:  0.5217391304347826
the error is:  0.4782608695652174


### 8. Create a 50-50 train-test-split of your target and predictors. Refit the KNN and assess the accuracy.

In [10]:
# A:

### 9. Evaluate the test accuracy of a KNN where K == number of rows in the training data.

In [11]:
# A:

### 10. Fit the KNN at values of K from 1 to the number of rows in the training data.
- Store the test accuracy in a list.
- Plot the test accuracy vs. the number of neighbors.

In [12]:
# A:

### 11. Fit KNN across different values of K and plot the mean cross-validated accuracy with 5 folds.

In [13]:
# A:

### 12. Standardize the predictor matrix and cross-validate across the different K.
- Plot the standardized mean cross-validated accuracy against the unstandardized. Which is better?
- Why?

In [14]:
# A: