# Exercise 10

## KNN exercise with NBA player data

## Introduction

- NBA player statistics from 2014-2015 (partial season): [data](https://github.com/justmarkham/DAT4-students/blob/master/kerry/Final/NBA_players_2015.csv), [data dictionary](https://github.com/justmarkham/DAT-project-examples/blob/master/pdf/nba_paper.pdf)
- **Goal:** Predict player position using assists, steals, blocks, turnovers, and personal fouls

## Read the data into Pandas

In [1]:
# read the data into a DataFrame
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT4-students/master/kerry/Final/NBA_players_2015.csv'
nba = pd.read_csv(url, index_col=0)

In [2]:
# examine the columns
nba.columns

Index(['season_end', 'player', 'pos', 'age', 'bref_team_id', 'g', 'gs', 'mp',
       'fg', 'fga', 'fg_', 'x3p', 'x3pa', 'x3p_', 'x2p', 'x2pa', 'x2p_', 'ft',
       'fta', 'ft_', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf',
       'pts', 'G', 'MP', 'PER', 'TS%', '3PAr', 'FTr', 'TRB%', 'AST%', 'STL%',
       'BLK%', 'TOV%', 'USG%', 'OWS', 'DWS', 'WS', 'WS/48', 'OBPM', 'DBPM',
       'BPM', 'VORP'],
      dtype='object')

In [3]:
# examine the positions
nba.pos.value_counts()

G    200
F    199
C     79
Name: pos, dtype: int64

## Create X and y

Use the following features: assists, steals, blocks, turnovers, personal fouls

In [4]:
# map positions to numbers
nba['pos_num'] = nba.pos.map({'C':0, 'F':1, 'G':2})

In [5]:
# create feature matrix (X)
feature_cols = ['ast', 'stl', 'blk', 'tov', 'pf']
X = nba[feature_cols]

In [6]:
# alternative way to create X
X = nba.loc[:, 'ast':'pf']

In [7]:
# create response vector (y)
y = nba.pos_num

# Exercice 10.1

* Split the data in train and test
* Train a KNN model (K=5)
* Evaluate the accuracy

In [30]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X.values, y.values, random_state=123)

In [31]:
from sklearn.neighbors import KNeighborsClassifier
# instantiate with K=5
knn = KNeighborsClassifier(n_neighbors=5)

In [32]:
# fit with data
knn.fit(X_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [33]:
y_pred = knn.predict(X_test)
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[12, 10,  3],
       [ 7, 36,  8],
       [ 0,  8, 36]])

# Exercice 10.2 

Predict player position and calculate predicted probability of each position

Predict for a player with these statistics: 1 assist, 1 steal, 0 blocks, 1 turnover, 2 personal fouls

In [34]:
# create a list to represent a player
import numpy as np
player = np.array([1, 1, 0, 1, 2]).reshape(1, -1) 

In [35]:
# make a prediction
knn.predict(player)

array([2])

In [36]:
# calculate predicted probabilities
knn.predict_proba(player)

array([[ 0. ,  0.4,  0.6]])

# Exercice 10.3  

Repeat steps 10.1 and 10.2 using K=50

In [37]:
# repeat for K=50
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[ 4, 20,  1],
       [ 0, 43,  8],
       [ 0, 16, 28]])

In [38]:
knn.predict(player)

array([1])

In [39]:
# calculate predicted probabilities
knn.predict_proba(player)

array([[ 0.06,  0.52,  0.42]])

# Exercice 10.4 (3 points) 

Explore the features to decide which ones are predictive