## 2. Problem Statement
In this assignment, students will be using the K-nearest neighbors algorithm to predict how many points NBA players scored in the 2013-2014 season.

A look at the data

Before we dive into the algorithm, let’s take a look at our data. Each row in the data contains information on how a player performed in the 2013-2014 NBA season.

Download 'nba_2013.csv' file from this link:
https://www.dropbox.com/s/b3nv38jjo5dxcl6/nba_2013.csv?dl=0

Here are some selected columns from the data:
- player - name of the player
- pos - the position of the player
- g - number of games the player was in
- gs - number of games the player started
- pts - total points the player scored

There are many more columns in the data, mostly containing information about average player game performance over the course of the season. See this site for an explanation of the rest of them.
We can read our dataset in and figure out which columns are present:


# Importing and reading the dataset

In [1]:
import pandas as pd

#data = pd.read_csv('nba_2013.csv')

#data
nba =''

with open("nba_2013.csv", 'r') as csvfile:
    nba = pd.read_csv(csvfile)



In [2]:
nba

Unnamed: 0,player,pos,age,bref_team_id,g,gs,mp,fg,fga,fg.,...,drb,trb,ast,stl,blk,tov,pf,pts,season,season_end
0,Quincy Acy,SF,23,TOT,63,0,847,66,141,0.468,...,144,216,28,23,26,30,122,171,2013-2014,2013
1,Steven Adams,C,20,OKC,81,20,1197,93,185,0.503,...,190,332,43,40,57,71,203,265,2013-2014,2013
2,Jeff Adrien,PF,27,TOT,53,12,961,143,275,0.520,...,204,306,38,24,36,39,108,362,2013-2014,2013
3,Arron Afflalo,SG,28,ORL,73,73,2552,464,1011,0.459,...,230,262,248,35,3,146,136,1330,2013-2014,2013
4,Alexis Ajinca,C,25,NOP,56,30,951,136,249,0.546,...,183,277,40,23,46,63,187,328,2013-2014,2013
5,Cole Aldrich,C,25,NYK,46,2,330,33,61,0.541,...,92,129,14,8,30,18,40,92,2013-2014,2013
6,LaMarcus Aldridge,PF,28,POR,69,69,2498,652,1423,0.458,...,599,765,178,63,68,123,147,1603,2013-2014,2013
7,Lavoy Allen,PF,24,TOT,65,2,1072,134,300,0.447,...,192,311,71,24,33,44,126,303,2013-2014,2013
8,Ray Allen,SG,38,MIA,73,9,1936,240,543,0.442,...,182,205,143,54,8,84,115,701,2013-2014,2013
9,Tony Allen,SG,32,MEM,55,28,1278,204,413,0.494,...,129,208,94,90,19,90,121,495,2013-2014,2013


In [3]:
# The names of all the columns in the data.
print(nba.columns.values)

['player' 'pos' 'age' 'bref_team_id' 'g' 'gs' 'mp' 'fg' 'fga' 'fg.' 'x3p'
 'x3pa' 'x3p.' 'x2p' 'x2pa' 'x2p.' 'efg.' 'ft' 'fta' 'ft.' 'orb' 'drb'
 'trb' 'ast' 'stl' 'blk' 'tov' 'pf' 'pts' 'season' 'season_end']


### Euclidean distance
##### We can use the principle of euclidean distance to find the most similar NBA players to Lebron James.

In [4]:
    
# Select Lebron James from our dataset
selected_player = nba[nba["player"] == "LeBron James"].iloc[0]

# Choose only the numeric columns (we'll use these to compute euclidean distance)
distance_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts']

selected_player

player          LeBron James
pos                       PF
age                       29
bref_team_id             MIA
g                         77
gs                        77
mp                      2902
fg                       767
fga                     1353
fg.                    0.567
x3p                      116
x3pa                     306
x3p.                0.379085
x2p                      651
x2pa                    1047
x2p.                0.621777
efg.                    0.61
ft                       439
fta                      585
ft.                     0.75
orb                       81
drb                      452
trb                      533
ast                      488
stl                      121
blk                       26
tov                      270
pf                       126
pts                     2089
season             2013-2014
season_end              2013
Name: 225, dtype: object

In [5]:
import math

def euclidean_distance(row):
    """
    A simple euclidean distance function
    """
    inner_value = 0
    for k in distance_columns:
        inner_value += (row[k] - selected_player[k]) ** 2
    return math.sqrt(inner_value)

# Find the distance from each player in the dataset to lebron.
lebron_distance = nba.apply(euclidean_distance, axis=1)
lebron_distance

0      3475.792868
1              NaN
2              NaN
3      1189.554979
4      3216.773098
5              NaN
6       960.443178
7      3131.071083
8      2326.129199
9      2806.955657
10     2277.933945
11             NaN
12     2819.058890
13     2534.074598
14     1970.085795
15     3262.065464
16     2451.378405
17      485.856006
18             NaN
19     3246.515831
20     1539.172839
21             NaN
22     2969.043638
23             NaN
24     2023.603985
25             NaN
26             NaN
27             NaN
28     3754.041967
29     3835.882699
          ...     
451     716.243023
452    2996.450583
453    4135.156714
454    3023.456473
455    4138.570811
456            NaN
457    2206.524879
458    1347.758158
459    2136.309449
460            NaN
461            NaN
462    1922.713718
463    2364.771676
464    3033.755934
465    2625.998112
466    2495.296784
467    2232.354830
468            NaN
469    3525.434026
470    3574.911070
471    2873.509019
472    3831.

### Normalizing columns

Once you have multiple columns, one column may have larger impact than that of others columns becoz of its value, and thus dwarf the impact of racing_stripes values in the euclidean distance calculations.

This can be bad, because a variable having larger values doesn't necessarily make it better at predicting what rows are similar.

A simple way to deal with this is to normalize all the columns to have a mean of 0, and a standard deviation of 1. This will ensure that no single column has a dominant impact on the euclidean distance calculations.

To set the mean to 0, we have to find the mean of a column, then subtract the mean from every value in the column. To set the standard deviation to 1, we divide every value in the column by the standard deviation. The formula is //(x=\frac{x-\mu}{\sigma}\).

In [6]:
    
# Select only the numeric columns from the NBA dataset
nba_numeric = nba[distance_columns]

# Normalize all of the numeric columns
nba_normalized = (nba_numeric - nba_numeric.mean()) / nba_numeric.std()


### Finding the nearest neighbor
##### We now know enough to find the nearest neighbor of a given row in the NBA dataset. We can use the distance.euclidean function from scipy.spatial, a much faster way to calculate euclidean distance.

In [7]:
from scipy.spatial import distance

# Fill in NA values in nba_normalized
nba_normalized.fillna(0, inplace=True)

# Find the normalized vector for lebron james.
lebron_normalized = nba_normalized[nba["player"] == "LeBron James"]

# Find the distance between lebron james and everyone else.
euclidean_distances = nba_normalized.apply(lambda row: distance.euclidean(row, lebron_normalized), axis=1)

# Create a new dataframe with distances.
distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort_values("dist", inplace=True)
print(distance_frame)

# Find the most similar player to lebron (the lowest distance to lebron is lebron, the second smallest is the most similar non-lebron player)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_lebron = nba.loc[int(second_smallest)]["player"]
print(most_similar_to_lebron)

          dist  idx
225   0.000000  225
17    4.171854   17
136   4.206786  136
128   4.382582  128
185   4.489928  185
133   4.619280  133
123   4.673849  123
162   4.844802  162
332   4.893563  332
451   4.937466  451
160   4.938801  160
179   5.084443  179
423   5.305866  423
218   5.476262  218
197   5.542109  197
307   5.546064  307
416   5.604720  416
277   5.628071  277
110   5.724909  110
272   5.927671  272
85    6.012417   85
450   6.094992  450
278   6.104387  278
99    6.171672   99
253   6.221577  253
478   6.254191  478
347   6.309149  347
177   6.313540  177
345   6.362956  345
3     6.473960    3
..         ...  ...
263  15.560209  263
455  15.610856  455
308  15.658503  308
108  15.667851  108
356  15.715293  356
134  15.735660  134
53   15.736456   53
425  15.750488  425
404  15.850572  404
431  15.889840  431
222  15.959102  222
327  16.065221  327
321  16.201575  321
324  16.223200  324
424  16.397847  424
224  16.410734  224
339  16.562235  339
271  16.594910  271


### Generating training and testing sets

Now that we know how to find the nearest neighbors, we can make predictions on a test set. We'll try to predict how many points a player scored using the 5 closest neighbors. We'll find neighbors by using all the numeric columns in the dataset to generate similarity scores.

First, we have to generate test and train sets. In order to do this, we'll use random sampling. We'll randomly shuffle the index of the nba dataframe, and then pick rows using the randomly shuffled values.

If we didn't do this, we'd end up predicting and training on the same data set, which would overfit. We could do cross validation also, which would be slightly better, but slightly more complex.

In [8]:
import random
from numpy.random import permutation

# Randomly shuffle the index of nba.
random_indices = permutation(nba.index)
# Set a cutoff for how many items we want in the test set (in this case 1/3 of the items)
test_cutoff = math.floor(len(nba)/3)
# Generate the test set by taking the first 1/3 of the randomly shuffled indices.
test = nba.loc[random_indices[1:test_cutoff]]
# Generate the train set with the rest of the data.
train = nba.loc[random_indices[test_cutoff:]]


### Using sklearn for k nearest neighbors


Sklearn performs the normalization and distance finding automatically, and lets us specify how many neighbors we want to look at.



In [9]:
# Taking only  the relevant columns 
final_columnns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf','pts']


# The columns that we will be making predictions with.
x_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf']
# The column that we want to predict.
y_column = ["pts"]

In [10]:
final_train = train[final_columnns]
#final_train
final_test = test[final_columnns]

In [11]:
# Finding the persent of rows having atleast onw NaN value - Data Preprocessing
print('Training Data === ', final_train[final_columnns].isnull().T.any().T.sum()*100/final_train[final_columnns].shape[0], '%') # Train DatasetInput Data
print('Test Data === ', final_test[final_columnns].isnull().T.any().T.sum()*100/final_test[final_columnns].shape[0], '%') # Train DatasetInput Data


Training Data ===  14.330218068535826 %
Test Data ===  20.12578616352201 %


In [12]:
# Deleting NaN Rows from the dataset - Data Preprocessing
final_train.dropna( axis=0, inplace = True)
final_test.dropna( axis=0, inplace = True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  This is separate from the ipykernel package so we can avoid doing imports until


In [13]:
# Reconfirming on NaN  
print('Training Data === ', final_train[final_columnns].isnull().T.any().T.sum()*100/final_train[final_columnns].shape[0], '%') # Train DatasetInput Data
print('Test Data === ', final_test[final_columnns].isnull().T.any().T.sum()*100/final_test[final_columnns].shape[0], '%') # Train DatasetInput Data


Training Data ===  0.0 %
Test Data ===  0.0 %


In [14]:
print(final_train[x_columns].shape)
print(final_train[y_column].shape)

(275, 25)
(275, 1)


In [15]:


from sklearn.neighbors import KNeighborsRegressor
# Create the knn model.
# Look at the five closest neighbors. We are choosing here 5 neighbours
knn = KNeighborsRegressor(n_neighbors=5)
# Fit the model on the training data.
knn.fit(final_train[x_columns], final_train[y_column])
# Make point predictions on the test set using the fit model.
predictions = knn.predict(final_test[x_columns])

len(predictions)

127

### Computing error
Now that we know our point predictions, we can compute the error involved with our predictions. We can compute mean squared error.

In [16]:
# Get the actual values for the test set.
actual = final_test[y_column]

# Compute the mean squared error of our predictions.
mse = (((predictions - actual) ** 2).sum()) / len(predictions)


In [17]:
mse

pts    11371.621417
dtype: float64