# K-Nearest Neighbours

Let's see which player is the most like Lebron James using K-Nearest Neighbours

In [1]:
import pandas as pd
nba = pd.read_csv("nba_2013.csv")

# The names of the columns in the data
print(nba.columns.values)

['player' 'pos' 'age' 'bref_team_id' 'g' 'gs' 'mp' 'fg' 'fga' 'fg.' 'x3p'
 'x3pa' 'x3p.' 'x2p' 'x2pa' 'x2p.' 'efg.' 'ft' 'fta' 'ft.' 'orb' 'drb'
 'trb' 'ast' 'stl' 'blk' 'tov' 'pf' 'pts' 'season' 'season_end']


## Finding Similar Rows With Euclidean Distance

In [2]:
import math

selected_player = nba[nba["player"] == "LeBron James"].iloc[0]
distance_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf', 'pts']

def euclidean_distance(row):
    inner_value = 0
    for k in distance_columns:
        inner_value += (row[k] - selected_player[k]) ** 2
    return math.sqrt(inner_value)

lebron_distance = nba.apply(euclidean_distance, axis=1)

## Normalize Columns

A simple way to deal with this problem is to normalize all of the columns to have a mean of 0 and a standard deviation of 1. This ensures that no single column has a dominant impact on the Euclidean distance calculations.

To set a column's mean to 0, we have to find its current mean, then subtract it from every value in that column. To set the standard deviation to 1, we divide every value in the column by the current standard deviation.

In [3]:
nba_numeric = nba[distance_columns]
nba_normalized = (nba_numeric - nba_numeric.mean()) / nba_numeric.std()

## Finding the Nearest Neighbour

In [4]:
from scipy.spatial import distance

# Fill in the NA values in nba_normalized
nba_normalized.fillna(0, inplace=True)

# Find the normalized vector for Lebron James
lebron_normalized = nba_normalized[nba["player"] == "LeBron James"]

# Find the distance between Lebron James and everyone else.
euclidean_distances = nba_normalized.apply(lambda row: distance.euclidean(row, lebron_normalized), axis=1)

distance_frame = pd.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})

distance_frame.sort_values("dist", inplace=True)
second_smallest = distance_frame.iloc[1]["idx"]
most_similar_to_lebron = nba.loc[int(second_smallest)]["player"]

In [5]:
print(most_similar_to_lebron)

Carmelo Anthony


## Generating Training and Testing Sets

Now that we know how to find the nearest neighbors, we can make predictions on a test set.

First, we have to generate testing and training sets. We'll use random sampling to do this. We'll randomly shuffle the index of the nba dataframe, and then pick rows using the randomly shuffled values.

If we didn't do this, we'd end up predicting and training on the same data set, which would overfit.

In [6]:
import random
from numpy.random import permutation

# Randomly shuffle the index of nba
random_indices = permutation(nba.index)
# Set a cutoff for how many items we want in the test set (in this case 1/3 of the items)
test_cutoff = math.floor(len(nba)/3)
# Generate the test set by taking the first 1/3 of the randomly shuffled indices
test = nba.loc[random_indices[1:test_cutoff]]
# Generate the train set with the rest of the data
train = nba.loc[random_indices[test_cutoff:]]

## Using SKLearn

Instead of having to do it all ourselves, we can use the kNN implementation in scikit-learn. While scikit-learn (Sklearn for short) makes a regressor and a classifier available, we'll be using the regressor, as we have continuous values to predict on.

Sklearn performs the normalization and distance finding automatically, and lets us specify how many neighbors we want to look at.

In [7]:
# The columns that we'll be using to make predictions
x_columns = ['age', 'g', 'gs', 'mp', 'fg', 'fga', 'fg.', 'x3p', 'x3pa', 'x3p.', 'x2p', 'x2pa', 'x2p.', 'efg.', 'ft', 'fta', 'ft.', 'orb', 'drb', 'trb', 'ast', 'stl', 'blk', 'tov', 'pf']
# The column we want to predict
y_column = ["pts"]

from sklearn.neighbors import KNeighborsRegressor
# Create the kNN model
knn = KNeighborsRegressor(n_neighbors=5)
# Fit the model on the training data
knn.fit(train[x_columns], train[y_column])
# Make predictions on the test set using the fit model
predictions = knn.predict(test[x_columns])

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

## Computer the Error

In [8]:
actual = test[y_column]
mse = (((predictions - actual) ** 2).sum()) / len(predictions)

NameError: name 'predictions' is not defined