# Demo 07 - Distance Functions and Comparing Entities

In this notebook we do a few things with the [NBA Salary Dataset](https://github.com/joshrosson/NBASalaryPredictions) to illustrate working with relationships between variables. Specifically we'll look at distances between *observations* in this dataset and see what we can learn!


In [None]:
# first, mount your google drive, change to the course folder, pull latest changes, and change to the lab folder.
# Startup Magic to: (1) Mount Google Drive
# (2) Change to Course Folder
# (3) Pull latest Changes
# (4) Move to the Demo Directory so that the data files are available

from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/My Drive/cmps3160
!git pull
%cd _demos

In [None]:
# Includes and Standard Magic...
### Standard Magic and startup initializers.

# Load Numpy
import numpy as np
# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt
# Load Pandas
import pandas as pd
# Load Stats
from scipy import stats
import seaborn as sns

# This lets us show plots inline and also save PDF plots if we want them
%matplotlib inline
from matplotlib.backends.backend_pdf import PdfPages
matplotlib.style.use('fivethirtyeight')

# These two things are for Pandas, it widens the notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

# Show a ludicrus number of rows and columns
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Supress scientific notation
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# Loading the Data and Down Selecting

First we'll load the data and focus on just a few attributes to make this more clear.

In [None]:
# Load the data
# Data from here: https://github.com/joshrosson/NBASalaryPredictions
df_nba = pd.read_csv("./data/nba_stats.csv")
display(df_nba.head(10))

# Always double check your Dtypes
df_nba.dtypes

Is the data for NBA Salary skewed?  Why?

In [None]:
# Let's use all the years this time but only a subset of the stats.

# Why did I copy this time?
df_smallNBA = df_nba[["Name", "Salary", "Season", "Pos", "Age", "MP", "PTS","TRB", "AST"]].copy()
df_smallNBA.head(10)

# Find the closest players!


In [None]:
# Get a smaller set, drop NA's and get dummies...
df_smallNBA.dropna(inplace=True)
df_smallNBA.reset_index(drop=True, inplace=True)
df_smallNBA.head(10)

In [None]:
sns.pairplot(df_smallNBA)

In [None]:
# get dummies -- Why did I remove name? How do we get it back?
df_ml = pd.get_dummies(df_smallNBA[["Season", "Pos", "Age", "MP", "PTS","TRB", "AST"]])
display(df_ml.head(10))
len(df_ml)

In [None]:
df_ml.Season.value_counts()

We're going to start using [SKLearn](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) we'll get more into it as we go!

In [None]:
# Use SKLEARN to do some distances.
from sklearn.metrics import pairwise_distances
D = pairwise_distances(df_ml, metric="euclidean")
D.shape


In [None]:
# Find someone intersting...
df_smallNBA[(df_smallNBA['Name'] == 'Anthony Davis')]

In [None]:
# So what does this D matrix have inside of it?
D

In [None]:
# So let's see who was the closest to Davis's 2016 Season
D[8416, :].argmin()

In [None]:
# Wait... that's me... what went wrong here?
np.fill_diagonal(D, np.inf)

# To fix this we have to fill the diagonal with infs to fill it out

In [None]:
D[8416, :].argmin()

In [None]:
df_smallNBA.loc[[6805]]

# 2012 Kevin love who dis? Explains why Lebron wanted that trade hunh?

In [None]:
df_ml.loc[[8416,6805]]


Lots of different distances we could use [SKLearn Distance Functions](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.DistanceMetric.html)

In [None]:
# If we change the distance metric what happens?


D = pairwise_distances(df_ml, metric="hamming")
np.fill_diagonal(D, np.inf)
D[8416, :].argmin()

In [None]:
df_smallNBA.loc[[8416,8399]]