# 10-Transforming Data

In this notebook we do a few things with the [NBA Salary Dataset](https://github.com/joshrosson/NBASalaryPredictions) to illustrate working with relationships between variables.


Why is this a good dataset to study transformations?


In [None]:
# Includes and Standard Magic...
### Standard Magic and startup initializers.

# Load Numpy
import numpy as np
# Load MatPlotLib
import matplotlib
import matplotlib.pyplot as plt
# Load Pandas
import pandas as pd
# Load Stats
from scipy import stats
import seaborn as sns

# This lets us show plots inline and also save PDF plots if we want them
%matplotlib inline
from matplotlib.backends.backend_pdf import PdfPages
matplotlib.style.use('fivethirtyeight')

# These two things are for Pandas, it widens the notebook and lets us display data easily.
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

# Show a ludicrus number of rows and columns
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)

# Supress scientific notation
pd.set_option('display.float_format', lambda x: '%.2f' % x)

In [None]:
# Load the data
# Data from here: https://github.com/joshrosson/NBASalaryPredictions
df_nba = pd.read_csv("./data/nba_stats.csv")
df_nba.head()

Is the data for NBA Salary skewed?  Why?

In [None]:
# Is the salary skewed?
df_nba['Salary'].plot.hist()

In [None]:
df_nba['Salary'].describe()

In [None]:
# Transformation functions..
x = np.linspace(0.0001, 200, 500)
plt.plot(x**1.5, label="$x^{1.5}$")
plt.plot(x**1, label="f(x)")
plt.plot(x**0.2, label="$x^{0.2}$")
#plt.plot(np.log(x), label="log(x)")
plt.legend(loc='best')

In [None]:
# Apply a few functions...
(df_nba['Salary']**2).plot.hist()
plt.show()
(df_nba['Salary']**0.2).plot.hist()
plt.show()
(np.log(df_nba['Salary'])).plot.hist()
plt.show()

## So what's going on?

Having very skewed data can make it hard to see what relationships may exist in our data. We'll see this a lot in Project 2. But for now let's investigage the relationship between Salary and how many points players score.


In [None]:
# Uskewing the data can help us to see relationships.
# Try 2, 1, 0.2

# We'll use subplots to make this easier just for fun...
fig, ax = plt.subplots(1, 4, figsize=(15,5))

#df_nba.plot.scatter(x=(df_nba['Salary']**2.0), y=df_nba['PTS'], ax=ax[0])

ax[0].scatter((df_nba['Salary']**2.0), df_nba['PTS'])
ax[0].title.set_text("Squared Salary")
ax[1].scatter((df_nba['Salary']**1.0), df_nba['PTS'])
ax[1].title.set_text("No Transform")
ax[2].scatter((np.log(df_nba['Salary'])), df_nba['PTS'])
ax[2].title.set_text("Log Transform")
ax[3].scatter((df_nba['Salary']**0.2), df_nba['PTS'])
ax[3].title.set_text("x^0.2 Salary")

In [None]:
# Z-standardize and replot.
df_nba['std_salary'] = (df_nba['Salary'] - df_nba['Salary'].mean()) / df_nba['Salary'].std()
df_nba['std_pts'] = (df_nba['PTS'] - df_nba['PTS'].mean()) / df_nba['PTS'].std()


In [None]:
# Plot Salary v. Assists..
# Can see a bit more distribution and units are interpretable!
df_nba.plot.scatter(x='std_salary', y='std_pts')

In [None]:
# Compute the whole matrix..
df_nba.corr()

In [None]:
# Compute Some Cross Correlations...
plt.figure(figsize = (16,5))
sns.heatmap(df_nba.corr(), vmin=-1, vmax=1, cmap=sns.diverging_palette(20, 220, as_cmap=True))

In [None]:
sns.pairplot(df_nba)

# Find the closest players!


In [None]:
# Get a smaller set, drop NA's and get dummies...
df_comp = df_nba[['Name', 'Pos', 'Salary', 'PTS', 'TRB']].copy()
df_comp.dropna(inplace=True)
df_comp.reset_index(drop=True, inplace=True)
df_comp

In [None]:
sns.pairplot(df_comp)

In [None]:
# get dummies
df_ml = pd.get_dummies(df_comp[['Pos', 'Salary', 'PTS', 'TRB']])
df_ml

We're going to start using [SKLearn](https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) we'll get more into it as we go!

In [None]:
# Use SKLEARN to do soe distances.
from sklearn.metrics import pairwise_distances
D = pairwise_distances(df_ml, metric="cosine")
D.shape

In [None]:
# Find someone intersting...
df_comp[(df_comp['Name'] == 'Anthony Davis')]

In [None]:
D

In [None]:
# Whose the closest
D[8416, :].argmin()

In [None]:
# Wait... that's me...
np.fill_diagonal(D, np.inf)

In [None]:
D[8416, :].argmin()

In [None]:
df_comp.loc[5561]