# Demo 09 - Transforming and Comparing Data

In this notebook we do a few things with the [NBA Salary Dataset](https://github.com/joshrosson/NBASalaryPredictions) to illustrate working on relationships between variables as well as transforming a few of those variables using the standard methods in Pandas.


In [None]:
## COLAB cell Only!
# clone the course repository, change to right directory, and import libraries.
%cd /content
!git clone https://github.com/nmattei/cmps6790.git
%cd /content/cmps6790/_demos

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
# Make the fonts a little bigger in our graphs.
font = {'size'   : 20}
plt.rc('font', **font)
plt.rcParams['mathtext.fontset'] = 'cm'
plt.rcParams['pdf.fonttype'] = 42
# Supress scientific notation
pd.set_option('display.float_format', lambda x: '%.2f' % x)

## Loading the Data and Checking Skew

First up we need to open up this data and get it loaded. You'll see there are lots of different stats in different columns.

In [None]:
# Load the data
# Data from here: https://github.com/joshrosson/NBASalaryPredictions
df_nba = pd.read_csv("./data/nba_stats.csv")
display(df_nba.head(10))

# Always double check your Dtypes
df_nba.dtypes

In [None]:
# There's a lot here, let's work only with the 2017 data since that is the most recent
df_2017nba = df_nba[(df_nba["Season"] == 2017)][["Name", "Salary", "Pos", "Age", "MP", "PTS","TRB", "AST"]]
df_2017nba.head(10)

The First thing we might want to do (and we've seen before) is just applying a function to a column, or even making a new column as the output of such a function. Let's make a column that's the average of the counting stats (Points, Rebounds, and Assists) for each player.

In [None]:
(df_2017nba["PTS"] + df_2017nba["TRB"] + df_2017nba["AST"]) / 3.0

In [None]:
# Tricky, is this is error? Why not?
df_2017nba["AvgCount"] = (df_2017nba["PTS"] + df_2017nba["TRB"] + df_2017nba["AST"]) / 3.0
df_2017nba

Let's turn back to the question of Skew... is the salary data skewed for these players?

I'm going to use Seaborn functions, just for fun. First up is the [histplot](https://seaborn.pydata.org/generated/seaborn.histplot.html)

In [None]:
# First, let's visualize the salary data.
sns.histplot(df_2017nba["Salary"]**0.1)

Is this data skewed? If so, which direction is it skewed? What does this tell us about the Mean and the Median?

In [None]:
df_2017nba["Salary"].describe()

We learned about the ladder of powers to transform the data.

$$ x(\lambda) = \begin{cases} x^\lambda & \lambda > 0 \\  \log(x) & \lambda = 0 \\ -x^\lambda & \lambda < 0 \end{cases} $$

$\lambda = 1$ corresponds to no transformation at all. As we decrease $\lambda$, the distribution becomes more left-skewed (which is useful if the original distribution was right-skewed).

But what do these functions look like?

In [None]:
# Transformation functions..
x = np.linspace(0.0001, 200, 50)
plt.plot(x**1.5, label="$x^{1.5}$")
plt.plot(x**1, label="f(x)")
plt.plot(np.log(x), label="log(x)")
plt.plot(-x**0.2, label="$x^{0.2}$")
plt.legend(loc='best')


In [None]:
# Apply a few functions...
sns.histplot(df_2017nba['Salary']**2)
plt.show()

sns.histplot(df_2017nba['Salary']**0.2)
plt.show()

sns.histplot(np.log(df_2017nba['Salary']))
plt.show()


## Why Do We Unskew our data?

Having very skewed data can make it hard to see what relationships may exist in our data. But for now let's investigage the relationship between Salary and how many points players scores.


In [None]:
# Uskewing the data can help us to see relationships.
# Try 2, 1, 0.2

# We'll use subplots to make this easier just for fun...
fig, ax = plt.subplots(1, 4, figsize=(15,5))

#df_nba.plot.scatter(x=(df_nba['Salary']**2.0), y=df_nba['PTS'], ax=ax[0])

ax[0].scatter((df_2017nba['Salary']**2.0), df_2017nba['PTS'])
ax[0].title.set_text("Squared Salary")
ax[1].scatter((df_2017nba['Salary']**1.0), df_2017nba['PTS'])
ax[1].title.set_text("No Transform")
ax[2].scatter((np.log(df_2017nba['Salary'])), df_2017nba['PTS'])
ax[2].title.set_text("Log Transform")
ax[3].scatter((df_2017nba['Salary']**0.2), df_2017nba['PTS'])
ax[3].title.set_text("x^0.2 Salary")

## Relationships and Transformations

We've seen how it's possible to unskew our data, let's do a little more EDA to see what relationships might exist in our data.

To do this we can first look at the correlations between the various columns.

In [None]:
df_2017nba.head(10)

In [None]:
# Let's look at the correlations...
df_2017nba[["Salary", "Age", "MP", "PTS", "TRB", "AST"]].corr()

In [None]:
# Compute Some Cross Correlations...
plt.figure(figsize = (16,5))
sns.heatmap(df_2017nba[["Salary", "Age", "MP", "PTS", "TRB", "AST"]].corr(), vmin=-1, vmax=1, cmap=sns.diverging_palette(20, 220, as_cmap=True))

In [None]:
# Or do some really crazy Seaborn stuff.
sns.pairplot(df_2017nba)

### Let's apply the three transforms: P-Score, Z-Score, Normalizing, and making a variable discrete.

We'll do these in order with just a few examples. Let's start with percentile scoring with the Pandas [rank method](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rank.html).

In [None]:
# Turn Minutes played into percentile ranks.
df_2017nba["MP Pct"] = df_2017nba[["MP"]].rank(pct=True)
df_2017nba.sort_values(by="MP Pct")

We could standardize the hard way (How?) but we can also do it the easy way using [Scipy Stats](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.zscore.html)

In [None]:
import scipy.stats as stats
df_2017nba["Std MP"] = stats.zscore(df_2017nba["MP"])
df_2017nba

In [None]:
# Z-standardize and replot.
df_2017nba['std_salary'] = (df_2017nba['Salary'] - df_2017nba['Salary'].mean()) / df_2017nba['Salary'].std()
df_2017nba['std_pts'] = (df_2017nba['PTS'] - df_2017nba['PTS'].mean()) / df_2017nba['PTS'].std()

In [None]:
# Plot Salary v. Assists..
# Can see a bit more distribution and units are interpretable!
df_2017nba.plot.scatter(x='Salary', y='PTS')
df_2017nba.plot.scatter(x='std_salary', y='std_pts')
plt.show()

In the next demo we'll learn a bit more about these, why they're important, and what they can be used for. For now let's finish up with turning Position into a one-hot encoded variable.

In [None]:
# get dummies
df_ml = pd.get_dummies(df_2017nba[['Pos', 'Salary', 'PTS', 'TRB']])
df_ml