## Pre-processing and Training Development

This notebook is meant to do the pre-processing work to prepare the data for fitting models.  Dummy features will replace ategorical features, such as 'Signed Using'.  Additionally, data will be standardized and train and test splits will be created.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv('NBA data for preprocessing', index_col=0)
df.head()

Unnamed: 0,Player,Pos,Age,G,GS,MP,FG,FGA,FG%,3P,...,AST,STL,BLK,TOV,PF,PTS,Signed Using,Team_Value,Market Size,Salary (millions)
118,Aaron Gordon,PF,25,19,19,552,5.9,13.9,0.427,2.0,...,5.2,0.9,1.0,3.5,2.5,17.1,Early Bird Rights,1.46,small,19.863636
277,Aaron Holiday,PG,24,29,6,576,5.1,13.7,0.37,1.9,...,3.3,0.9,0.2,1.4,3.1,13.4,1st Round Pick,1.55,small,2.2392
364,Al Horford,C,34,19,19,536,7.5,16.6,0.453,2.8,...,4.6,1.2,1.1,1.6,2.2,18.7,Cap Space,2.075,medium,28.0
361,Alec Burks,SG,29,18,3,459,5.0,12.7,0.395,2.8,...,3.0,1.2,0.4,1.6,2.4,15.6,Minimum Salary,2.075,medium,2.320044
305,Alex Caruso,PG,26,22,0,412,3.9,8.5,0.464,1.9,...,4.4,1.9,0.3,2.2,3.1,10.7,Room Exception,4.6,large,2.75


In [3]:
# create dummy features to replace categorical features, exclude names because we want to keep those as identifiers
df_no_name = df.drop(['Player'], axis=1)
dfo=df_no_name.select_dtypes(include=['object'])
df = pd.concat([df.drop(dfo, axis=1), pd.get_dummies(dfo)], axis=1)
df.head()

Unnamed: 0,Player,Age,G,GS,MP,FG,FGA,FG%,3P,3PA,...,Signed Using_MLE,Signed Using_Maximum Salary,Signed Using_Minimum Salary,Signed Using_Non-Bird Exception,Signed Using_Room Exception,Signed Using_Sign and Trade,Signed Using_Unknown,Market Size_large,Market Size_medium,Market Size_small
118,Aaron Gordon,25,19,19,552,5.9,13.9,0.427,2.0,5.5,...,0,0,0,0,0,0,0,0,0,1
277,Aaron Holiday,24,29,6,576,5.1,13.7,0.37,1.9,5.8,...,0,0,0,0,0,0,0,0,0,1
364,Al Horford,34,19,19,536,7.5,16.6,0.453,2.8,7.1,...,0,0,0,0,0,0,0,0,1,0
361,Alec Burks,29,18,3,459,5.0,12.7,0.395,2.8,6.9,...,0,0,1,0,0,0,0,0,1,0
305,Alex Caruso,26,22,0,412,3.9,8.5,0.464,1.9,4.4,...,0,0,0,0,1,0,0,1,0,0


The numerical features vary significantly in magnitute, as shown in the EDA notebook.  The features need to be scaled.  First the data will be split into train/test sets.  The scaler will be fit using the training data and then the scaler will be applied to the training and testing set.

In [4]:
# Create train/test split.  X will be all data except salary (what will be predicted) and name.  Y will be salary.
X = df.drop(columns=['Salary (millions)', 'Player'], axis=1)
y = df['Salary (millions)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(216, 46) (72, 46) (216,) (72,)


In [5]:
# Initiate the scaler and fit to X_train
scaler = StandardScaler()
scaler = scaler.fit(X_train)
print(scaler)

StandardScaler()


In [6]:
# Transform the train data
X_train_scaled = scaler.transform(X_train)
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
print(X_train_scaled)

          Age         G        GS        MP        FG       FGA       FG%  \
0    1.174107 -0.225447 -1.284037 -0.057681  0.633261 -0.114741  1.466255   
1    0.161703  0.706545  1.299097  1.635966 -0.211394  0.488521 -1.153667   
2   -0.091398 -0.225447  0.916410  0.735608  1.875400  1.418549  0.792561   
3   -0.597600 -1.623435  0.342380 -0.359422  2.074143  1.368277  1.104457   
4    1.680309 -2.322430  0.055366 -1.142977  0.037034 -0.240420  0.480666   
..        ...       ...       ...       ...       ...       ...       ...   
211 -0.850701  0.007551 -0.518664 -0.189084 -0.459822 -0.114741 -0.791868   
212 -0.850701  0.706545  1.299097  1.056816  0.682947  0.940967 -0.280359   
213 -1.356902 -1.856434 -0.805679 -0.923971  0.434519 -0.441507  1.927861   
214  1.680309  0.473547 -0.805679  0.540936 -0.857306 -0.692866 -0.542352   
215 -0.597600  0.939543  1.394768  1.957174  2.272885  1.468821  1.254166   

           3P       3PA       3P%  ...  Signed Using_MLE  \
0   -1.323787 -

In [7]:
# Transform the test data
X_test_scaled = scaler.transform(X_test)
X_test_scaled = pd.DataFrame(X_test_scaled, columns=X_test.columns)
print(X_test_scaled)

         Age         G        GS        MP        FG       FGA       FG%  \
0  -0.344499 -0.458445 -1.284037 -0.933704 -1.006363 -1.044769 -0.068270   
1   0.667905  0.473547  1.203425  1.548363  1.130117  1.066647  0.206198   
2  -0.850701  1.172541 -0.614336  0.171059 -0.509507 -0.516915 -0.080746   
3   1.427208 -0.225447  0.916410  0.341397  0.533890  0.714744 -0.242932   
4   0.667905  0.473547  1.203425  0.594471 -0.261079 -0.893954  1.578538   
..       ...       ...       ...       ...       ...       ...       ...   
67 -1.610003 -0.924441  0.055366  0.404665  0.434519  0.714744 -0.405118   
68 -0.597600  0.706545 -0.997022  0.472801 -0.062337 -0.215284  0.243625   
69 -1.610003  0.706545  1.299097  0.954614 -0.310765  0.061211 -0.754441   
70 -0.597600 -0.225447 -1.188366 -1.172178 -0.459822 -0.215284 -0.617207   
71 -0.850701 -1.623435 -1.284037 -1.332782 -0.559193 -0.592323 -0.018367   

          3P       3PA       3P%  ...  Signed Using_MLE  \
0   0.054643 -0.205998  0.95