## Pre-processing and Training Development

This notebook is meant to do the pre-processing work to prepare the data for fitting models.  Dummy features will replace ategorical features, such as 'Signed Using'.  Additionally, data will be standardized and train and test splits will be created.

In [51]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [52]:
df = pd.read_csv('NBA data for preprocessing', index_col=0)
df.head()

Unnamed: 0,Player,Pos,Age,G,GS,MP,FG,FGA,FG%,3P,...,AST,STL,BLK,TOV,PF,PTS,Signed Using,Team_Value,Market Size,Salary (millions)
118,Aaron Gordon,PF,25,19,19,552,5.9,13.9,0.427,2.0,...,5.2,0.9,1.0,3.5,2.5,17.1,Early Bird Rights,1.46,small,19.863636
364,Al Horford,C,34,19,19,536,7.5,16.6,0.453,2.8,...,4.6,1.2,1.1,1.6,2.2,18.7,Cap Space,2.075,medium,28.0
361,Alec Burks,SG,29,18,3,459,5.0,12.7,0.395,2.8,...,3.0,1.2,0.4,1.6,2.4,15.6,Minimum Salary,2.075,medium,2.320044
305,Alex Caruso,PG,26,22,0,412,3.9,8.5,0.464,1.9,...,4.4,1.9,0.3,2.2,3.1,10.7,Room Exception,4.6,large,2.75
196,Alex Len,C,27,20,6,292,5.8,9.2,0.627,0.7,...,2.6,1.0,2.7,2.8,4.7,14.8,Room Exception,1.825,medium,4.16


The numerical features vary significantly in magnitute, as shown in the EDA notebook. The features need to be scaled. First the data will be split into train/test sets. Then the categorical data will be separated, the scaler will be fit using the numerical training data and then the scaler will be applied to the training and testing set. At the end, the categorical data will be added back into the train/test data.

In [53]:
# Create train/test split.  X will be all data except salary (what will be predicted) and name.  Y will be salary.
X = df.drop(columns=['Salary (millions)', 'Player'], axis=1)
y = df['Salary (millions)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(155, 29) (52, 29) (155,) (52,)


In [54]:
# Separate the categorical data
X_train_obj = X_train.select_dtypes(include=['object'])
X_test_obj = X_test.select_dtypes(include=['object'])

X_train_obj.head()

Unnamed: 0,Pos,Signed Using,Market Size
375,PF,Cap Space,large
172,SG,MLE,small
350,C,MLE,medium
311,SG,Minimum Salary,large
307,C,Unknown,large


In [55]:
# Separate the numerical data
X_train_num = X_train.select_dtypes(include=['number'])
X_test_num = X_test.select_dtypes(include=['number'])

X_train_num.head()

Unnamed: 0,Age,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Team_Value
375,30,26,26,816,3.0,8.2,0.371,1.9,5.8,0.333,...,0.9,6.2,7.1,2.1,1.9,1.1,1.2,3.0,8.7,2.5
172,25,29,20,721,5.9,13.9,0.423,1.5,4.1,0.373,...,0.6,3.9,4.5,1.5,1.1,0.1,1.0,2.4,15.5,1.5
350,27,26,10,454,4.9,7.9,0.626,0.1,0.4,0.2,...,2.6,6.9,9.5,1.6,0.7,1.5,1.1,5.2,12.1,2.45
311,20,25,0,413,5.8,13.1,0.447,1.0,3.7,0.279,...,0.8,4.4,5.2,3.8,1.8,1.0,2.7,4.2,14.7,4.6
307,30,25,11,506,5.6,14.9,0.376,2.8,8.3,0.336,...,2.1,11.5,13.6,4.3,1.5,1.3,2.8,5.3,17.1,4.6


In [56]:
# Initiate the scaler and fit to X_train_num
scaler = StandardScaler()
scaler = scaler.fit(X_train_num)
print(scaler)

StandardScaler()


In [57]:
# Transform the train data
X_train_num_scaled = scaler.transform(X_train_num)
X_train_num_scaled = pd.DataFrame(X_train_num_scaled, columns=X_train_num.columns)
print(X_train_num_scaled)

          Age         G        GS        MP        FG       FGA       FG%  \
0    0.398247  0.437332  1.298486  1.081792 -1.446446 -1.088630 -1.315116   
1   -0.955444  1.126687  0.723168  0.617892  0.053733  0.398390 -0.621455   
2   -0.413968  0.437332 -0.235695 -0.685910 -0.463570 -1.166894  2.086491   
3   -2.309136  0.207548 -1.194558 -0.886119  0.002002  0.189685 -0.301304   
4    0.398247  0.207548 -0.139808 -0.431986 -0.101458  0.659271 -1.248418   
..        ...       ...       ...       ...       ...       ...       ...   
150  0.127509  0.896902 -0.427467  0.676490  0.674496  0.372302  0.552433   
151  0.668986  1.126687  1.586145  0.520229  0.674496  0.372302  0.605792   
152 -0.143229  0.437332  0.339623 -0.046217 -0.825682 -1.323422  1.472868   
153 -0.413968 -0.022237  1.106714  0.393267  0.312384  0.346214 -0.034511   
154  1.481201  1.356471  0.051964  0.905998 -0.515300 -0.749485  0.472395   

           3P       3PA       3P%  ...       ORB       DRB       TRB  \
0  

In [58]:
# Transform the test data
X_test_num_scaled = scaler.transform(X_test_num)
X_test_num_scaled = pd.DataFrame(X_test_num_scaled, columns=X_test_num.columns)
print(X_test_num_scaled)

         Age         G        GS        MP        FG       FGA       FG%  \
0   0.127509  0.207548  1.202600  0.363968 -0.825682 -0.566868 -0.781530   
1   2.022677  0.667117 -0.906899  0.432332  0.415845  1.128856 -0.981625   
2   0.398247  0.667117  0.147851 -0.158529 -0.204919  0.085333 -0.568096   
3   0.668986  1.126687 -1.194558 -0.632195 -1.446446 -0.879925 -1.782003   
4  -0.955444 -0.022237  1.106714  0.515346  0.933148  1.728881 -0.968285   
5   0.668986 -0.941376  0.723168  0.153993 -0.567031  0.111421 -1.395154   
6  -0.143229  0.667117 -1.194558 -0.627312  0.157193  0.372302 -0.341323   
7  -0.955444 -0.252022 -1.194558 -1.652774 -1.187794 -0.540780 -1.701965   
8  -1.767659  0.896902 -0.715126  0.383501  0.674496  0.607094  0.138904   
9   1.210462 -0.941376 -1.098672 -1.374434  0.674496  0.815799 -0.141228   
10  0.939724 -1.171161 -1.098672 -0.993548 -0.877413 -0.436428 -1.208399   
11  1.481201  0.437332 -1.194558 -0.109698  0.519305  0.998415 -0.674813   
12 -0.413968

In [59]:
# create dummy features to replace categorical features, making sure to reset the index so it matches the numerical data
X_train_obj = pd.get_dummies(X_train_obj).reset_index(drop=True)
X_test_obj = pd.get_dummies(X_test_obj).reset_index(drop=True)
X_train_obj.head()

Unnamed: 0,Pos_C,Pos_PF,Pos_PG,Pos_SF,Pos_SG,Signed Using_Bi-annual Exception,Signed Using_Cap Space,Signed Using_Early Bird Rights,Signed Using_MLE,Signed Using_Maximum Salary,Signed Using_Minimum Salary,Signed Using_Non-Bird Exception,Signed Using_Room Exception,Signed Using_Sign and Trade,Signed Using_Unknown,Market Size_large,Market Size_medium,Market Size_small
0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0
1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,1
2,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0
3,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0


In [60]:
print(X_train_obj.shape, X_test_obj.shape)

(155, 18) (52, 16)


It looks like the training data has some "Signed Using" categories that the testing data does not include.  When the data sets are used with models, they will need to have the same number of columns.

In [61]:
# Find any columns that are only included in one data set and add them to the other

for col in X_test_obj.columns:
    if col not in X_train_obj.columns:
        X_train_obj[col] = 0
for col in X_train_obj.columns:
    if col not in X_test_obj.columns:
        X_test_obj[col] = 0
        
print(X_train.shape, X_test.shape)

(155, 29) (52, 29)


In [62]:
# Concatenate the dummy features with the scaled numerical data
X_train = pd.concat([X_train_num_scaled, X_train_obj], axis=1)
X_test = pd.concat([X_test_num_scaled, X_test_obj], axis=1)
X_train.head()

Unnamed: 0,Age,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,...,Signed Using_Maximum Salary,Signed Using_Minimum Salary,Signed Using_Non-Bird Exception,Signed Using_Room Exception,Signed Using_Sign and Trade,Signed Using_Unknown,Market Size_large,Market Size_medium,Market Size_small,Signed Using_DPE
0,0.398247,0.437332,1.298486,1.081792,-1.446446,-1.08863,-1.315116,0.0,0.268125,-0.032078,...,0,0,0,0,0,0,1,0,0,0
1,-0.955444,1.126687,0.723168,0.617892,0.053733,0.39839,-0.621455,-0.363452,-0.363251,0.304443,...,0,0,0,0,0,0,0,0,1,0
2,-0.413968,0.437332,-0.235695,-0.68591,-0.46357,-1.166894,2.086491,-1.635535,-1.737424,-1.15101,...,0,0,0,0,0,0,0,1,0,0
3,-2.309136,0.207548,-1.194558,-0.886119,0.002002,0.189685,-0.301304,-0.817768,-0.51181,-0.486381,...,0,1,0,0,0,0,1,0,0,0
4,0.398247,0.207548,-0.139808,-0.431986,-0.101458,0.659271,-1.248418,0.817768,1.196621,-0.006839,...,0,0,0,0,0,1,1,0,0,0


In [63]:
# Save the train/test data for future use
X_train.to_csv('NBA X_train')
X_test.to_csv('NBA X_test')
y_train.to_csv('NBA y_train')
y_test.to_csv('NBA y_test')