## Pre-processing and Training Development

This notebook is meant to do the pre-processing work to prepare the data for fitting models.  Dummy features will replace ategorical features, such as 'Signed Using'.  Additionally, data will be standardized and train and test splits will be created.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv('NBA data for preprocessing', index_col=0)
df.head()

Unnamed: 0,Player,Pos,Age,G,GS,MP,FG,FGA,FG%,3P,...,AST,STL,BLK,TOV,PF,PTS,Signed Using,Team_Value,Market Size,Salary (millions)
118,Aaron Gordon,PF,25,19,19,552,5.9,13.9,0.427,2.0,...,5.2,0.9,1.0,3.5,2.5,17.1,Early Bird Rights,1.46,small,19.863636
277,Aaron Holiday,PG,24,29,6,576,5.1,13.7,0.37,1.9,...,3.3,0.9,0.2,1.4,3.1,13.4,1st Round Pick,1.55,small,2.2392
364,Al Horford,C,34,19,19,536,7.5,16.6,0.453,2.8,...,4.6,1.2,1.1,1.6,2.2,18.7,Cap Space,2.075,medium,28.0
361,Alec Burks,SG,29,18,3,459,5.0,12.7,0.395,2.8,...,3.0,1.2,0.4,1.6,2.4,15.6,Minimum Salary,2.075,medium,2.320044
305,Alex Caruso,PG,26,22,0,412,3.9,8.5,0.464,1.9,...,4.4,1.9,0.3,2.2,3.1,10.7,Room Exception,4.6,large,2.75


The numerical features vary significantly in magnitute, as shown in the EDA notebook. The features need to be scaled. First the data will be split into train/test sets. Then the categorical data will be separated, the scaler will be fit using the numerical training data and then the scaler will be applied to the training and testing set. At the end, the categorical data will be added back into the train/test data.

In [3]:
# Create train/test split.  X will be all data except salary (what will be predicted) and name.  Y will be salary.
X = df.drop(columns=['Salary (millions)', 'Player'], axis=1)
y = df['Salary (millions)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(216, 29) (72, 29) (216,) (72,)


In [4]:
# Separate the categorical data
X_train_obj = X_train.select_dtypes(include=['object'])
X_test_obj = X_test.select_dtypes(include=['object'])

X_train_obj.head()

Unnamed: 0,Pos,Signed Using,Market Size
139,C,Cap Space,medium
359,SG,Sign and Trade,medium
361,SG,Minimum Salary,medium
327,SF,Cap Space,small
395,PG,Minimum Salary,large


In [5]:
# Separate the numerical data
X_train_num = X_train.select_dtypes(include=['number'])
X_test_num = X_test.select_dtypes(include=['number'])

X_train_num.head()

Unnamed: 0,Age,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,...,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Team_Value
139,31,29,29,701,7.1,13.8,0.515,1.5,4.0,0.377,...,2.8,7.1,9.9,2.6,0.4,1.8,2.0,2.8,17.5,2.15
359,28,28,23,823,4.5,10.0,0.456,1.0,3.0,0.353,...,1.4,4.3,5.7,6.0,1.7,0.7,1.5,1.3,12.9,2.45
361,29,18,3,459,5.0,12.7,0.395,2.8,6.9,0.409,...,0.7,5.8,6.5,3.0,1.2,0.4,1.6,2.4,15.6,2.075
327,23,27,0,417,4.9,12.9,0.38,3.2,9.9,0.322,...,0.6,3.7,4.3,2.8,1.5,0.4,1.6,2.0,14.8,1.45
395,26,28,1,303,4.8,10.0,0.476,3.8,8.6,0.444,...,0.1,3.0,3.1,1.4,0.7,0.7,0.6,3.2,14.1,4.7


In [6]:
# Initiate the scaler and fit to X_train_num
scaler = StandardScaler()
scaler = scaler.fit(X_train_num)
print(scaler)

StandardScaler()


In [7]:
# Transform the train data
X_train_num_scaled = scaler.transform(X_train_num)
X_train_num_scaled = pd.DataFrame(X_train_num_scaled, columns=X_train_num.columns)
print(X_train_num_scaled)

          Age         G        GS        MP        FG       FGA       FG%  \
0    0.979147  1.183168  1.487375  0.465414  0.600371  0.275560  0.595997   
1    0.234444  0.951888  0.919421  1.045053 -0.736537 -0.694753 -0.163618   
2    0.482678 -1.360911 -0.973762 -0.684362 -0.479440 -0.005320 -0.948982   
3   -1.006729  0.720608 -1.257739 -0.883910 -0.530859  0.045749 -1.142104   
4   -0.262025  0.951888 -1.163080 -1.425540 -0.582279 -0.694753  0.093879   
..        ...       ...       ...       ...       ...       ...       ...   
211 -0.013791 -1.823471 -0.311148 -1.206987 -0.839377 -1.282047  1.175364   
212 -0.758494  0.720608  1.298057  1.035550  0.703210  0.939459 -0.279491   
213  1.475616  0.258048 -0.500466  0.294373 -0.222342 -0.516011  0.634621   
214  0.730913  0.720608  1.298057  1.738719  1.731601  1.986375 -0.227992   
215 -0.013791 -0.898351 -0.689785 -1.363775 -0.016664  0.096819 -0.215117   

           3P       3PA       3P%  ...       ORB       DRB       TRB  \
0  

In [8]:
# Transform the test data
X_test_num_scaled = scaler.transform(X_test_num)
X_test_num_scaled = pd.DataFrame(X_test_num_scaled, columns=X_test_num.columns)
print(X_test_num_scaled)

         Age         G        GS        MP        FG       FGA       FG%  \
0  -1.503198  0.720608  1.298057  0.992790  0.240434  0.250026 -0.009120   
1   0.234444  0.489328  0.730102  0.094825 -0.222342 -0.771356  1.316987   
2  -0.510260  0.720608 -0.973762  0.465414 -0.068083 -0.235131  0.261252   
3  -0.013791  0.720608  1.298057  0.745731 -0.736537 -1.358651  1.728981   
4   1.723851  0.489328 -1.257739 -0.147483  0.446112  0.888390 -0.639986   
..       ...       ...       ...       ...       ...       ...       ...   
67 -1.254964  0.720608  1.298057  1.249352  0.446112 -0.286200  1.561609   
68  0.979147 -0.667071 -1.068421 -0.375538  0.291854  0.454302 -0.240866   
69  1.475616 -0.204511  0.919421  0.337133  0.548951  0.709648 -0.240866   
70 -0.262025  0.489328 -0.216489 -0.275764 -2.022026 -1.537393 -2.249339   
71  0.730913  0.489328  1.203398  1.011795 -1.507831 -1.154374 -1.257977   

          3P       3PA       3P%  ...       ORB       DRB       TRB       AST  \
0  -0.

In [9]:
# create dummy features to replace categorical features, making sure to reset the index so it matches the numerical data
X_train_obj = pd.get_dummies(X_train_obj).reset_index(drop=True)
X_test_obj = pd.get_dummies(X_test_obj).reset_index(drop=True)
X_train_obj.head()

Unnamed: 0,Pos_C,Pos_PF,Pos_PG,Pos_SF,Pos_SG,Signed Using_1st Round Pick,Signed Using_Bi-annual Exception,Signed Using_Cap Space,Signed Using_DPE,Signed Using_Early Bird Rights,Signed Using_MLE,Signed Using_Minimum Salary,Signed Using_Non-Bird Exception,Signed Using_Room Exception,Signed Using_Sign and Trade,Signed Using_Unknown,Market Size_large,Market Size_medium,Market Size_small
0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0
2,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0
3,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1
4,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0


In [10]:
# Concatenate the dummy features with the scaled numerical data
X_train = pd.concat([X_train_num_scaled, X_train_obj], axis=1)
X_test = pd.concat([X_test_num_scaled, X_test_obj], axis=1)
X_train.head()

Unnamed: 0,Age,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,...,Signed Using_Early Bird Rights,Signed Using_MLE,Signed Using_Minimum Salary,Signed Using_Non-Bird Exception,Signed Using_Room Exception,Signed Using_Sign and Trade,Signed Using_Unknown,Market Size_large,Market Size_medium,Market Size_small
0,0.979147,1.183168,1.487375,0.465414,0.600371,0.27556,0.595997,-0.343124,-0.408898,0.398759,...,0,0,0,0,0,0,0,0,1,0
1,0.234444,0.951888,0.919421,1.045053,-0.736537,-0.694753,-0.163618,-0.793943,-0.773113,0.185517,...,0,0,0,0,0,1,0,0,1,0
2,0.482678,-1.360911,-0.973762,-0.684362,-0.47944,-0.00532,-0.948982,0.829007,0.647324,0.683082,...,0,0,1,0,0,0,0,0,1,0
3,-1.006729,0.720608,-1.257739,-0.88391,-0.530859,0.045749,-1.142104,1.189662,1.739967,-0.08992,...,0,0,0,0,0,0,0,0,0,1
4,-0.262025,0.951888,-1.16308,-1.42554,-0.582279,-0.694753,0.093879,1.730645,1.266488,0.99406,...,0,0,1,0,0,0,0,1,0,0


In [11]:
# Save the train/test data for future use
X_train.to_csv('NBA X_train')
X_test.to_csv('NBA X_test')
y_train.to_csv('NBA y_train')
y_test.to_csv('NBA y_test')