# Pre-processing and Training Development

This notebook documents the pre-processing work to prepare the data for model development.  Dummy features will replace categorical features, such as 'Signed Using'.  Additionally, data will be standardized and train and test splits will be created.

## Imports and Data

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
df = pd.read_csv('NBA Data post EDA', index_col=0)
df.head()

Unnamed: 0,Player,Pos,Age,G,GS,MP,FG,FGA,FG%,3P,...,STL,BLK,TOV,PF,PTS,Signed Using,Team,Salary (millions),Team Value (Billions),Market Size
163,Stephen Curry,PG,32,29,29,987,10.6,21.5,0.492,5.3,...,1.3,0.1,3.4,2.0,31.7,Early Bird Rights,GSW,40.231758,4.7,large
117,Chris Paul,PG,35,26,26,843,7.3,14.9,0.489,1.6,...,1.4,0.3,2.6,3.0,19.1,Early Bird Rights,PHO,38.506482,1.7,medium
193,Russell Westbrook,PG,32,19,19,631,7.9,19.4,0.406,1.3,...,0.9,0.4,5.4,3.1,20.5,Early Bird Rights,WAS,38.178,1.8,medium
308,John Wall,PG,30,19,19,591,8.6,19.5,0.441,2.4,...,1.2,0.8,4.0,1.6,23.6,Early Bird Rights,HOU,37.8,2.5,large
329,LeBron James,PG,36,29,29,1006,9.8,19.5,0.504,2.6,...,1.1,0.5,3.9,1.6,26.6,Cap Space,LAL,37.436858,4.6,large


## Train/Test Split

The numerical features vary significantly in magnitute, as shown in the EDA notebook. The features need to be scaled. First the data will be split into train/test sets. Then the categorical data will be separated, the scaler will be fit using the numerical training data and then the scaler will be applied to the training and testing set. At the end, the categorical data will be added back into the train/test data.

In [3]:
# Create train/test split.  X will be all data except salary (what will be predicted) and name.  Y will be salary.
X = df.drop(columns=['Salary (millions)', 'Player'], axis=1)
y = df['Salary (millions)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=24)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(238, 30) (80, 30) (238,) (80,)


## Separate Categorical and Numerical Data for Processing

In [4]:
# Separate the categorical data
X_train_obj = X_train.select_dtypes(include=['object'])
X_test_obj = X_test.select_dtypes(include=['object'])

X_train_obj.head()

Unnamed: 0,Pos,Signed Using,Team,Market Size
312,PG,1st Round Pick,DAL,medium
268,SG,MLE,PHI,medium
170,PG,Minimum Salary,GSW,large
324,PF,1st Round Pick,LAL,large
306,SG,Minimum Salary,HOU,large


In [5]:
# Separate the numerical data
X_train_num = X_train.select_dtypes(include=['number'])
X_test_num = X_test.select_dtypes(include=['number'])

pd.set_option('max_columns', None)
X_train_num.head()

Unnamed: 0,Age,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Team Value (Billions)
312,21,27,27,954,10.2,21.5,0.475,2.5,7.3,0.335,7.8,14.2,0.546,6.7,8.5,0.795,0.9,7.9,8.8,9.6,1.0,0.7,4.3,2.2,29.6,2.45
268,30,22,22,637,5.8,11.5,0.5,2.6,5.4,0.479,3.2,6.1,0.519,2.2,2.3,0.975,0.3,2.1,2.4,3.3,1.0,0.3,1.8,2.1,16.3,2.075
170,31,29,0,465,3.0,9.0,0.336,0.8,3.5,0.222,2.2,5.5,0.408,2.9,3.3,0.884,0.5,3.6,4.0,6.0,1.8,0.3,1.9,3.3,9.8,4.7
324,27,23,23,755,9.7,18.3,0.533,0.8,2.8,0.293,8.9,15.5,0.575,4.4,6.2,0.715,2.2,7.0,9.2,3.3,1.5,2.0,2.1,2.0,24.7,4.6
306,27,20,2,303,5.1,13.7,0.374,3.3,9.9,0.337,1.8,3.8,0.469,1.5,2.1,0.722,0.8,4.3,5.1,2.3,1.4,0.1,2.5,3.1,15.1,2.5


## Scale Numerical Data

In [6]:
# Initiate the scaler and fit to X_train_num
scaler = StandardScaler()
scaler = scaler.fit(X_train_num)
print(scaler)

StandardScaler()


In [7]:
# Transform the train data
X_train_num_scaled = scaler.transform(X_train_num)
X_train_num_scaled = pd.DataFrame(X_train_num_scaled, columns=X_train_num.columns)
print(X_train_num_scaled)

          Age         G        GS        MP        FG       FGA       FG%  \
0   -1.470509  0.829981  1.345483  1.506218  2.123839  2.239854  0.113309   
1    0.808674  0.002087  0.897741  0.336692 -0.058347 -0.308233  0.445095   
2    1.061916  1.161138 -1.072323 -0.297877 -1.447012 -0.945255 -1.731420   
3    0.048946  0.167666  0.987289  0.772036  1.875864  1.424466  0.883052   
4    0.048946 -0.329070 -0.893226 -0.895552 -0.405513  0.252346 -1.227106   
..        ...       ...       ...       ...       ...       ...       ...   
233 -1.470509 -1.488121 -1.072323 -1.555947  0.239224 -0.435637  1.493538   
234 -1.217267  0.167666  0.987289  0.974951  1.826269  2.188893 -0.231748   
235  0.555431  0.995560 -0.355936  0.690870  0.586390  0.277827  0.591081   
236 -1.470509 -0.991385 -0.445484 -0.762735 -0.256728 -0.231790 -0.099034   
237  0.302189  0.664402 -0.982774  0.285041 -0.951060 -1.200063  0.498181   

           3P       3PA       3P%        2P       2PA       2P%        FT  

In [8]:
# Transform the test data
X_test_num_scaled = scaler.transform(X_test_num)
X_test_num_scaled = pd.DataFrame(X_test_num_scaled, columns=X_test_num.columns)
print(X_test_num_scaled)

         Age         G        GS        MP        FG       FGA       FG%  \
0  -0.710781  1.161138  1.435031  1.495150  1.280722  1.500909 -0.152120   
1  -0.457539  0.995560 -0.982774  0.237080  1.181532  0.558117  1.188295   
2   0.808674  0.333245  0.360451  0.458441 -0.752679 -0.308233 -1.160749   
3   2.328129  0.498823 -1.072323 -0.331081 -0.455108  0.124942 -1.187292   
4   1.568401 -0.163492  0.539548 -0.238847 -0.058347 -0.027943 -0.138848   
..       ...       ...       ...       ...       ...       ...       ...   
75 -1.217267  0.002087 -1.072323 -1.054194 -1.149441 -1.047178 -0.629891   
76 -0.457539 -0.991385 -0.714129 -1.460023 -0.058347 -0.308233  0.445095   
77 -0.710781 -2.316015 -1.072323 -1.828959 -0.802275 -0.486599 -0.882049   
78 -0.457539  0.333245  1.076838  0.569122  0.834366  1.602833 -0.921863   
79  0.302189  0.995560  0.360451  0.576500  0.536795 -0.180829  1.520081   

          3P       3PA       3P%        2P       2PA       2P%        FT  \
0   1.76472

## Create Dummy Features for Categorical Data

In [9]:
# create dummy features to replace categorical features, making sure to reset the index so it matches the numerical data
X_train_obj = pd.get_dummies(X_train_obj).reset_index(drop=True)
X_test_obj = pd.get_dummies(X_test_obj).reset_index(drop=True)
X_train_obj.head(1)

Unnamed: 0,Pos_C,Pos_PF,Pos_PG,Pos_SF,Pos_SG,Signed Using_1st Round Pick,Signed Using_Bi-Annual Exception,Signed Using_Cap Space,Signed Using_Early Bird Rights,Signed Using_MLE,Signed Using_Maximum Salary,Signed Using_Minimum Salary,Signed Using_Non-Bird Exception,Signed Using_Room Exception,Signed Using_Sign and Trade,Signed Using_Unknown,Team_ATL,Team_BOS,Team_BRK,Team_CHI,Team_CHO,Team_CLE,Team_DAL,Team_DEN,Team_DET,Team_GSW,Team_HOU,Team_IND,Team_LAC,Team_LAL,Team_MEM,Team_MIA,Team_MIL,Team_MIN,Team_NOP,Team_NYK,Team_OKC,Team_ORL,Team_PHI,Team_PHO,Team_POR,Team_SAC,Team_SAS,Team_TOR,Team_UTA,Team_WAS,Market Size_large,Market Size_medium,Market Size_small
0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


In [10]:
# Check the shape
print(X_train_obj.shape, X_test_obj.shape)

(238, 49) (80, 47)


It looks like the training data and test data do not have the same number of dummy features.  When the data was split, there could have been some categorical data that was put into the train or test set and not the other.  When the data sets are used with models, they will need to have the same number of columns.

In [11]:
# Find any dummy features that are only included in one data set and add them to the other

for col in X_test_obj.columns:
    if col not in X_train_obj.columns:
        X_train_obj[col] = 0
for col in X_train_obj.columns:
    if col not in X_test_obj.columns:
        X_test_obj[col] = 0
        
print(X_train_obj.shape, X_test_obj.shape)

(238, 49) (80, 49)


In [12]:
# Sort columns so they appear in the same order in both the train and test set
X_train_obj = X_train_obj.sort_index(axis=1)
X_test_obj = X_test_obj.sort_index(axis=1)

## Combine Data

In [13]:
# Concatenate the dummy features with the scaled numerical data
X_train = pd.concat([X_train_num_scaled, X_train_obj], axis=1)
X_test = pd.concat([X_test_num_scaled, X_test_obj], axis=1)
X_train.head(1)

Unnamed: 0,Age,G,GS,MP,FG,FGA,FG%,3P,3PA,3P%,2P,2PA,2P%,FT,FTA,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Team Value (Billions),Market Size_large,Market Size_medium,Market Size_small,Pos_C,Pos_PF,Pos_PG,Pos_SF,Pos_SG,Signed Using_1st Round Pick,Signed Using_Bi-Annual Exception,Signed Using_Cap Space,Signed Using_Early Bird Rights,Signed Using_MLE,Signed Using_Maximum Salary,Signed Using_Minimum Salary,Signed Using_Non-Bird Exception,Signed Using_Room Exception,Signed Using_Sign and Trade,Signed Using_Unknown,Team_ATL,Team_BOS,Team_BRK,Team_CHI,Team_CHO,Team_CLE,Team_DAL,Team_DEN,Team_DET,Team_GSW,Team_HOU,Team_IND,Team_LAC,Team_LAL,Team_MEM,Team_MIA,Team_MIL,Team_MIN,Team_NOP,Team_NYK,Team_OKC,Team_ORL,Team_PHI,Team_PHO,Team_POR,Team_SAC,Team_SAS,Team_TOR,Team_UTA,Team_WAS
0,-1.470509,0.829981,1.345483,1.506218,2.123839,2.239854,0.113309,0.621737,0.881972,0.071322,1.772389,1.762487,0.273097,2.485888,2.572961,0.223583,-0.504889,1.34045,0.688509,2.772487,-0.277269,-0.039405,2.64977,-0.711839,2.396014,0.200401,0,1,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [14]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(238, 75) (80, 75) (238,) (80,)


## Save Processed Data

In [15]:
# Save the train/test data for future use
X_train.to_csv('NBA X_train')
X_test.to_csv('NBA X_test')
y_train.to_csv('NBA y_train')
y_test.to_csv('NBA y_test')