# Preprocessing Module:

Goal: Create a standard preprocessing pipeline 'block' that will be used by all modeling techniques, to ensure model comparing is 'apples to apples'. Input is raw data, output is the preprocessed data that is ready to be cross-validated on.

Repeat steps in other notebook.

## Outline of Notebook

Here is the outline of steps to be performed in this notebook. There will be two seperate pipelines: one for the numeric data, and one for categorical data. First we will format the data by fixing data format issues before building both pipelines, as explained below. Then we will apply the two seperate pipelines to the two seperate data types, numeric and categorical, and combine both seperate pipelines into one combined pipeline. This pipeline will output the transformed data, that will then be used with whichever modeling technique, to ensure that the data is the same between all model comparisons.   


- Format the target classes, format the 'pitch_name' classes, and drop 'player_name' feature.
    - There are three classes in the target: "blocked_ball", "ball", "called_strike". Since a blocked ball is the same thing as a ball, we will group those two classes into one class, "ball"
    - There are only a handful amounts of class "Eephus", (almost all pitchers do not regularly throw this pitch) in 'pitch_name', so we can drop those rows which have this class. The class "knuckle curve" is essentialy the same thing as "curveball". Therefore, we will group these two into the same class, "curveball".
    - 'Player_name' is not needed for modeling results, as its simply an identifying featre. So we can disregard it for modeling purposes, but will keep it for future tasks. 

Now for the two seperate pipelines:

1. #### Numeric Data Pipeline: 
    - Impute any missing values using median imputation
    - Scale data using scikit-learn's StandardScaler() class
    

2. #### Categorical Data (will only be pitch_name feature):
    - Impute any missing values using 
    - One-hot encode to create dummy variables for classes of particular feature


3. #### Combine both pipelines together using scikit-learn ColumnTransformer class.

 
        


In [69]:
def preprocess_data(raw_data_file):
    '''Input the raw baseball file, 'Statcast_data.csv' as a pandas dataframe for data preprocessing.
       Go through preprocessing steps and output the data in a form suitable 
       to be trained on by machine learning models'''
    
    import pandas as pd
    import numpy as np
    import sklearn.preprocessing
    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    
    bsb = pd.read_csv(raw_data_file, index_col = 0)
    
    #format the classes of the target, filter out to avoid transformations. 
    #Will store the target as a variable to
    #append back after all transformations taken place.
    bsb['description'] = bsb['description'].replace({'blocked_ball': 0, 'ball': 0, "called_strike": 1})
    target = bsb['description']
    bsb = bsb.drop(columns = 'description')
    
    #replace Knuckle Curve with Curveball
    bsb['pitch_name'] = bsb['pitch_name'].replace('Knuckle Curve', 'Curveball')
    
    
    #filter out dataframe to exclude any rows with Eephus
    bsb = bsb[bsb.pitch_name != 'Eephus']
            
    #Begin seperate preprocessing pipelines
    #seperate out categorical features;
    #these will be handled after numeric transformations
    #define numeric features for custom transformation:
    numeric_features = ['release_speed', 'release_spin_rate', 'release_pos_x',
       'release_pos_y', 'release_pos_z', 'pfx_x', 'pfx_z', 'vx0', 'vy0', 'vz0',
       'ax', 'ay', 'az', 'sz_top', 'sz_bot', 'release_extension']
    
    cat_features = ['pitch_name', 'player_name']
    
    bsb_num = bsb.drop(columns = cat_features)
    
    #establish numeric pipeline steps; simple imputer and StandardScaler()
    num_pipe = Pipeline(steps = [
        ('impute', sklearn.preprocessing.Imputer(missing_values = np.nan, strategy = 'median')),
        ('scale', sklearn.preprocessing.StandardScaler())
                                ])
    #transform numeric data
    baseball = num_pipe.fit_transform(bsb_num)
    
    #make a dataframe to concat with categorical data that was filtered out
    baseball = pd.DataFrame(baseball, columns = bsb_num.columns)

    #concat the pitch_name thta was filtered out
    baseball['pitch_name'] = bsb['pitch_name']
    
    #get dummies, add a feature which indicates if the value was missing or not
    baseball = pd.get_dummies(baseball, dummy_na=True)
    
    #add back the pitcher name
    baseball['player_name'] = bsb['player_name']
    
    #add back the target; note that there were no missing
    #values in the raw data, so no imputation was necessary
    baseball['description'] = target
    
    #however, there are 12 missing 'player_name' values in original data,
    #and since 'player_name' was not imputed,
    #we'll simply drop these 12 rows.
    baseball = baseball.dropna(how = 'any')
    
    return baseball

df = preprocess_data('Statcast_data.csv')

df.head()



Unnamed: 0,release_speed,release_spin_rate,release_pos_x,release_pos_y,release_pos_z,pfx_x,pfx_z,vx0,vy0,vz0,...,pitch_name_4-Seam Fastball,pitch_name_Changeup,pitch_name_Curveball,pitch_name_Cutter,pitch_name_Sinker,pitch_name_Slider,pitch_name_Split Finger,pitch_name_nan,player_name,description
0,1.073523,0.225683,2.080234,-0.016458,-1.073886,2.119752,-0.27965,-1.984984,-1.064687,1.276245,...,0,0,0,0,0,0,0,0,Chris Sale,0
1,1.340953,0.25821,2.033107,-0.393337,-0.823324,1.22874,0.483185,-1.856961,-1.349626,0.535466,...,1,0,0,0,0,0,0,0,Chris Sale,1
2,-1.316632,0.898992,2.124057,1.138366,-1.320666,-0.982587,-1.009283,-1.006819,1.330334,1.589316,...,0,0,0,0,0,1,0,0,Chris Sale,1
3,1.257381,0.274474,2.013077,-0.965694,-1.152964,1.662236,0.401776,-2.347235,-1.209132,-0.252616,...,1,0,0,0,0,0,0,0,Chris Sale,1
4,1.307524,0.625765,2.099451,-0.293616,-1.431626,1.619067,0.139219,-2.665304,-1.265463,0.268337,...,1,0,0,0,0,0,0,0,Chris Sale,0


In [3]:
from data_import import preprocess_data

data = preprocess_data('Statcast_data.csv')
data.head()



Unnamed: 0,release_speed,release_spin_rate,release_pos_x,release_pos_y,release_pos_z,pfx_x,pfx_z,vx0,vy0,vz0,...,pitch_name_4-Seam Fastball,pitch_name_Changeup,pitch_name_Curveball,pitch_name_Cutter,pitch_name_Sinker,pitch_name_Slider,pitch_name_Split Finger,pitch_name_nan,player_name,description
0,1.073523,0.225683,2.080234,-0.016458,-1.073886,2.119752,-0.27965,-1.984984,-1.064687,1.276245,...,0,0,0,0,0,0,0,0,Chris Sale,0
1,1.340953,0.25821,2.033107,-0.393337,-0.823324,1.22874,0.483185,-1.856961,-1.349626,0.535466,...,1,0,0,0,0,0,0,0,Chris Sale,1
2,-1.316632,0.898992,2.124057,1.138366,-1.320666,-0.982587,-1.009283,-1.006819,1.330334,1.589316,...,0,0,0,0,0,1,0,0,Chris Sale,1
3,1.257381,0.274474,2.013077,-0.965694,-1.152964,1.662236,0.401776,-2.347235,-1.209132,-0.252616,...,1,0,0,0,0,0,0,0,Chris Sale,1
4,1.307524,0.625765,2.099451,-0.293616,-1.431626,1.619067,0.139219,-2.665304,-1.265463,0.268337,...,1,0,0,0,0,0,0,0,Chris Sale,0
