# Pre-processing and Training Data

## Introduction

We continue the process of trying to predict MLB pitch types by preparing our features for modeling. Previous notebooks allowed us the opportunity to review data quality, explore data content and relationships, and select features to drop from consideration for prediction. Now the focus shifts to the goal of creating a cleaned development dataset that we can use for modeling. There are 3 main steps we'll focus on:

1. Finalize feature selection for various modeling efforts 
2. Create dummy or indicator features for categorical variables
3. Define datasets to be used in modeling

We'll reserve standardizing the magnitude of numeric features for our pipeline in the modeling stage. Train and test data will also be assigned in the next step for modeling. Let's begin by importing our post-EDA dataset and make final adjustments to the features we've selected.

## Imports and Data Load

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
pitch_data = pd.read_csv('baseball_features.csv')

In [3]:
pitch_data.head().T

Unnamed: 0,0,1,2,3,4
pitch_type,SI,SL,SL,FF,FF
release_speed,101.0,88.1,85.8,96.5,96.8
batter,606466.0,606466.0,606466.0,572233.0,572233.0
pitcher,547973.0,547973.0,547973.0,547973.0,547973.0
stand,R,R,R,R,R
p_throws,L,L,L,L,L
balls,0.0,0.0,0.0,3.0,2.0
strikes,2.0,1.0,0.0,0.0,0.0
on_3b,,,,,
on_2b,444482.0,444482.0,444482.0,444482.0,444482.0


## Feature Engineering 

During EDA, we addressed how using 'release_speed' to predict 'pitch_type' does not align with our intent to predict pitches using only pre-pitch data. The speed of a pitch is not pre-pitch data and needs to be dropped. The primary key reference for a specific game, 'game_pk', can also be dropped at this point.

In [4]:
pitch_data.drop(columns=['release_speed', 'game_pk'], inplace=True)

The features 'batter', 'pitcher', and 'fielder_2' are all label encoded using number sequencing. Each number represents a different player but introduces inappropriate numerical comparisons between the players. While this works for Statcast as an index to the player, we can not use these features in machine learning models.  

One of the model scenarios we intend to train includes the top 5 pitchers by count of pitches, so we'll keep the 'pitcher' column for now. With the limited number of 'pitcher' values, we can use one-hot encoding to address the concern that the numeric values can be misinterpreted by algorithms as having some sort of order in them.  

In [5]:
pitch_data.drop(columns=['batter', 'fielder_2'], inplace=True)

With the need to predict pitch type, we are dealing with a classification problem and don't need to encode 'pitch_type'. Our labeled data allows us to consider a supervised learning model for classification such as Random Forest. In the meantime, we need to create dummy features for our remaining categorical variables.

In [6]:
pitch_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110301 entries, 0 to 110300
Data columns (total 20 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   pitch_type             110301 non-null  object 
 1   pitcher                110301 non-null  float64
 2   stand                  110301 non-null  object 
 3   p_throws               110301 non-null  object 
 4   balls                  110301 non-null  float64
 5   strikes                110301 non-null  float64
 6   on_3b                  9954 non-null    float64
 7   on_2b                  20207 non-null   float64
 8   on_1b                  33042 non-null   float64
 9   outs_when_up           110301 non-null  float64
 10  inning                 110301 non-null  float64
 11  inning_topbot          110301 non-null  object 
 12  at_bat_number          110301 non-null  float64
 13  pitch_number           110301 non-null  float64
 14  home_score             110301 non-nu

In [7]:
categorical_cols = ['stand', 'p_throws', 'inning_topbot', 'if_fielding_alignment', 'of_fielding_alignment'] 

df = pd.get_dummies(pitch_data, columns = categorical_cols)

In [8]:
df.head().T

Unnamed: 0,0,1,2,3,4
pitch_type,SI,SL,SL,FF,FF
pitcher,547973.0,547973.0,547973.0,547973.0,547973.0
balls,0.0,0.0,0.0,3.0,2.0
strikes,2.0,1.0,0.0,0.0,0.0
on_3b,,,,,
on_2b,444482.0,444482.0,444482.0,444482.0,444482.0
on_1b,572233.0,572233.0,572233.0,,
outs_when_up,2.0,2.0,2.0,2.0,2.0
inning,9.0,9.0,9.0,9.0,9.0
at_bat_number,77.0,77.0,77.0,76.0,76.0


All features are now numeric except our labeled target feature. We still need to make an adjustment to the columns referencing if someone is on a specific base (on_1b, on_2b, on_3b). Rather than having reference to the specific player, we'll transform the column to a binary indicator. This is effectively one-hot encoding and it's needed due to the same reasons discussed for the 'batter' and 'fielder_2' columns.

In [9]:
df['on_1b'] = np.where(df['on_1b'] > 0, 1, 0)
df['on_2b'] = np.where(df['on_2b'] > 0, 1, 0)
df['on_3b'] = np.where(df['on_3b'] > 0, 1, 0)

In [10]:
df[['on_1b', 'on_2b', 'on_3b']].describe()

Unnamed: 0,on_1b,on_2b,on_3b
count,110301.0,110301.0,110301.0
mean,0.299562,0.183199,0.090244
std,0.458068,0.386831,0.286532
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,1.0,0.0,0.0
max,1.0,1.0,1.0


Now we have a value of '1' to indicate someone is on a specific base and a value of '0' if not. The above shows the count of on base values matches our df count of rows, and our min/max values are as expected.

In [11]:
df.head().T

Unnamed: 0,0,1,2,3,4
pitch_type,SI,SL,SL,FF,FF
pitcher,547973.0,547973.0,547973.0,547973.0,547973.0
balls,0.0,0.0,0.0,3.0,2.0
strikes,2.0,1.0,0.0,0.0,0.0
on_3b,0,0,0,0,0
on_2b,1,1,1,1,1
on_1b,1,1,1,0,0
outs_when_up,2.0,2.0,2.0,2.0,2.0
inning,9.0,9.0,9.0,9.0,9.0
at_bat_number,77.0,77.0,77.0,76.0,76.0


This view presents something we probably should have addressed earlier but will do so now. All of the original numerical values are of type 'float64' when they are all comprised of integer values. Let's go ahead and convert them to 'int32'. 

In [12]:
# https://stackoverflow.com/questions/44602139/pandas-convert-all-column-from-string-to-number-except-two

cols=[i for i in df.columns if i not in ['pitch_type']]
for col in cols:
    df[col]=df[col].astype('int32')

In [13]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110301 entries, 0 to 110300
Data columns (total 27 columns):
 #   Column                                Non-Null Count   Dtype 
---  ------                                --------------   ----- 
 0   pitch_type                            110301 non-null  object
 1   pitcher                               110301 non-null  int32 
 2   balls                                 110301 non-null  int32 
 3   strikes                               110301 non-null  int32 
 4   on_3b                                 110301 non-null  int32 
 5   on_2b                                 110301 non-null  int32 
 6   on_1b                                 110301 non-null  int32 
 7   outs_when_up                          110301 non-null  int32 
 8   inning                                110301 non-null  int32 
 9   at_bat_number                         110301 non-null  int32 
 10  pitch_number                          110301 non-null  int32 
 11  home_score   

Now that we have finalized feature selection and created dummy features, we are ready to define our datasets for modeling.

## Dataset Definintions for Modeling

Let's begin with creating the dataset that will predict pitches with 'pitcher' included, then we can drop that column and move on. We want to keep the top 5 most frequent pitchers in the data.

In [14]:
df['pitcher'].value_counts()

545333    666
605400    632
572971    611
571578    607
596001    598
         ... 
543148     10
519326      9
607457      8
607732      7
606299      4
Name: pitcher, Length: 539, dtype: int64

In [15]:
pitchers = df.loc[(df['pitcher'] == 545333) | (df['pitcher'] == 605400) | 
                          (df['pitcher'] == 572971) | (df['pitcher'] == 571578) |
                          (df['pitcher'] == 596001)]

In [16]:
pitchers.head()

Unnamed: 0,pitch_type,pitcher,balls,strikes,on_3b,on_2b,on_1b,outs_when_up,inning,at_bat_number,...,p_throws_L,p_throws_R,inning_topbot_Bot,inning_topbot_Top,if_fielding_alignment_Infield shift,if_fielding_alignment_Standard,if_fielding_alignment_Strategic,of_fielding_alignment_4th outfielder,of_fielding_alignment_Standard,of_fielding_alignment_Strategic
2463,SI,596001,1,2,0,0,0,0,8,55,...,0,1,0,1,0,1,0,0,0,1
2464,CH,596001,1,1,0,0,0,0,8,55,...,0,1,0,1,0,1,0,0,0,1
2465,CH,596001,1,0,0,0,0,0,8,55,...,0,1,0,1,0,1,0,0,0,1
2466,SI,596001,0,0,0,0,0,0,8,55,...,0,1,0,1,0,1,0,0,0,1
2467,CU,596001,0,0,0,0,0,0,8,54,...,0,1,0,1,0,1,0,0,1,0


Now we need to one-hot encode the 'pitcher' feature so as not to cause confusion with the numeric reference. 

In [17]:
pitchers = pd.get_dummies(pitchers, columns = ['pitcher'])

In [18]:
pitchers.head().T

Unnamed: 0,2463,2464,2465,2466,2467
pitch_type,SI,CH,CH,SI,CU
balls,1,1,1,0,0
strikes,2,1,0,0,0
on_3b,0,0,0,0,0
on_2b,0,0,0,0,0
on_1b,0,0,0,0,0
outs_when_up,0,0,0,0,0
inning,8,8,8,8,8
at_bat_number,55,55,55,55,54
pitch_number,4,3,2,1,1


We'll continue defining additional datasets with the 'pitcher' feature dropped. Another scenario we will model in attempt to predict pitch type is for the first pitch of each at-bat. 

In [19]:
no_pitchers = df.drop(columns = 'pitcher')
first_pitch = no_pitchers.loc[df['pitch_number'] == 1]

In [20]:
first_pitch['pitch_number'].describe()

count    28166.0
mean         1.0
std          0.0
min          1.0
25%          1.0
50%          1.0
75%          1.0
max          1.0
Name: pitch_number, dtype: float64

Great, we now have a dataset first_pitch that contains only pitches with 'pitch_number' = 1. We'll have access to our raw data, no_pitchers, in the modeling stage so we can select additional pitch counts as desired.

In [21]:
no_pitchers.head().T

Unnamed: 0,0,1,2,3,4
pitch_type,SI,SL,SL,FF,FF
balls,0,0,0,3,2
strikes,2,1,0,0,0
on_3b,0,0,0,0,0
on_2b,1,1,1,1,1
on_1b,1,1,1,0,0
outs_when_up,2,2,2,2,2
inning,9,9,9,9,9
at_bat_number,77,77,77,76,76
pitch_number,3,2,1,4,3


## Summary

Pre-processing started with our post-EDA data and finished with datasets ready to apply train/test splits for modeling. First, we identified a couple remaining features that were not relevant for our prediction efforts. Other features had to be dropped as they were label encoded using number sequencing. We did keep one such feature and later used dummy features to one-hot encode the 'pitcher' data. Second, we created dummy features for our remaining categorical features. For the features indicating a specific player is on-base, we used logic to one-hot encode the features if a player is on-base. We lost reference to the specific player but keep the important information of having someone on base or not. Finally, we converted the feature data types to all int32 (except our target feature) and created a few datasets for modeling.

We are ready to move on and begin modeling. We have the following datasets to work with:

- no_pitchers - All pitches without pitcher reference
- pitchers - All pitches with reference to top 5 pitchers based on pitch count
- first_pitch - Only the first pitch in each at-bat without pitcher reference

From here, we can work with additional scenarios from the datasets above. For example, if we want to look at first pitch only with pitchers or third pitch only without pitchers, we have the data.

In [22]:
pwd

'C:\\Users\\Louie\\GitHub\\Capstone2_Project'

In [24]:
no_pitchers.to_csv(r'C:\\Users\\Louie\\GitHub\\Capstone2_Project\no_pitchers.csv', index=False)
pitchers.to_csv(r'C:\\Users\\Louie\\GitHub\\Capstone2_Project\pitchers.csv', index=False)
first_pitch.to_csv(r'C:\\Users\\Louie\\GitHub\\Capstone2_Project\first_pitch.csv', index=False)