# Pre-processing and Training Data

## Introduction

We continue the process of trying to predict MLB pitch types by preparing our features for modeling. Previous notebooks allowed us the opportunity to review data quality, explore data content and relationships, and select features to drop from consideration for prediction. Now the focus shifts to the goal of creating a cleaned development dataset that we can use for modeling. There are 3 main steps we'll focus on:

1. Create dummy or indicator features for categorical variables 
2. Standardize the magnitude of numeric features using a scaler
3. Split into testing and training datasets

Let's begin by importing our post-EDA dataset and make final adjustments to the features we've selected.

## Imports and Data Load

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

In [2]:
pitch_data = pd.read_csv('baseball_features.csv')

In [3]:
pitch_data.head().T

Unnamed: 0,0,1,2,3,4
pitch_type,SI,SL,SL,FF,FF
release_speed,101.0,88.1,85.8,96.5,96.8
batter,606466.0,606466.0,606466.0,572233.0,572233.0
pitcher,547973.0,547973.0,547973.0,547973.0,547973.0
stand,R,R,R,R,R
p_throws,L,L,L,L,L
balls,0.0,0.0,0.0,3.0,2.0
strikes,2.0,1.0,0.0,0.0,0.0
on_3b,,,,,
on_2b,444482.0,444482.0,444482.0,444482.0,444482.0


## Feature Engineering 

During EDA, we addressed how using 'release_speed' to predict 'pitch_type' does not align with our intent to predict pitches using only pre-pitch data. The speed of a pitch is not pre-pitch data and needs to be dropped. The primary key reference for a specific game, 'game_pk', can also be dropped at this point.

In [4]:
pitch_data.drop(columns=['release_speed', 'game_pk'], inplace=True)

*'batter', 'pitcher', and 'fielder_2' appear to be label encoded using numerical encoding.*

With the need to predict pitch type, we are dealing with a classification problem and don't need to encode 'pitch_type'. Our labeled data allows us to consider a supervised learning model for classification such as Random Forest. In the meantime, we need to create dummy features for our remaining categorical variables.

In [5]:
pitch_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110301 entries, 0 to 110300
Data columns (total 22 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   pitch_type             110301 non-null  object 
 1   batter                 110301 non-null  float64
 2   pitcher                110301 non-null  float64
 3   stand                  110301 non-null  object 
 4   p_throws               110301 non-null  object 
 5   balls                  110301 non-null  float64
 6   strikes                110301 non-null  float64
 7   on_3b                  9954 non-null    float64
 8   on_2b                  20207 non-null   float64
 9   on_1b                  33042 non-null   float64
 10  outs_when_up           110301 non-null  float64
 11  inning                 110301 non-null  float64
 12  inning_topbot          110301 non-null  object 
 13  fielder_2              110301 non-null  float64
 14  at_bat_number          110301 non-nu

In [6]:
categorical_cols = ['stand', 'p_throws', 'inning_topbot', 'if_fielding_alignment', 'of_fielding_alignment'] 

df = pd.get_dummies(pitch_data, columns = categorical_cols)

In [7]:
df.head().T

Unnamed: 0,0,1,2,3,4
pitch_type,SI,SL,SL,FF,FF
batter,606466.0,606466.0,606466.0,572233.0,572233.0
pitcher,547973.0,547973.0,547973.0,547973.0,547973.0
balls,0.0,0.0,0.0,3.0,2.0
strikes,2.0,1.0,0.0,0.0,0.0
on_3b,,,,,
on_2b,444482.0,444482.0,444482.0,444482.0,444482.0
on_1b,572233.0,572233.0,572233.0,,
outs_when_up,2.0,2.0,2.0,2.0,2.0
inning,9.0,9.0,9.0,9.0,9.0


All features are now numeric except our labeled target feature. I want to make an adjustment to the columns referencing if someone is on a specific base (on_1b, on_2b, on_3b). Rather than having reference to the specific player, we'll transform the column to a binary indicator.

In [8]:
df['on_1b'] = np.where(df['on_1b'] > 0, 1, 0)
df['on_2b'] = np.where(df['on_2b'] > 0, 1, 0)
df['on_3b'] = np.where(df['on_3b'] > 0, 1, 0)

In [9]:
df[['on_1b', 'on_2b', 'on_3b']].describe()

Unnamed: 0,on_1b,on_2b,on_3b
count,110301.0,110301.0,110301.0
mean,0.299562,0.183199,0.090244
std,0.458068,0.386831,0.286532
min,0.0,0.0,0.0
25%,0.0,0.0,0.0
50%,0.0,0.0,0.0
75%,1.0,0.0,0.0
max,1.0,1.0,1.0


Now we have a value of '1' to indicate someone is on a specific base and a value of '0' if not. The above shows the count of on base values matches our df count of rows, and our min/max values are as expected.

In [10]:
df.head().T

Unnamed: 0,0,1,2,3,4
pitch_type,SI,SL,SL,FF,FF
batter,606466.0,606466.0,606466.0,572233.0,572233.0
pitcher,547973.0,547973.0,547973.0,547973.0,547973.0
balls,0.0,0.0,0.0,3.0,2.0
strikes,2.0,1.0,0.0,0.0,0.0
on_3b,0,0,0,0,0
on_2b,1,1,1,1,1
on_1b,1,1,1,0,0
outs_when_up,2.0,2.0,2.0,2.0,2.0
inning,9.0,9.0,9.0,9.0,9.0


*I might need to scale features and/or address the values for 'batter', 'pitcher', and 'fielder_2' but moving on for now*

## Train/Test Split

In [11]:
X = df.drop(columns='pitch_type')
y = df['pitch_type']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=47)

In [13]:
X.shape

(110301, 28)

In [14]:
y.shape

(110301,)

*Do I want a baseline of predicting all FB to compare to model results?*