# Statcast Data Preparation

**Author:** Jacob Sauberman, Joel Klein, Ben Perkins

**Date:** November 21, 2021

**Description:** This script loads in MLB statcast data from every MLB registered pitch since 2015 and prepares the data for use in modeling. 

**Data:** The data is scraped from *baseballsavant.com* for each year and combined together in one large data file. The code for scraping the data and making the data available is called statcast_data_download.ipynb in the adhoc folder of this repository.

**Scope:** We will use MLB pitch data starting in 2017 for training, tuning, and evaluating the next pitch prediction models. To avoid data leakage, the data can not be randomly split across all years in the training, validation, and test sets. Future pitches cannot be included as inputs within the training data to generate predictions for past pitches in the validation and/or test data. The data from the 2017 and 2018 seasons will be used for training data while the 2019 season is used for validation. 2020 and 2021 seasons will be used for the testing data in the next pitch prediction model. 

**Summary:**

There are a series of data preparation steps in this notebook: 

* **Recategorize similar pitch types:** 
    
  The pitch types in scope for this analysis are:
    - 4-Seam Fastball
    - Slider
    - 2-Seam Fastball
    - Changeup
    - Sinker
    - Curveball
    - Cutter
    - Knuckle Curve
    - Split-Finger
  
  There are pitches which are very similar to each other based on the vertical and horizontal movement. We will combine these pitch types together and reformat the target variable.
    
  Any at bat with a pitch thrown not in the common pitch list will be removed from the data set.

* **Input Feature Engineering:**
  * Create current pitch count feature - the pitch count impacts the types of pitches thrown
  * Create pitcher historical pitch type probabilities - previous pitch probabilities for a pitcher will likely boost the performance of the models for predicting future pitches thrown
  * Create pitcher and hitter historical pitch type statistics such as wOBA and whiff %
  * Create game pitch number - the types of pitches thrown in a game may be impacted by the number of pitches a pitcher has thrown in the game

* **Combine previous pitch results to existing pitch:**
  * The results from the previous pitches within the at-bat may influence the next pitches thrown
* **Remove any at bats with missing pitch type or uncommon pitch types thrown:**
  * The previous pitch thrown will likely impact the next pitches thrown in the at-bat. Any atbats with missing pitch information will be out of scope for this analysis.
* **Filter to only the first 9 innings of pitches:**
  * In 2020, extra innings started with a runner on second base. There may be a temporal shift in the data during this time so we will exclude extras from scope.
* **Filter at bats to batters and pitcher match up which account for 90% of total at bats:**
  * There are many instances in major league baseball where rookie batters and pitchers receive at-bats and there is little or no prior pitch sequence data. In order to make a prediction on next pitch types in the 2019, 2020, and 2021 seasons, the batter and the pitcher need to have a significant amount of recorded at-bats in the 2017 and 2018 seasons. Only batters and pitchers accounting for 90% of at-bats in the 2017 and 2018 seasons were included in scope (443 batters and 521 pitchers).
* **Split original data into train, validation, test sets:**
  * To avoid data leakage, the data can not be randomly split across all years in the training, validation, and test sets. Future pitches cannot be included as inputs within the training data to generate predictions for past pitches in the validation and/or test data. The data from the 2017 and 2018 seasons will be used for training data while the 2019 season is used for validation. 2020 and 2021 seasons will be used for the testing data in the next pitch prediction model.

**Notes:** 

**Warnings:** 

**Outline:** 
  - Import Libraries
  - Global Options
  - Set Directories
  - Define Functions
  - Load Data
  - Data Preparation
    - Feature Engineering
  - Data Filtering
  - Data Splitting

## Import Libraries

In [None]:
# data manipulation
import numpy as np
import pandas as pd 
import os
import zipfile

# plotting
import seaborn as sns
import matplotlib.pyplot as plt
from pandas.plotting import scatter_matrix
import plotly.graph_objects as go
import plotly.express as px

## Global Options

In [None]:
# do not show warnings
import warnings
warnings.filterwarnings('ignore')

# set pandas display options
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 2000)
pd.set_option('display.max_colwidth', 2000)
pd.options.display.float_format = '{:.5f}'.format

## Set Directories

In [None]:
# Connect to the colab data source
from google.colab import drive
drive.mount('/content/drive')

DATA_DIR = "/content/drive/MyDrive/final-project-dl/data/statcast"

Mounted at /content/drive


## Define Functions

In [None]:
# define a function for loading in dataset
def load_data(in_path, name):
    df = pd.read_csv(in_path)
    print(f"{name}: shape is {df.shape}")
    
    if ds_name == "all_15_dataframe":
      print(df.info())
      display(df.head(5))
    
    return df

## Load Data

In [None]:
datasets = {}

# set the input data set names we will load in
ds_names = ("all_15_dataframe", "all_16_dataframe", "all_17_dataframe",
            "all_18_dataframe", "all_19_dataframe", "all_20_dataframe",
            "all_21_dataframe")

# load in each dataset
for ds_name in ds_names:
    datasets[ds_name] = load_data(os.path.join(DATA_DIR, f'{ds_name}.csv'), ds_name)

# combine data into one file
statcast_df = pd.concat([datasets['all_15_dataframe'], 
                         datasets['all_16_dataframe'],
                         datasets['all_17_dataframe'],
                         datasets['all_18_dataframe'],
                         datasets['all_19_dataframe'],
                         datasets['all_20_dataframe'],
                         datasets['all_21_dataframe']])

# delete objects no longer needed for memory
import gc
gc.enable()
del datasets
gc.collect()

all_15_dataframe: shape is (702301, 93)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 702301 entries, 0 to 702300
Data columns (total 93 columns):
 #   Column                           Non-Null Count   Dtype  
---  ------                           --------------   -----  
 0   index                            702301 non-null  int64  
 1   pitch_type                       701661 non-null  object 
 2   game_date                        702301 non-null  object 
 3   release_speed                    687964 non-null  float64
 4   release_pos_x                    687381 non-null  float64
 5   release_pos_z                    687381 non-null  float64
 6   player_name                      702301 non-null  object 
 7   batter                           702301 non-null  float64
 8   pitcher                          702301 non-null  float64
 9   events                           183953 non-null  object 
 10  description                      702301 non-null  object 
 11  spin_dir                 

Unnamed: 0,index,pitch_type,game_date,release_speed,release_pos_x,release_pos_z,player_name,batter,pitcher,events,description,spin_dir,spin_rate_deprecated,break_angle_deprecated,break_length_deprecated,zone,des,game_type,stand,p_throws,home_team,away_team,type,hit_location,bb_type,balls,strikes,game_year,pfx_x,pfx_z,plate_x,plate_z,on_3b,on_2b,on_1b,outs_when_up,inning,inning_topbot,hc_x,hc_y,tfs_deprecated,tfs_zulu_deprecated,fielder_2,umpire,sv_id,vx0,vy0,vz0,ax,ay,az,sz_top,sz_bot,hit_distance_sc,launch_speed,launch_angle,effective_speed,release_spin_rate,release_extension,game_pk,pitcher.1,fielder_2.1,fielder_3,fielder_4,fielder_5,fielder_6,fielder_7,fielder_8,fielder_9,release_pos_y,estimated_ba_using_speedangle,estimated_woba_using_speedangle,woba_value,woba_denom,babip_value,iso_value,launch_speed_angle,at_bat_number,pitch_number,pitch_name,home_score,away_score,bat_score,fld_score,post_away_score,post_home_score,post_bat_score,post_fld_score,if_fielding_alignment,of_fielding_alignment,spin_axis,delta_home_win_exp,delta_run_exp
0,2692,SI,2015-10-04,97.2,-1.08,6.2,"Familia, Jeurys",150029.0,544727.0,field_out,hit_into_play,,,,,7.0,Jayson Werth flies out to center fielder Juan Lagares.,R,R,R,NYM,WSH,X,8.0,fly_ball,0.0,1.0,2015.0,-1.12,0.46,-0.71,2.02,,547180.0,,2.0,9.0,Top,110.32,71.66,,,518595.0,,151004_174434,3.6,-141.27,-6.54,-14.35,32.15,-28.51,3.64,1.67,345.0,95.2,35.0,96.5,2018.0,6.1,416079.0,544727.0,518595.0,446263.0,502517.0,431151.0,514913.0,624424.0,501571.0,434158.0,50.0,0.153,0.293,0.0,1.0,0.0,0.0,3.0,61.0,2.0,Sinker,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,Standard,Standard,,0.112,-0.256
1,2740,SI,2015-10-04,97.5,-0.93,6.42,"Familia, Jeurys",150029.0,544727.0,,called_strike,,,,,3.0,Jayson Werth flies out to center fielder Juan Lagares.,R,R,R,NYM,WSH,S,,,0.0,0.0,2015.0,-1.11,0.71,0.44,3.5,,547180.0,,2.0,9.0,Top,,,,,518595.0,,151004_174405,6.39,-141.72,-3.75,-14.31,32.1,-24.84,3.64,1.67,,,,96.5,2093.0,6.0,416079.0,544727.0,518595.0,446263.0,502517.0,431151.0,514913.0,624424.0,501571.0,434158.0,50.0,,,,,,,,61.0,1.0,Sinker,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,Standard,Standard,,0.0,-0.048
2,2875,SI,2015-10-04,98.4,-1.01,6.21,"Familia, Jeurys",547180.0,544727.0,double,hit_into_play,,,,,4.0,"Mets challenged (tag play), call on the field was upheld: Bryce Harper doubles (38) on a ground ball to left fielder Michael Conforto.",R,L,R,NYM,WSH,X,7.0,ground_ball,0.0,0.0,2015.0,-1.52,0.52,-0.35,2.45,,,,2.0,9.0,Top,73.19,139.24,,,518595.0,,151004_174011,5.48,-143.01,-5.67,-20.49,36.6,-27.46,3.19,1.46,20.0,72.2,-6.0,97.6,1960.0,6.4,416079.0,544727.0,518595.0,446263.0,502517.0,431151.0,514913.0,624424.0,501571.0,434158.0,50.0,0.093,0.084,1.25,1.0,1.0,1.0,2.0,60.0,1.0,Sinker,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,Standard,Strategic,,-0.077,0.208
3,2944,SI,2015-10-04,97.7,-0.96,6.08,"Familia, Jeurys",607208.0,544727.0,strikeout,swinging_strike,,,,,13.0,Trea Turner strikes out swinging.,R,R,R,NYM,WSH,S,2.0,,3.0,2.0,2015.0,-1.29,0.62,-0.05,1.27,,,,1.0,9.0,Top,,,,,518595.0,,151004_173928,5.58,-141.82,-8.8,-16.94,31.9,-26.15,3.5,1.61,,,,97.1,2099.0,6.3,416079.0,544727.0,518595.0,446263.0,502517.0,431151.0,514913.0,624424.0,501571.0,434158.0,50.0,,,0.0,1.0,0.0,0.0,,59.0,7.0,Sinker,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,Standard,Standard,,0.052,-0.2
4,3111,SI,2015-10-04,98.2,-1.02,6.09,"Familia, Jeurys",607208.0,544727.0,,foul,,,,,7.0,Trea Turner strikes out swinging.,R,R,R,NYM,WSH,S,,,3.0,2.0,2015.0,-1.0,0.8,-0.41,1.56,,,,1.0,9.0,Top,,,,,518595.0,,151004_173902,3.99,-142.75,-8.51,-12.71,36.27,-23.52,3.5,1.61,,,,97.7,2155.0,6.5,416079.0,544727.0,518595.0,446263.0,502517.0,431151.0,514913.0,624424.0,501571.0,434158.0,50.0,,,,,,,,59.0,6.0,Sinker,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,Standard,Standard,,0.0,0.0


all_16_dataframe: shape is (715821, 93)
all_17_dataframe: shape is (721244, 93)
all_18_dataframe: shape is (721190, 93)
all_19_dataframe: shape is (732473, 93)
all_20_dataframe: shape is (263584, 93)
all_21_dataframe: shape is (654091, 93)


11

## Data Preparation

### Feature Engineering

#### Re-classifying pitch types

There are several pitch types the model will attempt to predict. Some of the pitch types in the raw data are very rare. We will only keep the commonly thrown pitch types in scope for modeling.

The pitch types in scope for this analysis are:

  * 4-Seam Fastball
  * Slider
  * 2-Seam Fastball
  * Changeup
  * Sinker
  * Curveball
  * Cutter
  * Knuckle Curve
  * Split-Finger
  
There are pitches which are very similar to each other based on the vertical and horizontal movement. We will combine these pitch types together and reformat the target variable.

In [None]:
# Unique pitch types -- these will end up being our target class in the NN
statcast_df.pitch_type.unique()

array(['SI', 'SL', 'CH', 'FS', 'FF', 'KC', 'CU', 'FT', 'FC', 'IN', nan,
       'PO', 'KN', 'EP', 'FO', 'SC', 'FA', 'CS'], dtype=object)

In [None]:
# Plot each pitch type's average movement in the X and Y directions, with velocity
pitch_type_avg_move = statcast_df.query("p_throws == 'R'").groupby(by=["pitch_type"], as_index=False).agg(
    **{
       "Count": pd.NamedAgg(column="pitch_type", aggfunc='count'),
       "Horz_Break": pd.NamedAgg(column="pfx_x", aggfunc=np.mean),
       "Vert_Break": pd.NamedAgg(column="pfx_z", aggfunc=np.mean),
       "Velocity": pd.NamedAgg(column="release_speed", aggfunc=np.mean)
    }
)

print(pitch_type_avg_move)

fig = px.scatter_3d(
    pitch_type_avg_move, 
    x = "Horz_Break",
    y = "Vert_Break",
    z = "Velocity",
    color = "pitch_type")

fig.show()

   pitch_type    Count  Horz_Break  Vert_Break  Velocity
0          CH   315552    -1.11201     0.65596  84.88367
1          CS      359     0.95844    -0.89326  71.27939
2          CU   264934     0.74454    -0.66593  78.77364
3          EP      912     0.54898    -0.38519  64.79238
4          FA     1357    -0.58444     1.25709  74.20332
5          FC   185850     0.18204     0.78431  89.02783
6          FF  1169740    -0.65373     1.39607  93.64323
7          FO      845    -0.97961     0.75085  86.50726
8          FS    63109    -0.92152     0.46916  85.26307
9          FT   286707    -1.23676     0.96999  92.89225
10         IN     4578    -0.72685     1.27744  73.43726
11         KC    84355     0.67933    -0.83188  81.08125
12         KN    11455    -0.13279     0.34409  76.09009
13         PO      589    -0.81748     1.33183  86.99224
14         SC        1    -0.83000    -0.17000  82.30000
15         SI   309409    -1.25861     0.77645  92.52217
16         SL   565842     0.40

There are some very similar pitch types based on their characteristics that, for the purposes of this project, we can group together to reduce the number of target features to predict.

Filter out rarely used pitch types -- CS, EP, FA, IN, KN, PO, SC

Group similar pitch types together to get 6 target pitch type groups:

1.   FF --> Fourseam
2.   FT, SI --> Twoseam
3.   FC --> Cutter
4.   SL --> Slider
5.   CU, KC --> Curveball
6.   CH, FS, FO --> Changeup

In [None]:
# Re-classify pitch types using new system
statcast_df = statcast_df.assign(
    pitch_class = [
      "Fourseam" if p == "FF"
      else "Twoseam" if p in ["FT", "SI"]
      else "Cutter" if p == "FC"
      else "Slider" if p == "SL"
      else "Curveball" if p in ["CU", "KC"]
      else "Changeup" if p in ["CH", "FS", "FO"]
      else None
      for p in statcast_df['pitch_type']
    ]
)

#### Creating Count Feature

The pitch count impacts the types of pitches thrown. It is important to combine the balls and strikes into 1 feature as they are used together to determine the pitch type which will be thrown. There are also a hand full of instances in the data which have 4 balls for the current count. We will replace 4 balls with 3 balls. 

In [None]:
# there are about 10 rows which have 4 balls for the count which is impossible
# we will fix those rows by replacing four balls with three
statcast_df["balls"][statcast_df["balls"] == 4] = 3

# Add count column to df
statcast_df["count"] = statcast_df["balls"].astype(int).astype(str) + "-" + statcast_df["strikes"].astype(int).astype(str)

In [None]:
# Calculate average pitch type usage by count
pitches_by_count = statcast_df.groupby(by=["count", "pitch_class"], as_index=False).agg(
    **{
       "count_pitches": pd.NamedAgg(column="pitch_class", aggfunc='count')
    }
)

pitches_overall = statcast_df.groupby(by=["count"], as_index=False).agg(
    **{
       "tot_pitches": pd.NamedAgg(column="count", aggfunc='count')
    }
)

usage_by_count = pitches_by_count.merge(pitches_overall, how = "inner", on = "count")
usage_by_count["usage"] = usage_by_count["count_pitches"] / usage_by_count["tot_pitches"]

print(usage_by_count.head())

  count pitch_class  count_pitches  tot_pitches   usage
0   0-0    Changeup          94884      1158966 0.08187
1   0-0   Curveball         138613      1158966 0.11960
2   0-0      Cutter          63808      1158966 0.05506
3   0-0    Fourseam         431073      1158966 0.37195
4   0-0      Slider         171462      1158966 0.14794


In [None]:
# Plot pitch type usage by count
fig = px.bar(usage_by_count, x="count", y="usage", color="pitch_class", title="Pitch Type Usage by Count")
fig.show()

Count appears to be an influencer over pitch type usage. In counts more favorable to hitters (1-0, 2-0, 3-0, 3-1), we can see higher fastball usage. In counts more favorable to pitchers (0-2, 1-2, 2-2), we can see higher slider and curveball usage.

In [None]:
# Calculate % of pitchers that throw each pitch type
num_pitchers = len(statcast_df.query('game_year == 2021').pitcher.unique())

pitchers_by_pitch_type = statcast_df.query('game_year == 2021').groupby(by=["pitch_class"], as_index=False).agg(
    **{
       "pitchers": pd.NamedAgg(column="pitcher", aggfunc=pd.Series.nunique)
    }
)

pitchers_by_pitch_type["frequency"] = pitchers_by_pitch_type["pitchers"] / num_pitchers

fig = px.bar(pitchers_by_pitch_type, x="pitch_class", y="frequency", color="pitch_class", title="% of Pitchers Who Threw Each Pitch Type in 2021")
fig.show()


#### Modify the on base features

The on base features in the data in the original data represent player ids, when there is a runner on the base and NaN if there is not a runner on. We will transform this feature to be an indicator if there is a runner on the base as the runner likely doesn't affect the pitch type.

In [None]:
# change the on base fields to indicators
statcast_df['on_1b'] = np.where(statcast_df['on_1b'].isna(), 0, 1)
statcast_df['on_2b'] = np.where(statcast_df['on_2b'].isna(), 0, 1)
statcast_df['on_3b'] = np.where(statcast_df['on_3b'].isna(), 0, 1)

#### Remove at-bats with missing pitch type information

There are over 12000 pitches in the dataset which have a missing pitch type. There are also several other features in the dataset which are missing for certain pitch tracking measurements. We will remove at bats which are missing any pitch type information as the hypothesis is the pitch sequence information will be valuable to improve the model accuracy.

In [None]:
# get game ids and at bats with missing data
missing_pitch_type_removal = statcast_df[statcast_df['pitch_class'].isna()][['game_pk','at_bat_number']].drop_duplicates()

# remove at bats from the data set with a missing pitch type
statcast_df = statcast_df[~(statcast_df['at_bat_number'].isin(missing_pitch_type_removal['at_bat_number']) & \
                            statcast_df['game_pk'].isin(missing_pitch_type_removal['game_pk']))]

#### Create starting pitcher indicator feature

In [None]:
home_starter = (statcast_df[(statcast_df['inning']==1) &
                       (statcast_df['inning_topbot'] == 'Top')].groupby('game_pk')
                       .head(n=1)
                       .set_index('game_pk')['pitcher']
           .rename('home_starter'))

away_starter = (statcast_df[(statcast_df['inning']==1) &
                       (statcast_df['inning_topbot'] == 'Top')].groupby('game_pk')
                       .head(n=1)
                       .set_index('game_pk')['pitcher']
           .rename('away_starter'))

In [None]:
home_starter = (statcast_df[(statcast_df['inning']==1) &
                       (statcast_df['inning_topbot'] == 'Top')].groupby('game_pk')
                       .head(n=1)
                       .set_index('game_pk')['pitcher']
           .rename('home_starter'))

away_starter = (statcast_df[(statcast_df['inning']==1) &
                       (statcast_df['inning_topbot'] == 'Top')].groupby('game_pk')
                       .head(n=1)
                       .set_index('game_pk')['pitcher']
           .rename('away_starter'))

starting_pitcher = home_starter.to_frame().join(away_starter, how='left', on='game_pk')

statcast_df = statcast_df.join(starting_pitcher, on='game_pk', how='left')

statcast_df['starter'] = ((statcast_df['pitcher'] == statcast_df['home_starter']) | (statcast_df['pitcher'] == statcast_df['away_starter']))

statcast_df.drop(columns=['home_starter', 'away_starter'], inplace=True)

#### Previous pitch class pitcher statistics

We are going to create features to get the mean of the pitch statistics going into the next pitch. These will be based on the last 50 pitches thrown by the pitcher for that pitch type. The pitcher needs a minimum of at least 1 pitch thrown for the pitch to calculate the mean.

In [None]:
# create strike percentage feature
statcast_df['strike_per'] = statcast_df['description'].isin(['called_strike', 'foul', 'swinging_strike', 'hit_into_play'])

# create whiff percentage feature
statcast_df['whiff_per'] = np.where(statcast_df['description'].isin(['swinging_strike']), 1, np.where(statcast_df['description'].isin(['foul', 'hit_into_play']), 0, np.nan))

In [None]:
pitch_stat_cols = [# pitch speed
                  'release_speed',

                  # pitch release spin
                  'release_spin_rate',
                   
                   # strike percentage
                   'strike_per',

                   # whiff percentage
                   'whiff_per',

                   # woba value
                   'woba_value'
                   ]

# Sort dataframe by date, game, at bat number
statcast_df = statcast_df.sort_values(["game_date", "game_pk", "at_bat_number", "pitch_number"], ascending = (True, True, True, True))

# set the pitch unique id
statcast_df['pitch_id'] = statcast_df["game_date"].astype(str) + "_" + statcast_df["game_pk"].astype(int).astype(str) + "_" + statcast_df["at_bat_number"].astype(int).astype(str) + "_" + statcast_df["pitch_number"].astype(int).astype(str)

# set the index
statcast_df.set_index('pitch_id', inplace=True)

# group
rolling_pitch_affects = (statcast_df.groupby(['pitcher', 'game_year', 'pitch_class'])[pitch_stat_cols].rolling(window=50, min_periods=1))

In [None]:
# calculate the rolling 50 pitch statistics by pitcher and pitch class
rolling_pitch_affects_means = rolling_pitch_affects.mean().fillna(0)

# pivot the data so that it is in wide format for these statistics
rolling_pitch_affects_means = rolling_pitch_affects_means.unstack(level=2)

# rename the columns
rolling_pitch_affects_means = (rolling_pitch_affects_means.set_axis([f"{y}_{x}_m" for x, y in rolling_pitch_affects_means.columns],
                                                                    axis=1, inplace=False)
                        .reset_index()
                        .set_index('pitch_id'))

rolling_pitch_affects_means = rolling_pitch_affects_means.drop(['pitcher', 'game_year'], axis=1)

# merge the previous 50 pitch stats with data
statcast_df = statcast_df.merge(rolling_pitch_affects_means, left_index=True, right_index=True)

In [None]:
pitch_stat_cols = [
'Changeup_release_speed_m',
'Curveball_release_speed_m',
'Cutter_release_speed_m',
'Fourseam_release_speed_m',
'Slider_release_speed_m',
'Twoseam_release_speed_m',
'Changeup_release_spin_rate_m',
'Curveball_release_spin_rate_m',
'Cutter_release_spin_rate_m',
'Fourseam_release_spin_rate_m',
'Slider_release_spin_rate_m',
'Twoseam_release_spin_rate_m',
'Changeup_strike_per_m',
'Curveball_strike_per_m',
'Cutter_strike_per_m',
'Fourseam_strike_per_m',
'Slider_strike_per_m',
'Twoseam_strike_per_m',
'Changeup_whiff_per_m',
'Curveball_whiff_per_m',
'Cutter_whiff_per_m',
'Fourseam_whiff_per_m',
'Slider_whiff_per_m',
'Twoseam_whiff_per_m',
'Changeup_woba_value_m',
'Curveball_woba_value_m',
'Cutter_woba_value_m',
'Fourseam_woba_value_m',
'Slider_woba_value_m',
'Twoseam_woba_value_m']

# iterating the columns and doing a forward fill of the last 50 pitches results for each pitch type
for i in pitch_stat_cols:
    statcast_df[i] = statcast_df.groupby('pitcher')[i].transform(lambda x: x.ffill())
    statcast_df[i] = statcast_df[i].fillna(0) # replace a missing value with 0

In [None]:
# Repeat process for batter stats columns
bat_stat_cols = [# whiff percentage
                   'whiff_per',

                   # woba value
                   'woba_value'
                   ]

# group
rolling_bat_affects = (statcast_df.groupby(['batter', 'game_year', 'pitch_class'])[bat_stat_cols].rolling(window=50, min_periods=1))

# calculate the rolling 50 pitch statistics by batter and pitch class
rolling_bat_affects_means = rolling_bat_affects.mean().fillna(0)

# pivot the data so that it is in wide format for these statistics
rolling_bat_affects_means = rolling_bat_affects_means.unstack(level=2)

# rename the columns
rolling_bat_affects_means = (rolling_bat_affects_means.set_axis([f"batter_{y}_{x}_m" for x, y in rolling_bat_affects_means.columns],
                                                                    axis=1, inplace=False)
                        .reset_index()
                        .set_index('pitch_id'))

rolling_bat_affects_means = rolling_bat_affects_means.drop(['batter', 'game_year'], axis=1)

# merge the previous 50 pitch stats with data
statcast_df = statcast_df.merge(rolling_bat_affects_means, left_index=True, right_index=True)

In [None]:
bat_stat_cols = [
'batter_Changeup_whiff_per_m',
'batter_Curveball_whiff_per_m',
'batter_Cutter_whiff_per_m',
'batter_Fourseam_whiff_per_m',
'batter_Slider_whiff_per_m',
'batter_Twoseam_whiff_per_m',
'batter_Changeup_woba_value_m',
'batter_Curveball_woba_value_m',
'batter_Cutter_woba_value_m',
'batter_Fourseam_woba_value_m',
'batter_Slider_woba_value_m',
'batter_Twoseam_woba_value_m']

# iterating the columns and doing a forward fill of the last 50 pitches results for each pitch type
for i in bat_stat_cols:
    statcast_df[i] = statcast_df.groupby('batter')[i].transform(lambda x: x.ffill())
    statcast_df[i] = statcast_df[i].fillna(0) # replace a missing value with 0

Calculate the number of times the pitcher has faced that batter in the current game. Often, the pitcher will try not to throw too many of the same pitch class to the same batter, as there is a negative correlation with performance.

In [None]:
# calculate number of times the pitcher has faced that batter in the current game
statcast_df["tto"] = statcast_df.groupby(["game_pk", "pitcher", "batter"])["at_bat_number"].rank("dense", ascending=True)

#### Create previous pitch selection by pitcher features

The identity of the pitcher will be another factor in predicting pitch type, as not every pitcher throws every pitch type. Each pitcher has their own unique arsenal, which consists of a subset of pitch types. 75% of pitchers do not throw a cutter, therefore that target option can all but be eliminated for those pitchers when predicting pitch type. We will calculate several features to calculate the pitchers pitch type selection for the previous 100 pitches, and the previous 1000 pitches. The pitcher historical pitch probabilities for at bats where the pitcher does not have at least 100 or 1000 pitches respectively thrown will be imputed with the previous MLB season average pitch probabilities.

In [None]:
# Create OHE for pitch class
temp = statcast_df['pitch_class']
statcast_df['pitch_class'] = statcast_df['pitch_class'].str.lower()
statcast_df = pd.get_dummies(statcast_df, columns=['pitch_class'], dtype=float)
statcast_df['pitch_class'] = temp

In [None]:
# Calculate rolling 30-pitch pitch type usage for each pitcher
statcast_df['recent_fourseam_usage'] = statcast_df.groupby('pitcher')['pitch_class_fourseam'].transform(lambda x: x.rolling(30, 5).mean())
statcast_df['recent_twoseam_usage'] = statcast_df.groupby('pitcher')['pitch_class_twoseam'].transform(lambda x: x.rolling(30, 5).mean())
statcast_df['recent_cutter_usage'] = statcast_df.groupby('pitcher')['pitch_class_cutter'].transform(lambda x: x.rolling(30, 5).mean())
statcast_df['recent_slider_usage'] = statcast_df.groupby('pitcher')['pitch_class_slider'].transform(lambda x: x.rolling(30, 5).mean())
statcast_df['recent_curveball_usage'] = statcast_df.groupby('pitcher')['pitch_class_curveball'].transform(lambda x: x.rolling(30, 5).mean())
statcast_df['recent_changeup_usage'] = statcast_df.groupby('pitcher')['pitch_class_changeup'].transform(lambda x: x.rolling(30, 5).mean())

# need to shift the rolling means by 1 pitch so that the current pitch is not included in the rolling mean
statcast_df['recent_fourseam_usage'] = statcast_df.groupby(['pitcher'])['recent_fourseam_usage'].shift(1)
statcast_df['recent_twoseam_usage'] = statcast_df.groupby(['pitcher'])['recent_twoseam_usage'].shift(1)
statcast_df['recent_cutter_usage'] = statcast_df.groupby(['pitcher'])['recent_cutter_usage'].shift(1)
statcast_df['recent_slider_usage'] = statcast_df.groupby(['pitcher'])['recent_slider_usage'].shift(1)
statcast_df['recent_curveball_usage'] = statcast_df.groupby(['pitcher'])['recent_curveball_usage'].shift(1)
statcast_df['recent_changeup_usage'] = statcast_df.groupby(['pitcher'])['recent_changeup_usage'].shift(1)

In [None]:
# Calculate rolling 100-pitch pitch type usage for each pitcher
statcast_df['long_term_fourseam_usage'] = statcast_df.groupby('pitcher')['pitch_class_fourseam'].transform(lambda x: x.rolling(100, 10).mean())
statcast_df['long_term_twoseam_usage'] = statcast_df.groupby('pitcher')['pitch_class_twoseam'].transform(lambda x: x.rolling(100, 10).mean())
statcast_df['long_term_cutter_usage'] = statcast_df.groupby('pitcher')['pitch_class_cutter'].transform(lambda x: x.rolling(100, 10).mean())
statcast_df['long_term_slider_usage'] = statcast_df.groupby('pitcher')['pitch_class_slider'].transform(lambda x: x.rolling(100, 10).mean())
statcast_df['long_term_curveball_usage'] = statcast_df.groupby('pitcher')['pitch_class_curveball'].transform(lambda x: x.rolling(100, 10).mean())
statcast_df['long_term_changeup_usage'] = statcast_df.groupby('pitcher')['pitch_class_changeup'].transform(lambda x: x.rolling(100, 10).mean())

# need to shift the rolling means by 1 pitch so that the current pitch is not included in the rolling mean
statcast_df['long_term_fourseam_usage'] = statcast_df.groupby(['pitcher'])['long_term_fourseam_usage'].shift(1)
statcast_df['long_term_twoseam_usage'] = statcast_df.groupby(['pitcher'])['long_term_twoseam_usage'].shift(1)
statcast_df['long_term_cutter_usage'] = statcast_df.groupby(['pitcher'])['long_term_cutter_usage'].shift(1)
statcast_df['long_term_slider_usage'] = statcast_df.groupby(['pitcher'])['long_term_slider_usage'].shift(1)
statcast_df['long_term_curveball_usage'] = statcast_df.groupby(['pitcher'])['long_term_curveball_usage'].shift(1)
statcast_df['long_term_changeup_usage'] = statcast_df.groupby(['pitcher'])['long_term_changeup_usage'].shift(1)

In [None]:
# Calculate league average usage rates from the previous season
season_average_usage = pd.DataFrame()
season_average_usage['lg_avg_fourseam'] = statcast_df.groupby('game_year')['pitch_class_fourseam'].mean()
season_average_usage['lg_avg_twoseam'] = statcast_df.groupby('game_year')['pitch_class_twoseam'].mean()
season_average_usage['lg_avg_cutter'] = statcast_df.groupby('game_year')['pitch_class_cutter'].mean()
season_average_usage['lg_avg_slider'] = statcast_df.groupby('game_year')['pitch_class_slider'].mean()
season_average_usage['lg_avg_curveball'] = statcast_df.groupby('game_year')['pitch_class_curveball'].mean()
season_average_usage['lg_avg_changeup'] = statcast_df.groupby('game_year')['pitch_class_changeup'].mean()
season_average_usage = season_average_usage.reset_index()

season_average_usage['game_year'] = season_average_usage['game_year'] + 1

statcast_df = statcast_df.merge(season_average_usage, left_on = 'game_year', right_on = 'game_year', how = 'left')

In [None]:
# Impute missing values in rolling averages with league average usage rates from the previous season
statcast_df['recent_fourseam_usage'] = np.where(np.isnan(statcast_df['recent_fourseam_usage']) == True, statcast_df['lg_avg_fourseam'], statcast_df['recent_fourseam_usage'])
statcast_df['recent_twoseam_usage'] = np.where(np.isnan(statcast_df['recent_twoseam_usage']) == True, statcast_df['lg_avg_twoseam'], statcast_df['recent_twoseam_usage'])
statcast_df['recent_cutter_usage'] = np.where(np.isnan(statcast_df['recent_cutter_usage']) == True, statcast_df['lg_avg_cutter'], statcast_df['recent_cutter_usage'])
statcast_df['recent_slider_usage'] = np.where(np.isnan(statcast_df['recent_slider_usage']) == True, statcast_df['lg_avg_slider'], statcast_df['recent_slider_usage'])
statcast_df['recent_curveball_usage'] = np.where(np.isnan(statcast_df['recent_curveball_usage']) == True, statcast_df['lg_avg_curveball'], statcast_df['recent_curveball_usage'])
statcast_df['recent_changeup_usage'] = np.where(np.isnan(statcast_df['recent_changeup_usage']) == True, statcast_df['lg_avg_changeup'], statcast_df['recent_changeup_usage'])

statcast_df['long_term_fourseam_usage'] = np.where(np.isnan(statcast_df['long_term_fourseam_usage']) == True, statcast_df['lg_avg_fourseam'], statcast_df['long_term_fourseam_usage'])
statcast_df['long_term_twoseam_usage'] = np.where(np.isnan(statcast_df['long_term_twoseam_usage']) == True, statcast_df['lg_avg_twoseam'], statcast_df['long_term_twoseam_usage'])
statcast_df['long_term_cutter_usage'] = np.where(np.isnan(statcast_df['long_term_cutter_usage']) == True, statcast_df['lg_avg_cutter'], statcast_df['long_term_cutter_usage'])
statcast_df['long_term_slider_usage'] = np.where(np.isnan(statcast_df['long_term_slider_usage']) == True, statcast_df['lg_avg_slider'], statcast_df['long_term_slider_usage'])
statcast_df['long_term_curveball_usage'] = np.where(np.isnan(statcast_df['long_term_curveball_usage']) == True, statcast_df['lg_avg_curveball'], statcast_df['long_term_curveball_usage'])
statcast_df['long_term_changeup_usage'] = np.where(np.isnan(statcast_df['long_term_changeup_usage']) == True, statcast_df['lg_avg_changeup'], statcast_df['long_term_changeup_usage'])

#### Incorporating previous pitch results with current game state

The results from the previous pitches within the at-bat may influence the next pitches thrown. We will need to append the previous pitch result in the game to the game state of the pitch we are trying to predict.

##### Split the current game state features and the previous pitch result features

Let's start by specifying the columns from the original statcast data we need 

In [None]:
##### Select columns of interest ----

# features contain information about the pitch event in the game
prev_columns = ['game_pk',
                'pitch_number',
                'at_bat_number',
                
                # target
                'pitch_class', 
           
                # pitch speed
                'release_speed',
                'effective_speed',
                
                # pitch release position
                'release_pos_x',
                'release_pos_y',
                'release_pos_z',
                
                # pitch velocity
                'vx0',
                'vy0',
                'vz0',
                
                # pitch acceleration
                'ax',
                'ay',
                'az',
                
                # pitch movement
                'pfx_x',
                'pfx_z',
                
                # pitch release spin
                'release_spin_rate',
                'release_extension',
                'spin_axis',
                
                # where crosses plate
                'zone',
                'plate_x',
                'plate_z',
                
                # pitch events
                'type',
                'events',
                'description',
                'bb_type',
                
                # hit location
                'hc_x',
                'hc_y',
                'hit_distance_sc',
                'launch_speed_angle'
      ]

# features contain information about the current game situation
game_state_columns = [# target
                      'pitch_class',
                      'batter',
                      'pitcher',
                      'starter',
                      'recent_fourseam_usage',
                      'recent_twoseam_usage',
                      'recent_cutter_usage',
                      'recent_slider_usage',
                      'recent_curveball_usage',
                      'recent_changeup_usage',
                      'long_term_fourseam_usage',
                      'long_term_twoseam_usage',
                      'long_term_cutter_usage',
                      'long_term_slider_usage',
                      'long_term_curveball_usage',
                      'long_term_changeup_usage',
                      'Changeup_release_speed_m',
                      'Curveball_release_speed_m',
                      'Cutter_release_speed_m',
                      'Fourseam_release_speed_m',
                      'Slider_release_speed_m',
                      'Twoseam_release_speed_m',
                      'Changeup_release_spin_rate_m',
                      'Curveball_release_spin_rate_m',
                      'Cutter_release_spin_rate_m',
                      'Fourseam_release_spin_rate_m',
                      'Slider_release_spin_rate_m',
                      'Twoseam_release_spin_rate_m',
                      'Changeup_strike_per_m',
                      'Curveball_strike_per_m',
                      'Cutter_strike_per_m',
                      'Fourseam_strike_per_m',
                      'Slider_strike_per_m',
                      'Twoseam_strike_per_m',
                      'Changeup_whiff_per_m',
                      'Curveball_whiff_per_m',
                      'Cutter_whiff_per_m',
                      'Fourseam_whiff_per_m',
                      'Slider_whiff_per_m',
                      'Twoseam_whiff_per_m',
                      'Changeup_woba_value_m',
                      'Curveball_woba_value_m',
                      'Cutter_woba_value_m',
                      'Fourseam_woba_value_m',
                      'Slider_woba_value_m',
                      'Twoseam_woba_value_m',
                      'batter_Changeup_whiff_per_m',
                      'batter_Curveball_whiff_per_m',
                      'batter_Cutter_whiff_per_m',
                      'batter_Fourseam_whiff_per_m',
                      'batter_Slider_whiff_per_m',
                      'batter_Twoseam_whiff_per_m',
                      'batter_Changeup_woba_value_m',
                      'batter_Curveball_woba_value_m',
                      'batter_Cutter_woba_value_m',
                      'batter_Fourseam_woba_value_m',
                      'batter_Slider_woba_value_m',
                      'batter_Twoseam_woba_value_m',
                      'inning',
                      'inning_topbot',
                      'outs_when_up',
                      'bat_score',
                      'fld_score',
                      'count',
                      'stand',
                      'p_throws',
                      'on_1b',
                      'on_2b',
                      'on_3b',
                      'game_pk',
                      'pitch_number',
                      'at_bat_number',
                      'tto',
                      'game_year']

In [None]:
# split the data set into the game state features and the results of the previous pitch
game_state_statcast_df = statcast_df[game_state_columns].sort_values(['game_pk','at_bat_number','pitch_number'])
prev_pitch_statcast_df = statcast_df[prev_columns].sort_values(['game_pk','at_bat_number','pitch_number'])

In [None]:
##### Missing Data Check ----

# game state data
percent = (game_state_statcast_df.isnull().sum()/game_state_statcast_df.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = game_state_statcast_df.isna().sum().sort_values(ascending = False)
missing_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
display(missing_data[missing_data['Percent'] != 0])

# previous pitch data
percent = (prev_pitch_statcast_df.isnull().sum()/prev_pitch_statcast_df.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = prev_pitch_statcast_df.isna().sum().sort_values(ascending = False)
missing_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
display(missing_data[missing_data['Percent'] != 0])

Unnamed: 0,Percent,Train Missing Count
long_term_twoseam_usage,0.22,7066
long_term_fourseam_usage,0.22,7066
long_term_curveball_usage,0.22,7066
long_term_slider_usage,0.22,7066
long_term_cutter_usage,0.22,7066
long_term_changeup_usage,0.22,7066
recent_twoseam_usage,0.11,3543
recent_cutter_usage,0.11,3543
recent_curveball_usage,0.11,3543
recent_changeup_usage,0.11,3543


Unnamed: 0,Percent,Train Missing Count
launch_speed_angle,82.72,2678175
hc_y,82.59,2674089
hc_x,82.59,2674089
bb_type,82.47,2670304
events,74.39,2408420
hit_distance_sc,73.05,2365257
spin_axis,12.09,391462
release_spin_rate,2.63,85177
release_extension,0.59,19070
effective_speed,0.48,15695


The network can not know the resulting information of the pitch before generating a prediction. Thus we will pass in the result of the previous pitch to the network for the model to learn from the previous at bat results. The resulting data frame contains a row per pitch, with the previous pitch results and the current game state.

In [None]:
# create a game pitch number
game_state_statcast_df['game_pitch_number'] = 1
game_state_statcast_df['pitcher_pitch_number'] = game_state_statcast_df.groupby(['game_pk','pitcher'])['game_pitch_number'].cumsum()
game_state_statcast_df['game_pitch_number'] = game_state_statcast_df.groupby(['game_pk'])['game_pitch_number'].cumsum()

# create a game pitch number
prev_pitch_statcast_df['game_pitch_number'] = 1
prev_pitch_statcast_df['game_pitch_number'] = prev_pitch_statcast_df.groupby(['game_pk'])['game_pitch_number'].cumsum()

# calculate the game pitch number and previous pitch number in game state data
game_state_statcast_df['prev_game_pitch_number'] = game_state_statcast_df['game_pitch_number'] - 1

In [None]:
# join the previous pitch result data to the current game state
statcast_new_df = game_state_statcast_df.merge(prev_pitch_statcast_df.add_prefix('prev_'), left_on=['game_pk','prev_game_pitch_number'], right_on = ['prev_game_pk','prev_game_pitch_number'], how='left')

## Data Filtering

There are games which can go into extra innings. We will remove any at-bats which occur in extra innings from the training data as these events are not common in the MLB. An important input into the model are the batter and the pitcher embeddings. We will only keep at-bats which feature the batters and pitchers which make up 90% of all at-bats in the training seasons.

In [None]:
##### Data Filtering to First 9 innings and Starting in 2017 ----

# we are only going to predict the first 9 innings
statcast_new_df = statcast_new_df[statcast_new_df['inning'] <= 9.0]

# we are only using data starting in the 2017 season for training data
statcast_new_df = statcast_new_df[statcast_new_df['game_year'] >= 2017]

There are many instances in major league baseball where rookie batters and pitchers receive at-bats and there is little or no prior pitch sequence data. In order to make a prediction on next pitch types in the 2019, 2020, and 2021 seasons, the batter and the pitcher need to have a significant amount of recorded at-bats in the 2017 and 2018 seasons. Only batters and pitchers accounting for 90% of at-bats in the 2017 and 2018 seasons were included in scope (443 batters and 521 pitchers). 

In [None]:
##### Filter to only batter and pitcher matchups accounting for 90% of total at-bats -----

# get the at bats from before 2019
train_years = statcast_new_df[statcast_new_df['game_year'] < 2019]
unique_abs_train = train_years[['game_pk', 'at_bat_number', 'batter', 'pitcher']].drop_duplicates()

# get the batter ids that made up 90% of at bats in from 2017 to 2018
unique_abs_batter = unique_abs_train.groupby(['batter']).count().sort_values(by='game_pk', ascending=False)
unique_abs_batter['cum_sum'] = unique_abs_batter['game_pk'].cumsum()
unique_abs_batter['cum_perc'] = 100*unique_abs_batter['cum_sum']/unique_abs_batter['game_pk'].sum()
unique_batters = list(unique_abs_batter[unique_abs_batter['cum_perc'] <= 90].index)

# get the pitcher ids that made up 90% of at bats in from 2017 to 2018
unique_abs_pitcher = unique_abs_train.groupby(['pitcher']).count().sort_values(by='game_pk', ascending=False)
unique_abs_pitcher['cum_sum'] = unique_abs_pitcher['game_pk'].cumsum()
unique_abs_pitcher['cum_perc'] = 100*unique_abs_pitcher['cum_sum']/unique_abs_pitcher['game_pk'].sum()
unique_pitchers = list(unique_abs_pitcher[unique_abs_pitcher['cum_perc'] <= 90].index)

# filter the data to only the get the batter, pitcher matchups from batters and pitchers that make up 90% of at bats
statcast_new_df = statcast_new_df[(statcast_new_df["batter"].isin(unique_batters)) & (statcast_new_df["pitcher"].isin(unique_pitchers))]

# delete objects no longer needed for memory
import gc
gc.enable()
del unique_abs_batter, unique_abs_pitcher, train_years, unique_abs_train
gc.collect()

## Data Splitting

To avoid data leakage, the data can not be randomly split across all years in the training, validation, and test sets. Future pitches cannot be included as inputs within the training data to generate predictions for past pitches in the validation and/or test data. The data from the 2017 and 2018 seasons will be used for training data while the 2019 season is used for validation. 2020 and 2021 seasons will be used for the testing data in the next pitch prediction model. 

In [None]:
##### Data Splitting ----

# split the data into train, validation, and test
train_data = statcast_new_df[statcast_new_df['game_year'] < 2019].drop(columns=['game_year'])
validation_data = statcast_new_df[statcast_new_df['game_year'] == 2019].drop(columns=['game_year'])
test_data = statcast_new_df[statcast_new_df['game_year'] > 2019].drop(columns=['game_year'])

After checking for missing data after these transformations, we see there are still some of the previous pitch data features which are missing. These represent the first pitches of the game and so there is no previous pitch information. These features will be imputed with a 0 when passed into the model.

In [None]:
##### Missing Data Check ----

# game state data
percent = (train_data.isnull().sum()/train_data.isnull().count()*100).sort_values(ascending = False).round(2)
sum_missing = train_data.isna().sum().sort_values(ascending = False)
missing_data  = pd.concat([percent, sum_missing], axis=1, keys=['Percent', "Train Missing Count"])
display(missing_data[missing_data['Percent'] != 0])

## Save output

The training, validation, and test data sets will be used throughout the model building process. There will be several models attempted. The data sets need to be consistent across model runs. These files will be referenced throughout the model notebooks for training, tuning, and evaluating models.

In [None]:
##### Save the training, validation, and test prepped data ----

# Save data
train_data.to_csv(os.path.join(DATA_DIR, f'train_data.csv'), index = None, header=True)
validation_data.to_csv(os.path.join(DATA_DIR, f'validation_data.csv'), index = None, header=True)
test_data.to_csv(os.path.join(DATA_DIR, f'test_data.csv'), index = None, header=True)