# Pipeline Architecture:
The purpose of this notebook is to build out a pipeline of modeling, from classification to regression.  The general form is:

1. Random Forest Classification of Pitch Type
2. Linear Regression of X-Coordinate of Pitch Location (Px) - pitch type used as feature
3. Linear Regression of Z-Coordinate of Pitch Location (Pz) - pitch type and px used as feature

Importing packages:

In [1]:
import pickle
from sqlalchemy import create_engine
import pandas as pd
from importlib import reload
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_formats = ['retina']
%matplotlib inline

plt.rcParams['figure.figsize'] = (9, 6)
sns.set(context='notebook', style='whitegrid', font_scale=1.2)

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV, ElasticNetCV
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline, make_pipeline

Pickling in initial data to work with:

In [3]:
pwd

'/Users/patrickbovard/Documents/GitHub/metis_final_project/Pitch_Classification'

In [4]:
with open('../Data/train_df_clusters.pickle','rb') as read_file:
    pitch_df = pickle.load(read_file)

In order to determine who to include in the pipeline, I'll see who the top xx pitchers are, in terms of pitches thrown:

In [5]:
pitch_df.pitcher_full_name.value_counts().head(50)

Max Scherzer         13626
Chris Sale           13284
Justin Verlander     12999
Jose Quintana        12944
Chris Archer         12760
Rick Porcello        12745
Jon Lester           12566
Corey Kluber         12480
Gio Gonzalez         12439
Julio Teheran        12125
Zack Greinke         12092
Jake Arrieta         12028
Cole Hamels          12016
Trevor Bauer         11860
Kyle Gibson          11829
Jacob deGrom         11775
Gerrit Cole          11772
James Shields        11768
Marco Estrada        11763
Jake Odorizzi        11719
Dallas Keuchel       11708
J.A. Happ            11597
Kevin Gausman        11540
Tanner Roark         11296
David Price          11205
Mike Leake           11141
Mike Fiers           11093
Ian Kennedy          11026
Kyle Hendricks       10952
Carlos Martinez      10920
Carlos Carrasco      10913
Andrew Cashner       10907
CC Sabathia          10654
Masahiro Tanaka      10589
Jeff Samardzija      10573
Madison Bumgarner    10551
Jason Hammel         10544
W

In [6]:
pitch_df.pitcher_full_name.value_counts().head(50).sum()

564235

So that would cover 564,235 total pitches from 2015-2018.

In [68]:
pitch_df.tail()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,stand,p_throws,event,home_team,...,cumulative_si_rate,cumulative_fc_rate,cumulative_kc_rate,cumulative_fs_rate,cumulative_kn_rate,cumulative_ep_rate,cumulative_fo_rate,cumulative_sc_rate,Name,Cluster
2870367,9.0,595879,623352,0.0,2018186000.0,3.0,R,L,Single,chn,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Javier Baez,1.0
2870368,9.0,519203,623352,0.0,2018186000.0,3.0,L,L,Flyout,chn,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Anthony Rizzo,3.0
2870369,9.0,519203,623352,0.0,2018186000.0,3.0,L,L,Flyout,chn,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Anthony Rizzo,3.0
2870370,9.0,519203,623352,0.0,2018186000.0,3.0,L,L,Flyout,chn,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Anthony Rizzo,3.0
2870371,9.0,519203,623352,0.0,2018186000.0,3.0,L,L,Flyout,chn,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Anthony Rizzo,3.0


In [7]:
pitch_df.columns

Index(['inning', 'batter_id', 'pitcher_id', 'top', 'ab_id', 'p_score', 'stand',
       'p_throws', 'event', 'home_team', 'away_team', 'b_score', 'on_1b',
       'on_2b', 'on_3b', 'px', 'pz', 'zone', 'pitch_type', 'start_speed',
       'type', 'b_count', 's_count', 'outs', 'pitch_num', 'last_pitch_type',
       'last_pitch_px', 'last_pitch_pz', 'last_pitch_speed',
       'pitcher_full_name', 'pitcher_run_diff', 'hitter_full_name',
       'Date_Time_Date', 'Season', 'cumulative_pitches', 'cumulative_ff_rate',
       'cumulative_sl_rate', 'cumulative_ft_rate', 'cumulative_ch_rate',
       'cumulative_cu_rate', 'cumulative_si_rate', 'cumulative_fc_rate',
       'cumulative_kc_rate', 'cumulative_fs_rate', 'cumulative_kn_rate',
       'cumulative_ep_rate', 'cumulative_fo_rate', 'cumulative_sc_rate',
       'Name', 'Cluster'],
      dtype='object')

Checking out pitch types:

In [8]:
pitch_df.pitch_type.value_counts()

FF    1023717
SL     454880
FT     340899
CH     295221
SI     244501
CU     236045
FC     150892
KC      66942
FS      44233
KN      11295
EP        816
FO        810
SC        121
Name: pitch_type, dtype: int64

In [9]:
pitch_df = pitch_df[pitch_df.pitch_type != 'EP']

Importing the regression functions:

In [10]:
from location_regression_functions import *
from pitch_cat_functions import *

Importing the functions from the pipeline script:

In [11]:
from classification_location_combo import *

### Initial dataframe setup:

Before working on building out a workflow to model multiple pitchers in succession, I will build out the pipeline here in a step by step fashion:

### Dataframe Setup:

First, to set up the dataframe structure needed to model.  First step is to filter for the desired player's name:

In [12]:
scherzer_df = pitch_df[(pitch_df.pitcher_full_name == 'Max Scherzer') &(pitch_df.last_pitch_px.notnull())]

One hot encoding of needed columns:

In [13]:
ohe_cols = ['stand', 'p_throws']

In [14]:
ohe_df = column_ohe_maker(scherzer_df, ohe_cols)

Numerically coding last pitch type and pitch type:

In [15]:
temp_df = last_pitch_type_to_num(ohe_df, 'last_pitch_type')

Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [16]:
output_df = pitch_type_to_num(temp_df, 'pitch_type')

Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}


Declaring the columns for each step of the modeling process:

In [17]:
rf_cols = ['Cluster','inning', 'top', 'on_1b', 'on_2b', 'on_3b', 'b_count', 's_count', 'outs', 'stand_R',
       'pitcher_run_diff','last_pitch_speed', 'last_pitch_px', 'last_pitch_pz','pitch_num','cumulative_pitches',
       'cumulative_ff_rate', 'cumulative_sl_rate', 'cumulative_ft_rate',
       'cumulative_ch_rate', 'cumulative_cu_rate', 'cumulative_si_rate',
       'cumulative_fc_rate', 'cumulative_kc_rate', 'cumulative_fs_rate',
       'cumulative_kn_rate', 'cumulative_ep_rate', 'cumulative_fo_rate',
       'cumulative_sc_rate', 'Last_Pitch_Type_Num']

In [18]:
px_cols = rf_cols.copy()

In [19]:
px_cols.append('Pitch_Type_Num')

In [20]:
px_cols_val = px_cols.copy()
px_cols_val.append('pitch_pred')
px_cols_val.remove('Pitch_Type_Num')

In [21]:
pz_cols = px_cols.copy()
pz_cols.append('px')

In [22]:
pz_cols_val = pz_cols.copy()
pz_cols_val.append('px_pred')
pz_cols_val.remove('px')

Instead of a traditional train/test split, I need to split my dataframe into similar pitches.  To help with generalization/simulating unseen data, I'll split by at bat (using split_pitch_data in classification_location_combo.py).

In [23]:
training_pitches, val_pitches = split_pitch_data(output_df, test_size=0.20)

Actual Test Size: 0.24353704211199415


Now with training and validation df's separate, I can run modeling on each step, using the columns.

### Classification:

Setting up X and y for modeling, and performing train/test split:

In [24]:
X_rf_train = training_pitches[rf_cols]

In [25]:
y_rf_train = training_pitches['Pitch_Type_Num']

In [26]:
X_rf_val = val_pitches[rf_cols]

In [27]:
y_rf_val = val_pitches['Pitch_Type_Num']

Fitting model on test, then predicting on train:

In [28]:
rf_model = RandomForestClassifier()
rf_model.fit(X_rf_train,y_rf_train)

RandomForestClassifier()

In [29]:
y_rf_pred = rf_model.predict(X_rf_val)

In [30]:
unique, counts = np.unique(y_rf_pred, return_counts=True)

In [31]:
np.asarray((unique, counts)).T

array([[   0, 2195],
       [   1,  315],
       [   2,   96],
       [   3,   24],
       [   4,   34],
       [   5,    2]])

Running some metrics:

In [32]:
cm = confusion_matrix(y_rf_val, y_rf_pred)
cm

array([[1196,  141,   37,    9,   14,    0],
       [ 371,  133,    8,    1,    0,    0],
       [ 314,   23,   28,    7,    4,    1],
       [ 196,    8,   10,    5,    3,    0],
       [ 108,    3,   13,    2,   13,    0],
       [  10,    7,    0,    0,    0,    1]])

### Px:

Adding the predicted pitch types to the validation dataframe, for use as a feature in Px regression:

In [33]:
val_pitches['pitch_pred'] = y_rf_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_pitches['pitch_pred'] = y_rf_pred


Setting up X and y, with train test split:

In [34]:
X_px_tr = training_pitches[px_cols]
y_px_tr = training_pitches['px']

In [35]:
X_px_val = val_pitches[px_cols_val]
y_px_val = val_pitches['px']

Instantiating linear regression:

In [36]:
lm = LinearRegression()

In [37]:
lm.fit(X_px_tr, y_px_tr)

LinearRegression()

In [38]:
px_pred = lm.predict(X_px_val)

Getting some metrics:

In [39]:
px_r2 = lm.score(X_px_val, y_px_val)
px_r2

0.06411949700287711

In [40]:
px_mae = mae(y_px_val, px_pred)

In [41]:
px_mae

0.6794465366889837

Adding a px_pred column to the dataframe to utilize in the pz regression:

In [42]:
val_pitches['px_pred'] = px_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_pitches['px_pred'] = px_pred


### Pz:

Setting up X and y, with train test split:

In [43]:
X_pz_tr = training_pitches[pz_cols]
y_pz_tr = training_pitches['pz']

In [44]:
X_pz_val = val_pitches[pz_cols_val]
y_pz_val = val_pitches['pz']

Instantiating linear regression:

In [45]:
lm = LinearRegression()

In [46]:
lm.fit(X_pz_tr, y_pz_tr)

LinearRegression()

In [47]:
pz_pred = lm.predict(X_pz_val)

Getting some metrics:

In [48]:
pz_r2 = lm.score(X_pz_val, y_pz_val)
pz_r2

0.1489761794161102

In [49]:
pz_mae = mae(y_pz_val, pz_pred)
pz_mae

0.6234399445809027

Adding a px_pred column to the dataframe to utilize in the pz regression:

In [50]:
val_pitches['pz_pred'] = pz_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_pitches['pz_pred'] = pz_pred


### Evaluating the pipeline:

Looking at the final validation df:

In [51]:
output_df = pd.DataFrame(columns=val_pitches.columns)

In [52]:
output_df = pd.concat([output_df, val_pitches])

In [53]:
output_df

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,cumulative_fo_rate,cumulative_sc_rate,Name,Cluster,stand_R,Last_Pitch_Type_Num,Pitch_Type_Num,pitch_pred,px_pred,pz_pred
2453,3.0,527038,453286,1.0,2.015001e+09,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,0,0.044230,2.523223
2454,3.0,527038,453286,1.0,2.015001e+09,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,0,0.007730,2.420953
2455,3.0,527038,453286,1.0,2.015001e+09,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,1,0.113485,2.260595
2496,5.0,502517,453286,1.0,2.015001e+09,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,Daniel Murphy,3.0,0.0,1,0,0,-0.234551,2.645617
2497,5.0,502517,453286,1.0,2.015001e+09,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,Daniel Murphy,3.0,0.0,0,0,0,-0.328774,2.583225
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2846534,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,Peter O'Brien,1.0,1.0,0,1,1,0.219846,1.985454
2846535,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,Peter O'Brien,1.0,1.0,1,1,1,0.250910,1.893609
2846536,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,Peter O'Brien,1.0,1.0,1,0,0,0.108060,2.269643
2846537,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,Peter O'Brien,1.0,1.0,0,2,0,0.029326,1.923316


In [54]:
val_pitches.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,cumulative_fo_rate,cumulative_sc_rate,Name,Cluster,stand_R,Last_Pitch_Type_Num,Pitch_Type_Num,pitch_pred,px_pred,pz_pred
2453,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,0,0.04423,2.523223
2454,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,0,0.00773,2.420953
2455,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,1,0.113485,2.260595
2496,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,Daniel Murphy,3.0,0.0,1,0,0,-0.234551,2.645617
2497,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,Daniel Murphy,3.0,0.0,0,0,0,-0.328774,2.583225


In [55]:
val_pitches.shape

(2666, 54)

In [56]:
val_pitches.columns

Index(['inning', 'batter_id', 'pitcher_id', 'top', 'ab_id', 'p_score', 'event',
       'home_team', 'away_team', 'b_score', 'on_1b', 'on_2b', 'on_3b', 'px',
       'pz', 'zone', 'pitch_type', 'start_speed', 'type', 'b_count', 's_count',
       'outs', 'pitch_num', 'last_pitch_type', 'last_pitch_px',
       'last_pitch_pz', 'last_pitch_speed', 'pitcher_full_name',
       'pitcher_run_diff', 'hitter_full_name', 'Date_Time_Date', 'Season',
       'cumulative_pitches', 'cumulative_ff_rate', 'cumulative_sl_rate',
       'cumulative_ft_rate', 'cumulative_ch_rate', 'cumulative_cu_rate',
       'cumulative_si_rate', 'cumulative_fc_rate', 'cumulative_kc_rate',
       'cumulative_fs_rate', 'cumulative_kn_rate', 'cumulative_ep_rate',
       'cumulative_fo_rate', 'cumulative_sc_rate', 'Name', 'Cluster',
       'stand_R', 'Last_Pitch_Type_Num', 'Pitch_Type_Num', 'pitch_pred',
       'px_pred', 'pz_pred'],
      dtype='object')

# Utilizing Functions

Based on the above, I put together functions to run this same process in classification_location_combo.py.  Running those here:

### Random Forest:

In [57]:
val_df_full = pitch_prediction_modeling_pipeline('Max Scherzer', pitch_df, split_size = 0.2)

Pitch Modeling for Max Scherzer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.24353704211199415
Accuracy: 0.5108777194298575
Precision: (0.36546652884847725,)
Recall: 0.22209160369601336
Random Forest Pitch Classification confusion matrix results:
[[1186  130   50   14   16    1]
 [ 371  133    6    3    0    0]
 [ 305   33   27    6    6    0]
 [ 201    7    8    4    2    0]
 [ 115    3    9    1   11    0]
 [  10    6    1    0    0    1]]
Val Px R^2: 0.062325783098987175
Val Px MAE: 0.680765619510812 ft.
Val Pz R^2: 0.14848516786356858
Val Pz MAE: 0.6236726718293819 ft.


In [58]:
val_df_full.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,Pitch_Type_Num,pitch_pred,FF_prob,SL_prob,CH_prob,CU_prob,FC_prob,FT_prob,px_pred,pz_pred
2453,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0,0.57,0.22,0.01,0.01,0.19,0.0,0.04423,2.523223
2454,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0,0.51,0.19,0.05,0.06,0.19,0.0,0.00773,2.420953
2455,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,1,0.23,0.54,0.18,0.02,0.03,0.0,0.113485,2.260595
2496,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0,0.46,0.05,0.19,0.29,0.01,0.0,-0.234551,2.645617
2497,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,3,0.35,0.13,0.14,0.38,0.0,0.0,-0.170658,2.532033


Trying with another pitcher:

In [59]:
val_df_full = pitch_prediction_modeling_pipeline('Rick Porcello', pitch_df, split_size = 0.2)

Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'SI': 6, 'PO': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24137595010719157
Accuracy: 0.4162293096487687
Precision: (0.3111562640689481,)
Recall: 0.2734400647117818
Random Forest Pitch Classification confusion matrix results:
[[541 160  50  23  23   0]
 [206 335  51  36  18   0]
 [191 110  73  22  13   0]
 [141 124  27  43   9   0]
 [113  77  32  18  39   0]
 [  0   2   0   0   0   0]]
Val Px R^2: 0.014557609052274323
Val Px MAE: 0.6486744755969264 ft.
Val Pz R^2: 0.06221922446446859
Val Pz MAE: 0.7294498061232275 ft.


  _warn_prf(average, modifier, msg_start, len(result))


In [60]:
val_df_full.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,Pitch_Type_Num,pitch_pred,FT_prob,FF_prob,SL_prob,CU_prob,CH_prob,SI_prob,px_pred,pz_pred
6895,2.0,425796,519144,0.0,2015002000.0,0.0,Single,phi,bos,0.0,...,1,0,0.59,0.14,0.16,0.06,0.05,0.0,-0.353588,2.23679
6896,2.0,425796,519144,0.0,2015002000.0,0.0,Single,phi,bos,0.0,...,0,0,0.52,0.1,0.2,0.14,0.04,0.0,-0.205796,2.447072
6897,2.0,520471,519144,0.0,2015002000.0,0.0,Groundout,phi,bos,0.0,...,0,0,0.71,0.1,0.02,0.09,0.08,0.0,-0.536513,2.564916
6898,2.0,520471,519144,0.0,2015002000.0,0.0,Groundout,phi,bos,0.0,...,0,0,0.39,0.06,0.18,0.23,0.14,0.0,-0.456235,2.618067
6899,2.0,520471,519144,0.0,2015002000.0,0.0,Groundout,phi,bos,0.0,...,0,0,0.33,0.12,0.2,0.16,0.19,0.0,-0.516097,2.522265


Testing out for the first 25 pitchers to get a sense for computational cost:

In [61]:
pitcher_list = pitch_df.pitcher_full_name.value_counts().head(10).index

In [62]:
output_df = multiple_pitcher_predictions(pitcher_list, pitch_df, split_size = 0.2)

Pitch Modeling for Max Scherzer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.24353704211199415
Accuracy: 0.5183795948987246
Precision: (0.342726904843907,)
Recall: 0.2262029733328285
Random Forest Pitch Classification confusion matrix results:
[[1203  120   43   14   17    0]
 [ 367  133    8    4    0    1]
 [ 316   19   29    5    7    1]
 [ 198    6    9    4    5    0]
 [ 110    2   13    2   12    0]
 [  10    6    1    0    0    1]]
Val Px R^2: 0.06153724557577367
Val Px MAE: 0.6802779498197136 ft.
Val Pz R^2: 0.1476924443892782
Val Pz MAE: 0.6239005089910424 ft.




Pitch Modeling for Chris Sale
Here is the coding for last pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FA': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FS': 4}
Actual Test Size: 0.24533083059596433
Accuracy: 0.486610558530987
Precision: (0.46151852003132376,)
Recall: 0.4670491413718008
Random Forest Pitch Classification confusion matrix results:
[[632  93 107  29]
 [229 237  64 211]
 [235  72 148  85]
 [ 20 166  31 255]]
Val Px R^2: 0.023095089199271168
Val Px MAE: 0.6843916309697382 ft.
Val Pz R^2: 0.014973378131042825
Val Pz MAE: 0.6707055464911662 ft.




Pitch Modeling for Justin Verlander
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.2518304431599229


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.6009946442234124
Precision: (0.2864531353958749,)
Recall: 0.19936569335504417
Random Forest Pitch Classification confusion matrix results:
[[1449   57   36    0    0    0]
 [ 411   91   12    0    0    0]
 [ 347   38   30    3    0    0]
 [ 121    6    2    1    0    0]
 [   9    0    0    0    0    0]
 [   1    0    0    0    0    0]]
Val Px R^2: 0.08661369514665618
Val Px MAE: 0.6161545213994422 ft.
Val Pz R^2: 0.17335400989189187
Val Pz MAE: 0.6829630040791395 ft.




Pitch Modeling for Jose Quintana
Here is the coding for last pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4, 'PO': 5, 'UN': 6, 'FA': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4}
Actual Test Size: 0.2551168881559802
Accuracy: 0.48136882129277564
Precision: (0.3498715095335006,)
Recall: 0.3087461631797343
Random Forest Pitch Classification confusion matrix results:
[[954 204  60   6  36]
 [420 182  57   9  24]
 [127  42 100   2   0]
 [145  44  18   3   5]
 [121  42   0   2  27]]
Val Px R^2: -0.03062325274769706
Val Px MAE: 0.6625562392018111 ft.
Val Pz R^2: 0.028746998411795177
Val Pz MAE: 0.7309372245348111 ft.




Pitch Modeling for Chris Archer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4}
Actual Test Size: 0.255042802322149
Accuracy: 0.5883487654320988
Precision: (0.5163901537645826,)
Recall: 0.36970086516365164
Random Forest Pitch Classification confusion matrix results:
[[855 352  14   7   1]
 [433 630  10  11   0]
 [145  55  24   3   0]
 [  9  19   1  15   0]
 [  4   2   0   1   1]]
Val Px R^2: 0.17566407421103236
Val Px MAE: 0.5668685464726025 ft.
Val Pz R^2: 0.10881452114566903
Val Pz MAE: 0.7859658427984146 ft.




Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'SI': 6, 'PO': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24137595010719157


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.41017359709325796
Precision: (0.2996594773435209,)
Recall: 0.26892607394418105
Random Forest Pitch Classification confusion matrix results:
[[545 141  63  27  21   0]
 [224 319  43  37  23   0]
 [188 113  70  17  21   0]
 [153 109  26  43  13   0]
 [108  79  36  17  39   0]
 [  0   2   0   0   0   0]]
Val Px R^2: 0.018716112611301572
Val Px MAE: 0.6471149573265069 ft.
Val Pz R^2: 0.06373489859765347
Val Pz MAE: 0.7289494810786988 ft.




Pitch Modeling for Jon Lester
Here is the coding for last pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4}
Actual Test Size: 0.25269461077844313
Accuracy: 0.4640600315955766
Precision: (0.2695997172636402,)
Recall: 0.24171678618169187
Random Forest Pitch Classification confusion matrix results:
[[990 134  41  11   3]
 [407 150  30  10   1]
 [257  59  24   5   2]
 [174  35   4  11   1]
 [130  41   6   6   0]]
Val Px R^2: 0.07352767878561417
Val Px MAE: 0.7153903679259673 ft.
Val Pz R^2: 0.16208339916356196
Val Pz MAE: 0.6037630627963091 ft.




Pitch Modeling for Corey Kluber
Here is the coding for last pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5}
Actual Test Size: 0.2488986784140969
Accuracy: 0.37530168946098147
Precision: (0.338991759191259,)
Recall: 0.3089662405928412
Random Forest Pitch Classification confusion matrix results:
[[440 122  60  82  71  10]
 [201 183  37  52  56   4]
 [153  68  72  66  32   3]
 [152  23  36 103   0   1]
 [135  45   9   0 130   3]
 [ 59  22  10  22  19   5]]
Val Px R^2: 0.11318535310347666
Val Px MAE: 0.6469497392179319 ft.
Val Pz R^2: 0.037226938438072565
Val Pz MAE: 0.6934128696593213 ft.




Pitch Modeling for Gio Gonzalez
Here is the coding for last pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3, 'UN': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3}
Actual Test Size: 0.24769076305220883
Accuracy: 0.40413457640859346
Precision: (0.3783593371958611,)
Recall: 0.36264521059654153
Random Forest Pitch Classification confusion matrix results:
[[513 130 101  49]
 [268 302  59  72]
 [254 106 102  45]
 [192 145  49  80]]
Val Px R^2: 0.0772145426681915
Val Px MAE: 0.6798956565932708 ft.
Val Pz R^2: 0.21765718544311008
Val Pz MAE: 0.673657193883049 ft.




Pitch Modeling for Julio Teheran
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CU': 3, 'CH': 4, 'PO': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CH': 3, 'CU': 4}
Actual Test Size: 0.24703943981052415
Accuracy: 0.4385160483534806
Precision: (0.34920271722256707,)
Recall: 0.2835705570381501
Random Forest Pitch Classification confusion matrix results:
[[780 122  74  20  16]
 [363 122  52   4   4]
 [225  54 120  11  19]
 [139  20  33  20   3]
 [145  15  19   9  10]]
Val Px R^2: 0.12132717416734273
Val Px MAE: 0.6897067036832362 ft.
Val Pz R^2: 0.10378567401374617
Val Pz MAE: 0.6779873784021134 ft.






Based on my stopwatch, it took ~5 mins to run for 25 pitchers, so it is not terribly expensive to model on a per pitcher basis.

Overall at this time, I'm not as concerned with pitch location R^2 as I am pitch type.  The rationale is that incorrect predictions of pitch type have a large factor on the location, so a wrong pitch type doesn't really matter since different pitches are thrown in different locations (or are meant to be).  

The pitch location also does not need to be exact for my use case.  Telling a hitter "expect a fastball at coordinate 1.23, 2.54" won't give them good info - in real life, saying "expect a fastball high and inside" is typically what would be needed.

In [63]:
output_df.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,pitch_pred,FF_prob,SL_prob,CH_prob,CU_prob,FC_prob,FT_prob,px_pred,pz_pred,SI_prob
2453,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0.59,0.21,0.03,0.01,0.16,0.0,0.04423,2.523223,
2454,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0.47,0.21,0.04,0.06,0.22,0.0,0.00773,2.420953,
2455,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,1,0.23,0.55,0.14,0.03,0.05,0.0,0.113485,2.260595,
2496,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0.57,0.1,0.1,0.23,0.0,0.0,-0.234551,2.645617,
2497,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0.37,0.18,0.2,0.23,0.02,0.0,-0.328774,2.583225,


### XGBoost Classifier:

In order to compare, I will be now running the workflow with XGBoost utilized at the classification step, as compared to Random Forest:

In [64]:
val_df_full = pitch_prediction_modeling_pipeline('Rick Porcello', pitch_df, split_size = 0.2, class_method = 'XGBoost')

Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'SI': 6, 'PO': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24137595010719157




Accuracy: 0.42389987888574887
Precision: (0.31570832992243386,)
Recall: 0.29346191244922587
XGBoost Pitch Classification confusion matrix results:
[[503 138  73  56  27   0]
 [178 337  58  43  30   0]
 [170  97  86  34  22   0]
 [117 102  43  69  13   0]
 [ 89  75  35  25  55   0]
 [  0   2   0   0   0   0]]
Val Px R^2: 0.014246877138460401
Val Px MAE: 0.6481563481851975 ft.
Val Pz R^2: 0.06401292208670706
Val Pz MAE: 0.7291888588890443 ft.


  _warn_prf(average, modifier, msg_start, len(result))


In [65]:
output_df = multiple_pitcher_predictions(pitcher_list, pitch_df, split_size = 0.2, class_method = 'XGBoost')

Pitch Modeling for Max Scherzer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.24353704211199415
Accuracy: 0.5247561890472618
Precision: (0.3510732599785648,)
Recall: 0.2291925231955305
Random Forest Pitch Classification confusion matrix results:
[[1209  124   34   13   15    2]
 [ 361  146    5    1    0    0]
 [ 310   24   29    6    8    0]
 [ 199    5   13    3    2    0]
 [ 115    2   10    1   11    0]
 [  11    6    0    0    0    1]]
Val Px R^2: 0.06199571348188049
Val Px MAE: 0.6802437034361896 ft.
Val Pz R^2: 0.14889063640305722
Val Pz MAE: 0.6234543746266302 ft.




Pitch Modeling for Chris Sale
Here is the coding for last pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FA': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FS': 4}
Actual Test Size: 0.24533083059596433
Accuracy: 0.4659525631216526
Precision: (0.43778029556770937,)
Recall: 0.44884472053144947
Random Forest Pitch Classification confusion matrix results:
[[610  95 119  37]
 [235 211  86 209]
 [243  71 146  80]
 [ 14 172  35 251]]
Val Px R^2: 0.021604553478507205
Val Px MAE: 0.684842166026736 ft.
Val Pz R^2: 0.014585255419264587
Val Pz MAE: 0.6708066600212033 ft.




Pitch Modeling for Justin Verlander
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.2518304431599229


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.5921958684009181
Precision: (0.31362392658071886,)
Recall: 0.1965963451525606
Random Forest Pitch Classification confusion matrix results:
[[1428   67   46    1    0    0]
 [ 411   87   16    0    0    0]
 [ 350   36   32    0    0    0]
 [ 117    9    3    1    0    0]
 [   8    1    0    0    0    0]
 [   1    0    0    0    0    0]]
Val Px R^2: 0.08643801056555622
Val Px MAE: 0.6162772939515128 ft.
Val Pz R^2: 0.17326143088463752
Val Pz MAE: 0.6830010109761702 ft.




Pitch Modeling for Jose Quintana
Here is the coding for last pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4, 'PO': 5, 'UN': 6, 'FA': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4}
Actual Test Size: 0.2551168881559802
Accuracy: 0.4954372623574145
Precision: (0.37510204357095617,)
Recall: 0.32457090136129507
Random Forest Pitch Classification confusion matrix results:
[[969 195  63   4  29]
 [410 190  50   8  34]
 [120  39 111   1   0]
 [139  45  22   4   5]
 [122  38   0   3  29]]
Val Px R^2: -0.03999812480989795
Val Px MAE: 0.6673943655330864 ft.
Val Pz R^2: 0.028530814255639303
Val Pz MAE: 0.7308324752564238 ft.




Pitch Modeling for Chris Archer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4}
Actual Test Size: 0.255042802322149
Accuracy: 0.5752314814814815
Precision: (0.39406014366036907,)
Recall: 0.3333371348600675
Random Forest Pitch Classification confusion matrix results:
[[833 371  15   8   2]
 [437 622  12  13   0]
 [141  61  22   3   0]
 [ 11  19   0  14   0]
 [  5   2   0   1   0]]
Val Px R^2: 0.1755720879921988
Val Px MAE: 0.566879170680485 ft.
Val Pz R^2: 0.10869913477580195
Val Pz MAE: 0.7859729721550708 ft.




Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'SI': 6, 'PO': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24137595010719157


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.4125958821154623
Precision: (0.3010463387446746,)
Recall: 0.2690155312080396
Random Forest Pitch Classification confusion matrix results:
[[553 147  54  22  21   0]
 [224 317  48  37  20   0]
 [183 112  78  21  15   0]
 [148 111  32  39  14   0]
 [108  91  28  17  35   0]
 [  0   2   0   0   0   0]]
Val Px R^2: 0.017246582990617765
Val Px MAE: 0.648322844894267 ft.
Val Pz R^2: 0.06260227894896209
Val Pz MAE: 0.7290012873054939 ft.




Pitch Modeling for Jon Lester
Here is the coding for last pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4}
Actual Test Size: 0.25269461077844313
Accuracy: 0.45813586097946285
Precision: (0.30675644989930706,)
Recall: 0.24278971825562906
Random Forest Pitch Classification confusion matrix results:
[[966 144  50  16   3]
 [402 154  28  13   1]
 [254  54  28   7   4]
 [177  34   4   9   1]
 [126  43   7   4   3]]
Val Px R^2: 0.07187965251462347
Val Px MAE: 0.7158502706574434 ft.
Val Pz R^2: 0.1614825119565667
Val Pz MAE: 0.6041117240629665 ft.




Pitch Modeling for Corey Kluber
Here is the coding for last pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5}
Actual Test Size: 0.2488986784140969
Accuracy: 0.3845534995977474
Precision: (0.35330197937810515,)
Recall: 0.3167079942809002
Random Forest Pitch Classification confusion matrix results:
[[457 119  49  84  67   9]
 [202 184  34  49  60   4]
 [147  67  74  70  33   3]
 [141  26  41 105   0   2]
 [125  54  11   0 129   3]
 [ 59  20   8  19  24   7]]
Val Px R^2: 0.11702535012149162
Val Px MAE: 0.645238253947373 ft.
Val Pz R^2: 0.03729399976364933
Val Pz MAE: 0.6936360615053123 ft.




Pitch Modeling for Gio Gonzalez
Here is the coding for last pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3, 'UN': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3}
Actual Test Size: 0.24769076305220883
Accuracy: 0.4187271990271585
Precision: (0.39544692533689174,)
Recall: 0.37755971387455983
Random Forest Pitch Classification confusion matrix results:
[[512 117 105  59]
 [259 326  52  64]
 [239 113 109  46]
 [200 133  47  86]]
Val Px R^2: 0.07488129266371324
Val Px MAE: 0.6788851474017268 ft.
Val Pz R^2: 0.21726928447684013
Val Pz MAE: 0.6731209095812561 ft.




Pitch Modeling for Julio Teheran
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CU': 3, 'CH': 4, 'PO': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CH': 3, 'CU': 4}
Actual Test Size: 0.24703943981052415
Accuracy: 0.42934556065027096
Precision: (0.34160390566220356,)
Recall: 0.2785744668366351
Random Forest Pitch Classification confusion matrix results:
[[763 133  77  20  19]
 [373 115  51   5   1]
 [225  51 122  16  15]
 [136  15  43  20   1]
 [147  17  14  10  10]]
Val Px R^2: 0.12427293399426254
Val Px MAE: 0.6883925244486934 ft.
Val Pz R^2: 0.10315021413824865
Val Pz MAE: 0.6782081389681447 ft.






About 70 seconds to run 10.

Overall, the metrics seem relatively similar with Random Forest and XGBoost.  In either way, the prediction quality (and location prediction) will need to be improved.

In [66]:
output_df.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,pitch_pred,FF_prob,SL_prob,CH_prob,CU_prob,FC_prob,FT_prob,px_pred,pz_pred,SI_prob
2453,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0.67,0.15,0.04,0.0,0.14,0.0,0.04423,2.523223,
2454,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0.6,0.21,0.03,0.0,0.16,0.0,0.00773,2.420953,
2455,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,1,0.25,0.56,0.17,0.01,0.01,0.0,0.113485,2.260595,
2496,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0.45,0.05,0.14,0.36,0.0,0.0,-0.234551,2.645617,
2497,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0.45,0.07,0.16,0.3,0.02,0.0,-0.328774,2.583225,


Either way, at this point I'd think it would be best to look into new features for feature engineering, to improve both facets of the model.

For location, based on visualizations in other notebooks, the issue is that the predictions are concentrated in one location, and not showing as much variability as the real world (i.e. underfitting).  I would expect some underfitting in general since even the best pitchers won't throw the ball with pinpoint accuracy every time, but this should be able to be improved.  Some possible features to engineer could be:
- Rolling average of pitch location data, pitch-type specific (i.e. avg. x and y coordinates last three times they threw that pitch)
- More recent pitch proportion rates - right now, I have their running averages over the dataset.  Perhaps shorten to over the current game as well, against the current batter (which would have null values), or over the last 80-100 pitches (roughly a game for a starting pitcher)