# Pipeline Architecture:
The purpose of this notebook is to build out a pipeline of modeling, from classification to regression.  The general form is:

1. Random Forest Classification of Pitch Type
2. Linear Regression of X-Coordinate of Pitch Location (Px) - pitch type used as feature
3. Linear Regression of Z-Coordinate of Pitch Location (Pz) - pitch type and px used as feature

Importing packages:

In [1]:
import pickle
from sqlalchemy import create_engine
import pandas as pd
from importlib import reload
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_formats = ['retina']
%matplotlib inline

plt.rcParams['figure.figsize'] = (9, 6)
sns.set(context='notebook', style='whitegrid', font_scale=1.2)

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV, ElasticNetCV
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline, make_pipeline

Pickling in initial data to work with:

In [3]:
pwd

'/Users/patrickbovard/Documents/GitHub/metis_final_project/Pitch_Classification'

In [4]:
with open('.../Data/train_df_clusters.pickle','rb') as read_file:
    pitch_df = pickle.load(read_file)

In [5]:
pitch_df.shape

(2848371, 49)

In order to determine who to include in the pipeline, I'll see who the top xx pitchers are, in terms of pitches thrown:

In [6]:
pitch_df.pitcher_full_name.value_counts().head(50)

Max Scherzer         13479
Justin Verlander     12810
Chris Archer         12760
Jose Quintana        12692
Chris Sale           12689
Rick Porcello        12591
Jon Lester           12566
Corey Kluber         12480
Gio Gonzalez         12418
Zack Greinke         12092
Julio Teheran        12076
Jake Arrieta         12028
Cole Hamels          11946
Trevor Bauer         11860
James Shields        11768
Gerrit Cole          11716
Jacob deGrom         11691
Dallas Keuchel       11659
Jake Odorizzi        11649
Kyle Gibson          11514
Marco Estrada        11490
J.A. Happ            11401
Kevin Gausman        11337
Tanner Roark         11296
Mike Fiers           11093
Ian Kennedy          10963
Kyle Hendricks       10952
Mike Leake           10952
David Price          10932
Carlos Martinez      10920
Carlos Carrasco      10913
Andrew Cashner       10732
Jeff Samardzija      10573
Madison Bumgarner    10551
Jason Hammel         10446
Masahiro Tanaka      10442
CC Sabathia          10437
R

In [7]:
pitch_df.pitcher_full_name.value_counts().head(50).sum()

558964

So that would cover 564,235 total pitches from 2015-2018.

In [8]:
pitch_df.tail()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,stand,p_throws,event,home_team,...,cumulative_cu_rate,cumulative_si_rate,cumulative_fc_rate,cumulative_kc_rate,cumulative_fs_rate,cumulative_kn_rate,cumulative_ep_rate,cumulative_fo_rate,cumulative_sc_rate,Cluster
2848366,9.0,595879,623352,0.0,2018186000.0,3.0,R,L,Single,chn,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2848367,9.0,519203,623352,0.0,2018186000.0,3.0,L,L,Flyout,chn,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
2848368,9.0,519203,623352,0.0,2018186000.0,3.0,L,L,Flyout,chn,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
2848369,9.0,519203,623352,0.0,2018186000.0,3.0,L,L,Flyout,chn,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0
2848370,9.0,519203,623352,0.0,2018186000.0,3.0,L,L,Flyout,chn,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0


In [9]:
pitch_df.columns

Index(['inning', 'batter_id', 'pitcher_id', 'top', 'ab_id', 'p_score', 'stand',
       'p_throws', 'event', 'home_team', 'away_team', 'b_score', 'on_1b',
       'on_2b', 'on_3b', 'px', 'pz', 'zone', 'pitch_type', 'start_speed',
       'type', 'b_count', 's_count', 'outs', 'pitch_num', 'last_pitch_type',
       'last_pitch_px', 'last_pitch_pz', 'last_pitch_speed',
       'pitcher_full_name', 'pitcher_run_diff', 'hitter_full_name',
       'Date_Time_Date', 'Season', 'cumulative_pitches', 'cumulative_ff_rate',
       'cumulative_sl_rate', 'cumulative_ft_rate', 'cumulative_ch_rate',
       'cumulative_cu_rate', 'cumulative_si_rate', 'cumulative_fc_rate',
       'cumulative_kc_rate', 'cumulative_fs_rate', 'cumulative_kn_rate',
       'cumulative_ep_rate', 'cumulative_fo_rate', 'cumulative_sc_rate',
       'Cluster'],
      dtype='object')

Checking out pitch types:

In [10]:
pitch_df.pitch_type.value_counts()

FF    1015709
SL     451009
FT     338134
CH     293009
SI     242653
CU     234554
FC     149870
KC      66522
FS      43911
KN      11260
EP        816
FO        810
SC        114
Name: pitch_type, dtype: int64

Removing the eephus, since that is more of a trick pitch (i.e. a slow lob), not a pitch anyone throws with regularity, unlike the other low pitches (Forkball and Screwball), which are rare but thrown with more frequency by the 1-2 pitchers that use them.

In [11]:
pitch_df = pitch_df[pitch_df.pitch_type != 'EP']

Importing the regression functions:

In [12]:
from location_regression_functions import *
from pitch_cat_functions import *

Importing the functions from the pipeline script:

In [13]:
from classification_location_combo import *

### Initial dataframe setup:

Before working on building out a workflow to model multiple pitchers in succession, I will build out the pipeline here in a step by step fashion:

### Dataframe Setup:

First, to set up the dataframe structure needed to model.  First step is to filter for the desired player's name:

In [14]:
scherzer_df = pitch_df[(pitch_df.pitcher_full_name == 'Max Scherzer') &(pitch_df.last_pitch_px.notnull())]

One hot encoding of needed columns:

In [15]:
ohe_cols = ['stand', 'p_throws']

In [16]:
ohe_df = column_ohe_maker(scherzer_df, ohe_cols)

Numerically coding last pitch type and pitch type:

In [17]:
temp_df = last_pitch_type_to_num(ohe_df, 'last_pitch_type')

Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [18]:
output_df = pitch_type_to_num(temp_df, 'pitch_type')

Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}


Declaring the columns for each step of the modeling process:

In [19]:
rf_cols = ['Cluster','inning', 'top', 'on_1b', 'on_2b', 'on_3b', 'b_count', 's_count', 'outs', 'stand_R',
       'pitcher_run_diff','last_pitch_speed', 'last_pitch_px', 'last_pitch_pz','pitch_num','cumulative_pitches',
       'cumulative_ff_rate', 'cumulative_sl_rate', 'cumulative_ft_rate',
       'cumulative_ch_rate', 'cumulative_cu_rate', 'cumulative_si_rate',
       'cumulative_fc_rate', 'cumulative_kc_rate', 'cumulative_fs_rate',
       'cumulative_kn_rate', 'cumulative_ep_rate', 'cumulative_fo_rate',
       'cumulative_sc_rate', 'Last_Pitch_Type_Num']

In [20]:
px_cols = rf_cols.copy()

In [21]:
px_cols.append('Pitch_Type_Num')

In [22]:
px_cols_val = px_cols.copy()
px_cols_val.append('pitch_pred')
px_cols_val.remove('Pitch_Type_Num')

In [23]:
pz_cols = px_cols.copy()
pz_cols.append('px')

In [24]:
pz_cols_val = pz_cols.copy()
pz_cols_val.append('px_pred')
pz_cols_val.remove('px')

Instead of a traditional train/test split, I need to split my dataframe into similar pitches.  To help with generalization/simulating unseen data, I'll split by at bat (using split_pitch_data in classification_location_combo.py).

In [25]:
training_pitches, val_pitches = split_pitch_data(output_df, test_size=0.20)

Actual Test Size: 0.2460442305912834


Now with training and validation df's separate, I can run modeling on each step, using the columns.

### Classification:

Setting up X and y for modeling, and performing train/test split:

In [26]:
X_rf_train = training_pitches[rf_cols]

In [27]:
y_rf_train = training_pitches['Pitch_Type_Num']

In [28]:
X_rf_val = val_pitches[rf_cols]

In [29]:
y_rf_val = val_pitches['Pitch_Type_Num']

Fitting model on test, then predicting on train:

In [30]:
rf_model = RandomForestClassifier()
rf_model.fit(X_rf_train,y_rf_train)

RandomForestClassifier()

In [31]:
y_rf_pred = rf_model.predict(X_rf_val)

In [32]:
unique, counts = np.unique(y_rf_pred, return_counts=True)

In [33]:
np.asarray((unique, counts)).T

array([[   0, 2193],
       [   1,  309],
       [   2,   88],
       [   3,   28],
       [   4,   40],
       [   5,    1]])

Running some metrics:

In [34]:
cm = confusion_matrix(y_rf_val, y_rf_pred)
cm

array([[1200,  127,   29,   16,   18,    0],
       [ 361,  142,    7,    3,    0,    0],
       [ 314,   22,   31,    5,    5,    0],
       [ 199,    8,   11,    2,    2,    0],
       [ 109,    3,   10,    2,   15,    0],
       [  10,    7,    0,    0,    0,    1]])

### Px:

Adding the predicted pitch types to the validation dataframe, for use as a feature in Px regression:

In [35]:
val_pitches['pitch_pred'] = y_rf_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_pitches['pitch_pred'] = y_rf_pred


Setting up X and y, with train test split:

In [36]:
X_px_tr = training_pitches[px_cols]
y_px_tr = training_pitches['px']

In [37]:
X_px_val = val_pitches[px_cols_val]
y_px_val = val_pitches['px']

Instantiating linear regression:

In [38]:
lm = LinearRegression()

In [39]:
lm.fit(X_px_tr, y_px_tr)

LinearRegression()

In [40]:
px_pred = lm.predict(X_px_val)

Getting some metrics:

In [41]:
px_r2 = lm.score(X_px_val, y_px_val)
px_r2

0.06449344857078854

In [42]:
px_mae = mae(y_px_val, px_pred)

In [43]:
px_mae

0.6803683899976727

Adding a px_pred column to the dataframe to utilize in the pz regression:

In [44]:
val_pitches['px_pred'] = px_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_pitches['px_pred'] = px_pred


### Pz:

Setting up X and y, with train test split:

In [45]:
X_pz_tr = training_pitches[pz_cols]
y_pz_tr = training_pitches['pz']

In [46]:
X_pz_val = val_pitches[pz_cols_val]
y_pz_val = val_pitches['pz']

Instantiating linear regression:

In [47]:
lm = LinearRegression()

In [48]:
lm.fit(X_pz_tr, y_pz_tr)

LinearRegression()

In [49]:
pz_pred = lm.predict(X_pz_val)

Getting some metrics:

In [50]:
pz_r2 = lm.score(X_pz_val, y_pz_val)
pz_r2

0.14804216434246775

In [51]:
pz_mae = mae(y_pz_val, pz_pred)
pz_mae

0.6252280801703138

Adding a px_pred column to the dataframe to utilize in the pz regression:

In [52]:
val_pitches['pz_pred'] = pz_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_pitches['pz_pred'] = pz_pred


### Evaluating the pipeline:

Looking at the final validation df:

In [53]:
output_df = pd.DataFrame(columns=val_pitches.columns)

In [54]:
output_df = pd.concat([output_df, val_pitches])

In [55]:
output_df

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,cumulative_ep_rate,cumulative_fo_rate,cumulative_sc_rate,Cluster,stand_R,Last_Pitch_Type_Num,Pitch_Type_Num,pitch_pred,px_pred,pz_pred
2453,3.0,527038,453286,1.0,2.015001e+09,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,0.0,3.0,1.0,0,0,0,0.038671,2.530527
2454,3.0,527038,453286,1.0,2.015001e+09,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,0.0,3.0,1.0,0,0,0,0.000305,2.423075
2455,3.0,527038,453286,1.0,2.015001e+09,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,0.0,3.0,1.0,0,0,1,0.109142,2.255876
2496,5.0,502517,453286,1.0,2.015001e+09,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,0.0,3.0,0.0,1,0,0,-0.236734,2.648451
2497,5.0,502517,453286,1.0,2.015001e+09,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,0.0,3.0,0.0,0,0,0,-0.335756,2.579163
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2824533,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,0.0,1.0,1.0,0,1,1,0.207293,1.981297
2824534,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,0.0,1.0,1.0,1,1,1,0.237352,1.898896
2824535,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,0.0,1.0,1.0,1,0,0,0.101127,2.274266
2824536,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,0.0,1.0,1.0,0,2,0,0.023004,1.927386


In [56]:
val_pitches.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,cumulative_ep_rate,cumulative_fo_rate,cumulative_sc_rate,Cluster,stand_R,Last_Pitch_Type_Num,Pitch_Type_Num,pitch_pred,px_pred,pz_pred
2453,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,0.0,3.0,1.0,0,0,0,0.038671,2.530527
2454,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,0.0,3.0,1.0,0,0,0,0.000305,2.423075
2455,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,0.0,3.0,1.0,0,0,1,0.109142,2.255876
2496,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,0.0,3.0,0.0,1,0,0,-0.236734,2.648451
2497,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,0.0,3.0,0.0,0,0,0,-0.335756,2.579163


In [57]:
val_pitches.shape

(2659, 53)

In [58]:
val_pitches.columns

Index(['inning', 'batter_id', 'pitcher_id', 'top', 'ab_id', 'p_score', 'event',
       'home_team', 'away_team', 'b_score', 'on_1b', 'on_2b', 'on_3b', 'px',
       'pz', 'zone', 'pitch_type', 'start_speed', 'type', 'b_count', 's_count',
       'outs', 'pitch_num', 'last_pitch_type', 'last_pitch_px',
       'last_pitch_pz', 'last_pitch_speed', 'pitcher_full_name',
       'pitcher_run_diff', 'hitter_full_name', 'Date_Time_Date', 'Season',
       'cumulative_pitches', 'cumulative_ff_rate', 'cumulative_sl_rate',
       'cumulative_ft_rate', 'cumulative_ch_rate', 'cumulative_cu_rate',
       'cumulative_si_rate', 'cumulative_fc_rate', 'cumulative_kc_rate',
       'cumulative_fs_rate', 'cumulative_kn_rate', 'cumulative_ep_rate',
       'cumulative_fo_rate', 'cumulative_sc_rate', 'Cluster', 'stand_R',
       'Last_Pitch_Type_Num', 'Pitch_Type_Num', 'pitch_pred', 'px_pred',
       'pz_pred'],
      dtype='object')

# Utilizing Functions

Based on the above, I put together functions to run this same process in classification_location_combo.py.  Running those here:

### Random Forest:

In [59]:
val_df_full = pitch_prediction_modeling_pipeline('Max Scherzer', pitch_df, split_size = 0.2)

Pitch Modeling for Max Scherzer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.2460442305912834
Accuracy: 0.5133508837908989
Precision: (0.4598125523075958,)
Recall: 0.22482440111277155
Random Forest Pitch Classification confusion matrix results:
[[1190  132   42   10   16    0]
 [ 366  129   15    3    0    0]
 [ 310   27   28    8    4    0]
 [ 202    5    8    4    3    0]
 [ 115    2    7    2   13    0]
 [  11    5    1    0    0    1]]
Val Px R^2: 0.0654484504505789
Val Px MAE: 0.6798226602240268 ft.
Val Pz R^2: 0.14842993405141258
Val Pz MAE: 0.6251297703264211 ft.


In [60]:
val_df_full.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,Pitch_Type_Num,pitch_pred,FF_prob,SL_prob,CH_prob,CU_prob,FC_prob,FT_prob,px_pred,pz_pred
2453,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0,0.56,0.25,0.01,0.04,0.14,0.0,0.038671,2.530527
2454,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0,0.59,0.18,0.02,0.06,0.15,0.0,0.000305,2.423075
2455,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,1,0.34,0.44,0.16,0.02,0.04,0.0,0.109142,2.255876
2496,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0,0.41,0.12,0.16,0.3,0.01,0.0,-0.236734,2.648451
2497,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0,0.39,0.23,0.14,0.19,0.05,0.0,-0.335756,2.579163


Trying with another pitcher:

In [61]:
val_df_full = pitch_prediction_modeling_pipeline('Rick Porcello', pitch_df, split_size = 0.2)

Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'PO': 6, 'SI': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24505342303126237
Accuracy: 0.4263221639079532
Precision: (0.32810241127390066,)
Recall: 0.28362065622309207
Random Forest Pitch Classification confusion matrix results:
[[534 153  62  29  19   0]
 [201 350  46  33  16   0]
 [188 116  79  14  12   0]
 [144 106  28  54  12   0]
 [108  82  31  19  39   0]
 [  0   2   0   0   0   0]]
Val Px R^2: 0.015915844491007913
Val Px MAE: 0.6478150365262397 ft.
Val Pz R^2: 0.06332683571436248
Val Pz MAE: 0.7288968081900468 ft.


  _warn_prf(average, modifier, msg_start, len(result))


In [62]:
val_df_full.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,Pitch_Type_Num,pitch_pred,FT_prob,FF_prob,SL_prob,CU_prob,CH_prob,SI_prob,px_pred,pz_pred
6895,2.0,425796,519144,0.0,2015002000.0,0.0,Single,phi,bos,0.0,...,1,0,0.59,0.16,0.21,0.01,0.03,0.0,-0.350683,2.237202
6896,2.0,425796,519144,0.0,2015002000.0,0.0,Single,phi,bos,0.0,...,0,0,0.55,0.09,0.16,0.12,0.08,0.0,-0.189153,2.450072
6897,2.0,520471,519144,0.0,2015002000.0,0.0,Groundout,phi,bos,0.0,...,0,0,0.74,0.1,0.05,0.06,0.05,0.0,-0.531416,2.565697
6898,2.0,520471,519144,0.0,2015002000.0,0.0,Groundout,phi,bos,0.0,...,0,0,0.45,0.05,0.17,0.16,0.17,0.0,-0.456327,2.618574
6899,2.0,520471,519144,0.0,2015002000.0,0.0,Groundout,phi,bos,0.0,...,0,0,0.33,0.13,0.21,0.12,0.21,0.0,-0.524665,2.521748


Testing out for the first 25 pitchers to get a sense for computational cost:

In [63]:
pitcher_list = pitch_df.pitcher_full_name.value_counts().head(10).index

In [64]:
output_df = multiple_pitcher_predictions(pitcher_list, pitch_df, split_size = 0.2)

Pitch Modeling for Max Scherzer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.2460442305912834
Accuracy: 0.5125987213238059
Precision: (0.37924760899072624,)
Recall: 0.2227373714773994
Random Forest Pitch Classification confusion matrix results:
[[1194  133   42    7   13    1]
 [ 379  123   10    1    0    0]
 [ 311   25   30    7    4    0]
 [ 201    8   10    2    1    0]
 [ 110    4   11    1   13    0]
 [  12    5    0    0    0    1]]
Val Px R^2: 0.0644705758438372
Val Px MAE: 0.6800822790766223 ft.
Val Pz R^2: 0.14841358934368132
Val Pz MAE: 0.625079238006986 ft.




Pitch Modeling for Justin Verlander
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.2522002738118521


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.6013958898797984
Precision: (0.3357764950022184,)
Recall: 0.20033153051120486
Random Forest Pitch Classification confusion matrix results:
[[1429   58   34    0    0    0]
 [ 408   86   13    0    0    0]
 [ 347   28   35    1    0    0]
 [ 120    7    2    1    0    0]
 [   9    0    0    0    0    0]
 [   1    0    0    0    0    0]]
Val Px R^2: 0.0896577683512686
Val Px MAE: 0.6132845848156774 ft.
Val Pz R^2: 0.17819509394793864
Val Pz MAE: 0.6805326374883964 ft.




Pitch Modeling for Chris Archer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4}
Actual Test Size: 0.255042802322149
Accuracy: 0.5841049382716049
Precision: (0.3814431929362423,)
Recall: 0.33344488058325583
Random Forest Pitch Classification confusion matrix results:
[[831 372  16   8   2]
 [408 653  11  12   0]
 [145  63  16   3   0]
 [  9  19   2  14   0]
 [  5   2   0   1   0]]
Val Px R^2: 0.17543743546695778
Val Px MAE: 0.5668988230007762 ft.
Val Pz R^2: 0.10862493790322036
Val Pz MAE: 0.7860668280783217 ft.




Pitch Modeling for Jose Quintana
Here is the coding for last pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4, 'PO': 5, 'UN': 6, 'FA': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4}
Actual Test Size: 0.2450441609421001
Accuracy: 0.49018822587104527
Precision: (0.3610066795384953,)
Recall: 0.31863810123189606
Random Forest Pitch Classification confusion matrix results:
[[913 190  61   9  24]
 [399 178  50   9  14]
 [105  42 106   4   0]
 [133  41  25   3   6]
 [118  39   0   4  24]]
Val Px R^2: -0.01583596738409776
Val Px MAE: 0.6542913433808422 ft.
Val Pz R^2: 0.032720470164693594
Val Pz MAE: 0.7282229785275337 ft.




Pitch Modeling for Chris Sale
Here is the coding for last pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FA': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FS': 4}
Actual Test Size: 0.24323529411764705
Accuracy: 0.4719871019750101
Precision: (0.4431110357217011,)
Recall: 0.4476541909907382
Random Forest Pitch Classification confusion matrix results:
[[607  98 117  32]
 [229 212  58 179]
 [246  75 129  69]
 [ 24 153  30 223]]
Val Px R^2: 0.023947983682335083
Val Px MAE: 0.7011793516701536 ft.
Val Pz R^2: 0.009443689978341552
Val Pz MAE: 0.6700133520192111 ft.




Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'PO': 6, 'SI': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24505342303126237


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.4182478805006056
Precision: (0.3169066644714637,)
Recall: 0.2764964227174732
Random Forest Pitch Classification confusion matrix results:
[[541 147  59  31  19   0]
 [218 331  43  35  19   0]
 [188 119  78  14  10   0]
 [140 116  27  46  15   0]
 [105  83  33  18  40   0]
 [  0   2   0   0   0   0]]
Val Px R^2: 0.014955047438161673
Val Px MAE: 0.6482810271019848 ft.
Val Pz R^2: 0.06401226112752101
Val Pz MAE: 0.7286748863571434 ft.




Pitch Modeling for Jon Lester
Here is the coding for last pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4}
Actual Test Size: 0.25269461077844313
Accuracy: 0.45813586097946285
Precision: (0.294716790893499,)
Recall: 0.2385036891959318
Random Forest Pitch Classification confusion matrix results:
[[983 138  42  12   4]
 [419 141  27   9   2]
 [258  58  25   4   2]
 [176  33   5   9   2]
 [127  40   9   5   2]]
Val Px R^2: 0.0729026728104415
Val Px MAE: 0.7154414263280175 ft.
Val Pz R^2: 0.16116943901370806
Val Pz MAE: 0.6041251856868358 ft.




Pitch Modeling for Corey Kluber
Here is the coding for last pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5}
Actual Test Size: 0.2488986784140969
Accuracy: 0.3901850362027353
Precision: (0.3600371757884588,)
Recall: 0.3241120234966125
Random Forest Pitch Classification confusion matrix results:
[[446 125  59  82  64   9]
 [199 194  40  48  50   2]
 [160  63  80  62  25   4]
 [129  31  38 115   0   2]
 [132  48  10   0 129   3]
 [ 56  21  11  16  27   6]]
Val Px R^2: 0.11739863350533586
Val Px MAE: 0.6460771757815361 ft.
Val Pz R^2: 0.03714850546433157
Val Pz MAE: 0.6935211738426176 ft.




Pitch Modeling for Gio Gonzalez
Here is the coding for last pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3, 'UN': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3}
Actual Test Size: 0.248214106046886
Accuracy: 0.41183623834616945
Precision: (0.3851922914842664,)
Recall: 0.36859638080953433
Random Forest Pitch Classification confusion matrix results:
[[519 120 102  52]
 [264 317  62  58]
 [245 115 107  40]
 [203 141  49  73]]
Val Px R^2: 0.07932808863716345
Val Px MAE: 0.6770006072125093 ft.
Val Pz R^2: 0.2186969349298037
Val Pz MAE: 0.672870047026909 ft.




Pitch Modeling for Julio Teheran
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CU': 3, 'CH': 4, 'UN': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CH': 3, 'CU': 4}
Actual Test Size: 0.24829227903125647
Accuracy: 0.4260108378491038
Precision: (0.35284000031573987,)
Recall: 0.2795108259674451
Random Forest Pitch Classification confusion matrix results:
[[765 131  79  20  17]
 [373 104  62   3   3]
 [235  47 115  15  17]
 [140  18  31  23   3]
 [137  17  19  10  15]]
Val Px R^2: 0.12246102763520972
Val Px MAE: 0.6886627292453926 ft.
Val Pz R^2: 0.10326806230291541
Val Pz MAE: 0.6776427616282779 ft.






Based on my stopwatch, it took ~5 mins to run for 25 pitchers, so it is not terribly expensive to model on a per pitcher basis.

Overall at this time, I'm not as concerned with pitch location R^2 as I am pitch type.  The rationale is that incorrect predictions of pitch type have a large factor on the location, so a wrong pitch type doesn't really matter since different pitches are thrown in different locations (or are meant to be).  

The pitch location also does not need to be exact for my use case.  Telling a hitter "expect a fastball at coordinate 1.23, 2.54" won't give them good info - in real life, saying "expect a fastball high and inside" is typically what would be needed.

In [65]:
output_df.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,pitch_pred,FF_prob,SL_prob,CH_prob,CU_prob,FC_prob,FT_prob,px_pred,pz_pred,SI_prob
2453,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0.69,0.15,0.0,0.02,0.14,0.0,0.038671,2.530527,
2454,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0.57,0.18,0.0,0.06,0.19,0.0,0.000305,2.423075,
2455,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,1,0.27,0.52,0.16,0.01,0.04,0.0,0.109142,2.255876,
2496,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0.43,0.08,0.12,0.35,0.02,0.0,-0.236734,2.648451,
2497,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0.42,0.12,0.13,0.29,0.04,0.0,-0.335756,2.579163,


### XGBoost Classifier:

In order to compare, I will be now running the workflow with XGBoost utilized at the classification step, as compared to Random Forest:

In [66]:
val_df_full = pitch_prediction_modeling_pipeline('Rick Porcello', pitch_df, split_size = 0.2, class_method = 'XGBoost')

Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'PO': 6, 'SI': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24505342303126237




Accuracy: 0.4360113039967703
Precision: (0.32028349706706166,)
Recall: 0.3009872614828474
XGBoost Pitch Classification confusion matrix results:
[[521 122  67  54  33   0]
 [180 344  51  42  29   0]
 [165  95  95  31  23   0]
 [113  98  57  63  13   0]
 [ 84  68  37  33  57   0]
 [  0   2   0   0   0   0]]
Val Px R^2: 0.017058267278571848
Val Px MAE: 0.6473860190128807 ft.
Val Pz R^2: 0.0658685778375333
Val Pz MAE: 0.7280400887539987 ft.


  _warn_prf(average, modifier, msg_start, len(result))


In [67]:
output_df = multiple_pitcher_predictions(pitcher_list, pitch_df, split_size = 0.2, class_method = 'XGBoost')

Pitch Modeling for Max Scherzer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.2460442305912834
Accuracy: 0.5182399398270027
Precision: (0.37473402791012855,)
Recall: 0.22615082523620308
Random Forest Pitch Classification confusion matrix results:
[[1189  136   39   10   15    1]
 [ 362  141    7    3    0    0]
 [ 301   32   35    5    4    0]
 [ 196    7   14    3    2    0]
 [ 116    2   11    1    9    0]
 [  11    6    0    0    0    1]]
Val Px R^2: 0.06226044569020561
Val Px MAE: 0.6810332885874905 ft.
Val Pz R^2: 0.14865238769648392
Val Pz MAE: 0.6249756179201421 ft.




Pitch Modeling for Justin Verlander
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.2522002738118521


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.5979061651803025
Precision: (0.3216475178197666,)
Recall: 0.19867667736842357
Random Forest Pitch Classification confusion matrix results:
[[1422   58   40    1    0    0]
 [ 399   87   21    0    0    0]
 [ 346   33   32    0    0    0]
 [ 122    4    3    1    0    0]
 [   9    0    0    0    0    0]
 [   1    0    0    0    0    0]]
Val Px R^2: 0.08974092782992571
Val Px MAE: 0.613288115875817 ft.
Val Pz R^2: 0.17819110519912873
Val Pz MAE: 0.6805312747976837 ft.




Pitch Modeling for Chris Archer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4}
Actual Test Size: 0.255042802322149
Accuracy: 0.5798611111111112
Precision: (0.40305924346834415,)
Recall: 0.3365381721571751
Random Forest Pitch Classification confusion matrix results:
[[828 380  15   5   1]
 [422 643   9  10   0]
 [150  57  17   3   0]
 [  7  21   1  15   0]
 [  5   2   0   1   0]]
Val Px R^2: 0.17567480641882793
Val Px MAE: 0.5668410709687982 ft.
Val Pz R^2: 0.1087659781911886
Val Pz MAE: 0.7860541844708447 ft.




Pitch Modeling for Jose Quintana
Here is the coding for last pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4, 'PO': 5, 'UN': 6, 'FA': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4}
Actual Test Size: 0.2450441609421001
Accuracy: 0.48217861433720466
Precision: (0.3626605057867406,)
Recall: 0.3136464551230064
Random Forest Pitch Classification confusion matrix results:
[[907 196  64   4  26]
 [405 165  46  10  24]
 [117  36 102   2   0]
 [133  42  24   4   5]
 [119  37   0   3  26]]
Val Px R^2: -0.021576674437576493
Val Px MAE: 0.6558854772078329 ft.
Val Pz R^2: 0.0317435853351421
Val Pz MAE: 0.7286681158437902 ft.




Pitch Modeling for Chris Sale
Here is the coding for last pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FA': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FS': 4}
Actual Test Size: 0.24323529411764705
Accuracy: 0.4808544941555824
Precision: (0.45154139583453856,)
Recall: 0.45488853949018093
Random Forest Pitch Classification confusion matrix results:
[[621 100 104  29]
 [230 221  63 164]
 [249  80 125  65]
 [ 21 149  34 226]]
Val Px R^2: 0.023075653099997706
Val Px MAE: 0.7016656017281951 ft.
Val Pz R^2: 0.009413236959385674
Val Pz MAE: 0.6700861940347584 ft.




Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'PO': 6, 'SI': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24505342303126237


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.4210738796931772
Precision: (0.31142079660416727,)
Recall: 0.2760736465319838
Random Forest Pitch Classification confusion matrix results:
[[545 152  54  23  23   0]
 [210 343  43  34  16   0]
 [184 111  75  24  15   0]
 [152 106  32  42  12   0]
 [109  84  31  17  38   0]
 [  0   2   0   0   0   0]]
Val Px R^2: 0.017267196951480823
Val Px MAE: 0.6482323287482625 ft.
Val Pz R^2: 0.06401561883500506
Val Pz MAE: 0.7283612572142649 ft.




Pitch Modeling for Jon Lester
Here is the coding for last pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4}
Actual Test Size: 0.25269461077844313
Accuracy: 0.46484992101105843
Precision: (0.28728335724132215,)
Recall: 0.2448329793029051
Random Forest Pitch Classification confusion matrix results:
[[977 139  43  15   5]
 [402 162  21  11   2]
 [256  56  29   4   2]
 [182  30   4   8   1]
 [129  40   9   4   1]]
Val Px R^2: 0.07215325649946047
Val Px MAE: 0.7157403643245349 ft.
Val Pz R^2: 0.1611316586764222
Val Pz MAE: 0.6039364052085006 ft.




Pitch Modeling for Corey Kluber
Here is the coding for last pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5}
Actual Test Size: 0.2488986784140969
Accuracy: 0.3797264682220434
Precision: (0.35204358673887737,)
Recall: 0.3134372660217487
Random Forest Pitch Classification confusion matrix results:
[[441 125  56  92  63   8]
 [205 191  36  51  49   1]
 [148  75  72  69  25   5]
 [151  22  34 105   0   3]
 [124  58   9   0 129   2]
 [ 61  19   9  19  23   6]]
Val Px R^2: 0.11696425775726726
Val Px MAE: 0.6459847667505951 ft.
Val Pz R^2: 0.03675213256407073
Val Pz MAE: 0.6936022005834324 ft.




Pitch Modeling for Gio Gonzalez
Here is the coding for last pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3, 'UN': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3}
Actual Test Size: 0.248214106046886
Accuracy: 0.402918524523713
Precision: (0.3835686892048009,)
Recall: 0.3647930944334375
Random Forest Pitch Classification confusion matrix results:
[[488 133 113  59]
 [275 308  61  57]
 [236 113 118  40]
 [192 142  52  80]]
Val Px R^2: 0.07453513380320376
Val Px MAE: 0.6797699092388659 ft.
Val Pz R^2: 0.21717838551568192
Val Pz MAE: 0.6733512328445445 ft.




Pitch Modeling for Julio Teheran
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CU': 3, 'CH': 4, 'UN': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CH': 3, 'CU': 4}
Actual Test Size: 0.24829227903125647
Accuracy: 0.4406002501042101
Precision: (0.36357376256955004,)
Recall: 0.28697066720257597
Random Forest Pitch Classification confusion matrix results:
[[792 115  73  15  17]
 [382 106  50   5   2]
 [231  46 123  13  16]
 [137  15  34  27   2]
 [148  12  18  11   9]]
Val Px R^2: 0.12100726135466755
Val Px MAE: 0.6898774704986192 ft.
Val Pz R^2: 0.10286126144094143
Val Pz MAE: 0.6780796007840403 ft.






About 70 seconds to run 10.

Overall, the metrics seem relatively similar with Random Forest and XGBoost.  In either way, the prediction quality (and location prediction) will need to be aimed to be improved.

In [68]:
output_df.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,pitch_pred,FF_prob,SL_prob,CH_prob,CU_prob,FC_prob,FT_prob,px_pred,pz_pred,SI_prob
2453,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0.74,0.16,0.0,0.01,0.09,0.0,0.038671,2.530527,
2454,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0.52,0.25,0.03,0.01,0.19,0.0,0.000305,2.423075,
2455,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,1,0.34,0.43,0.2,0.0,0.03,0.0,0.109142,2.255876,
2496,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0.53,0.09,0.08,0.3,0.0,0.0,-0.236734,2.648451,
2497,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0.37,0.07,0.22,0.31,0.03,0.0,-0.335756,2.579163,


Either way, at this point I'd think it would be best to look into new features for feature engineering, to improve both facets of the model.

- More recent pitch proportion rates - right now, I have their running averages over the dataset.  Perhaps shorten to over the current game as well, against the current batter (which would have null values), or over the last 80-100 pitches (roughly a game for a starting pitcher)

# Next: Feature_Engineering.ipynb, in the Feature_Engineering_Additional folder
The new features will then go into the Pipeline_Part_2.ipynb file in this folder.