# Pipeline Architecture:
The purpose of this notebook is to build out a pipeline of modeling, from classification to regression.  The general form is:

1. Random Forest Classification of Pitch Type
2. Linear Regression of X-Coordinate of Pitch Location (Px) - pitch type used as feature
3. Linear Regression of Z-Coordinate of Pitch Location (Pz) - pitch type and px used as feature

Importing packages:

In [1]:
import pickle
from sqlalchemy import create_engine
import pandas as pd
from importlib import reload
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_formats = ['retina']
%matplotlib inline

plt.rcParams['figure.figsize'] = (9, 6)
sns.set(context='notebook', style='whitegrid', font_scale=1.2)

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV, ElasticNetCV
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline, make_pipeline

Pickling in initial data to work with:

In [3]:
pwd

'/Users/patrickbovard/Documents/GitHub/metis_final_project/Pitch_Classification'

In [4]:
with open('../Data/train_df_clusters.pickle','rb') as read_file:
    pitch_df = pickle.load(read_file)

In order to determine who to include in the pipeline, I'll see who the top xx pitchers are, in terms of pitches thrown:

In [5]:
pitch_df.pitcher_full_name.value_counts().head(50)

Max Scherzer         13626
Chris Sale           13284
Justin Verlander     12999
Jose Quintana        12944
Chris Archer         12760
Rick Porcello        12745
Jon Lester           12566
Corey Kluber         12480
Gio Gonzalez         12439
Julio Teheran        12125
Zack Greinke         12092
Jake Arrieta         12028
Cole Hamels          12016
Trevor Bauer         11860
Kyle Gibson          11829
Jacob deGrom         11775
Gerrit Cole          11772
James Shields        11768
Marco Estrada        11763
Jake Odorizzi        11719
Dallas Keuchel       11708
J.A. Happ            11597
Kevin Gausman        11540
Tanner Roark         11296
David Price          11205
Mike Leake           11141
Mike Fiers           11093
Ian Kennedy          11026
Kyle Hendricks       10952
Carlos Martinez      10920
Carlos Carrasco      10913
Andrew Cashner       10907
CC Sabathia          10654
Masahiro Tanaka      10589
Jeff Samardzija      10573
Madison Bumgarner    10551
Jason Hammel         10544
W

In [6]:
pitch_df.pitcher_full_name.value_counts().head(50).sum()

564235

So that would cover 564,235 total pitches from 2015-2018.

In [7]:
pitch_df.columns

Index(['inning', 'batter_id', 'pitcher_id', 'top', 'ab_id', 'p_score', 'stand',
       'p_throws', 'event', 'home_team', 'away_team', 'b_score', 'on_1b',
       'on_2b', 'on_3b', 'px', 'pz', 'zone', 'pitch_type', 'start_speed',
       'type', 'b_count', 's_count', 'outs', 'pitch_num', 'last_pitch_type',
       'last_pitch_px', 'last_pitch_pz', 'last_pitch_speed',
       'pitcher_full_name', 'pitcher_run_diff', 'hitter_full_name',
       'Date_Time_Date', 'Season', 'cumulative_pitches', 'cumulative_ff_rate',
       'cumulative_sl_rate', 'cumulative_ft_rate', 'cumulative_ch_rate',
       'cumulative_cu_rate', 'cumulative_si_rate', 'cumulative_fc_rate',
       'cumulative_kc_rate', 'cumulative_fs_rate', 'cumulative_kn_rate',
       'cumulative_ep_rate', 'cumulative_fo_rate', 'cumulative_sc_rate',
       'Name', 'Cluster'],
      dtype='object')

Checking out pitch types:

In [8]:
pitch_df.pitch_type.value_counts()

FF    1023717
SL     454880
FT     340899
CH     295221
SI     244501
CU     236045
FC     150892
KC      66942
FS      44233
KN      11295
EP        816
FO        810
SC        121
Name: pitch_type, dtype: int64

In [9]:
pitch_df = pitch_df[pitch_df.pitch_type != 'EP']

Importing the regression functions:

In [10]:
from location_regression_functions import *
from pitch_cat_functions import *

Importing the functions from the pipeline script:

In [11]:
from classification_location_combo import *

### Initial dataframe setup:

Before working on building out a workflow to model multiple pitchers in succession, I will build out the pipeline here in a step by step fashion:

### Dataframe Setup:

First, to set up the dataframe structure needed to model.  First step is to filter for the desired player's name:

In [12]:
scherzer_df = pitch_df[(pitch_df.pitcher_full_name == 'Max Scherzer') &(pitch_df.last_pitch_px.notnull())]

One hot encoding of needed columns:

In [13]:
ohe_cols = ['stand', 'p_throws']

In [14]:
ohe_df = column_ohe_maker(scherzer_df, ohe_cols)

Numerically coding last pitch type and pitch type:

In [15]:
temp_df = last_pitch_type_to_num(ohe_df, 'last_pitch_type')

Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


In [16]:
output_df = pitch_type_to_num(temp_df, 'pitch_type')

Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}


Declaring the columns for each step of the modeling process:

In [17]:
rf_cols = ['Cluster','inning', 'top', 'on_1b', 'on_2b', 'on_3b', 'b_count', 's_count', 'outs', 'stand_R',
       'pitcher_run_diff','last_pitch_speed', 'last_pitch_px', 'last_pitch_pz','pitch_num','cumulative_pitches',
       'cumulative_ff_rate', 'cumulative_sl_rate', 'cumulative_ft_rate',
       'cumulative_ch_rate', 'cumulative_cu_rate', 'cumulative_si_rate',
       'cumulative_fc_rate', 'cumulative_kc_rate', 'cumulative_fs_rate',
       'cumulative_kn_rate', 'cumulative_ep_rate', 'cumulative_fo_rate',
       'cumulative_sc_rate', 'Last_Pitch_Type_Num']

In [18]:
px_cols = rf_cols.copy()

In [19]:
px_cols.append('Pitch_Type_Num')

In [20]:
px_cols_val = px_cols.copy()
px_cols_val.append('pitch_pred')
px_cols_val.remove('Pitch_Type_Num')

In [21]:
pz_cols = px_cols.copy()
pz_cols.append('px')

In [22]:
pz_cols_val = pz_cols.copy()
pz_cols_val.append('px_pred')
pz_cols_val.remove('px')

Instead of a traditional train/test split, I need to split my dataframe into similar pitches.  To help with generalization/simulating unseen data, I'll split by at bat (using split_pitch_data in classification_location_combo.py).

In [23]:
training_pitches, val_pitches = split_pitch_data(output_df, test_size=0.20)

Actual Test Size: 0.24353704211199415


Now with training and validation df's separate, I can run modeling on each step, using the columns.

### Classification:

Setting up X and y for modeling, and performing train/test split:

In [24]:
X_rf_train = training_pitches[rf_cols]

In [25]:
y_rf_train = training_pitches['Pitch_Type_Num']

In [26]:
X_rf_val = val_pitches[rf_cols]

In [27]:
y_rf_val = val_pitches['Pitch_Type_Num']

Fitting model on test, then predicting on train:

In [28]:
rf_model = RandomForestClassifier()
rf_model.fit(X_rf_train,y_rf_train)

RandomForestClassifier()

In [29]:
y_rf_pred = rf_model.predict(X_rf_val)

In [30]:
unique, counts = np.unique(y_rf_pred, return_counts=True)

In [31]:
np.asarray((unique, counts)).T

array([[   0, 2193],
       [   1,  317],
       [   2,   95],
       [   3,   23],
       [   4,   36],
       [   5,    2]])

Running some metrics:

In [32]:
cm = confusion_matrix(y_rf_val, y_rf_pred)
cm

array([[1186,  140,   44,   11,   15,    1],
       [ 362,  141,    8,    2,    0,    0],
       [ 317,   20,   27,    5,    8,    0],
       [ 200,    8,    9,    3,    2,    0],
       [ 117,    2,    7,    2,   11,    0],
       [  11,    6,    0,    0,    0,    1]])

### Px:

Adding the predicted pitch types to the validation dataframe, for use as a feature in Px regression:

In [33]:
val_pitches['pitch_pred'] = y_rf_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_pitches['pitch_pred'] = y_rf_pred


Setting up X and y, with train test split:

In [34]:
X_px_tr = training_pitches[px_cols]
y_px_tr = training_pitches['px']

In [35]:
X_px_val = val_pitches[px_cols_val]
y_px_val = val_pitches['px']

Instantiating linear regression:

In [36]:
lm = LinearRegression()

In [37]:
lm.fit(X_px_tr, y_px_tr)

LinearRegression()

In [38]:
px_pred = lm.predict(X_px_val)

Getting some metrics:

In [39]:
px_r2 = lm.score(X_px_val, y_px_val)
px_r2

0.06443692792880518

In [40]:
px_mae = mae(y_px_val, px_pred)

In [41]:
px_mae

0.6795902216537721

Adding a px_pred column to the dataframe to utilize in the pz regression:

In [42]:
val_pitches['px_pred'] = px_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_pitches['px_pred'] = px_pred


### Pz:

Setting up X and y, with train test split:

In [43]:
X_pz_tr = training_pitches[pz_cols]
y_pz_tr = training_pitches['pz']

In [44]:
X_pz_val = val_pitches[pz_cols_val]
y_pz_val = val_pitches['pz']

Instantiating linear regression:

In [45]:
lm = LinearRegression()

In [46]:
lm.fit(X_pz_tr, y_pz_tr)

LinearRegression()

In [47]:
pz_pred = lm.predict(X_pz_val)

Getting some metrics:

In [48]:
pz_r2 = lm.score(X_pz_val, y_pz_val)
pz_r2

0.1487819100411506

In [49]:
pz_mae = mae(y_pz_val, pz_pred)
pz_mae

0.6235886665887916

Adding a px_pred column to the dataframe to utilize in the pz regression:

In [50]:
val_pitches['pz_pred'] = pz_pred

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_pitches['pz_pred'] = pz_pred


### Evaluating the pipeline:

Looking at the final validation df:

In [51]:
output_df = pd.DataFrame(columns=val_pitches.columns)

In [52]:
output_df = pd.concat([output_df, val_pitches])

In [53]:
output_df

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,cumulative_fo_rate,cumulative_sc_rate,Name,Cluster,stand_R,Last_Pitch_Type_Num,Pitch_Type_Num,pitch_pred,px_pred,pz_pred
2453,3.0,527038,453286,1.0,2.015001e+09,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,0,0.044230,2.523223
2454,3.0,527038,453286,1.0,2.015001e+09,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,0,0.007730,2.420953
2455,3.0,527038,453286,1.0,2.015001e+09,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,1,0.113485,2.260595
2496,5.0,502517,453286,1.0,2.015001e+09,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,Daniel Murphy,3.0,0.0,1,0,0,-0.234551,2.645617
2497,5.0,502517,453286,1.0,2.015001e+09,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,Daniel Murphy,3.0,0.0,0,0,0,-0.328774,2.583225
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2846534,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,Peter O'Brien,1.0,1.0,0,1,1,0.219846,1.985454
2846535,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,Peter O'Brien,1.0,1.0,1,1,1,0.250910,1.893609
2846536,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,Peter O'Brien,1.0,1.0,1,0,0,0.108060,2.269643
2846537,6.0,598284,453286,1.0,2.018179e+09,3.0,Single,was,mia,1.0,...,0.0,0.0,Peter O'Brien,1.0,1.0,0,2,0,0.029326,1.923316


In [54]:
val_pitches.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,cumulative_fo_rate,cumulative_sc_rate,Name,Cluster,stand_R,Last_Pitch_Type_Num,Pitch_Type_Num,pitch_pred,px_pred,pz_pred
2453,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,0,0.04423,2.523223
2454,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,0,0.00773,2.420953
2455,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0.0,0.0,Wilmer Flores,3.0,1.0,0,0,1,0.113485,2.260595
2496,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,Daniel Murphy,3.0,0.0,1,0,0,-0.234551,2.645617
2497,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0.0,0.0,Daniel Murphy,3.0,0.0,0,0,0,-0.328774,2.583225


In [55]:
val_pitches.shape

(2666, 54)

In [56]:
val_pitches.columns

Index(['inning', 'batter_id', 'pitcher_id', 'top', 'ab_id', 'p_score', 'event',
       'home_team', 'away_team', 'b_score', 'on_1b', 'on_2b', 'on_3b', 'px',
       'pz', 'zone', 'pitch_type', 'start_speed', 'type', 'b_count', 's_count',
       'outs', 'pitch_num', 'last_pitch_type', 'last_pitch_px',
       'last_pitch_pz', 'last_pitch_speed', 'pitcher_full_name',
       'pitcher_run_diff', 'hitter_full_name', 'Date_Time_Date', 'Season',
       'cumulative_pitches', 'cumulative_ff_rate', 'cumulative_sl_rate',
       'cumulative_ft_rate', 'cumulative_ch_rate', 'cumulative_cu_rate',
       'cumulative_si_rate', 'cumulative_fc_rate', 'cumulative_kc_rate',
       'cumulative_fs_rate', 'cumulative_kn_rate', 'cumulative_ep_rate',
       'cumulative_fo_rate', 'cumulative_sc_rate', 'Name', 'Cluster',
       'stand_R', 'Last_Pitch_Type_Num', 'Pitch_Type_Num', 'pitch_pred',
       'px_pred', 'pz_pred'],
      dtype='object')

# Utilizing Functions

Based on the above, I put together functions to run this same process in classification_location_combo.py.  Running those here:

### Random Forest:

In [57]:
val_df_full = pitch_prediction_modeling_pipeline('Max Scherzer', pitch_df, split_size = 0.2)

Pitch Modeling for Max Scherzer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.24353704211199415
Accuracy: 0.5153788447111778
Precision: (0.3562664410132765,)
Recall: 0.2232799413526301
Random Forest Pitch Classification confusion matrix results:
[[1198  132   35   15   16    1]
 [ 375  131    6    1    0    0]
 [ 311   21   32    6    7    0]
 [ 198    8   12    0    4    0]
 [ 119    2    6    0   12    0]
 [  11    6    0    0    0    1]]
Val Px R^2: 0.06217340518712988
Val Px MAE: 0.6799217544894862 ft.
Val Pz R^2: 0.14797326755723217
Val Pz MAE: 0.6239260720953342 ft.


In [58]:
val_df_full.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,Pitch_Type_Num,pitch_pred,FF_prob,SL_prob,CH_prob,CU_prob,FC_prob,FT_prob,px_pred,pz_pred
2453,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0,0.69,0.15,0.01,0.01,0.14,0.0,0.04423,2.523223
2454,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0,0.55,0.2,0.01,0.07,0.17,0.0,0.00773,2.420953
2455,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,1,0.25,0.52,0.17,0.03,0.03,0.0,0.113485,2.260595
2496,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0,0.55,0.03,0.13,0.28,0.01,0.0,-0.234551,2.645617
2497,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0,0.34,0.11,0.17,0.33,0.05,0.0,-0.328774,2.583225


Trying with another pitcher:

In [59]:
val_df_full = pitch_prediction_modeling_pipeline('Rick Porcello', pitch_df, split_size = 0.2)

Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'SI': 6, 'PO': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24137595010719157
Accuracy: 0.413403310456197
Precision: (0.3082354601213106,)
Recall: 0.27011107979952925
Random Forest Pitch Classification confusion matrix results:
[[542 159  55  19  22   0]
 [219 329  49  35  14   0]
 [182 122  76  17  12   0]
 [145 112  30  45  12   0]
 [112  81  33  21  32   0]
 [  0   2   0   0   0   0]]
Val Px R^2: 0.01625097051207347
Val Px MAE: 0.6479778321352762 ft.
Val Pz R^2: 0.0630893631249938
Val Pz MAE: 0.7294874864012376 ft.


  _warn_prf(average, modifier, msg_start, len(result))


In [60]:
val_df_full.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,Pitch_Type_Num,pitch_pred,FT_prob,FF_prob,SL_prob,CU_prob,CH_prob,SI_prob,px_pred,pz_pred
6895,2.0,425796,519144,0.0,2015002000.0,0.0,Single,phi,bos,0.0,...,1,0,0.62,0.16,0.16,0.03,0.03,0.0,-0.353588,2.23679
6896,2.0,425796,519144,0.0,2015002000.0,0.0,Single,phi,bos,0.0,...,0,0,0.53,0.13,0.12,0.14,0.08,0.0,-0.205796,2.447072
6897,2.0,520471,519144,0.0,2015002000.0,0.0,Groundout,phi,bos,0.0,...,0,0,0.68,0.11,0.03,0.06,0.12,0.0,-0.536513,2.564916
6898,2.0,520471,519144,0.0,2015002000.0,0.0,Groundout,phi,bos,0.0,...,0,0,0.49,0.12,0.11,0.16,0.12,0.0,-0.456235,2.618067
6899,2.0,520471,519144,0.0,2015002000.0,0.0,Groundout,phi,bos,0.0,...,0,0,0.3,0.15,0.21,0.18,0.16,0.0,-0.516097,2.522265


Testing out for the first 25 pitchers to get a sense for computational cost:

In [61]:
pitcher_list = pitch_df.pitcher_full_name.value_counts().head(10).index

In [62]:
output_df = multiple_pitcher_predictions(pitcher_list, pitch_df, split_size = 0.2)

Pitch Modeling for Max Scherzer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.24353704211199415
Accuracy: 0.5116279069767442
Precision: (0.35134453614306466,)
Recall: 0.22939324545432085
Random Forest Pitch Classification confusion matrix results:
[[1194  139   35   13   15    1]
 [ 370  130    9    3    0    1]
 [ 306   30   25    6   10    0]
 [ 202    8   10    1    1    0]
 [ 117    3    6    1   12    0]
 [  12    4    0    0    0    2]]
Val Px R^2: 0.06307061255595747
Val Px MAE: 0.6801648634298397 ft.
Val Pz R^2: 0.14828013035815868
Val Pz MAE: 0.6236701133574182 ft.




Pitch Modeling for Chris Sale
Here is the coding for last pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FA': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FS': 4}
Actual Test Size: 0.24533083059596433
Accuracy: 0.46901300688599845
Precision: (0.4403917529128037,)
Recall: 0.44842434850158785
Random Forest Pitch Classification confusion matrix results:
[[616  97 117  31]
 [233 234  60 214]
 [241  88 128  83]
 [ 19 174  31 248]]
Val Px R^2: 0.022252093421557317
Val Px MAE: 0.684886496455859 ft.
Val Pz R^2: 0.015032681481285781
Val Pz MAE: 0.6706689331434134 ft.




Pitch Modeling for Justin Verlander
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.2518304431599229


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.600229533282326
Precision: (0.30881157256297503,)
Recall: 0.20148604779978938
Random Forest Pitch Classification confusion matrix results:
[[1442   64   36    0    0    0]
 [ 403   91   20    0    0    0]
 [ 352   29   34    3    0    0]
 [ 116    7    5    2    0    0]
 [   8    1    0    0    0    0]
 [   1    0    0    0    0    0]]
Val Px R^2: 0.08674479355763709
Val Px MAE: 0.6161427923452784 ft.
Val Pz R^2: 0.173386134513563
Val Pz MAE: 0.6829542920471685 ft.




Pitch Modeling for Jose Quintana
Here is the coding for last pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4, 'PO': 5, 'FA': 6, 'UN': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4}
Actual Test Size: 0.2551168881559802
Accuracy: 0.49125475285171105
Precision: (0.38242278491951914,)
Recall: 0.3184550166684189
Random Forest Pitch Classification confusion matrix results:
[[977 195  55   7  26]
 [433 177  47   9  26]
 [130  37 102   2   0]
 [143  39  22   5   6]
 [119  41   0   1  31]]
Val Px R^2: -0.03200539804175584
Val Px MAE: 0.6637639257459879 ft.
Val Pz R^2: 0.027910153525192105
Val Pz MAE: 0.7306234277446174 ft.




Pitch Modeling for Chris Archer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4}
Actual Test Size: 0.255042802322149
Accuracy: 0.5760030864197531
Precision: (0.3920757482208521,)
Recall: 0.3246869770336806
Random Forest Pitch Classification confusion matrix results:
[[823 384  13   8   1]
 [425 642   9   8   0]
 [143  66  15   3   0]
 [  9  22   0  13   0]
 [  5   2   0   1   0]]
Val Px R^2: 0.17556908300949325
Val Px MAE: 0.5668768332490354 ft.
Val Pz R^2: 0.10863986209296961
Val Pz MAE: 0.7860822280100803 ft.




Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'SI': 6, 'PO': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24137595010719157


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.42430359305611626
Precision: (0.3179165474621097,)
Recall: 0.277485848762754
Random Forest Pitch Classification confusion matrix results:
[[550 159  49  23  16   0]
 [196 349  42  44  15   0]
 [194 109  70  22  14   0]
 [148 110  31  45  10   0]
 [115  83  28  16  37   0]
 [  1   1   0   0   0   0]]
Val Px R^2: 0.016638203711253463
Val Px MAE: 0.6488525881367437 ft.
Val Pz R^2: 0.06276122906570014
Val Pz MAE: 0.7294404699457013 ft.




Pitch Modeling for Jon Lester
Here is the coding for last pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4}
Actual Test Size: 0.25269461077844313
Accuracy: 0.4755134281200632
Precision: (0.28733649615452583,)
Recall: 0.25120487921882323
Random Forest Pitch Classification confusion matrix results:
[[999 123  38  16   3]
 [389 163  29  13   4]
 [240  66  33   6   2]
 [177  34   4   8   2]
 [127  39  10   6   1]]
Val Px R^2: 0.07157463518789675
Val Px MAE: 0.7156655476249751 ft.
Val Pz R^2: 0.16076557456873153
Val Pz MAE: 0.604059959760813 ft.




Pitch Modeling for Corey Kluber
Here is the coding for last pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5}
Actual Test Size: 0.2488986784140969
Accuracy: 0.37650844730490746
Precision: (0.35966526285131684,)
Recall: 0.31196955997333703
Random Forest Pitch Classification confusion matrix results:
[[444 128  60  80  67   6]
 [202 178  42  47  62   2]
 [162  67  69  65  29   2]
 [151  22  41 100   0   1]
 [124  47  10   0 138   3]
 [ 57  18  11  21  23   7]]
Val Px R^2: 0.117682211776304
Val Px MAE: 0.6454248571311181 ft.
Val Pz R^2: 0.0375367807831235
Val Pz MAE: 0.693438523992874 ft.




Pitch Modeling for Gio Gonzalez
Here is the coding for last pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3, 'UN': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3}
Actual Test Size: 0.24769076305220883
Accuracy: 0.4154843940008107
Precision: (0.3944652831397155,)
Recall: 0.37591449779419167
Random Forest Pitch Classification confusion matrix results:
[[496 131 108  58]
 [255 330  59  57]
 [237 113 113  44]
 [194 136  50  86]]
Val Px R^2: 0.07806261108540524
Val Px MAE: 0.6796511215306378 ft.
Val Pz R^2: 0.21730982301837365
Val Pz MAE: 0.6726560494794446 ft.




Pitch Modeling for Julio Teheran
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CU': 3, 'CH': 4, 'UN': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CH': 3, 'CU': 4}
Actual Test Size: 0.24703943981052415
Accuracy: 0.436848686952897
Precision: (0.3517501643016741,)
Recall: 0.28689011271863013
Random Forest Pitch Classification confusion matrix results:
[[768 134  72  19  19]
 [373 114  52   4   2]
 [209  52 134  18  16]
 [139  18  36  19   3]
 [142  13  20  10  13]]
Val Px R^2: 0.12576206196258877
Val Px MAE: 0.6879329688347735 ft.
Val Pz R^2: 0.10264284504860444
Val Pz MAE: 0.6780325056637525 ft.






Based on my stopwatch, it took ~5 mins to run for 25 pitchers, so it is not terribly expensive to model on a per pitcher basis.

Overall at this time, I'm not as concerned with pitch location R^2 as I am pitch type.  The rationale is that incorrect predictions of pitch type have a large factor on the location, so a wrong pitch type doesn't really matter since different pitches are thrown in different locations (or are meant to be).  

The pitch location also does not need to be exact for my use case.  Telling a hitter "expect a fastball at coordinate 1.23, 2.54" won't give them good info - in real life, saying "expect a fastball high and inside" is typically what would be needed.

In [63]:
output_df.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,pitch_pred,FF_prob,SL_prob,CH_prob,CU_prob,FC_prob,FT_prob,px_pred,pz_pred,SI_prob
2453,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0.69,0.17,0.01,0.0,0.13,0.0,0.04423,2.523223,
2454,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,0,0.49,0.2,0.06,0.05,0.2,0.0,0.00773,2.420953,
2455,3.0,527038,453286,1.0,2015001000.0,0.0,Strikeout,was,nyn,0.0,...,1,0.31,0.51,0.15,0.02,0.01,0.0,0.113485,2.260595,
2496,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0.59,0.07,0.12,0.21,0.0,0.01,-0.234551,2.645617,
2497,5.0,502517,453286,1.0,2015001000.0,1.0,Flyout,was,nyn,0.0,...,0,0.36,0.1,0.2,0.34,0.0,0.0,-0.328774,2.583225,


### XGBoost Classifier:

In order to compare, I will be now running the workflow with XGBoost utilized at the classification step, as compared to Random Forest:

In [64]:
val_df_full = pitch_prediction_modeling_pipeline('Rick Porcello', pitch_df, split_size = 0.2, class_method = 'XGBoost')

Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'SI': 6, 'PO': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24137595010719157




Accuracy: 0.42389987888574887
Precision: (0.31570832992243386,)
Recall: 0.29346191244922587
XGBoost Pitch Classification confusion matrix results:
[[503 138  73  56  27   0]
 [178 337  58  43  30   0]
 [170  97  86  34  22   0]
 [117 102  43  69  13   0]
 [ 89  75  35  25  55   0]
 [  0   2   0   0   0   0]]
Val Px R^2: 0.014246877138460401
Val Px MAE: 0.6481563481851975 ft.
Val Pz R^2: 0.06401292208670706
Val Pz MAE: 0.7291888588890443 ft.


  _warn_prf(average, modifier, msg_start, len(result))


In [65]:
output_df = multiple_pitcher_predictions(pitcher_list, pitch_df, split_size = 0.2, class_method = 'XGBoost')

Pitch Modeling for Max Scherzer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5, 'UN': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.24353704211199415
Accuracy: 0.5123780945236309
Precision: (0.33167357629037814,)
Recall: 0.21886717954759705
Random Forest Pitch Classification confusion matrix results:
[[1198  130   36   17   14    2]
 [ 376  127    7    2    1    0]
 [ 312   24   28    7    6    0]
 [ 200    8    9    3    2    0]
 [ 115    3   11    1    9    0]
 [  11    6    0    0    0    1]]
Val Px R^2: 0.06292832123650283
Val Px MAE: 0.6807360474068881 ft.
Val Pz R^2: 0.14825977257811085
Val Pz MAE: 0.6237695998133398 ft.




Pitch Modeling for Chris Sale
Here is the coding for last pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FA': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'SL': 1, 'CH': 2, 'FF': 3, 'FS': 4}
Actual Test Size: 0.24533083059596433
Accuracy: 0.49387911247130833
Precision: (0.4716932061087076,)
Recall: 0.4763742622979258
Random Forest Pitch Classification confusion matrix results:
[[620  88 118  35]
 [231 253  67 190]
 [236  75 156  73]
 [ 18 161  31 262]]
Val Px R^2: 0.0233236221461528
Val Px MAE: 0.6845313195428048 ft.
Val Pz R^2: 0.014685803610571257
Val Pz MAE: 0.6707958594156237 ft.




Pitch Modeling for Justin Verlander
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CU': 2, 'CH': 3, 'FC': 4, 'FT': 5}
Actual Test Size: 0.2518304431599229


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.5925784238714613
Precision: (0.28211079805007694,)
Recall: 0.1951840034268869
Random Forest Pitch Classification confusion matrix results:
[[1435   62   44    1    0    0]
 [ 408   84   22    0    0    0]
 [ 353   35   29    1    0    0]
 [ 119    6    4    1    0    0]
 [   9    0    0    0    0    0]
 [   1    0    0    0    0    0]]
Val Px R^2: 0.0866611794646418
Val Px MAE: 0.616163313870525 ft.
Val Pz R^2: 0.17335695850285582
Val Pz MAE: 0.6829768586422804 ft.




Pitch Modeling for Jose Quintana
Here is the coding for last pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4, 'PO': 5, 'FA': 6, 'UN': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'CU': 1, 'SI': 2, 'CH': 3, 'FT': 4}
Actual Test Size: 0.2551168881559802
Accuracy: 0.4870722433460076
Precision: (0.35832041385062785,)
Recall: 0.3136757250424462
Random Forest Pitch Classification confusion matrix results:
[[969 192  64   5  30]
 [427 177  47   8  33]
 [126  39 104   2   0]
 [147  42  18   3   5]
 [125  37   0   2  28]]
Val Px R^2: -0.043905909735479254
Val Px MAE: 0.6678932858879805 ft.
Val Pz R^2: 0.030339873854484
Val Pz MAE: 0.7296862263036331 ft.




Pitch Modeling for Chris Archer
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'FT': 3, 'CU': 4}
Actual Test Size: 0.255042802322149
Accuracy: 0.5837191358024691
Precision: (0.4396075223123077,)
Recall: 0.34460119632201797
Random Forest Pitch Classification confusion matrix results:
[[849 355  14   9   2]
 [423 636  15  10   0]
 [151  58  16   2   0]
 [ 11  21   1  11   0]
 [  3   3   0   1   1]]
Val Px R^2: 0.1754263006858301
Val Px MAE: 0.5669516203286671 ft.
Val Pz R^2: 0.10862124903677184
Val Pz MAE: 0.7860724403579163 ft.




Pitch Modeling for Rick Porcello
Here is the coding for last pitch type:
{'FT': 0, 'FF': 1, 'CU': 2, 'SL': 3, 'CH': 4, 'EP': 5, 'SI': 6, 'PO': 7}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FT': 0, 'FF': 1, 'SL': 2, 'CU': 3, 'CH': 4, 'SI': 5}
Actual Test Size: 0.24137595010719157


  _warn_prf(average, modifier, msg_start, len(result))


Accuracy: 0.41421073879693177
Precision: (0.3130149385471274,)
Recall: 0.2742480574563962
Random Forest Pitch Classification confusion matrix results:
[[542 157  52  25  21   0]
 [226 319  46  37  18   0]
 [186 106  76  24  17   0]
 [134 118  32  49  11   0]
 [101  93  28  17  40   0]
 [  0   2   0   0   0   0]]
Val Px R^2: 0.012106816405400567
Val Px MAE: 0.6488782347646885 ft.
Val Pz R^2: 0.06273054970224434
Val Pz MAE: 0.7292177771210143 ft.




Pitch Modeling for Jon Lester
Here is the coding for last pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4, 'PO': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FC': 1, 'CU': 2, 'SI': 3, 'CH': 4}
Actual Test Size: 0.25269461077844313
Accuracy: 0.45813586097946285
Precision: (0.2801739477802663,)
Recall: 0.24305116568075444
Random Forest Pitch Classification confusion matrix results:
[[967 144  42  18   8]
 [408 150  26  11   3]
 [254  57  30   3   3]
 [173  35   3  13   1]
 [131  41   9   2   0]]
Val Px R^2: 0.06990997010013356
Val Px MAE: 0.716402492811807 ft.
Val Pz R^2: 0.16124324477114083
Val Pz MAE: 0.6039821917558733 ft.




Pitch Modeling for Corey Kluber
Here is the coding for last pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'SI': 0, 'CU': 1, 'FF': 2, 'SL': 3, 'FC': 4, 'CH': 5}
Actual Test Size: 0.2488986784140969
Accuracy: 0.3668543845534996
Precision: (0.33393577430355886,)
Recall: 0.3047007535232488
Random Forest Pitch Classification confusion matrix results:
[[425 131  63  81  78   7]
 [203 182  38  48  55   7]
 [161  63  61  77  28   4]
 [144  22  39 110   0   0]
 [127  53  11   0 128   3]
 [ 59  24   6  19  23   6]]
Val Px R^2: 0.11583685501532659
Val Px MAE: 0.6458478056499573 ft.
Val Pz R^2: 0.03642402167295489
Val Pz MAE: 0.6940147833609269 ft.




Pitch Modeling for Gio Gonzalez
Here is the coding for last pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3, 'UN': 4}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'FT': 1, 'CU': 2, 'CH': 3}
Actual Test Size: 0.24769076305220883
Accuracy: 0.41143088771787595
Precision: (0.3922583781989796,)
Recall: 0.37389465312258935
Random Forest Pitch Classification confusion matrix results:
[[497 126 106  64]
 [264 308  62  67]
 [241 108 122  36]
 [183 138  57  88]]
Val Px R^2: 0.0758541759026703
Val Px MAE: 0.6782002543114094 ft.
Val Pz R^2: 0.2170310194605135
Val Pz MAE: 0.6731217908687507 ft.




Pitch Modeling for Julio Teheran
Here is the coding for last pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CU': 3, 'CH': 4, 'UN': 5, 'PO': 6}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'FT': 2, 'CH': 3, 'CU': 4}
Actual Test Size: 0.24703943981052415
Accuracy: 0.42350979574822845
Precision: (0.33331423765830537,)
Recall: 0.27068914752683926
Random Forest Pitch Classification confusion matrix results:
[[768 130  77  22  15]
 [378 105  54   7   1]
 [242  47 115  11  14]
 [136  18  40  19   2]
 [146  16  17  10   9]]
Val Px R^2: 0.12324580903101334
Val Px MAE: 0.688575858132727 ft.
Val Pz R^2: 0.10339876763060019
Val Pz MAE: 0.6777827775190203 ft.






About 70 seconds to run 10.

Overall, the metrics seem relatively similar with Random Forest and XGBoost.  In either way, the prediction quality (and location prediction) will need to be improved.

In [66]:
pitch_df.columns

Index(['inning', 'batter_id', 'pitcher_id', 'top', 'ab_id', 'p_score', 'stand',
       'p_throws', 'event', 'home_team', 'away_team', 'b_score', 'on_1b',
       'on_2b', 'on_3b', 'px', 'pz', 'zone', 'pitch_type', 'start_speed',
       'type', 'b_count', 's_count', 'outs', 'pitch_num', 'last_pitch_type',
       'last_pitch_px', 'last_pitch_pz', 'last_pitch_speed',
       'pitcher_full_name', 'pitcher_run_diff', 'hitter_full_name',
       'Date_Time_Date', 'Season', 'cumulative_pitches', 'cumulative_ff_rate',
       'cumulative_sl_rate', 'cumulative_ft_rate', 'cumulative_ch_rate',
       'cumulative_cu_rate', 'cumulative_si_rate', 'cumulative_fc_rate',
       'cumulative_kc_rate', 'cumulative_fs_rate', 'cumulative_kn_rate',
       'cumulative_ep_rate', 'cumulative_fo_rate', 'cumulative_sc_rate',
       'Name', 'Cluster'],
      dtype='object')

In [67]:
output_df.KN_prob.value_counts()

AttributeError: 'DataFrame' object has no attribute 'KN_prob'