# Modeling

## Introduction

In this notebook, we finally get to explore some machine learning models in attempt to predict baseball pitch types. We have wrangled, explored, and processed Statcast data resulting in 3 datasets available for modeling:

- no_pitchers - All pitches without pitcher reference
- pitchers - All pitches with reference to top 5 pitchers based on pitch count
- first_pitch - Only the first pitch in each at-bat without pitcher reference

From here, we can explore modeling various scenarios with the data above. We'll begin by setting a baseline from predicting all pitches as the most common pitch type, FF. Our modeling efforts will include scaling the data, creating a train/test split, and evaluating performance metrics. We will also complete tuning of some hyperparameters relevant to each model. 

## Imports and Data Load

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV, learning_curve
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import SelectKBest, f_regression

In [2]:
no_pitchers = pd.read_csv('no_pitchers.csv')
pitchers = pd.read_csv('pitchers.csv')
first_pitch = pd.read_csv('first_pitch.csv')

In [3]:
print(len(no_pitchers))
print(len(pitchers))
print(len(first_pitch))

110301
3114
28166


## Train/Test Split

We begin with our first_pitch dataset... blah/blah/blah

In [4]:
first_pitch.head().T

Unnamed: 0,0,1,2,3,4
pitch_type,SL,FF,FF,SL,FF
balls,0,0,0,0,0
strikes,0,0,0,0,0
on_3b,0,0,0,0,0
on_2b,1,1,1,1,0
on_1b,1,0,0,0,1
outs_when_up,2,2,2,1,0
inning,9,9,9,9,9
at_bat_number,77,76,75,74,73
pitch_number,1,1,1,1,1


Let's define X and y, then check the partition sizes if we use a 70/30 train/test split.

In [5]:
X = first_pitch.drop(columns='pitch_type')
y = first_pitch['pitch_type']

In [6]:
len(first_pitch) * .7, len(first_pitch) * .3

(19716.199999999997, 8449.8)

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [8]:
X_train.shape, X_test.shape

((19716, 25), (8450, 25))

In [9]:
y_train.shape, y_test.shape

((19716,), (8450,))

In [10]:
X_train.dtypes

balls                                   int64
strikes                                 int64
on_3b                                   int64
on_2b                                   int64
on_1b                                   int64
outs_when_up                            int64
inning                                  int64
at_bat_number                           int64
pitch_number                            int64
home_score                              int64
away_score                              int64
bat_score                               int64
fld_score                               int64
stand_L                                 int64
stand_R                                 int64
p_throws_L                              int64
p_throws_R                              int64
inning_topbot_Bot                       int64
inning_topbot_Top                       int64
if_fielding_alignment_Infield shift     int64
if_fielding_alignment_Standard          int64
if_fielding_alignment_Strategic   

In [11]:
X_test.dtypes

balls                                   int64
strikes                                 int64
on_3b                                   int64
on_2b                                   int64
on_1b                                   int64
outs_when_up                            int64
inning                                  int64
at_bat_number                           int64
pitch_number                            int64
home_score                              int64
away_score                              int64
bat_score                               int64
fld_score                               int64
stand_L                                 int64
stand_R                                 int64
p_throws_L                              int64
p_throws_R                              int64
inning_topbot_Bot                       int64
inning_topbot_Top                       int64
if_fielding_alignment_Infield shift     int64
if_fielding_alignment_Standard          int64
if_fielding_alignment_Strategic   

In [12]:
dumb_class = DummyClassifier(strategy='constant', constant='FF')
dumb_class.fit(X_train, y_train)
dumb_class.constant

'FF'

In [13]:
y_train_pred = dumb_class.predict(X_train)
y_train_pred[:5]

array(['FF', 'FF', 'FF', 'FF', 'FF'], dtype='<U2')

In [23]:
classification_report(y_train, y_train_pred, zero_division=0)

'              precision    recall  f1-score   support\n\n          CH       0.96      0.95      0.96      1506\n          CU       0.95      0.93      0.94      2009\n          EP       1.00      1.00      1.00         1\n          FC       0.97      0.93      0.95      1215\n          FF       0.91      0.98      0.94      7383\n          FS       0.96      0.97      0.96       166\n          FT       0.95      0.85      0.90      1977\n          KC       0.96      0.91      0.94       490\n          KN       1.00      1.00      1.00        11\n          SI       0.96      0.88      0.92      1750\n          SL       0.96      0.94      0.95      3208\n\n    accuracy                           0.94     19716\n   macro avg       0.96      0.94      0.95     19716\nweighted avg       0.94      0.94      0.94     19716\n'

Our classification report is a little confusing given the count of labels in our dummy classification model while only predicting a constant FF.  

In [16]:
pipe = make_pipeline(StandardScaler(), 
                     RandomForestClassifier(n_estimators=100, 
                                            random_state=42, 
                                            max_features = 'sqrt', 
                                            n_jobs=-1, verbose = 1)
)

In [17]:
pipe.fit(X_train, y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    0.8s
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed:    1.9s finished


Pipeline(memory=None,
         steps=[('standardscaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('randomforestclassifier',
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features='sqrt',
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=-1,
                                        oob_score=False, random_state=42,
                                        verbose=1, warm_start=False))],
         verbose=False

In [18]:
y_train_pred = pipe.predict(X_train)
y_test_pred = pipe.predict(X_test)

[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.1s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.3s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  42 tasks      | elapsed:    0.0s
[Parallel(n_jobs=4)]: Done 100 out of 100 | elapsed:    0.1s finished


In [20]:
from sklearn import metrics

In [22]:
print("Random Forest model")
print("Accuracy:", metrics.accuracy_score(y_test,y_test_pred))
print("Balanced accuracy:", metrics.balanced_accuracy_score(y_test,y_test_pred))
print('Precision score' , metrics.precision_score(y_test,y_test_pred, average='micro'))
print('Recall score' , metrics.recall_score(y_test,y_test_pred, average='micro'))

Random Forest model
Accuracy: 0.3190532544378698
Balanced accuracy: 0.19096925581511703
Precision score 0.3190532544378698
Recall score 0.3190532544378698
