<h1>Baseball Pitch Prediction</h1>

<h3>Author:  Le Phan</h3>
 
<h4>Date: October 11, 2019</h4>

1. [Introduction](#intro)
2. [Data Processing](#process)
3. [Feature Engineering](#features)
4. [Classification Method](#model)
5. [Results](#res)
6. [Future Considerations](#futures)
7. [Appendix](#appendix)

<a id='intro'></a>
<h2>1. Introduction</h2>

Baseball is one of the most popular sports in America. Top performing players could earn astronomical contract value in addition to sponsorship deals with leading consumer product brands. Most recently, Many Machado signed a 10-year deal with the Padres for a whopping $300 millions. To help understand player performance, the Major Baseball League (MLB) provides detailed data set that measures every aspect of the game through its PITCHf/x system. Pitch related data is extremely important as it helps baseball teams measure their players' performances. It can also be used to predict the pitching pattern of an opposing team during a game--a significant strategy advantage.

This data analysis project examines the 2011 pitch data set provided by Swish Analytics, a sport analytics company, and builds machine learning models that can be used to for pitch prediction during a game. [Section 2](#process) discusses the data exploration and transformation process. [Section 3](#features) goes over the feature engineering steps derived from a subset of features taken from the original data. Machine learning models and results are discussed in [Section 4](#model) and [Section 5](#res).

<a id='process'></a>
<h2>2. Data Processing</h2>

The data set is a record of 720,000 pitches made in the year 2011. It contains pre-pitch information such as the current ball/strike count, whether the pitcher is left-handed or right-handed, the presence of runner on first, second, and third base. These are important features as they influence a pitcher's strategy. For instance, if there are runners on all three bases and the current ball/strike count is 3/2 with 2 outs then the pitcher has greater incentive not to let the batter walk and might attemp to strike the batter out by throwing more centered pitch. Other post-pitch information that are also useful includes the horizontal/vertical position of the pitch and its velocity. These features will be use to engineer additional features in the next section. But let's first load the data and examine some descriptive statistics. See the [Appendix](#appendix) for a full list of features used and their descriptions.

In [132]:
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.dpi'] = 144

import numpy as np
import pandas as pd
from datetime import datetime
import pickle

from bokeh.io import output_notebook, show
from bokeh.plotting import figure
output_notebook()

In [133]:
names = ["uid", "game_pk", "inning", "top", "at_bat_num", "pcount_at_bat", "pcount_pitcher",
         "balls", "strikes", "fouls", "outs", "batter_id", "pitcher_id", "p_throws",
         "x", "y", "start_speed", "pitch_type", "on_1b", "on_2b", "on_3b"]

df = pd.read_csv("pitches.zip", usecols=names, index_col="uid")
len(df.pitcher_id.unique()) # 662 pitchers in 2011

662

In [None]:
df.shape # (718961, 20)

In [134]:
# convert to categorical data
df[["batter_id", "p_throws", "pitch_type"]] = df[["batter_id", "p_throws", "pitch_type"]].astype("category")
# binarize 
df[["on_1b", "on_2b", "on_3b"]] = np.where(df[["on_1b", "on_2b", "on_3b"]] > 0, 1, 0)
# add pitcher-count pair
df["pc_pair"]  = df["pitcher_id"].astype(str) + "_" + df["balls"].astype(str) + df["strikes"].astype(str)
# add balls-strike count
df["bs_count"] = df["balls"].astype(str) + df["strikes"].astype(str)

In [135]:
df.pitcher_id.value_counts().describe()

count     662.000000
mean     1086.043807
std      1044.893803
min         5.000000
25%       251.250000
50%       814.500000
75%      1332.000000
max      4301.000000
Name: pitcher_id, dtype: float64

Given there are 662 pitchers in the data set, we want to filter out pitchers who do not pitch regularly.  Since there are 162 games for each team per MLB season, it is assumed that pitchers who pitched less than 1000 times (slightly below the mean) in 2011 are **relieve** pitchers who entered the game after the starting pitcher is removed--often due to injuries or fatigue. This filter reduces the pitcher count to 255. From here, we can exmine the distribution of each type of pitch over the entire data set.

In [136]:
# filter out pitcher with less than 1000 pitches
df = df.groupby("pitcher_id").filter(lambda x: len(x) > 1000)  # 255 pitchers

pitches = df.pitch_type.value_counts()
types = list(pitches.index)
counts = list(pitches.values)

# set the x_range to the list of categories
p = figure(x_range=types, plot_height=250, title="Pitch Type Counts")
p.vbar(x=types, top=counts, width=.9)

# set some properties to make plot look better
p.xgrid.grid_line_color = None
p.y_range.start = 0

show(p)

From the barplot, it is obvious that fastballs are the popular type of pitch. Although there are 18 pitch categories, the distribution skew toward a hand full of pitch types. Thus, it makes sense to put the categories that occur more frequently in their own general buckets while relabeling the low frequencies pitch types as "others".

In [137]:
# replace pitch_types with general labels
fastballs = dict.fromkeys(["FF", "FT", "FC", "FS"], "FB")
knuckleballs = dict.fromkeys(["KC", "KN"], "KB")
otherballs = dict.fromkeys(["PO", "FO", "EP", "FA", "UN", "AB", "SC", np.nan], "OB")
df["pitch_type"] = df.pitch_type.replace(fastballs)
df["pitch_type"] = df.pitch_type.replace(knuckleballs)
df["pitch_type"] = df.pitch_type.replace(otherballs)

<a id='features'></a>
<h2>3. Feature Engineering</h2>

This section use a subset of the original features to derive additional insights into factors that might be useful in predicting the next pitch. See the [Appendix](#appendix) for a full list of added features. Since each pitcher is different and have different preference to the type of ball he will throw, we will add the pitcher's historical pitch percentages of each type (i.e., fast, sinker, slider, curve, changeup, knuckle). This is accomplished with the helper function `_get_pitch_pct` which is called by the `add_pitch_pct` function.

In [138]:
def _get_pitch_pct(dff):  
    """Compute pitcher's career percentages for each pitch type.
    params:
    -------
    dff: pd.DataFrame
    
    return:
    -------
    d: dictionary with (pithcer_id, pitch_type) as key and percentage as value
    """
    x = dff.groupby("pitcher_id").agg({"pitch_type": "count"})
    y = dff.groupby(["pitcher_id", "pitch_type"]).agg({"pitch_type": "count"})
    z = y.div(x, axis=1)
    d = {}
    for i in range(len(z)):
        d[z.index[i]] = z.values[i][0]
    return d

def add_pitch_pct(dff, pct_list):
    """Add careeer percentages features to dataframe.
    params:
    -------
    dff: pd.DataFrame
    pct_list: list of 2-tuple of strings representing the pitch type percentages
    
    return:
    -------
    dff: pd.DataFrame with added career features
    """
    career_pct = _get_pitch_pct(dff)
    for el in pct_list:
        dff[el[0]] = dff.pitcher_id.map(lambda x: career_pct.get((x, el[1]), 0))
    return dff

In [139]:
%%time
# percentage list for each type of pitch
career_features = [("fast_pct", "FB"), ("sinker_pct", "SI"), 
           ("slider_pct", "SL"), ("curve_pct", "CU"), 
           ("changeup_pct", "CH"), ("knuckle_pct", "KB"), 
           ("other_pct", "OB")]
# add career percentages features for each pitch
df = add_pitch_pct(df, career_features)

CPU times: user 4.99 s, sys: 223 ms, total: 5.21 s
Wall time: 3.93 s


In [140]:
df.loc[:, "fast_pct" : "other_pct"].describe()

Unnamed: 0,fast_pct,sinker_pct,slider_pct,curve_pct,changeup_pct,knuckle_pct,other_pct
count,544741.0,544741.0,544741.0,544741.0,544741.0,544741.0,544741.0
mean,0.521239,0.11537,0.14086,0.082819,0.108646,0.021711,0.004775
std,0.217422,0.201848,0.115913,0.081531,0.081837,0.092092,0.011917
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.403816,0.0,0.009494,0.0,0.040486,0.0,0.001358
50%,0.583994,0.0,0.145727,0.075859,0.101254,0.0,0.002689
75%,0.668494,0.177795,0.225708,0.133868,0.16429,0.0,0.004878
max,0.982009,0.753546,0.613902,0.396465,0.345737,0.878311,0.232751


We can see from the pitch statistics that fastball is the most common type of pitch among all pitchers with some pitchers thrown almost entirely all fastball pitches. Although historical pitch percentages are useful in indicating a pitcher's tendency, we also need to consider the current batter that he faces. To do that, we'll add the historical pitch percentages of each type thrown by the pitcher to the current batter.

In [141]:
def _get_batter_pct(dff):
    """Compute pitcher's career percentages for each pitch type thrown at a specific batter.
    params:
    -------
    dff: pd.DataFrame
    
    return:
    -------
    d: dictionary with (pitchcer_id, batter_id, pitch_type) as key and percentage as value
    """
    x = dff.groupby(["pitcher_id", "batter_id"]).agg({"pitch_type": "count"})
    y = dff.groupby(["pitcher_id", "batter_id", "pitch_type"]).agg({"pitch_type": "count"})
    z = y.div(x, axis=1)
    d = {}
    for i in range(len(z)):
        d[z.index[i]] = z.values[i][0]
    return d

def add_batter_pct(dff, btr_list):
    """Add careeer percentages features to dataframe.
    params:
    -------
    dff: pd.DataFrame
    btr_list: list of 2-tuple strings representing pitch percentage specific to a batter.
    
    return:
    -------
    dff: pd.DataFrame with added pitch percentages specific to a batter
    """
    btr_pct = _get_batter_pct(dff)
    temp = pd.Series(zip(dff.pitcher_id, dff.batter_id), index=dff.index)
    for el in btr_list:
        df[el[0]] = temp.map(lambda x: btr_pct.get((x[0], x[1], el[1]), 0))
    return df

In [142]:
%time
btr_features = [("fast_btr", "FB"), ("sinker_btr", "SI"), 
           ("slider_btr", "SL"), ("curve_btr", "CU"), 
           ("changeup_btr", "CH"), ("knuckle_btr", "KB"), 
           ("other_btr", "OB")]
# add batter-specific percentages
df = add_batter_pct(df, btr_features)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.91 µs


The previous features captures the pitcher's overall tendency to favor a particular type of pitch. However, pitchers through practices in between games might modify their pitch tendency. Thus, it is important that we capture the pitcher's recent trend as well. This is done by adding a rolling 5, 10, 15, and 20 pitch percentages. 

In [143]:
def get_recent_pct(dff, n, feats):
    """Compute n-recent pitch percentages for each pitch type.
    params:
    -------
    dff: pd.DataFrame
    n: int representing number of recent pitches
    feats: list of 2-tuple of feature names
    
    returns:
    --------
    dff: pd.DataFrame
    """
    for feat in feats:
        dff[feat[0] + "_prev" + str(n)] = (dff.groupby("pitcher_id")["pitch_type"].shift(1) == feat[1]).rolling(n).sum()/n
    return dff

In [144]:
%time
recent_features = [("fast", "FB"), ("sinker", "SI"), ("slider", "SL"), 
                   ("curve", "CU"), ("changeup", "CH"), ("knuckle", "KB"), 
                   ("other", "OB")]
# add 5, 10, 15, 20 recent percentages for each type
for i in [5, 10, 15, 20]:
    df = get_recent_pct(df, i, recent_features)

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 9.06 µs


Next, we'll get even more specific with the pitch percentages that capture a specific game scenario. For instance, facing a really good batter (i.e one with high home-run rate) with one runner on first base and the current ball/strike count is 3/0, the pitcher might be more inclined to walk the batter than attempting to strike him out. We need the percentages for each ball-strike count specific to a pitcher.

In [145]:
def _get_combo_pct(dff):
    """Compute pitch type percentages for each ball-strike count specific to a pitcher.
    params:
    -------
    df: pd.DataFrame
    
    return:
    -------
    d: dictionary of (pitcher_id, ball-strike, pitch_type) tuple as key and percentages as value
    """
    x = df.groupby(["pitcher_id", "bs_count"]).agg({"pitch_type": "count"})
    y = df.groupby(["pitcher_id", "bs_count", "pitch_type"]).agg({"pitch_type": "count"})
    z = y.div(x, axis=1)
    d = {}
    for i in range(len(z)):
        d[z.index[i]] = z.values[i][0]
    return d

def add_combo_pct(dff, combo_list):
    """Add pitch type percentages for each ball-strike combo specific to a pitcher.
    params:
    -------
    dff: pd.DataFrame
    
    return:
    -------
    dff: pd.DataFrame with added features
    """
    combo_pct = _get_combo_pct(dff)
    temp = pd.Series(zip(dff.pitcher_id, dff.bs_count), index=dff.index)
    pairs = [str(i)+str(j) for i in range(4) for j in range(3)]
    for co in combo_list:
        for pair in pairs:
            dff[co[0] + pair + "_combo"] = temp.map(lambda x: combo_pct.get((x[0], x[1], co[1]), 0))
    return dff

In [146]:
%time
# add ball-strike pitch percentages
df = add_combo_pct(df, recent_features)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 8.82 µs


All engineered features added to the data set up to this point are based on information known prior to the pitch being thrown. Next, we will use post-pitch features (i.e., horizontal/vertical positions and velocity) to derive metrics that capture the recent tendency of a pitcher to throw in a particular pitch zone. We'll accompish this by taking a moving average of the previous three pitch coordinates and velocities.

In [147]:
# add previous coordinates and moving averages
df["x_prev"] = df.groupby("pitcher_id")["x"].shift(1)
df["y_prev"] = df.groupby("pitcher_id")["y"].shift(1)
df["x_avg3"] = df.groupby("pitcher_id")["x"].shift(1).rolling(3).mean()
df["y_avg3"] = df.groupby("pitcher_id")["y"].shift(1).rolling(3).mean()
df["speed_prev"] = df.groupby("pitcher_id")["start_speed"].shift(1)
df["speed_avg3"] = df.groupby("pitcher_id")["start_speed"].shift(1).rolling(3).mean()

Before we start the modeling process, let's first remove the `x` and `y` features which are information that are known only after the pitch is thrown. These two features were used to engineer the previous pitch coordinates as well as the average coordinates of the previous three pitches. Similarly, we will remove `start_speed` which indicates the velocity of the pitch. 

In addition, we need to shift the label column `pitch_type` backward since our goal is to forecast the next pitch. In its original form, the `pitch_type` label is only known after the pitch is made. Therefore, shifting backward one period avoids information leakage.

After the above tasks is completed, we also need to drop all rows with NaN as well.

In [148]:
# drop after-pitch information
df.drop(["x", "y", "start_speed"], axis=1, inplace=True)
# drop rows with balls = 4
df.drop(df[df.balls == 4].index, inplace=True)  # 2 rows dropped
# shift label backward for forecasting
df["pitch_type"] = df.pitch_type.shift(-1)
# drop NaN 
df.dropna(axis=0, inplace=True)
len(df.batter_id.unique()) # 903 batters

903

<a id='model'></a>
<h2>4. Classification Method</h2>

The goal of this project is not to predict the next pitch but to *predict the next pitch given a specific ball/strike count*. We will use **Random Forest** method of classification for multiclass to predict the next pitch. But first, let's discuss the issues with the "cleaned" data set. There are over 903 unique batters in the record. Since `batter_id` is a categorical feature, if we are to **OneHot encode** this feature for our classfier, it will result in very high dimension, i.e., 903 features plus the rest of the other features. So we need to decide how to filter the batters such that we only keep the records of pitches related to frequent batters. Since there are 162 baseball games per season, it is assumed that frequent batters are those who get to bat at least twice per game. Thus, we will set the threshold to identify frequent batters at 400 and filter out pitches made to infrequent batters. Another issue to consider is that with 255 pitchers and 12 possible ball-strike count, there are 3060 subsets of data and models. However, not all subsets have large enough number of observations to train on. Some scenario are more likely than others (e.g., count 0-0 occur more frequently than 3-0 for some pitchers). So we will filter out scenario with subset data with fewer than 100. 

The two filters below result in 128 frequent batters and 778 pitcher and ball/strike combinations.

In [149]:
mask = ["game_pk", "at_bat_num", "batter_id"]
bat_freq = df.drop_duplicates(mask).batter_id.value_counts() 
frequent_batters = bat_freq[bat_freq > 400].index  # 128 frequent batters
df = df.loc[df.batter_id.isin(frequent_batters)]   # only keep pitch made to frequent batters
df = df.groupby("pc_pair").filter(lambda x: len(x) > 100)
df.pc_pair.value_counts().shape[0]   # 778 subsets
df.drop(["game_pk", "bs_count"], axis=1, inplace=True) # not used in forecasting

Since we will be building 778 models for each pitcher and ball/strike combo, let's first subset the data. We'll put our subsets into a dictionary where the keys are pitcher and ball/strike count (i.e `pc_pair`) and the dictionary values are the subset dataframe specific to each key.

In [150]:
def get_subsets(dff):
    """Split dataframe into subsets of data based on pc_pair.
    param:
    ------
    dff: pd.DataFrame
    
    return:
    -------
    d: dictionary with pc_pair as key and pd.DataFrame as value
    """
    d = {}
    for i in dff.pc_pair.unique():
        d[i] = dff[dff.pc_pair == i]
    return d

In [151]:
%%time
# create subsets of data
subsets = get_subsets(df)

CPU times: user 10.9 s, sys: 210 ms, total: 11.1 s
Wall time: 11.5 s


Excluding the identifier columns, the resulting data set contains 147 features -- 15 original and 132 engineered features that capture the historical pitch percentages and recent trend for each pitcher. This posses a problem of high dimensionality which could make it computationally expensive. So we will need to reduce the dimension to a more managable size. 

When it comes to dimensionality reduction, one well-known method comes to mind is **Principal Component Analysis (PCA)**. The loadings (coefficients) of each principal components are often use to select important features. However, *we will not use PCA method here*. This is because each pitcher is different and tend to favor a certain type of pitch. The problem gets more complicated as we consider the ball/strike count scenario which impact the pitcher's strategy. This mean a group of similar features that capture the historical pitch percentages and recent trend for fastballs may not be a useful in predicting pitchers who favor curveballs or sliders in a given ball/strike scenario. So we will need to select features by similar group for each pitcher and ball/strike combination. 

First, we'll create several lists of "similar features" to be used by the custom **ColumnSelector** to subset the dataframe. These features will be passed to a feature selection tool **FeatureSelector** which uses **LinearSVC** to reduce a given set of 18 features down to a smaller number. To deal with categorical features, we'll use the custom **FeatureHasher** to more efficiently encode the categories. These steps will be wrapped inside the **FeatureUnion** which combine the selected features for classification.

In [152]:
postfix = ["_pct", "_btr", "_prev5", "_prev10", "_prev15", "_prev20", 
           "00_combo", "01_combo", "02_combo", 
           "10_combo", "11_combo", "12_combo", 
           "20_combo", "21_combo", "22_combo",
           "30_combo", "31_combo", "32_combo"]

fast_features = ["fast" + pf for pf in postfix]
sinker_features = ["sinker" + pf for pf in postfix]
slider_features = ["slider" + pf for pf in postfix]
curve_features = ["curve" + pf for pf in postfix]
changeup_features = ["changeup" + pf for pf in postfix]
knuckle_features = ["fast" + pf for pf in postfix]

numeric_features = ['inning', 'top', 'at_bat_num', 'pcount_at_bat',
                    'pcount_pitcher', 'balls', 'strikes', 'fouls', 'outs',
                    'x_prev', 'y_prev', 'x_avg3', 'y_avg3', 'speed_prev', 'speed_avg3']

categorical_features = ["batter_id", "p_throws", "on_1b", "on_2b", "on_3b"]

In [153]:
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_extraction import FeatureHasher
from sklearn.model_selection import train_test_split, GridSearchCV

random_state = 42

class ColumnSelector(BaseEstimator, TransformerMixin):
    """Select a subset of columns from the data set."""
    def __init__(self, col_names):
        self.col_names = col_names   # columns is a list of column names
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return X[self.col_names]

class MyFeatureHasher(BaseEstimator, TransformerMixin):
    """Vecotrize a set of categorical variables."""
    def __init__(self, col_names, n_features=10):
        self.col_names = col_names
        self.n_features = n_features
        
    def fit(self, X, y=None):
        data = X[self.col_names]
        self.myvec = FeatureHasher(n_features=self.n_features)
        self.myvec.fit(X[self.col_names].to_dict(orient='records'))
        return self
    
    def transform(self, X):
        # vectorize input
        return self.myvec.transform(X[self.col_names].to_dict(orient='records')).toarray()
    

pipeline = Pipeline([
    
    # Use feature union to combine features selectors
    ('union', FeatureUnion(
        transformer_list=[
            
            # pipeline for fastball features selector
            ('fast', Pipeline([
                ('selector', ColumnSelector(col_names=fast_features)),
                ('feature_selection', SelectFromModel(LinearSVC(C=0.9, penalty="l1", dual=False)))
            ])),
            
            # pipeline for sinker features selector
            ('sinker', Pipeline([
                ('selector', ColumnSelector(col_names=sinker_features)),
                ('feature_selection', SelectFromModel(LinearSVC(C=0.9, penalty="l1", dual=False)))
            ])), 
            
            # pipeline for sinker features selector
            ('slider', Pipeline([
                ('selector', ColumnSelector(col_names=slider_features)),
                ('feature_selection', SelectFromModel(LinearSVC(C=0.75, penalty="l1", dual=False)))
            ])),
            
            # pipeline for curveball features selector
            ('curve', Pipeline([
                ('selector', ColumnSelector(col_names=curve_features)),
                ('feature_selection', SelectFromModel(LinearSVC(C=0.75, penalty="l1", dual=False)))
            ])), 
            
            # pipeline for curveball features selector
            ('changeup', Pipeline([
                ('selector', ColumnSelector(col_names=changeup_features)),
                ('feature_selection', SelectFromModel(LinearSVC(C=0.75, penalty="l1", dual=False)))
            ])),
            
            # pipeline for curveball features selector
            ('knuckle', Pipeline([
                ('selector', ColumnSelector(col_names=knuckle_features)),
                ('feature_selection', SelectFromModel(LinearSVC(C=0.75, penalty="l1", dual=False)))
            ])),
            
            # pipeline for numeric features selector
            ('numeric', Pipeline([
                ('selector', ColumnSelector(col_names=numeric_features)),
                ('feature_selection', SelectFromModel(LinearSVC(C=0.75, penalty="l1", dual=False)))
            ])),
                      
            # pipeline for categorical features hasher
            ('hash', Pipeline([
                ('feature_hasher', MyFeatureHasher(col_names=categorical_features, n_features=100)) # 128 + 8 categories
            ]))
        ]
    )),
    
#   # use RandomForest classifier on combined features
    ('rfc', RandomForestClassifier(n_estimators=20, max_depth=20, 
                                   class_weight='balanced', random_state=random_state))
    
])

param_grid = {
    'union__hash__feature_hasher__n_features': [75, 100, 125],
    'rfc__n_estimators': range(20, 40, 60),
    'rfc__max_depth': range(10, 20, 30)
}

%time
grid_search = GridSearchCV(pipeline, param_grid, cv=10, iid=False)
# grid_search.fit(X_train, y_train)
# print(("best random forest from grid search: %.3f" % grid_search.score(X_test, y_test)))

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.96 µs


Now that we have a pipeline built out, we will train all 778 models and store the result in a dictionary for forecasting purpose.

<a id='res'></a>
<h2>5. Results</h2>

To build out 778 different models specific to each pitcher-count combination, we use the block of code below to store the models in adictionary. Of course we do not want to re-run all 778 models in real-time as we only need one one them. Therefore, we can save our model in a dictionary where the keys are pitcher-count combinations and the dictionary values are tuples which hold the fitted model, as well as the train and test data. Then we pickle the models for use at a later time. 

The block of code below (commented out) does just that but it will take a while to train and test all 778 models. However for demostration purpose, we'll use the pipeline above to test the model prediction of one of the 778 subsets. 

In [154]:
X = subsets["434378_12"].drop("pitch_type", axis=1)
y = subsets["434378_12"].pitch_type
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [None]:
model = pipeline.fit(X_train, y_train)

In [156]:
model.predict(X_test)

array(['FB', 'FB', 'FB', 'FB', 'FB', 'CH', 'FB', 'FB', 'FB', 'CU', 'FB',
       'FB', 'CU', 'FB', 'FB', 'FB', 'CU', 'FB', 'FB', 'FB', 'CU', 'FB',
       'FB', 'FB', 'CU', 'FB', 'CH', 'CU', 'CU', 'FB', 'CU', 'SL', 'FB',
       'CU', 'FB', 'FB', 'FB', 'FB', 'CU', 'FB', 'FB', 'CH', 'FB'],
      dtype=object)

In [157]:
model.predict_proba(X_test)[:3]

array([[0.3 , 0.25, 0.4 , 0.  , 0.05],
       [0.  , 0.35, 0.45, 0.1 , 0.1 ],
       [0.05, 0.  , 0.8 , 0.  , 0.15]])

In [None]:
# %%time
# models = {}
# for k, v in subsets.items():
#     X = v.drop("pitch_type", axis=1)
#     y = v.pitch_type
#     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
#     models[k] = (grid_search.fit(X_train, y_train), X_train, X_test, y_train, y_test)

In [None]:
# # pickle the results
# filename = "final_models.sav"
# pickle.dump(models, open(filename, 'wb'))

In [None]:
# # unpickle the saved data
# filename = "final_models.sav"
# models = pickle.load(open(filename, 'rb'))

<a id='futures'></a>
<h2>Future Consideration</h2>

For future work, I would like to consider additional factors such as time of the day the pitch was made and whether the morning or evening have different impact on the type of pitch thrown by a pitcher. However, without talking to an expert in the field, I cannot ascertain whether it is an important factor. Additionally, I would like to consider the `type_confidence` feature from the data set as a predictor. However, I am not sure how this number is measured for now as some of the confidence exceed 1. I also would like to consider the current score as I think it definitely impact the pitch strategy. However, from the given metadata csv file, it is not clear whether that information is contained in the data set. 

With respect to programming code, the commented out code block above builds 778 models and save them in a dictionary to be access later via pickling. This process takes a long time and can be improved with **multiprocessing**.

<a id='appendix'></a>
<h2>Appendix</h2>

* Original features
    - Inning
    - Top
    - At-bat-number
    - Pitches thrown at bat
    - Pitches thrown by pitcher
    - Balls and strikes count
    - Current number of fouls and outs
    - Batter and pitcher identifications
    - Hand pitcher throw with
    - (x, y) horizontal and vertical position of a pitch
    - Start-speed or velocity of the pitch
    - Pitch type
    - On-first base
    - On-second base
    - On-third base
* Engineered features (132 added)
    - Historical pitch percentage for each type of pitch (fast, sinker, slider, etc.) for each pitcher
    - Historical pitch percentage for each type thrown by the pitcher to the current batter
    - Pitch percentage for each type type thrown by the pitcher in a given ball-strike count scenario.