# Pitch prediction

If a batter knows what pitch is coming, he is more likely to be able to hit it or lay off of it. Therefore, if a pitcher's next pitch can be predicted, a massive advantage goes to the batter. I will look into the possibility of pitch prediction in the following notebook.

I will start with importing some standard libraries.

In [1]:
# imports
from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sqlite3
import subprocess
%matplotlib inline

## Approach

I will look to predict pitches by using as few features as possible. Thus, I will begin by selecting raw features in the database based on human intuition. I intend to only turn to feature engineering if results are poor.

I intend to predict pitches using random forest. This approach has the advantages that it can display both low bias and low variance if results from a large number of decision trees are averaged. Additionally, random forest has a track record of good performance, can handle both regression and classification problems, and can also rank features based on influence on the response. Some drawbacks are that this approach can take up a lot of memory if the size of the dataset and the number of trees is large, and the forests themselves are a bit of a black box in terms of interpretability. Additionally, each pitcher is treated individually since each pitcher will have different pitch tendencies.

For the particular problem of pitch prediction, there are two possible responses that could be used: a categorical response that describes the pitch type (e.g., four-seam fastball, curveball, slider, etc.), or a numerical response that describes the trajectory of the ball (e.g., velocity and vertical/horizontal acceleration).

Ideally, there would be reliable pitch-type labels so that noise in the response that's being predicted would be limited. For example, noise from slightly different trajectories of the same pitch would be limited by classifying them as the same pitch. Additionally, this approach would have a simple baseline metric from which to test my model against: does the model perform better than guessing the most commonly thrown pitch type every pitch? However, from previous work with pitch classification, I concluded that all pitch-type labels should be taken with caution. For instance, the difference between a two-seam and a four-seam fastball can be subjective, and the number of pitches in a pitcher's repertoire is also subjective. One option is to group all pitches into two labels: fastball and off-speed. While this approach would limit the number of misclassified pitches, it also fails to fully exploit the wealth of pitch trajectory information in the database.

Alternatively, I could fully utilize the information in the database and try to predict the trajectory of the next pitch. From my pitch classification work, it appeared that velocity and acceleration were the most important features for separating pitch type, making them candidates for prediction. This approach, however, would be sensitive to noise in the data. Again, a given pitch type is not executed perfectly every time, resulting in a range of trajectories. I am not interested so much in fitting the exact pitch trajectory of the next pitch as I am in estimating an approximate pitch trajectory that allows the hitter to distinguish between pitch types. This approach has the additional challenge that it is difficult to specify a baseline metric to test my model against. Would I just guess the average trajectory of all pitch types every pitch?

Therefore, I propose to use a hybrid approach of the two potential responses: predict how the next pitch falls into bins of velocities and accelerations. So instead of pitches being labeled by pitch type, they will have three labels related to velocity and acceleration bins. Each label type (velocity, vertical and horizontal acceleration) will be split into bins (or categories). For instance, velocity could be split into bins of 70-, 70-80, 80-90, and 90+ mph (these could be different values depending on the pitcher). By predicting which bin the next pitch falls into for each of the three categories (independently), I am able to utilize the pitch trajectory information directly while limiting the effect of noise in the data at the same time. I would no longer be fitting subjective pitch type labels. For this particular approach, the baseline metric to test my model against could be guessing that every pitch is the bin with the highest pitch density.

## Loading data

Pitch prediction will not only depend on the previous pitch thrown, but also variables such as game situation. Therefore, I will load each table in the database as a pandas dataframe so that I maintain maximum flexibility. I will focus on Barry Zito again for this study.