# Pitch prediction

If a batter knows what pitch is coming, he is more likely to be able to hit it or lay off of it. Therefore, if a pitcher's next pitch can be predicted, a massive advantage goes to the batter. I will look into the possibility of pitch prediction in the following notebook.

I will start with importing some standard libraries.

In [1]:
# imports
from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sqlite3
import subprocess
%matplotlib inline

## Approach

I will look to predict pitches by using as few features as possible. Thus, I will begin by selecting raw features in the database based on human intuition. I intend to only turn to feature engineering if results are poor.

I intend to predict pitches using random forest. This approach has the advantages that it can display both low bias and low variance if results from a large number of decision trees are averaged. Additionally, random forest has a track record of good performance, can handle both regression and classification problems, and can also rank features based on influence on the response. Some drawbacks are that this approach can take up a lot of memory if the size of the dataset and the number of trees is large, and the forests themselves are a bit of a black box in terms of interpretability. Additionally, each pitcher is treated individually since each pitcher will have different pitch tendencies.

For the particular problem of pitch prediction, there are two possible responses that could be used: a categorical response that describes the pitch type (e.g., four-seam fastball, curveball, slider, etc.), or a numerical response that describes the trajectory of the ball (e.g., velocity and vertical/horizontal acceleration).

Ideally, there would be reliable pitch-type labels so that noise in the response that's being predicted would be limited. For example, noise from slightly different trajectories of the same pitch would be limited by classifying them as the same pitch. Additionally, this approach would have a simple baseline metric from which to test my model against: does the model perform better than guessing the most commonly thrown pitch type every pitch? However, from previous work with pitch classification, I concluded that all pitch-type labels should be taken with caution. For instance, the difference between a two-seam and a four-seam fastball can be subjective, and the number of pitches in a pitcher's repertoire is also subjective. One option is to group all pitches into two labels: fastball and off-speed. While this approach would limit the number of misclassified pitches, it also fails to fully exploit the wealth of pitch trajectory information in the database.

Alternatively, I could fully utilize the information in the database and try to predict the trajectory of the next pitch. From my pitch classification work, it appeared that velocity and acceleration were the most important features for separating pitch type, making them candidates for prediction. This approach, however, would be sensitive to noise in the data. Again, a given pitch type is not executed perfectly every time, resulting in a range of trajectories. I am not interested so much in fitting the exact pitch trajectory of the next pitch as I am in estimating an approximate pitch trajectory that allows the hitter to distinguish between pitch types. This approach has the additional challenge that it is difficult to specify a baseline metric to test my model against. Would I just guess the average trajectory of all pitch types every pitch?

Therefore, I propose to use a hybrid approach of the two potential responses: predict how the next pitch falls into bins of velocities and accelerations. So instead of pitches being labeled by pitch type, they will have three labels related to velocity and acceleration bins. Each label type (velocity, vertical and horizontal acceleration) will be split into bins (or categories). For instance, velocity could be split into bins of 70-, 70-80, 80-90, and 90+ mph (these could be different values depending on the pitcher). By predicting which bin the next pitch falls into for each of the three categories (independently), I am able to utilize the pitch trajectory information directly while limiting the effect of noise in the data at the same time. I would no longer be fitting subjective pitch type labels. For this particular approach, the baseline metric to test my model against could be guessing that every pitch is the bin with the highest pitch density.

## Loading data

Pitch prediction will not only depend on the previous pitch thrown, but also variables such as game situation. Therefore, I will load the *pitchfx*, *events*, and *games* tables in the database as pandas dataframes so that I maintain maximum flexibility. I will focus on Barry Zito again for this study.

In [2]:
# specify database name
dbname = "../dat/pitchfx2008.db"

# connect to the sqlite3 database
db = sqlite3.connect(dbname)
hdb = db.cursor()

# import classes
sys.path.append('../src')
from Player import Player

# create player object
bz = Player("Barry Zito")

# grab pitches
bz_pitches = bz.pitches(db, clean=1)
# grab events
bz_events = bz.pitch_events(db)
# grab games
bz_games = bz.pitch_games(db)

## Data preprocessing

Feature selection can sometimes be more art than science. Here, I will prepare a subset of features that intuition suggests will help predict the type of pitch thrown. They are, loosely, the current ball-strike count, the handedness of the batter, the baserunner situation, and the previous pitch thrown. I will prepare these features one by one.

### Ball-strike count

Pitchers will likely pitch differently depending on the ball-strike count. For instance, if the count is 0-2 (0 balls, 2 strikes), a pitcher can throw any pitch because he knows he has 3 more tries to get the batter out. On the other hand, if the count is 3-0 (3 balls, 0 strikes), a pitcher will likely throw a pitch he trusts most in terms of command (often a fastball) because the pitcher will oftentimes want to avoid walking the batter.

In terms of how to represent the count as a feature, there are a couple of options: treat the number of balls and the number of strikes independently, or treat each count as its own category. For now, I will treat the number of balls and strikes as independent features because it requires less feature manipulation. Furthermore, I will keep their original values (0-3 for balls, 0-2 for strikes) rather than encode them, as the values of these numbers have meaning.

In [90]:
# create first feature
features = bz_pitches[["pre_balls", "pre_strike"]]

# print length of vector
print("number of data points: ", features.shape[0])

number of data points:  3079


### Handedness of batter

The next pitch is also likely influenced by the handedness of the batter. For example, certain pitches are more effective when the pitcher handedness and batter handedness match (such as a breaking ball), and are thus more likely to be thrown.

To get the handedness of the batter for each pitch, I'll need to join the *pitchfx* table with both the *events* table (to tie the pitch to a batter id) and the *players* table (to tie the batter id to the batter handedness). I will do this all in pandas, which requires me to import the *players* table as a pandas dataframe.

In [7]:
# save players table as pandas data frame
query = """SELECT *
        FROM players
        """
players = pd.read_sql_query(query, db)

I'll perform a sequence of left joins. I'll join the *events* table first to the pitches in order to get the batter information. I will then join that resulting table to the *players* table in order to get the batter handedness. Note that I will only add the columns I need (*player_id* and *bats*) and will need to drop the duplicate player information before joining the tables since the same player will have multiple entries if he plays multiple positions.

In [91]:
# merge tables to get the batter handedness for each pitch
handedness = bz_pitches.merge(bz_events[['game_id', 'event_id', 'batter_id']],
                              left_on=['game_id', 'cur_event'],
                              right_on=['game_id', 'event_id'],
                              how='left') \
                       .merge(players[['player_id', 'bats']].drop_duplicates(),
                              left_on='batter_id',
                              right_on='player_id',
                              how='left') \
                       .sort_values(by=['game_id', 'cur_event'])['bats']
        
# print length of vector
print("number of data points: ",handedness.shape[0])

number of data points:  3079


There are three different batter handednesses: left, right, and switch. Typically, switch hitters bat with the opposite hand of the pitcher. Therefore, because Zito throws with his left hand, I will change all switch hitters to bat right-handed here.

Additionally, I will transform these values into categorical variables using one-hot encoding. *sklearn* requires all inputs to continuous, and I do not want batter handedness to be interpreted as being ordered.

In [95]:
# print unique entries of batter handedness
print("unique batter handednesses: ", players['bats'].unique())

# update switch hitters in batter handedness based on pitcher's throwing hand


unique batter handednesses:  ['L' 'R' 'S']
