# Pitch prediction

If a batter knows what pitch is coming, he is more likely to be able to hit it or lay off of it. Therefore, if a pitcher's next pitch can be predicted, a massive advantage goes to the batter. I will look into the possibility of pitch prediction in the following notebook.

I will start with importing some standard libraries.

In [1]:
# imports
from IPython.display import display
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sqlite3
import subprocess
%matplotlib inline

## Approach

I will look to predict pitches by using as few features as possible. Thus, I will begin by selecting raw features in the database based on human intuition. I intend to only turn to feature engineering if results are poor.

I intend to predict pitches using random forest. This approach has the advantages that it can display both low bias and low variance if results from a large number of decision trees are averaged. Additionally, random forest has a track record of good performance, can handle both regression and classification problems, and can also rank features based on influence on the response. Some drawbacks are that this approach can take up a lot of memory if the size of the dataset and the number of trees is large, and the forests themselves are a bit of a black box in terms of interpretability. Additionally, each pitcher is treated individually since each pitcher will have different pitch tendencies.

For the particular problem of pitch prediction, there are two possible responses that could be used: a categorical response that describes the pitch type (e.g., four-seam fastball, curveball, slider, etc.), or a numerical response that describes the trajectory of the ball (e.g., velocity and vertical/horizontal acceleration).

Ideally, there would be reliable pitch-type labels so that noise in the response that's being predicted would be limited. For example, noise from slightly different trajectories of the same pitch would be limited by classifying them as the same pitch. Additionally, this approach would have a simple baseline metric from which to test my model against: does the model perform better than guessing the most commonly thrown pitch type every pitch? However, from previous work with pitch classification, I concluded that all pitch-type labels should be taken with caution. For instance, the difference between a two-seam and a four-seam fastball can be subjective, and the number of pitches in a pitcher's repertoire is also subjective. One option is to group all pitches into two labels: fastball and off-speed. While this approach would limit the number of misclassified pitches, it also fails to fully exploit the wealth of pitch trajectory information in the database.

Alternatively, I could fully utilize the information in the database and try to predict the trajectory of the next pitch. From my pitch classification work, it appeared that velocity and acceleration were the most important features for separating pitch type, making them candidates for prediction. This approach, however, would be sensitive to noise in the data. Again, a given pitch type is not executed perfectly every time, resulting in a range of trajectories. I am not interested so much in fitting the exact pitch trajectory of the next pitch as I am in estimating an approximate pitch trajectory that allows the hitter to distinguish between pitch types. This approach has the additional challenge that it is difficult to specify a baseline metric to test my model against. Would I just guess the average trajectory of all pitch types every pitch?

Therefore, I propose to use a hybrid approach of the two potential responses: predict how the next pitch falls into bins of velocities and accelerations. So instead of pitches being labeled by pitch type, they will have three labels related to velocity and acceleration bins. Each label type (velocity, vertical and horizontal acceleration) will be split into bins (or categories). For instance, velocity could be split into bins of 70-, 70-80, 80-90, and 90+ mph (these could be different values depending on the pitcher). By predicting which bin the next pitch falls into for each of the three categories (independently), I am able to utilize the pitch trajectory information directly while limiting the effect of noise in the data at the same time. I would no longer be fitting subjective pitch type labels. For this particular approach, the baseline metric to test my model against could be guessing that every pitch is the bin with the highest pitch density.

## Loading data

Pitch prediction will not only depend on the previous pitch thrown, but also variables such as game situation. Therefore, I will load each table in the database as a pandas dataframe so that I maintain maximum flexibility. I will focus on Barry Zito again for this study.

In [5]:
# imports
import pandas as pd
import sqlite3

class Player():
    """Player class for extracting information from pitchfx database"""
    def __init__(self, name):
        """Initialize player object
        
        Inputs:
            name: name of player in "first last" format
        """
        # parse name
        self.first, self.last = name.split(" ")
        
    def pitch_games(self, database):
        """Grab all games from database player pitched in
        
        Inputs:
            database: sqlite object of database to read from
        
        Outpus:
            games: pandas dataframe containing games player played in
        """
        # grab all games
        query = """SELECT DISTINCT games.*
                FROM games
                JOIN events ON (games.game_id=events.game_id)
                WHERE events.pitcher_id=(SELECT player_id
                FROM players
                WHERE players.player_first='%s'
                    AND players.player_last='%s')
                ORDER BY games.game_id""" %(self.first, self.last)
        self.player_pgames = pd.read_sql_query(query, database)
        
        # clean up
        return self.player_pgames

    def pitches(self, database, **params):
        """Grab all pitches from database thrown by player
        
        Inputs:
            database: sqlite object of database to read from
            clean [False]: remove Nans, pitch-outs, intentional balls
        
        Outputs:
            pitches: pandas dataframe containing pitchfx data
        """
        # grab all pitches
        query = """SELECT DISTINCT pitchfx.* 
                FROM pitchfx
                JOIN events ON (pitchfx.game_id=events.game_id
                    AND pitchfx.prev_event=events.event_id)
                WHERE events.pitcher_id=(SELECT player_id
                FROM players
                WHERE players.player_first='%s'
                    AND players.player_last='%s')
                ORDER BY game_id, pitch_num""" %(self.first, self.last)
        self.player_pfx = pd.read_sql_query(query, database)
        
        # clean or not
        if params:
            if "clean" in params:
                if bool(params["clean"]):
                    self.player_pfx = self.player_pfx.dropna(axis=0, how="any")
                    self.player_pfx = self.player_pfx[self.player_pfx.pitch_type!="IN"]
                    self.player_pfx = self.player_pfx[self.player_pfx.pitch_type!="PO"]
        
        # clean up
        return self.player_pfx
    
    def pitch_events(self, database):
        """Grab events where player is the pitcher
        
        Inputs:
            database: sqlite object of database to read from
            
        Output:
            events: events where player is the pitcher
        """
        # grab all events
        query = """SELECT DISTINCT * 
                FROM events
                WHERE pitcher_id=(SELECT player_id
                FROM players
                WHERE players.player_first='Barry'
                    AND players.player_last='Zito')
                ORDER BY game_id, event_id"""
        self.player_pevents = pd.read_sql_query(query, database)
        
        # clean up
        return self.player_pevents

In [6]:
# specify database name
dbname = "../dat/pitchfx2008.db"

# connect to the sqlite3 database
db = sqlite3.connect(dbname)
hdb = db.cursor()

# create player object
bz = Player("Barry Zito")

# grab pitches
bz_pitches = bz.pitches(db)
# grab events
bz_events = bz.pitch_events(db)
# grab games
bz_games = bz.pitch_games(db)

In [7]:
bz_games

Unnamed: 0,game_id,game_type,date,game_time,home_id,visit_id,home_wins,home_losses,visit_wins,visit_losses,stadium_id,umpire_home,umpire_first,umpire_second,umpire_third
0,233769,R,20080331,1310,119,1,0,137,0,1,22,427346,427299,427093,427554
1,233847,R,20080406,1305,158,5,1,137,1,5,32,427341,482641,427534,427545
2,233914,R,20080411,1915,137,4,7,138,8,3,2395,427457,427292,427248,429805
3,233970,R,20080416,1245,137,6,10,109,11,4,2395,427533,431232,427417,427269
4,234063,R,20080422,1840,109,15,5,137,8,13,15,427164,427419,427099,427315
5,234134,R,20080427,1305,137,11,15,113,11,15,2395,427414,427224,427044,427538
6,234269,R,20080507,1905,134,14,19,137,14,20,31,427128,482631,427292,427457
7,234335,R,20080512,1915,137,16,23,117,21,18,2395,482608,427235,427103,427129
8,234400,R,20080517,1805,137,17,27,145,22,21,2395,427058,482608,427509,427090
9,234491,R,20080523,1910,146,27,20,137,20,29,20,427545,427341,427534,427197


In [3]:
# import classes
sys.path.append('../src')
from Player import Player

# specify database name
dbname = "../dat/pitchfx2008.db"

# connect to the sqlite3 database
db = sqlite3.connect(dbname)
hdb = db.cursor()

# create player object
bz = Player("Barry Zito")

# grab pitches
bz_pitches = bz.pitches(db)
# grab events
bz_events = bz.pitch_events(db)
# grab 