# Classification of pitches

The goal here is to classify the types of pitches thrown by a given pitcher. Though the pitchfx data comes with pitch classfications, it is known to not be very reliable ("It is accurate enough for most work that involves differentiating between fastballs and off-speed pitches"--Fast, "What the heck is PitchFX?"). Additionally, clustering is a relatively simple task to perform with PitchFX and has a good chance to be able to separate pitches.

To start, we will link the relevant libraries.

In [1]:
# imports
from IPython.display import display
import numpy as np
import pandas as pd
import sqlite3
import subprocess

Now, let's specificy the database name and path, then connect to the database.

In [3]:
# specify database name
dbname = "../dat/pitchfx2008.db"

# connect to the sqlite3 database
db = sqlite3.connect(dbname)
hdb = db.cursor()

For our classification, we will choose one pitcher. Let's choose 'Barry Zito', who is known for the break of his curveball and a below-average fastball. He clearly got paid for his curveball and not his fastball...

Let's first figure out the games Barry Zito has pitched in (printing out the first 5 games).

In [17]:
# find all games that Barry Zito has thrown in
query = """SELECT DISTINCT game_id
   FROM events
   WHERE pitcher_id=(SELECT player_id
       FROM players
       WHERE player_first='Barry'
            AND player_last='Zito')"""
df = pd.read_sql_query(query, db)
df.head()

Unnamed: 0,game_id
0,233769
1,233847
2,233914
3,233970
4,234063


Let's choose a random game to analyze and grab that game id. We'll use numpys *random* module to pick a random game.

In [54]:
# grab relevant parameters
ngames = len(df) # number of games

# select random game
np.random.seed(0) # set seed
igame = np.random.randint(low=0, high=ngames) # choose random index
id_game = df.iloc[igame]['game_id'] # grab game id given index

234715


Next, let's grab the pitches thrown by Barry Zito that game. Note that we will not worry about sorting the pitches by pitch number since we are just clustering pitches. Additionally, the pitchfx database does not have the pitcher id stored. Instead, we have to link the pitchfx table to the events table to get player id information. We'll print out the first five pitches.

In [56]:
query = """SELECT DISTINCT pitchfx.* 
    FROM pitchfx
    JOIN events ON (pitchfx.game_id=events.game_id
        AND pitchfx.prev_event=events.event_id)
    WHERE events.pitcher_id=(SELECT player_id
        FROM players
        WHERE players.player_first='Barry'
            AND players.player_last='Zito')
    ORDER BY game_id, pitch_num"""
df = pd.read_sql_query(query, db)
df.head()

Unnamed: 0,game_id,pitch_num,at_bat,time,prev_event,description,outcome,pre_balls,post_balls,pre_strike,...,vz0,ax,ay,az,break_y,break_angle,break_length,spin_dir,spin_rate,pitch_type
0,233769,30,4,134206.0,7,Ball,B,0,1,0,...,-4.167,-4.395,28.973,-13.125,23.7,16.5,3.9,192.99,2278.601,FC
1,233769,31,4,134217.0,7,Called Strike,S,1,1,0,...,-7.852,-1.776,29.654,-9.818,23.7,8.9,3.2,184.54,2620.543,FF
2,233769,32,4,134228.0,7,Foul,S,1,1,1,...,-2.479,3.593,22.606,-23.886,23.7,-8.3,9.2,156.564,1243.876,CH
3,233769,33,4,134247.0,7,"In play, no out",X,1,1,2,...,-5.803,-1.431,30.55,-8.798,23.7,12.1,2.8,183.501,2732.74,FF
4,233769,37,5,134336.0,8,Ball,B,0,1,0,...,-2.421,-3.625,31.828,-12.756,23.6,18.5,3.8,190.573,2300.7,FF


Now, let's look look at some of the features we would be interested using our intuition. We'll first list all the pichfx parameters to try to get a sense of what we have to work with.

In [63]:
exe = """PRAGMA table_info(pitchfx)"""
hdb.execute(exe)
print(*[ii[1] for ii in hdb.fetchall()], sep="\n")

game_id
pitch_num
at_bat
time
prev_event
description
outcome
pre_balls
post_balls
pre_strike
post_strike
start_speed
end_speed
sz_top
sz_bot
pfx_x
pfx_z
px
pz
x
y
x0
y0
z0
vx0
vy0
vz0
ax
ay
az
break_y
break_angle
break_length
spin_dir
spin_rate
pitch_type


A description of the pitch variables can be found here: https://fastballs.wordpress.com/category/pitchfx-glossary/. The most obvious features to use are anything that have to do with the trajectory of the ball. This includes 

Additionally, we will also focus only on one game for now. This is because the PitchFX system can vary from stadium to stadium, day to day (Fast, "What the heck is PitchFX?"). By focusing on one game, the same errors are applied to every pitch thrown that game. (Note that calibration of PitchFX data for all games is a step that should be taken in order to compare pitches from game to game).