# Feature Engineering
The purpose of this notebook is to work on engineering more features, to improve model performance from the pipeline_architecture.ipynb file.  The pitch location was underfitting, and pitch type was not as accurate as would be ideal.

Importing various packages:

In [180]:
import pickle
from sqlalchemy import create_engine
import pandas as pd
from importlib import reload
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_formats = ['retina']
%matplotlib inline

plt.rcParams['figure.figsize'] = (9, 6)
sns.set(context='notebook', style='whitegrid', font_scale=1.2)

Opening up a SQL Alchemy Engine, to work on this in SQL.

In [181]:
#First, creating an engine and then importing the various .csv files.
engine = create_engine('postgresql://patrickbovard:localhost@localhost:5432/mlb_pitches')

Re-acquainting myself with all the data I have:

### At-bats:

In [182]:
query = '''
--first, selecting all the standard columns:
SELECT *
FROM atbats
LIMIT 5
;
'''
df = pd.read_sql(query, engine)

df.head()

Unnamed: 0.1,Unnamed: 0,inning,top,ab_id,g_id,p_score,batter_id,pitcher_id,stand,p_throws,event,o
0,0,1.0,1.0,2019000000.0,201900001.0,0.0,594777,571666,L,R,Flyout,1
1,1,1.0,1.0,2019000000.0,201900001.0,0.0,545361,571666,R,R,Flyout,2
2,2,1.0,1.0,2019000000.0,201900001.0,0.0,571506,571666,L,R,Groundout,3
3,3,1.0,0.0,2019000000.0,201900001.0,0.0,543257,502239,L,R,Single,0
4,4,1.0,0.0,2019000000.0,201900001.0,0.0,656305,502239,R,R,Flyout,1


All of these (outside of id's) are currently in use as features, with the exception of event - perhaps previous at-bat event could help predict pitch type?

### Games:

In [4]:
query = '''
--first, selecting all the standard columns:
SELECT *
FROM games
LIMIT 5
;
'''
df = pd.read_sql(query, engine)

df.head()

Unnamed: 0.1,Unnamed: 0,attendance,away_final_score,away_team,date,elapsed_time,g_id,home_final_score,home_team,start_time,umpire_1B,umpire_2B,umpire_3B,umpire_HP,venue_name,weather,wind,delay
0,0,35055.0,3.0,sln,2015-04-05,184.0,201500001.0,0.0,chn,7:17 PM,Mark Wegner,Marty Foster,Mike Muchlinski,Mike Winters,Wrigley Field,"44 degrees, clear","7 mph, In from CF",0.0
1,1,45909.0,1.0,ana,2015-04-06,153.0,201500002.0,4.0,sea,1:12 PM,Ron Kulpa,Brian Knight,Vic Carapazza,Larry Vanover,Safeco Field,"54 degrees, cloudy","1 mph, Varies",0.0
2,2,36969.0,2.0,atl,2015-04-06,156.0,201500003.0,1.0,mia,4:22 PM,Laz Diaz,Chris Guccione,Cory Blaser,Jeff Nelson,Marlins Park,"80 degrees, partly cloudy","16 mph, In from CF",16.0
3,3,31042.0,6.0,bal,2015-04-06,181.0,201500004.0,2.0,tba,3:12 PM,Ed Hickox,Paul Nauert,Mike Estabrook,Dana DeMuth,Tropicana Field,"72 degrees, dome","0 mph, None",0.0
4,4,45549.0,8.0,bos,2015-04-06,181.0,201500005.0,0.0,phi,3:08 PM,Phil Cuzzi,Tony Randazzo,Will Little,Gerry Davis,Citizens Bank Park,"71 degrees, partly cloudy","11 mph, Out to RF",0.0


Perhaps here wind and temp/weather could help.

### Pitches:

In [5]:
query = '''
--first, selecting all the standard columns:
SELECT *
FROM pitches
LIMIT 100;
;
'''
df = pd.read_sql(query, engine)

df.head()

Unnamed: 0,px,pz,start_speed,end_speed,spin_rate,spin_dir,break_angle,break_length,break_y,ax,...,event_num,b_score,ab_id,b_count,s_count,outs,pitch_num,on_1b,on_2b,on_3b
0,0.416,2.963,92.9,84.1,2305.052,159.235,-25.0,3.2,23.7,7.665,...,3,0.0,2015000000.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
1,-0.191,2.347,92.8,84.1,2689.935,151.40200000000004,-40.7,3.4,23.7,12.043,...,4,0.0,2015000000.0,0.0,1.0,0.0,2.0,0.0,0.0,0.0
2,-0.518,3.284,94.1,85.2,2647.972,145.125,-43.7,3.7,23.7,14.368,...,5,0.0,2015000000.0,0.0,2.0,0.0,3.0,0.0,0.0,0.0
3,-0.641,1.221,91.0,84.0,1289.59,169.75099999999995,-1.3,5.0,23.8,2.104,...,6,0.0,2015000000.0,0.0,2.0,0.0,4.0,0.0,0.0,0.0
4,-1.821,2.083,75.4,69.6,1374.569,280.671,18.4,12.0,23.8,-10.28,...,7,0.0,2015000000.0,1.0,2.0,0.0,5.0,0.0,0.0,0.0


In [6]:
df.type_confidence.value_counts()

2.0      93
0.648     1
0.778     1
0.693     1
0.898     1
0.821     1
0.763     1
Name: type_confidence, dtype: int64

In [7]:
df.nasty.describe()

count     99.000000
mean      43.979798
std       16.744161
min       12.000000
25%       31.500000
50%       43.000000
75%       53.500000
max      100.000000
Name: nasty, dtype: float64

In [8]:
df.columns

Index(['px', 'pz', 'start_speed', 'end_speed', 'spin_rate', 'spin_dir',
       'break_angle', 'break_length', 'break_y', 'ax', 'ay', 'az', 'sz_bot',
       'sz_top', 'type_confidence', 'vx0', 'vy0', 'vz0', 'x', 'x0', 'y', 'y0',
       'z0', 'pfx_x', 'pfx_z', 'nasty', 'zone', 'code', 'type', 'pitch_type',
       'event_num', 'b_score', 'ab_id', 'b_count', 's_count', 'outs',
       'pitch_num', 'on_1b', 'on_2b', 'on_3b'],
      dtype='object')

There are a few I haven't used here that coudl be helpful: break_angle, break_length, code (i.e. use preceding code), type_confidence - i.e. setting a minimum threshold for pitches that can't be correctly classified, sz_top and sz_bottom (to somewhat regulate where the strikezone is on a per-pitch basis).

ONes that exist in 2019: break_length, break_angle, break_y, ax, ay, az, vx/y/z0, pfx_x/z

## Other Pitch Rates:

In modeling_prep.ipynb, I used a query to create running pitch counts for each pitcher.  Utilizing a similar format for some new ones:

Repeating, but over last 100 pitches:

In [9]:
query = '''
--first, selecting all the standard columns:
SELECT pitcher_id, pitcher_full_name, pitch_type,
--selecting counts of each pitch type, over the last 100 pitches the pitcher has thrown:
(count(CASE WHEN pitch_type = 'FF' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_ff,
(count(CASE WHEN pitch_type = 'SL' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_sl,
(count(CASE WHEN pitch_type = 'FT' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_ft,
(count(CASE WHEN pitch_type = 'CH' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_ch,
(count(CASE WHEN pitch_type = 'CU' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_cu,
(count(CASE WHEN pitch_type = 'SI' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_si,
(count(CASE WHEN pitch_type = 'FC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_fc,
(count(CASE WHEN pitch_type = 'KC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_kc,
(count(CASE WHEN pitch_type = 'FS' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_fs,
(count(CASE WHEN pitch_type = 'KN' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_kn,
(count(CASE WHEN pitch_type = 'EP' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_ep,
(count(CASE WHEN pitch_type = 'FO' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_fo,
(count(CASE WHEN pitch_type = 'SC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_sc

FROM full_pitch_data
LIMIT 1000
;
'''
df = pd.read_sql(query, engine)

df.head()

Unnamed: 0,pitcher_id,pitcher_full_name,pitch_type,last_100_ff,last_100_sl,last_100_ft,last_100_ch,last_100_cu,last_100_si,last_100_fc,last_100_kc,last_100_fs,last_100_kn,last_100_ep,last_100_fo,last_100_sc
0,112526,Bartolo Colon,FF,0,0,0,0,0,0,0,0,0,0,0,0,0
1,112526,Bartolo Colon,FT,1,0,0,0,0,0,0,0,0,0,0,0,0
2,112526,Bartolo Colon,SL,1,0,1,0,0,0,0,0,0,0,0,0,0
3,112526,Bartolo Colon,FF,1,1,1,0,0,0,0,0,0,0,0,0,0
4,112526,Bartolo Colon,FT,2,1,1,0,0,0,0,0,0,0,0,0,0


What about px/pz, for the last x times a pitcher has thrown a pitch:

In [10]:
query = '''
--first, selecting all the standard columns:
SELECT pitcher_id, pitcher_full_name, pitch_type,
--selecting avg px, over the last 10 pitches the pitcher has thrown:
(avg(CASE WHEN pitch_type = 'FF' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_ff,
(avg(CASE WHEN pitch_type = 'SL' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_sl,
(avg(CASE WHEN pitch_type = 'FT' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_ft,
(avg(CASE WHEN pitch_type = 'CH' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_ch,
(avg(CASE WHEN pitch_type = 'CU' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_cu,
(avg(CASE WHEN pitch_type = 'SI' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_si,
(avg(CASE WHEN pitch_type = 'FC' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_fc,
(avg(CASE WHEN pitch_type = 'KC' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_kc,
(avg(CASE WHEN pitch_type = 'FS' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_fs,
(avg(CASE WHEN pitch_type = 'KN' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_kn,
(avg(CASE WHEN pitch_type = 'EP' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_ep,
(avg(CASE WHEN pitch_type = 'FO' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_fo,
(avg(CASE WHEN pitch_type = 'SC' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_sc

FROM full_pitch_data
LIMIT 1000
;
'''
df = pd.read_sql(query, engine)

df.head()

Unnamed: 0,pitcher_id,pitcher_full_name,pitch_type,avg_px_ff,avg_px_sl,avg_px_ft,avg_px_ch,avg_px_cu,avg_px_si,avg_px_fc,avg_px_kc,avg_px_fs,avg_px_kn,avg_px_ep,avg_px_fo,avg_px_sc
0,112526,Bartolo Colon,FF,,,,,,,,,,,,,
1,112526,Bartolo Colon,FT,0.445,,,,,,,,,,,,
2,112526,Bartolo Colon,SL,0.445,,-0.296,,,,,,,,,,
3,112526,Bartolo Colon,FF,0.445,0.748,-0.296,,,,,,,,,,
4,112526,Bartolo Colon,FT,0.751,0.748,-0.296,,,,,,,,,,


Same, but for pz:

In [11]:
query = '''
--first, selecting all the standard columns:
SELECT pitcher_id, pitcher_full_name, pitch_type, px,
--selecting counts of each pitch type, over the last 100 pitches the pitcher has thrown:
(avg(px) FILTER (WHERE pitch_type = 'FF') OVER (PARTITION BY pitcher_id, pitch_type ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_ff,
(avg(px) FILTER (WHERE pitch_type = 'FT') OVER (PARTITION BY pitcher_id, pitch_type ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_ft,
(avg(px) FILTER (WHERE pitch_type = 'CU') OVER (PARTITION BY pitcher_id, pitch_type ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_cu,
(avg(px) FILTER (WHERE pitch_type = 'CH') OVER (PARTITION BY pitcher_id, pitch_type ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_ch,
(avg(px) FILTER (WHERE pitch_type = 'SI') OVER (PARTITION BY pitcher_id, pitch_type ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_si


FROM full_pitch_data
ORDER BY ab_id, pitch_num ASC
LIMIT 1000
;
'''
df = pd.read_sql(query, engine)

df.head(15)

Unnamed: 0,pitcher_id,pitcher_full_name,pitch_type,px,avg_px_ff,avg_px_ft,avg_px_cu,avg_px_ch,avg_px_si
0,452657,Jon Lester,FF,0.416,,,,,
1,452657,Jon Lester,FF,-0.191,0.416,,,,
2,452657,Jon Lester,FF,-0.518,0.1125,,,,
3,452657,Jon Lester,FF,-0.641,-0.097667,,,,
4,452657,Jon Lester,CU,-1.821,,,,,
5,452657,Jon Lester,FF,0.627,-0.45,,,,
6,452657,Jon Lester,FF,-1.088,-0.177333,,,,
7,452657,Jon Lester,FC,-0.257,,,,,
8,452657,Jon Lester,FF,1.47,-0.367333,,,,
9,452657,Jon Lester,FF,-1.337,0.336333,,,,


In [12]:
query = '''
--first, selecting all the standard columns:
SELECT pitcher_id, pitcher_full_name, pitch_type,
--selecting counts of each pitch type, over the last 100 pitches the pitcher has thrown:
(avg(CASE WHEN pitch_type = 'FF' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_ff,
(avg(CASE WHEN pitch_type = 'SL' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_sl,
(avg(CASE WHEN pitch_type = 'FT' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_ft,
(avg(CASE WHEN pitch_type = 'CH' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_ch,
(avg(CASE WHEN pitch_type = 'CU' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_cu,
(avg(CASE WHEN pitch_type = 'SI' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_si,
(avg(CASE WHEN pitch_type = 'FC' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_fc,
(avg(CASE WHEN pitch_type = 'KC' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_kc,
(avg(CASE WHEN pitch_type = 'FS' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_fs,
(avg(CASE WHEN pitch_type = 'KN' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_kn,
(avg(CASE WHEN pitch_type = 'EP' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_ep,
(avg(CASE WHEN pitch_type = 'FO' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_fo,
(avg(CASE WHEN pitch_type = 'SC' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_sc

FROM full_pitch_data
LIMIT 1000
;
'''
df = pd.read_sql(query, engine)

df.head()

Unnamed: 0,pitcher_id,pitcher_full_name,pitch_type,avg_pz_ff,avg_pz_sl,avg_pz_ft,avg_pz_ch,avg_pz_cu,avg_pz_si,avg_pz_fc,avg_pz_kc,avg_pz_fs,avg_pz_kn,avg_pz_ep,avg_pz_fo,avg_pz_sc
0,112526,Bartolo Colon,FF,,,,,,,,,,,,,
1,112526,Bartolo Colon,FT,2.705,,,,,,,,,,,,
2,112526,Bartolo Colon,SL,2.705,,1.189,,,,,,,,,,
3,112526,Bartolo Colon,FF,2.705,1.26,1.189,,,,,,,,,,
4,112526,Bartolo Colon,FT,3.1155,1.26,1.189,,,,,,,,,,


These work and likely can paint a good picture of where the pitcher is locating the ball, but will have to handle the NaN - a fair way could be middle of the strikezone (0 for px, ~1.85 for pz), since I don't want to lose those rows.  

Columns with None value can be removed from that pitcher's modeling, or changed to 0 - ultimately, it won't matter since they don't throw that pitch.

## Merging the Above in one query:

In [13]:
query = '''
--first, selecting all the standard columns:
SELECT *
FROM full_pitch_data
ORDER BY ab_id, pitch_num ASC
LIMIT 10
;
'''
df = pd.read_sql(query, engine)

df.head(10)

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,stand,p_throws,event,home_team,...,pitch_num,last_pitch_type,last_pitch_px,last_pitch_pz,last_pitch_speed,pitcher_full_name,pitcher_run_diff,hitter_full_name,Date_Time_Date,Season
0,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,1.0,,,,,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
1,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,2.0,FF,0.416,2.963,92.9,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
2,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,3.0,FF,-0.191,2.347,92.8,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
3,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,4.0,FF,-0.518,3.284,94.1,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
4,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,5.0,FF,-0.641,1.221,91.0,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
5,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,6.0,CU,-1.821,2.083,75.4,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
6,1.0,518792,452657,1.0,2015000000.0,0.0,L,L,Double,chn,...,1.0,,,,,Jon Lester,0.0,Jason Heyward,2015-04-05,2015
7,1.0,518792,452657,1.0,2015000000.0,0.0,L,L,Double,chn,...,2.0,FF,-1.088,1.61,93.3,Jon Lester,0.0,Jason Heyward,2015-04-05,2015
8,1.0,407812,452657,1.0,2015000000.0,0.0,R,L,Single,chn,...,1.0,,,,,Jon Lester,0.0,Matt Holliday,2015-04-05,2015
9,1.0,407812,452657,1.0,2015000000.0,0.0,R,L,Single,chn,...,2.0,FF,1.47,2.35,92.1,Jon Lester,0.0,Matt Holliday,2015-04-05,2015


In [14]:
query = '''
--first, selecting all the standard columns:
SELECT pitcher_id, batter_id, event, pitcher_full_name, pitch_type, "Season",
--selecting counts of each pitch type, over the last 100 pitches the pitcher has thrown:
(count(CASE WHEN pitch_type = 'FF' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_ff,
(count(CASE WHEN pitch_type = 'SL' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_sl,
(count(CASE WHEN pitch_type = 'FT' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_ft,
(count(CASE WHEN pitch_type = 'CH' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_ch,
(count(CASE WHEN pitch_type = 'CU' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_cu,
(count(CASE WHEN pitch_type = 'SI' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_si,
(count(CASE WHEN pitch_type = 'FC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_fc,
(count(CASE WHEN pitch_type = 'KC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_kc,
(count(CASE WHEN pitch_type = 'FS' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_fs,
(count(CASE WHEN pitch_type = 'KN' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_kn,
(count(CASE WHEN pitch_type = 'EP' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_ep,
(count(CASE WHEN pitch_type = 'FO' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_fo,
(count(CASE WHEN pitch_type = 'SC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 100 PRECEDING EXCLUDE CURRENT ROW)) AS last_100_sc,

--selecting avg px, over the last 3 pitches the pitcher has thrown:
(avg(CASE WHEN pitch_type = 'FF' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_ff,
(avg(CASE WHEN pitch_type = 'SL' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_sl,
(avg(CASE WHEN pitch_type = 'FT' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_ft,
(avg(CASE WHEN pitch_type = 'CH' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_ch,
(avg(CASE WHEN pitch_type = 'CU' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_cu,
(avg(CASE WHEN pitch_type = 'SI' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_si,
(avg(CASE WHEN pitch_type = 'FC' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_fc,
(avg(CASE WHEN pitch_type = 'KC' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_kc,
(avg(CASE WHEN pitch_type = 'FS' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_fs,
(avg(CASE WHEN pitch_type = 'KN' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_kn,
(avg(CASE WHEN pitch_type = 'EP' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_ep,
(avg(CASE WHEN pitch_type = 'FO' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_fo,
(avg(CASE WHEN pitch_type = 'SC' THEN px END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_px_sc,

--selecting avg pz, over the last 3 pitches the pitcher has thrown:
(avg(CASE WHEN pitch_type = 'FF' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_ff,
(avg(CASE WHEN pitch_type = 'SL' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_sl,
(avg(CASE WHEN pitch_type = 'FT' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_ft,
(avg(CASE WHEN pitch_type = 'CH' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_ch,
(avg(CASE WHEN pitch_type = 'CU' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_cu,
(avg(CASE WHEN pitch_type = 'SI' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_si,
(avg(CASE WHEN pitch_type = 'FC' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_fc,
(avg(CASE WHEN pitch_type = 'KC' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_kc,
(avg(CASE WHEN pitch_type = 'FS' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_fs,
(avg(CASE WHEN pitch_type = 'KN' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_kn,
(avg(CASE WHEN pitch_type = 'EP' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_ep,
(avg(CASE WHEN pitch_type = 'FO' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_fo,
(avg(CASE WHEN pitch_type = 'SC' THEN pz END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 3 PRECEDING EXCLUDE CURRENT ROW)) AS avg_pz_sc

FROM full_pitch_data
ORDER BY ab_id, pitch_num ASC
;
'''
df = pd.read_sql(query, engine)

df.head(10)

Unnamed: 0,pitcher_id,batter_id,event,pitcher_full_name,pitch_type,Season,last_100_ff,last_100_sl,last_100_ft,last_100_ch,...,avg_pz_ch,avg_pz_cu,avg_pz_si,avg_pz_fc,avg_pz_kc,avg_pz_fs,avg_pz_kn,avg_pz_ep,avg_pz_fo,avg_pz_sc
0,452657,572761,Groundout,Jon Lester,FF,2015,0,0,0,0,...,,,,,,,,,,
1,452657,572761,Groundout,Jon Lester,FF,2015,1,0,0,0,...,,,,,,,,,,
2,452657,572761,Groundout,Jon Lester,FF,2015,2,0,0,0,...,,,,,,,,,,
3,452657,572761,Groundout,Jon Lester,FF,2015,3,0,0,0,...,,,,,,,,,,
4,452657,572761,Groundout,Jon Lester,CU,2015,4,0,0,0,...,,,,,,,,,,
5,452657,572761,Groundout,Jon Lester,FF,2015,4,0,0,0,...,,2.083,,,,,,,,
6,452657,518792,Double,Jon Lester,FF,2015,5,0,0,0,...,,2.083,,,,,,,,
7,452657,518792,Double,Jon Lester,FC,2015,6,0,0,0,...,,2.083,,,,,,,,
8,452657,407812,Single,Jon Lester,FF,2015,6,0,0,0,...,,,,2.047,,,,,,
9,452657,407812,Single,Jon Lester,FF,2015,7,0,0,0,...,,,,2.047,,,,,,


In [15]:
df.tail(10)

Unnamed: 0,pitcher_id,batter_id,event,pitcher_full_name,pitch_type,Season,last_100_ff,last_100_sl,last_100_ft,last_100_ch,...,avg_pz_ch,avg_pz_cu,avg_pz_si,avg_pz_fc,avg_pz_kc,avg_pz_fs,avg_pz_kn,avg_pz_ep,avg_pz_fo,avg_pz_sc
3555824,571704,663993,Groundout,Ken Giles,SL,2019,43,54,3,0,...,,,,,,,,,,
3555825,571704,663993,Groundout,Ken Giles,SL,2019,43,54,3,0,...,,,,,,,,,,
3555826,571704,622110,Groundout,Ken Giles,FF,2019,43,54,3,0,...,,,,,,,,,,
3555827,571704,622110,Groundout,Ken Giles,SL,2019,43,54,3,0,...,,,,,,,,,,
3555828,571704,622110,Groundout,Ken Giles,SL,2019,43,54,3,0,...,,,,,,,,,,
3555829,571704,622110,Groundout,Ken Giles,FF,2019,42,55,3,0,...,,,,,,,,,,
3555830,571704,605421,Strikeout,Ken Giles,SL,2019,42,55,3,0,...,,,,,,,,,,
3555831,571704,605421,Strikeout,Ken Giles,FF,2019,41,56,3,0,...,,,,,,,,,,
3555832,571704,605421,Strikeout,Ken Giles,SL,2019,41,56,3,0,...,,,,,,,,,,
3555833,571704,605421,Strikeout,Ken Giles,SL,2019,40,57,3,0,...,,,,,,,,,,


In [16]:
df[df.Season != 2019].tail(15)

Unnamed: 0,pitcher_id,batter_id,event,pitcher_full_name,pitch_type,Season,last_100_ff,last_100_sl,last_100_ft,last_100_ch,...,avg_pz_ch,avg_pz_cu,avg_pz_si,avg_pz_fc,avg_pz_kc,avg_pz_fs,avg_pz_kn,avg_pz_ep,avg_pz_fo,avg_pz_sc
2848356,623352,450314,Flyout,Josh Hader,SL,2018,82,17,1,0,...,,,,,,,,,,
2848357,623352,450314,Flyout,Josh Hader,FF,2018,81,18,1,0,...,,,,,,,,,,
2848358,623352,595879,Single,Josh Hader,SL,2018,81,18,1,0,...,,,,,,,,,,
2848359,623352,595879,Single,Josh Hader,FF,2018,80,19,1,0,...,,,,,,,,,,
2848360,623352,595879,Single,Josh Hader,FF,2018,80,19,1,0,...,,,,,,,,,,
2848361,623352,595879,Single,Josh Hader,FF,2018,80,19,1,0,...,,,,,,,,,,
2848362,623352,595879,Single,Josh Hader,FF,2018,80,19,1,0,...,,,,,,,,,,
2848363,623352,595879,Single,Josh Hader,FF,2018,80,19,1,0,...,,,,,,,,,,
2848364,623352,595879,Single,Josh Hader,FF,2018,80,19,1,0,...,,,,,,,,,,
2848365,623352,595879,Single,Josh Hader,FF,2018,80,19,1,0,...,,,,,,,,,,


Saving this data as a new pickled file: (commenting out after initial run)

In [17]:
#with open('../Data/new_pitch_rates.pickle', 'wb') as to_write:
#    pickle.dump(df, to_write)

## Next Round:
Based on model performance in Pipeline_Part_2.ipynb, performance wasn't improved much by the above.  WOrking on some additional feature engineering below:

In [9]:
query = '''

SELECT *
FROM full_pitch_data
ORDER BY ab_id, pitch_num ASC
LIMIT 10
;
'''
df = pd.read_sql(query, engine)

df.head(10)

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,stand,p_throws,event,home_team,...,pitch_num,last_pitch_type,last_pitch_px,last_pitch_pz,last_pitch_speed,pitcher_full_name,pitcher_run_diff,hitter_full_name,Date_Time_Date,Season
0,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,1.0,,,,,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
1,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,2.0,FF,0.416,2.963,92.9,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
2,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,3.0,FF,-0.191,2.347,92.8,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
3,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,4.0,FF,-0.518,3.284,94.1,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
4,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,5.0,FF,-0.641,1.221,91.0,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
5,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,6.0,CU,-1.821,2.083,75.4,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
6,1.0,518792,452657,1.0,2015000000.0,0.0,L,L,Double,chn,...,1.0,,,,,Jon Lester,0.0,Jason Heyward,2015-04-05,2015
7,1.0,518792,452657,1.0,2015000000.0,0.0,L,L,Double,chn,...,2.0,FF,-1.088,1.61,93.3,Jon Lester,0.0,Jason Heyward,2015-04-05,2015
8,1.0,407812,452657,1.0,2015000000.0,0.0,R,L,Single,chn,...,1.0,,,,,Jon Lester,0.0,Matt Holliday,2015-04-05,2015
9,1.0,407812,452657,1.0,2015000000.0,0.0,R,L,Single,chn,...,2.0,FF,1.47,2.35,92.1,Jon Lester,0.0,Matt Holliday,2015-04-05,2015


In [10]:
df.columns

Index(['inning', 'batter_id', 'pitcher_id', 'top', 'ab_id', 'p_score', 'stand',
       'p_throws', 'event', 'home_team', 'away_team', 'b_score', 'on_1b',
       'on_2b', 'on_3b', 'px', 'pz', 'zone', 'pitch_type', 'start_speed',
       'type', 'b_count', 's_count', 'outs', 'pitch_num', 'last_pitch_type',
       'last_pitch_px', 'last_pitch_pz', 'last_pitch_speed',
       'pitcher_full_name', 'pitcher_run_diff', 'hitter_full_name',
       'Date_Time_Date', 'Season'],
      dtype='object')

In [190]:
query = '''
--first, selecting all the standard columns:
SELECT pitcher_id, batter_id, event, pitcher_full_name, pitch_type, "Season",
--selecting counts of each pitch type, over the last 10 pitches the pitcher has thrown:
(count(CASE WHEN pitch_type = 'FF' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_ff,
(count(CASE WHEN pitch_type = 'SL' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_sl,
(count(CASE WHEN pitch_type = 'FT' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_ft,
(count(CASE WHEN pitch_type = 'CH' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_ch,
(count(CASE WHEN pitch_type = 'CU' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_cu,
(count(CASE WHEN pitch_type = 'SI' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_si,
(count(CASE WHEN pitch_type = 'FC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_fc,
(count(CASE WHEN pitch_type = 'KC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_kc,
(count(CASE WHEN pitch_type = 'FS' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_fs,
(count(CASE WHEN pitch_type = 'KN' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_kn,
(count(CASE WHEN pitch_type = 'EP' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_ep,
(count(CASE WHEN pitch_type = 'FO' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_fo,
(count(CASE WHEN pitch_type = 'SC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 10 PRECEDING EXCLUDE CURRENT ROW)) AS last_10_sc,

--Last 5:
(count(CASE WHEN pitch_type = 'FF' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_ff,
(count(CASE WHEN pitch_type = 'SL' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_sl,
(count(CASE WHEN pitch_type = 'FT' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_ft,
(count(CASE WHEN pitch_type = 'CH' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_ch,
(count(CASE WHEN pitch_type = 'CU' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_cu,
(count(CASE WHEN pitch_type = 'SI' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_si,
(count(CASE WHEN pitch_type = 'FC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_fc,
(count(CASE WHEN pitch_type = 'KC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_kc,
(count(CASE WHEN pitch_type = 'FS' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_fs,
(count(CASE WHEN pitch_type = 'KN' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_kn,
(count(CASE WHEN pitch_type = 'EP' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_ep,
(count(CASE WHEN pitch_type = 'FO' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_fo,
(count(CASE WHEN pitch_type = 'SC' THEN pitch_type END) OVER (PARTITION BY pitcher_id ORDER BY ab_id, pitch_num ASC ROWS 5 PRECEDING EXCLUDE CURRENT ROW)) AS last_5_sc


FROM full_pitch_data
ORDER BY ab_id, pitch_num ASC
;
'''
last10_df = pd.read_sql(query, engine)

last10_df.head(10)

Unnamed: 0,pitcher_id,batter_id,event,pitcher_full_name,pitch_type,Season,last_10_ff,last_10_sl,last_10_ft,last_10_ch,...,last_5_ch,last_5_cu,last_5_si,last_5_fc,last_5_kc,last_5_fs,last_5_kn,last_5_ep,last_5_fo,last_5_sc
0,452657,572761,Groundout,Jon Lester,FF,2015,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,452657,572761,Groundout,Jon Lester,FF,2015,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,452657,572761,Groundout,Jon Lester,FF,2015,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,452657,572761,Groundout,Jon Lester,FF,2015,3,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,452657,572761,Groundout,Jon Lester,CU,2015,4,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,452657,572761,Groundout,Jon Lester,FF,2015,4,0,0,0,...,0,1,0,0,0,0,0,0,0,0
6,452657,518792,Double,Jon Lester,FF,2015,5,0,0,0,...,0,1,0,0,0,0,0,0,0,0
7,452657,518792,Double,Jon Lester,FC,2015,6,0,0,0,...,0,1,0,0,0,0,0,0,0,0
8,452657,407812,Single,Jon Lester,FF,2015,6,0,0,0,...,0,1,0,1,0,0,0,0,0,0
9,452657,407812,Single,Jon Lester,FF,2015,7,0,0,0,...,0,1,0,1,0,0,0,0,0,0


In [191]:
last10_df.columns

Index(['pitcher_id', 'batter_id', 'event', 'pitcher_full_name', 'pitch_type',
       'Season', 'last_10_ff', 'last_10_sl', 'last_10_ft', 'last_10_ch',
       'last_10_cu', 'last_10_si', 'last_10_fc', 'last_10_kc', 'last_10_fs',
       'last_10_kn', 'last_10_ep', 'last_10_fo', 'last_10_sc', 'last_5_ff',
       'last_5_sl', 'last_5_ft', 'last_5_ch', 'last_5_cu', 'last_5_si',
       'last_5_fc', 'last_5_kc', 'last_5_fs', 'last_5_kn', 'last_5_ep',
       'last_5_fo', 'last_5_sc'],
      dtype='object')

In [192]:
last10_df[last10_df.Season != 2019].tail(10)

Unnamed: 0,pitcher_id,batter_id,event,pitcher_full_name,pitch_type,Season,last_10_ff,last_10_sl,last_10_ft,last_10_ch,...,last_5_ch,last_5_cu,last_5_si,last_5_fc,last_5_kc,last_5_fs,last_5_kn,last_5_ep,last_5_fo,last_5_sc
2848361,623352,595879,Single,Josh Hader,FF,2018,7,3,0,0,...,0,0,0,0,0,0,0,0,0,0
2848362,623352,595879,Single,Josh Hader,FF,2018,8,2,0,0,...,0,0,0,0,0,0,0,0,0,0
2848363,623352,595879,Single,Josh Hader,FF,2018,8,2,0,0,...,0,0,0,0,0,0,0,0,0,0
2848364,623352,595879,Single,Josh Hader,FF,2018,8,2,0,0,...,0,0,0,0,0,0,0,0,0,0
2848365,623352,595879,Single,Josh Hader,FF,2018,8,2,0,0,...,0,0,0,0,0,0,0,0,0,0
2848366,623352,595879,Single,Josh Hader,SL,2018,8,2,0,0,...,0,0,0,0,0,0,0,0,0,0
2848367,623352,519203,Flyout,Josh Hader,FF,2018,8,2,0,0,...,0,0,0,0,0,0,0,0,0,0
2848368,623352,519203,Flyout,Josh Hader,FF,2018,8,2,0,0,...,0,0,0,0,0,0,0,0,0,0
2848369,623352,519203,Flyout,Josh Hader,FF,2018,9,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2848370,623352,519203,Flyout,Josh Hader,FF,2018,9,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [193]:
last10_df.shape

(3555834, 32)

Pickling out the last 10 data to use in my modeling:

In [194]:
with open('../Data/last_10_data.pickle', 'wb') as to_write:
    pickle.dump(last10_df, to_write)

Adding pitch value from two pitches ago (i.e. two pitches preceding), to potentially help with pitch prediction:

In [188]:
query = '''

SELECT pitcher_id, batter_id, event, pitcher_full_name, pitch_type, "Season",
--Finally, adding in previous pitch info (for that at bat).  This will lead to some NaN values for the first pitch of an at-bat, which is expected:
(max(last_pitch_type) OVER (PARTITION BY ab_id ORDER BY pitch_num ASC ROWS 1 PRECEDING EXCLUDE CURRENT ROW)) as last_pitch_type_2,
(max(last_pitch_px) OVER (PARTITION BY ab_id ORDER BY pitch_num ASC ROWS 1 PRECEDING EXCLUDE CURRENT ROW)) as last_pitch_px_2,
(max(last_pitch_pz) OVER (PARTITION BY ab_id ORDER BY pitch_num ASC ROWS 1 PRECEDING EXCLUDE CURRENT ROW)) as last_pitch_pz_2,
(max(last_pitch_speed) OVER (PARTITION BY ab_id ORDER BY pitch_num ASC ROWS 1 PRECEDING EXCLUDE CURRENT ROW)) as last_pitch_speed_2


FROM full_pitch_data
ORDER BY ab_id, pitch_num ASC
;
'''
two_pitches_ago_df = pd.read_sql(query, engine)

two_pitches_ago_df.head(10)

Unnamed: 0,pitcher_id,batter_id,event,pitcher_full_name,pitch_type,Season,last_pitch_type_2,last_pitch_px_2,last_pitch_pz_2,last_pitch_speed_2
0,452657,572761,Groundout,Jon Lester,FF,2015,,,,
1,452657,572761,Groundout,Jon Lester,FF,2015,,,,
2,452657,572761,Groundout,Jon Lester,FF,2015,FF,0.416,2.963,92.9
3,452657,572761,Groundout,Jon Lester,FF,2015,FF,-0.191,2.347,92.8
4,452657,572761,Groundout,Jon Lester,CU,2015,FF,-0.518,3.284,94.1
5,452657,572761,Groundout,Jon Lester,FF,2015,FF,-0.641,1.221,91.0
6,452657,518792,Double,Jon Lester,FF,2015,,,,
7,452657,518792,Double,Jon Lester,FC,2015,,,,
8,452657,407812,Single,Jon Lester,FF,2015,,,,
9,452657,407812,Single,Jon Lester,FF,2015,,,,


# Other potential features from pitch data:
There is a lot of other potential data to choose from the pitch data, such as break, speeds at various points, etc.  Some of these may be helpful in predicting pitch location (i.e. last x pitches of that type average break)

In [12]:
query = '''
--first, selecting all the standard columns:
SELECT *
FROM pitches
ORDER BY ab_id, pitch_num ASC
LIMIT 10
;
'''
pitches_df = pd.read_sql(query, engine)

pitches_df.columns

Index(['px', 'pz', 'start_speed', 'end_speed', 'spin_rate', 'spin_dir',
       'break_angle', 'break_length', 'break_y', 'ax', 'ay', 'az', 'sz_bot',
       'sz_top', 'type_confidence', 'vx0', 'vy0', 'vz0', 'x', 'x0', 'y', 'y0',
       'z0', 'pfx_x', 'pfx_z', 'nasty', 'zone', 'code', 'type', 'pitch_type',
       'event_num', 'b_score', 'ab_id', 'b_count', 's_count', 'outs',
       'pitch_num', 'on_1b', 'on_2b', 'on_3b'],
      dtype='object')

Of these, sz_bottom and sz_top would be useful to have for every pitch.  These could help normalize where the strike zone is, and inform pz prediction (and also possibly px).  Some others, such as break_angle/length/y, ax/y/z, vx/y/z0, pfx_x/z, etc. could be helpful as precursor stats (i.e. what were the avg. values the last x times the pitcher threw that pitch type).

In order to get pitch data in a similar fashion as before, I'll be utilizing a query from cell 14 of initial_sql_queries.ipynb to bring together the pitches, games, and at bats tables:

In [13]:
query = '''
--Queuing up the game/at bat info from teh above query:
WITH game_player_ab AS (

WITH game_ab AS (
SELECT a.inning, a.batter_id, a.pitcher_id, a.top, a.ab_id, a.p_score, a.stand, a.p_throws, a.event, g.date, g.home_team, g.away_team
FROM atbats as a
RIGHT JOIN games as g
    ON a.g_id = g.g_id
)

SELECT g.*, 
p.first_name as Pitcher_First_Name, p.last_name as Pitcher_Last_Name, 
h.first_name as Hitter_First_Name, h.last_name as Hitter_Last_Name

FROM game_ab as g

--first, joining up the pitcher's name:
LEFT JOIN players as p
    ON g.pitcher_id = p.id

--now, joining the hitter's name:
LEFT JOIN players as h
    ON g.batter_id = h.id

--Ordering:
ORDER BY g.ab_id ASC
)

SELECT 
--First, taking some of the data above to help identify pitch indexes:
gpa.pitcher_id, gpa.batter_id, gpa.Pitcher_First_Name, gpa.Pitcher_Last_Name, gpa.ab_id, gpa.event, pi.pitch_num, pi.pitch_type,

--Adding new features from the pitches table:
pi.px, pi.sz_bot, pi.sz_top, pi.break_angle, pi.break_length, pi.break_y, pi.ax, pi.ay, pi.az, pi.sz_bot,
       pi.sz_top, pi.vx0, pi.vy0, pi.vz0, pi.x, pi.x0, pi.y, pi.y0,
       pi.z0, pi.pfx_x, pi.pfx_z

FROM pitches as pi
RIGHT JOIN game_player_ab as gpa
    ON pi.ab_id = gpa.ab_id
    
--Ordering:
ORDER BY gpa.ab_id ASC, pi.pitch_num ASC
;
'''
combined_data_df = pd.read_sql(query, engine)

combined_data_df.head(10)

Unnamed: 0,pitcher_id,batter_id,pitcher_first_name,pitcher_last_name,ab_id,event,pitch_num,pitch_type,px,sz_bot,...,vx0,vy0,vz0,x,x0,y,y0,z0,pfx_x,pfx_z
0,452657,572761,Jon,Lester,2015000000.0,Groundout,1.0,FF,0.416,1.72,...,-6.409,-136.065,-3.995,101.14,2.28,158.78,50.0,5.302,4.16,10.93
1,452657,572761,Jon,Lester,2015000000.0,Groundout,2.0,FF,-0.191,1.72,...,-8.411,-135.69,-5.98,124.28,2.119,175.41,50.0,5.307,6.57,12.0
2,452657,572761,Jon,Lester,2015000000.0,Groundout,3.0,FF,-0.518,1.72,...,-9.802,-137.668,-3.337,136.74,2.127,150.11,50.0,5.313,7.61,10.88
3,452657,572761,Jon,Lester,2015000000.0,Groundout,4.0,FF,-0.641,1.74,...,-8.071,-133.005,-6.567,109.68563599417064,2.279,187.46348190644315,50.0,5.21,1.17,6.45
4,452657,572761,Jon,Lester,2015000000.0,Groundout,5.0,CU,-1.821,1.72,...,-6.309,-110.409,0.325,146.5275251955089,2.179,177.2428287731686,50.0,5.557,-8.43,-1.65
5,452657,572761,Jon,Lester,2015000000.0,Groundout,6.0,FF,0.627,1.72,...,-6.943,-136.012,-5.738,118.00477226544056,2.273,164.46701235657548,50.0,5.264,7.32,11.72
6,452657,518792,Jon,Lester,2015000000.0,Double,1.0,FF,-1.088,1.59,...,-11.032,-136.208,-7.762,141.43,2.013,205.81,50.0,5.179,7.79,11.97
7,452657,518792,Jon,Lester,2015000000.0,Double,2.0,FC,-0.257,1.59,...,-6.335,-130.711,-4.611,186.41,2.298,182.54,50.0,5.284,-0.77,7.38
8,452657,407812,Jon,Lester,2015000000.0,Single,1.0,FF,1.47,1.89,...,-5.075,-134.873,-5.723,93.1,2.402,174.06,50.0,5.31,7.46,11.09
9,452657,407812,Jon,Lester,2015000000.0,Single,2.0,FF,-1.337,1.81,...,-9.239,-130.512,-4.904,135.83149284673328,2.165,182.99194616063545,50.0,5.302,0.71,7.18


### Cleaning the Results

In order to match up with the other data on index, will need to clean this table in the same manner.

Removing the same pitch types as in data_cleaning.ipynb, for consistency:

In [52]:
bad_pitches = ['IN', 'PO', 'UN', 'FA', 'AB']

In [53]:
cleaned_df_1 = combined_data_df[(combined_data_df.pitch_type != 'IN') & (combined_data_df.pitch_type != 'PO') & (combined_data_df.pitch_type != 'UN') & (combined_data_df.pitch_type != 'FA') & (combined_data_df.pitch_type != 'AB')]

In [54]:
cleaned_df_1.isnull().sum()

pitcher_id                0
batter_id                 0
pitcher_first_name    81607
pitcher_last_name     81607
ab_id                     0
event                     0
pitch_num               173
pitch_type            20991
px                    20991
sz_bot                 2256
sz_top                 2256
break_angle           20991
break_length          20991
break_y               20991
ax                    20991
ay                    20991
az                    20991
sz_bot                 2256
sz_top                 2256
vx0                   20991
vy0                   20991
vz0                   20991
x                       173
x0                    20991
y                       173
y0                    20991
z0                    20991
pfx_x                 20944
pfx_z                 20944
dtype: int64

Removing null pitch types, since those will have no value in this model:

In [69]:
cleaned_df = cleaned_df_1[(cleaned_df_1.pitch_type.notnull())]

In [70]:
cleaned_df.shape

(3568226, 29)

In [71]:
last10_df.shape

(3555834, 19)

In [72]:
diff = cleaned_df.shape[0] - last10_df.shape[0]
print(diff)

12392


Finding remaining nulls:

In [74]:
cleaned_df.isnull().sum()

pitcher_id                0
batter_id                 0
pitcher_first_name    81024
pitcher_last_name     81024
ab_id                     0
event                     0
pitch_num                 0
pitch_type                0
px                        0
sz_bot                    0
sz_top                    0
break_angle               0
break_length              0
break_y                   0
ax                        0
ay                        0
az                        0
sz_bot                    0
sz_top                    0
vx0                       0
vy0                       0
vz0                       0
x                         0
x0                        0
y                         0
y0                        0
z0                        0
pfx_x                     0
pfx_z                     0
dtype: int64

That took care of all the nulls for the other stats.  Next step is to update the missing names, utilizing the other name/ID file (PLAYERIDMAP.CSV).

In [75]:
cleaned_df['pitcher_full_name'] = cleaned_df['pitcher_first_name'] + ' ' + cleaned_df.pitcher_last_name

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_df['pitcher_full_name'] = cleaned_df['pitcher_first_name'] + ' ' + cleaned_df.pitcher_last_name


In [76]:
cleaned_df.head()

Unnamed: 0,pitcher_id,batter_id,pitcher_first_name,pitcher_last_name,ab_id,event,pitch_num,pitch_type,px,sz_bot,...,vy0,vz0,x,x0,y,y0,z0,pfx_x,pfx_z,pitcher_full_name
0,452657,572761,Jon,Lester,2015000000.0,Groundout,1.0,FF,0.416,1.72,...,-136.065,-3.995,101.14,2.28,158.78,50.0,5.302,4.16,10.93,Jon Lester
1,452657,572761,Jon,Lester,2015000000.0,Groundout,2.0,FF,-0.191,1.72,...,-135.69,-5.98,124.28,2.119,175.41,50.0,5.307,6.57,12.0,Jon Lester
2,452657,572761,Jon,Lester,2015000000.0,Groundout,3.0,FF,-0.518,1.72,...,-137.668,-3.337,136.74,2.127,150.11,50.0,5.313,7.61,10.88,Jon Lester
3,452657,572761,Jon,Lester,2015000000.0,Groundout,4.0,FF,-0.641,1.74,...,-133.005,-6.567,109.68563599417064,2.279,187.46348190644315,50.0,5.21,1.17,6.45,Jon Lester
4,452657,572761,Jon,Lester,2015000000.0,Groundout,5.0,CU,-1.821,1.72,...,-110.409,0.325,146.5275251955089,2.179,177.2428287731686,50.0,5.557,-8.43,-1.65,Jon Lester


In [80]:
cleaned_df.loc[cleaned_df.index == 2848370, ['pitcher_full_name']]

Unnamed: 0,pitcher_full_name
2848370,Kyle McGowin


Pulling in the other name dataframe:

In [83]:
name_data = pd.read_csv('../Data/PLAYERIDMAP.csv')

In [84]:
new_comb_df = cleaned_df.merge(name_data[['MLBID', 'MLBNAME', 'POS']], how='left',left_on='pitcher_id', right_on='MLBID')

In [85]:
new_comb_df.head()

Unnamed: 0,pitcher_id,batter_id,pitcher_first_name,pitcher_last_name,ab_id,event,pitch_num,pitch_type,px,sz_bot,...,x0,y,y0,z0,pfx_x,pfx_z,pitcher_full_name,MLBID,MLBNAME,POS
0,452657,572761,Jon,Lester,2015000000.0,Groundout,1.0,FF,0.416,1.72,...,2.28,158.78,50.0,5.302,4.16,10.93,Jon Lester,452657.0,Jon Lester,P
1,452657,572761,Jon,Lester,2015000000.0,Groundout,2.0,FF,-0.191,1.72,...,2.119,175.41,50.0,5.307,6.57,12.0,Jon Lester,452657.0,Jon Lester,P
2,452657,572761,Jon,Lester,2015000000.0,Groundout,3.0,FF,-0.518,1.72,...,2.127,150.11,50.0,5.313,7.61,10.88,Jon Lester,452657.0,Jon Lester,P
3,452657,572761,Jon,Lester,2015000000.0,Groundout,4.0,FF,-0.641,1.74,...,2.279,187.46348190644315,50.0,5.21,1.17,6.45,Jon Lester,452657.0,Jon Lester,P
4,452657,572761,Jon,Lester,2015000000.0,Groundout,5.0,CU,-1.821,1.72,...,2.179,177.2428287731686,50.0,5.557,-8.43,-1.65,Jon Lester,452657.0,Jon Lester,P


In [86]:
new_comb_df[(new_comb_df.pitcher_full_name.isnull()) & (new_comb_df.MLBNAME.notnull())].pitcher_full_name.index

Int64Index([2848296, 2848297, 2848298, 2848299, 2848300, 2848301, 2848302,
            2848303, 2848304, 2848324,
            ...
            3568972, 3568973, 3568974, 3568975, 3568976, 3568977, 3568978,
            3568979, 3568980, 3568981],
           dtype='int64', length=65597)

In [88]:
new_comb_df.pitcher_full_name.iloc[2848296]

nan

In [89]:
new_comb_df.MLBNAME.iloc[2848296]

'Brandon Brennan'

In [96]:
new_comb_df[new_comb_df.MLBNAME != new_comb_df.pitcher_full_name].isnull().sum()

pitcher_id                 0
batter_id                  0
pitcher_first_name     81024
pitcher_last_name      81024
ab_id                      0
event                      0
pitch_num                  0
pitch_type                 0
px                         0
sz_bot                     0
sz_top                     0
break_angle                0
break_length               0
break_y                    0
ax                         0
ay                         0
az                         0
sz_bot                     0
sz_top                     0
vx0                        0
vy0                        0
vz0                        0
x                          0
x0                         0
y                          0
y0                         0
z0                         0
pfx_x                      0
pfx_z                      0
pitcher_full_name      81024
MLBID                 172732
MLBNAME               172732
POS                   172732
dtype: int64

In [97]:
new_comb_df[new_comb_df.MLBNAME != new_comb_df.pitcher_full_name].shape

(319324, 33)

For ones where the pitcher_full_name is null and the MLBNAME is not null, changing the name:

In [133]:
for i in  new_comb_df[(new_comb_df.pitcher_full_name.isnull()) & (new_comb_df.MLBNAME.notnull())].pitcher_full_name.index:
    name = new_comb_df.loc[i, 'MLBNAME']
    new_comb_df.loc[i, 'pitcher_full_name'] = name

In [134]:
new_comb_df.loc[2848296, 'MLBNAME']

'Brandon Brennan'

In [135]:
new_comb_df.loc[2848296, 'pitcher_full_name']

'Brandon Brennan'

In [136]:
new_comb_df.isnull().sum()

pitcher_id                 0
batter_id                  0
pitcher_first_name     81024
pitcher_last_name      81024
ab_id                      0
event                      0
pitch_num                  0
pitch_type                 0
px                         0
sz_bot                     0
sz_top                     0
break_angle                0
break_length               0
break_y                    0
ax                         0
ay                         0
az                         0
sz_bot                     0
sz_top                     0
vx0                        0
vy0                        0
vz0                        0
x                          0
x0                         0
y                          0
y0                         0
z0                         0
pfx_x                      0
pfx_z                      0
pitcher_full_name      15427
MLBID                 172732
MLBNAME               172732
POS                   172732
dtype: int64

In [137]:
new_comb_df.tail(10)

Unnamed: 0,pitcher_id,batter_id,pitcher_first_name,pitcher_last_name,ab_id,event,pitch_num,pitch_type,px,sz_bot,...,x0,y,y0,z0,pfx_x,pfx_z,pitcher_full_name,MLBID,MLBNAME,POS
3569068,571704,663993,Ken,Giles,2019185000.0,Groundout,1.0,SL,-0.37,1.57,...,-2.21,178.71,50.0,5.95,-1.85,0.71,Ken Giles,571704.0,Ken Giles,P
3569069,571704,663993,Ken,Giles,2019185000.0,Groundout,2.0,SL,0.47,1.58,...,-2.1,182.39,50.0,5.9,-2.2,1.45,Ken Giles,571704.0,Ken Giles,P
3569070,571704,622110,Ken,Giles,2019185000.0,Groundout,1.0,FF,1.38,1.61,...,-1.83,192.5,50.0,5.65,-6.07,11.21,Ken Giles,571704.0,Ken Giles,P
3569071,571704,622110,Ken,Giles,2019185000.0,Groundout,2.0,SL,-0.54,1.55,...,-2.27,179.5,50.0,5.88,-1.16,1.37,Ken Giles,571704.0,Ken Giles,P
3569072,571704,622110,Ken,Giles,2019185000.0,Groundout,3.0,SL,1.7,1.61,...,-1.8,243.88,50.0,5.56,-1.98,4.19,Ken Giles,571704.0,Ken Giles,P
3569073,571704,622110,Ken,Giles,2019185000.0,Groundout,4.0,FF,0.3,1.65,...,-2.04,184.97,50.0,5.69,-6.9,10.64,Ken Giles,571704.0,Ken Giles,P
3569074,571704,605421,Ken,Giles,2019185000.0,Strikeout,1.0,SL,1.0,1.6,...,-2.16,249.03,50.0,5.53,-3.2,3.0,Ken Giles,571704.0,Ken Giles,P
3569075,571704,605421,Ken,Giles,2019185000.0,Strikeout,2.0,FF,0.36,1.56,...,-2.05,184.22,50.0,5.68,-5.68,10.65,Ken Giles,571704.0,Ken Giles,P
3569076,571704,605421,Ken,Giles,2019185000.0,Strikeout,3.0,SL,-0.26,1.6,...,-2.14,168.51,50.0,5.87,-1.84,1.31,Ken Giles,571704.0,Ken Giles,P
3569077,571704,605421,Ken,Giles,2019185000.0,Strikeout,4.0,SL,0.22,1.53,...,-2.17,210.07,50.0,5.67,-2.56,0.89,Ken Giles,571704.0,Ken Giles,P


Dropping the null pitcher values:

In [153]:
new_comb_df.drop(columns=['MLBID', 'MLBNAME', 'POS'], inplace=True)

In [154]:
final_cleaned = new_comb_df[(new_comb_df.pitcher_full_name.notnull())]

In [155]:
final_cleaned.shape

(3553651, 30)

In [156]:
last10_df.shape

(3555834, 19)

With those gone, comparing new_comb_df to last10_df:

In [144]:
last10_df.tail(10)

Unnamed: 0,pitcher_id,batter_id,event,pitcher_full_name,pitch_type,Season,last_10_ff,last_10_sl,last_10_ft,last_10_ch,last_10_cu,last_10_si,last_10_fc,last_10_kc,last_10_fs,last_10_kn,last_10_ep,last_10_fo,last_10_sc
3555824,571704,663993,Groundout,Ken Giles,SL,2019,1,7,2,0,0,0,0,0,0,0,0,0,0
3555825,571704,663993,Groundout,Ken Giles,SL,2019,1,7,2,0,0,0,0,0,0,0,0,0,0
3555826,571704,622110,Groundout,Ken Giles,FF,2019,1,7,2,0,0,0,0,0,0,0,0,0,0
3555827,571704,622110,Groundout,Ken Giles,SL,2019,2,6,2,0,0,0,0,0,0,0,0,0,0
3555828,571704,622110,Groundout,Ken Giles,SL,2019,2,7,1,0,0,0,0,0,0,0,0,0,0
3555829,571704,622110,Groundout,Ken Giles,FF,2019,2,8,0,0,0,0,0,0,0,0,0,0,0
3555830,571704,605421,Strikeout,Ken Giles,SL,2019,3,7,0,0,0,0,0,0,0,0,0,0,0
3555831,571704,605421,Strikeout,Ken Giles,FF,2019,3,7,0,0,0,0,0,0,0,0,0,0,0
3555832,571704,605421,Strikeout,Ken Giles,SL,2019,3,7,0,0,0,0,0,0,0,0,0,0,0
3555833,571704,605421,Strikeout,Ken Giles,SL,2019,3,7,0,0,0,0,0,0,0,0,0,0,0


There are still differences in the index. With the differences, I won't be able to merge on index.  Second option is to add a pitch_id column, which is simply ab_id with the pitch number added.

In [151]:
pn = str(new_comb_df.loc[2848296, 'pitch_num']).split('.')[0]

In [150]:
an = str(new_comb_df.loc[2848296, 'ab_id']).split('.')[0]

In [152]:
an + '_' + pn

'2019000359_1'

That is the format I'll want, so I'll add a new column to have this information:

In [173]:
final_cleaned['pitch_id'] = ''

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_cleaned['pitch_id'] = ''


In [174]:
for i in final_cleaned.ab_id.index:
    an = str(final_cleaned.loc[i, 'ab_id']).split('.')[0]
    pn = str(final_cleaned.loc[i, 'pitch_num']).split('.')[0]
    final_cleaned.loc[i, 'pitch_id'] = an + '_' + pn

KeyboardInterrupt: 

In [None]:
final_cleaned['pitch_id']

In [None]:
final_cleaned[final_cleaned.duplicated(subset=['pitch_id'])]