# Pitch Location Regression
THe purpose of this notebook is to begin working on regression to predict the pitch location, via its x and z coordinates (px, pz).  In the final product, this will come after classification of the pitch type.  Here, I'll first predict px, then pz, as the horizontal location of the pitch is greatly influenced by the pitch type and its motion.

Importing packages:

In [1]:
import pickle
from sqlalchemy import create_engine
import pandas as pd
from importlib import reload
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_formats = ['retina']
%matplotlib inline

plt.rcParams['figure.figsize'] = (9, 6)
sns.set(context='notebook', style='whitegrid', font_scale=1.2)

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV, ElasticNetCV
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold

Pickling in initial data to work with:

In [3]:
pwd

'/Users/patrickbovard/Documents/GitHub/metis_final_project/Pitch_Classification'

In [4]:
with open('../Data/cleaned_pitch_df.pickle','rb') as read_file:
    pitch_df = pickle.load(read_file)

In [5]:
pitch_df.drop(columns=['Pitch_Family'], inplace=True)

In [6]:
pitch_df.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,stand,p_throws,event,home_team,...,pitch_num,last_pitch_type,last_pitch_px,last_pitch_pz,last_pitch_speed,pitcher_full_name,pitcher_run_diff,hitter_full_name,Date_Time_Date,Season
0,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,1.0,,,,,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
1,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,2.0,FF,0.416,2.963,92.9,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
2,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,3.0,FF,-0.191,2.347,92.8,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
3,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,4.0,FF,-0.518,3.284,94.1,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015
4,1.0,572761,452657,1.0,2015000000.0,0.0,L,L,Groundout,chn,...,5.0,FF,-0.641,1.221,91.0,Jon Lester,0.0,Matt Carpenter,2015-04-05,2015


For regression, the following columsn will be used as features:
- 'inning', 'top', 'stand', 'p_throws', 'on_1b', 'on_2b', 'on_3b', 'b_count', 's_count', 'outs', 'pitch_num', 'last_pitch_type', 'last_pitch_px', 'last_pitch_pz', 'last_pitch_speed', 'pitcher_run_diff'
- pitch_type will also be used - in my final model, this will be the predicted pitch_types, but for now as a proof of concept I'll be utilizing the actual rows.

# Dataframe Prep

## Split: Train/Val and Test:

For puposes of training/validating and testing my model(s), I'll be splitting as follows: 
- Train/Val: 2015-2018 seasons
- Test: 2019 season

In [7]:
pitch_df[pitch_df.Season == 2019].shape

(707463, 34)

In [8]:
pitch_df[pitch_df.Season != 2019].shape

(2848371, 34)

Based on the numbers, this is an ~80/20 split in data.

In [9]:
train_df = pitch_df[pitch_df.Season != 2019]

In [10]:
train_df.pitcher_full_name.value_counts()

Max Scherzer        13479
Justin Verlander    12810
Chris Archer        12760
Jose Quintana       12692
Chris Sale          12689
                    ...  
Phillip Ervin           3
Mark Reynolds           3
Alexi Amarista          3
Anthony Rizzo           2
Chris Denorfia          2
Name: pitcher_full_name, Length: 1329, dtype: int64

## Feature Prep

### One Hot Encoding:

Here, I'll need to one hot encode a few columns - specifically:

- stand (hitter hand), p_throws (pitcher throwing hand), last_pitch_type, pitch_type

In [11]:
ohe_cols = ['stand', 'p_throws', 'last_pitch_type', 'pitch_type']

In [12]:
scherzer_df = train_df[train_df.pitcher_full_name == 'Max Scherzer']

In [13]:
from pitch_cat_functions import *

In [14]:
new_df = column_ohe_maker(scherzer_df, ohe_cols)

In [15]:
new_df.head()

Unnamed: 0,inning,batter_id,pitcher_id,top,ab_id,p_score,event,home_team,away_team,b_score,...,last_pitch_type_FF,last_pitch_type_FT,last_pitch_type_SL,last_pitch_type_UN,last_pitch_type_None,pitch_type_CU,pitch_type_FC,pitch_type_FF,pitch_type_FT,pitch_type_SL
2396,1.0,434158,453286,1.0,2015001000.0,0.0,Walk,was,nyn,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
2397,1.0,434158,453286,1.0,2015001000.0,0.0,Walk,was,nyn,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2398,1.0,434158,453286,1.0,2015001000.0,0.0,Walk,was,nyn,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2399,1.0,434158,453286,1.0,2015001000.0,0.0,Walk,was,nyn,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2400,1.0,434158,453286,1.0,2015001000.0,0.0,Walk,was,nyn,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [16]:
new_df.columns

Index(['inning', 'batter_id', 'pitcher_id', 'top', 'ab_id', 'p_score', 'event',
       'home_team', 'away_team', 'b_score', 'on_1b', 'on_2b', 'on_3b', 'px',
       'pz', 'zone', 'start_speed', 'type', 'b_count', 's_count', 'outs',
       'pitch_num', 'last_pitch_px', 'last_pitch_pz', 'last_pitch_speed',
       'pitcher_full_name', 'pitcher_run_diff', 'hitter_full_name',
       'Date_Time_Date', 'Season', 'stand_R', 'last_pitch_type_CU',
       'last_pitch_type_FC', 'last_pitch_type_FF', 'last_pitch_type_FT',
       'last_pitch_type_SL', 'last_pitch_type_UN', 'last_pitch_type_None',
       'pitch_type_CU', 'pitch_type_FC', 'pitch_type_FF', 'pitch_type_FT',
       'pitch_type_SL'],
      dtype='object')

### Linear Regression Test:

Some features will need to be standard scaled, as they are numerical.  These are the following in num_col:

In [17]:
from location_regression_functions import *

In [18]:
columns = ['inning', 'b_count', 's_count', 'outs', 'pitcher_run_diff', 'stand_R',
          'last_pitch_type_CU',
       'last_pitch_type_FC', 'last_pitch_type_FF', 'last_pitch_type_FT',
       'last_pitch_type_SL', 'last_pitch_type_UN', 'last_pitch_type_None',
       'pitch_type_CU', 'pitch_type_FC', 'pitch_type_FF', 'pitch_type_FT',
       'pitch_type_SL']

In [19]:
X = new_df[columns]

In [20]:
y = new_df['px']

In [21]:
split_and_train_val_simple_lr_w_cv(X, y)

Simple Linear Regression w/ KFOLD CV Results:
Simple regression scores:  [0.1289906699265848, 0.15206934831818453, 0.1822668226018257, 0.1517150641576287, 0.12175620634510198] 

Simple mean cv r^2: 0.147 +- 0.021
MAE Scores: [0.6234395347741104, 0.6377879752975786, 0.6324862514074213, 0.6453973941126377, 0.6507615022407948]
Avg. CV MAE: 0.6379745315665086


Checking on pz, using px as a feature:

In [22]:
cols_pz = ['inning', 'b_count', 's_count', 'outs', 'pitcher_run_diff', 'stand_R',
          'last_pitch_type_CU',
       'last_pitch_type_FC', 'last_pitch_type_FF', 'last_pitch_type_FT',
       'last_pitch_type_SL', 'last_pitch_type_UN', 'last_pitch_type_None',
       'pitch_type_CU', 'pitch_type_FC', 'pitch_type_FF', 'pitch_type_FT',
       'pitch_type_SL', 'px']

In [23]:
X = new_df[cols_pz]

In [24]:
y = new_df['pz']

In [25]:
split_and_train_val_simple_lr_w_cv(X, y)

Simple Linear Regression w/ KFOLD CV Results:
Simple regression scores:  [0.3258937444050931, 0.3473746158483699, 0.321904875257912, 0.33571035802980265, 0.36807422827988323] 

Simple mean cv r^2: 0.340 +- 0.017
MAE Scores: [0.5590737226418868, 0.5561437669651792, 0.5564611497729697, 0.5649071252121297, 0.5624505363500322]
Avg. CV MAE: 0.5598072601884395


In [26]:
validation_comparer(X,y)

Simple Linear Regression w/ KFOLD CV Results:
Simple regression scores:  [0.3258937444050931, 0.3473746158483699, 0.321904875257912, 0.33571035802980265, 0.36807422827988323] 

Simple mean cv r^2: 0.340 +- 0.017
MAE Scores: [0.5590737226418868, 0.5561437669651792, 0.5564611497729697, 0.5649071252121297, 0.5624505363500322]
Avg. CV MAE: 0.5598072601884395


Lasso Linear Regression w/ CV Results:
Lasso R^2: 0.3424695107123906
Lasso mae: 0.5588967679431889
Lasso Coefficients: [('inning', -0.016094532087446142), ('b_count', 0.026134736340032123), ('s_count', -0.06900330067144024), ('outs', 0.010980627743810464), ('pitcher_run_diff', 0.012555321378606082), ('stand_R', -0.025381930847214643), ('last_pitch_type_CU', -0.028608828839731558), ('last_pitch_type_FC', 0.016615303226127538), ('last_pitch_type_FF', -0.02979824280062424), ('last_pitch_type_FT', 0.0022509001162964158), ('last_pitch_type_SL', -0.039651004280263746), ('last_pitch_type_UN', 0.0015242022962364207), ('last_pitch_type_None',