# Pipeline Testing
The purpose of this notebook is begin testing with the pipeline functionality of Python, to build out a pipeline of modeling.  The general form is:

1. Random Forest Classification of Pitch Type
2. Linear Regression of X-Coordinate of Pitch Location (Px) - pitch type used as feature
3. Linear Regression of Z-Coordinate of Pitch Location (Pz) - pitch type and px used as feature

Importing packages:

In [1]:
import pickle
from sqlalchemy import create_engine
import pandas as pd
from importlib import reload
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
%config InlineBackend.figure_formats = ['retina']
%matplotlib inline

plt.rcParams['figure.figsize'] = (9, 6)
sns.set(context='notebook', style='whitegrid', font_scale=1.2)

In [2]:
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, LassoCV, Ridge, RidgeCV, ElasticNetCV
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import KFold

Pickling in initial data to work with:

In [3]:
pwd

'/Users/patrickbovard/Documents/GitHub/metis_final_project/Pitch_Classification'

In [4]:
with open('../Data/train_df_clusters.pickle','rb') as read_file:
    pitch_df = pickle.load(read_file)

In order to determine who to include in the pipeline, I'll see who the top xx pitchers are, in terms of pitches thrown:

In [5]:
pitch_df.pitcher_full_name.value_counts().head(50)

Max Scherzer         13626
Chris Sale           13284
Justin Verlander     12999
Jose Quintana        12944
Chris Archer         12760
Rick Porcello        12745
Jon Lester           12566
Corey Kluber         12480
Gio Gonzalez         12439
Julio Teheran        12125
Zack Greinke         12092
Jake Arrieta         12028
Cole Hamels          12016
Trevor Bauer         11860
Kyle Gibson          11829
Jacob deGrom         11775
Gerrit Cole          11772
James Shields        11768
Marco Estrada        11763
Jake Odorizzi        11719
Dallas Keuchel       11708
J.A. Happ            11597
Kevin Gausman        11540
Tanner Roark         11296
David Price          11205
Mike Leake           11141
Mike Fiers           11093
Ian Kennedy          11026
Kyle Hendricks       10952
Carlos Martinez      10920
Carlos Carrasco      10913
Andrew Cashner       10907
CC Sabathia          10654
Masahiro Tanaka      10589
Jeff Samardzija      10573
Madison Bumgarner    10551
Jason Hammel         10544
W

In [6]:
pitch_df.pitcher_full_name.value_counts().head(50).sum()

564235

So that would cover 564,235 total pitches from 2015-2018.

In [7]:
pitch_df.columns

Index(['inning', 'batter_id', 'pitcher_id', 'top', 'ab_id', 'p_score', 'stand',
       'p_throws', 'event', 'home_team', 'away_team', 'b_score', 'on_1b',
       'on_2b', 'on_3b', 'px', 'pz', 'zone', 'pitch_type', 'start_speed',
       'type', 'b_count', 's_count', 'outs', 'pitch_num', 'last_pitch_type',
       'last_pitch_px', 'last_pitch_pz', 'last_pitch_speed',
       'pitcher_full_name', 'pitcher_run_diff', 'hitter_full_name',
       'Date_Time_Date', 'Season', 'cumulative_pitches', 'cumulative_ff_rate',
       'cumulative_sl_rate', 'cumulative_ft_rate', 'cumulative_ch_rate',
       'cumulative_cu_rate', 'cumulative_si_rate', 'cumulative_fc_rate',
       'cumulative_kc_rate', 'cumulative_fs_rate', 'cumulative_kn_rate',
       'cumulative_ep_rate', 'cumulative_fo_rate', 'cumulative_sc_rate',
       'Name', 'Cluster'],
      dtype='object')

Importing the pipeline package:

In [8]:
from sklearn.pipeline import Pipeline, make_pipeline

Importing the regression functions:

In [9]:
from location_regression_functions import *
from pitch_cat_functions import *

In [10]:
from classification_location_combo import *

In [11]:
feature_cols = ['Cluster','inning', 'top', 'on_1b', 'on_2b', 'on_3b', 'b_count', 's_count', 'outs', 'stand_R',
       'pitcher_run_diff','last_pitch_speed', 'last_pitch_px', 'last_pitch_pz','pitch_num','cumulative_pitches',
       'cumulative_ff_rate', 'cumulative_sl_rate', 'cumulative_ft_rate',
       'cumulative_ch_rate', 'cumulative_cu_rate', 'cumulative_si_rate',
       'cumulative_fc_rate', 'cumulative_kc_rate', 'cumulative_fs_rate',
       'cumulative_kn_rate', 'cumulative_ep_rate', 'cumulative_fo_rate',
       'cumulative_sc_rate']

In [12]:
pitch_type_regression_rf('Max Scherzer', pitch_df, feature_cols)

Here is the coding for pitch type:
{'FF': 0, 'SL': 1, 'CH': 2, 'CU': 3, 'FC': 4, 'FT': 5}


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Random Forest Results for Max Scherzer
Confusion Matrix for Fold 1
[[1219  159   40   13   14    1]
 [ 373  152    9    0    2    0]
 [ 313   25   21    5   10    1]
 [ 174    9    4    5    2    0]
 [ 134    1    6    1    9    0]
 [   9    9    1    0    1    1]]


Index(['Cluster', 'inning', 'top', 'on_1b', 'on_2b', 'on_3b', 'b_count',
       's_count', 'outs', 'stand_R', 'pitcher_run_diff', 'last_pitch_speed',
       'last_pitch_px', 'last_pitch_pz', 'pitch_num', 'cumulative_pitches',
       'cumulative_ff_rate', 'cumulative_sl_rate', 'cumulative_ft_rate',
       'cumulative_ch_rate', 'cumulative_cu_rate', 'cumulative_si_rate',
       'cumulative_fc_rate', 'cumulative_kc_rate', 'cumulative_fs_rate',
       'cumulative_kn_rate', 'cumulative_ep_rate', 'cumulative_fo_rate',
       'cumulative_sc_rate'],
      dtype='object')


KeyError: "['px'] not found in axis"