# Logistic Regression Modeling and Analysis

## This file is a pipeline of Logistic Regression modeling and subsequent interpretation of the model. 

The Pipeline will be as follows:

1. Read the data in with custom function built at the end of the `Baseball_EDA` notebook.
2. Build a Pipeline object with two components: a random undersampler and the classification model. Random Undersampler was chosen due to imbalance of the target classes. 
3. Fit the pipeline to training data and verify results using 5-fold cross validation. This process will use custom function in the `Evaluation.py` module. 
4. View the coefficient values of the Logistic Regression and analize its results. 
5. Apply the pipeline to testing data to get test results that will be compared with other models. 

In [1]:
#start with all dependencies

import numpy as np
import pandas as pd
from Evaluation import *
from data_import import prepare_data
import sklearn
import sklearn.metrics as metrics
from sklearn.model_selection import cross_val_score, cross_validate, train_test_split
from sklearn.preprocessing import Imputer
from imblearn.under_sampling import RandomUnderSampler
from imblearn.pipeline import Pipeline, make_pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
%matplotlib inline
import imblearn
import seaborn as sns
sns.set()
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)



In [2]:
#Read in the data as pandas dataframe
file = '../Statcast_data.csv'

#define bsb as the full dataframe
bsb = prepare_data('../Statcast_data.csv')

#filter out the predictors and the target
X = bsb.drop(columns = ['player_name', 'description'])
y = bsb['description']

#quick look to see above worked. 
X.head()



Unnamed: 0,release_spin_rate,release_pos_x,release_pos_y,release_pos_z,vx0,vz0,vy0,sz_top,sz_bot,pitch_L,pitch_R,pitch_2-Seam Fastball,pitch_4-Seam Fastball,pitch_Changeup,pitch_Curveball,pitch_Cutter,pitch_Sinker,pitch_Slider,pitch_Split Finger
0,2314.0,3.2655,54.4995,5.2575,-9.8035,0.1339,-138.113,3.2971,1.5059,1,0,1,0,0,0,0,0,0,0
1,2324.0,3.1728,54.3094,5.3966,-9.0084,-2.4218,-140.5865,3.3136,1.573,1,0,0,1,0,0,0,0,0,0
2,2521.0,3.3517,55.082,5.1205,-3.7285,1.214,-117.3223,3.9119,1.708,1,0,0,0,0,0,0,0,1,0
3,2329.0,3.1334,54.0207,5.2136,-12.0533,-5.1407,-139.3669,3.5553,1.5639,1,0,0,1,0,0,0,0,0,0
4,2437.0,3.3033,54.3597,5.0589,-14.0287,-3.3434,-139.8559,3.345,1.6241,1,0,0,1,0,0,0,0,0,0


# Logistic Regression pipeline


In [3]:
#split the data into training and testing splits

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 777)

#build the pipeline object
logit_reg = LogisticRegression()
sampler = RandomUnderSampler(ratio = 1, random_state=777)

logit_pipe = make_pipeline(sampler, logit_reg)

logit_pipe_results = cross_validate(logit_pipe, X_train, y_train, 
                            scoring = ['accuracy', 'f1', 'roc_auc'], 
                            cv =5, return_estimator=True, return_train_score = True)

for result in ['train_accuracy', 'test_accuracy', 'train_f1', 'test_f1', 'train_roc_auc', 'test_roc_auc']:
    print(f"Mean {result} Value: {np.mean(logit_pipe_results[result])}")
    print(f"{result} scores: {logit_pipe_results[result]}")
    print() 

Mean train_accuracy Value: 0.5635857129171152
train_accuracy scores: [0.5636941  0.56598677 0.56071626 0.56277176 0.56475967]

Mean test_accuracy Value: 0.56311928652947
test_accuracy scores: [0.56008222 0.56925266 0.56577422 0.55713081 0.56335653]

Mean train_f1 Value: 0.48739862492073505
train_f1 scores: [0.4881992  0.48803991 0.48621492 0.48592542 0.48861367]

Mean test_f1 Value: 0.48642774951834034
test_f1 scores: [0.48344576 0.49030246 0.49019244 0.48158431 0.48661378]

Mean train_roc_auc Value: 0.5816877972121671
train_roc_auc scores: [0.58336976 0.58200869 0.57939504 0.58181378 0.58185172]

Mean test_roc_auc Value: 0.5808802551341297
test_roc_auc scores: [0.5750287  0.58461803 0.58663607 0.57919172 0.57892677]



In [4]:
logit_pipe_results['estimator'][0][1].coef_

array([[ 1.55520182e-04, -3.96842167e-02,  3.12652119e-03,
         2.47318769e-01, -2.03895176e-02,  6.89648949e-02,
         1.65369807e-02,  6.32119584e-01, -1.18837122e+00,
        -7.73171382e-02,  6.30728391e-02,  5.10233557e-01,
         5.43990349e-01, -4.65057941e-01, -3.90530154e-01,
         1.11082743e-01,  5.70509247e-01, -1.87454290e-01,
        -6.72711982e-01]])

In [5]:
coefs = [coef for coef in logit_pipe_results['estimator'][0][1].coef_ ]
print("Feature Coefficient Values: \n")
for col, coef in zip(X_train.columns, coefs[0]):
    print( col, coef)

Feature Coefficient Values: 

release_spin_rate 0.00015552018184562818
release_pos_x -0.03968421666846841
release_pos_y 0.0031265211877609697
release_pos_z 0.24731876914812
vx0 -0.02038951761297689
vz0 0.06896489494151811
vy0 0.016536980736834222
sz_top 0.6321195837309241
sz_bot -1.1883712184815418
pitch_L -0.07731713823243547
pitch_R 0.0630728390573216
pitch_2-Seam Fastball 0.510233557044374
pitch_4-Seam Fastball 0.5439903488182751
pitch_Changeup -0.46505794130959344
pitch_Curveball -0.3905301542067602
pitch_Cutter 0.11108274348566656
pitch_Sinker 0.5705092474286227
pitch_Slider -0.18745429006050177
pitch_Split Finger -0.6727119823819887


## Interpretation

Looking at the coefficient values, it apppers that a fastball and its derivatives, a 2-seam fastball and a sinker, have a positive affect on determining a ball or a strike. Off speed pitches, like a Changeup, seem to have the opposite affect. Thus, one 'rule' that is inferable from the results is that throwing a fastball increases the chances of throwing a strike, while throwing an off speed decreases the liklihood of a strike. 

Of course, just the type of pitch alone is not very useful in determining a strike or ball.  Looking at the other features, the release position of the pitch in the z dimension also seems to show that increasing the value of this feature leads to a higher chance of prediction for a strike. 

Additionally, the thresholds of the strike zone are another important set of features. The basic intuition is that, for an individual pitch, an increase in the top of the zone threshold value increases the chances of that pitch being a strike, noted by the positive coefficient value. The opposite effect is seen at the bottom of the zone threshold; if that value increases, the liklihood of that pitch being called a strike decreases. This makes sense: a taller batter will have a larger zone, which gives the pitcher more area to work with, and a higher threshold for the bottom of the zone means that the strike zone shrinks, and pitches near bottom of the zone for other hitters are less likely to be called a strike for hitters with a higher threshold.  

This, however, also assumes that the values of the other pitches are held constant. In delivering a baseball pitch, that assumption simply does not hold; the process is too complex to be able to constantly provide the same values of velocity and movement. 

Thus, some human subjective rules that we can infer is that fastball and related pitches thrown with a higher level trajectory may have a higher chance of being called a strike. However, the usefullness of such rules is questionable.   

In [6]:
predictions = logit_pipe_results['estimator'][0].predict(X_test)

eval_test_set(predictions, y_test)



Accuracy Score: 0.5628182019416248

AUC Score: 0.5846427599332363

F1 Score: 0.48767833981841757

Classification Report: 
               precision    recall  f1-score   support

           0       0.76      0.52      0.62     21437
           1       0.39      0.65      0.49     10186

    accuracy                           0.56     31623
   macro avg       0.57      0.58      0.55     31623
weighted avg       0.64      0.56      0.58     31623


 Confustion Matrix: 
 [[11218 10219]
 [ 3606  6580]]
