# Logistic Regression Baseline

This notebook will build a basic logistic regression model and evaluate results using cross-validation.

1. Read in the data
2. Create hold-out test set
3. Create pipeline for logistic regression. 
    - Random Oversample of data to balance out target.
    - cross-validation of results; evaluation metrics are accuracy, f1 score, and AUC. 
  

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from data_import import preprocess_data
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, confusion_matrix
from sklearn.model_selection import cross_val_score, train_test_split, cross_validate
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.pipeline import make_pipeline
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
from Evaluation import eval_model

In [4]:
#load in the preprocessed data
data = preprocess_data('Statcast_data.csv')
target = data['description']
#filter out 'player_name'; will not use as a feature
data = data.iloc[:, :-2]
data.head()



Unnamed: 0,release_speed,release_spin_rate,release_pos_x,release_pos_y,release_pos_z,pfx_x,pfx_z,vx0,vy0,vz0,...,pitch_name_Changeup,pitch_name_Curveball,pitch_name_Cutter,pitch_name_Sinker,pitch_name_Slider,pitch_name_Split Finger,pitch_name_nan,p_throws_L,p_throws_R,p_throws_nan
0,1.073523,0.225683,2.080234,-0.016458,-1.073886,2.119752,-0.27965,-1.984984,-1.064687,1.276245,...,0,0,0,0,0,0,0,1,0,0
1,1.340953,0.25821,2.033107,-0.393337,-0.823324,1.22874,0.483185,-1.856961,-1.349626,0.535466,...,0,0,0,0,0,0,0,1,0,0
2,-1.316632,0.898992,2.124057,1.138366,-1.320666,-0.982587,-1.009283,-1.006819,1.330334,1.589316,...,0,0,0,0,1,0,0,1,0,0
3,1.257381,0.274474,2.013077,-0.965694,-1.152964,1.662236,0.401776,-2.347235,-1.209132,-0.252616,...,0,0,0,0,0,0,0,1,0,0
4,1.307524,0.625765,2.099451,-0.293616,-1.431626,1.619067,0.139219,-2.665304,-1.265463,0.268337,...,0,0,0,0,0,0,0,1,0,0


In [5]:
#first, create a test set
X_train, X_test, y_train, y_test = train_test_split(data, target, random_state = 777, test_size = .2)

#then instatiate the model we will use: Logistic Regression
log_reg = LogisticRegression()


## Build Model and Interpret

Build a model and apply interpretation of model on the results. This involves analyzing the values of the coefficients.

Train on one fold of training data, and test on a validation set. We will randomely oversample to balance out the ratio of classes in the target

In [6]:
ros = RandomOverSampler(ratio = 1, random_state =777)

X_os, y_os = ros.fit_resample(X_train, y_train)

log_reg.fit(X_os, y_os)

y_pred_logreg = log_reg.predict(X_test)

print(f" Test Accuracy score: {accuracy_score(y_test, y_pred_logreg)}")
print(f" Test AUC score: {roc_auc_score(y_test, y_pred_logreg)}")
print(f" Test F1 score: {f1_score(y_test, y_pred_logreg)}")

 Test Accuracy score: 0.49510793601490916
 Test AUC score: 0.5439700027685492
 Test F1 score: 0.4633985309895189


In [7]:
for idx, feature in enumerate(X_train.columns):
    print(f"{feature} coefficient: {log_reg.coef_[0][idx]}")


release_speed coefficient: 0.10589773981278268
release_spin_rate coefficient: 0.002541147877040908
release_pos_x coefficient: 0.04777890477456728
release_pos_y coefficient: 0.4781476931964008
release_pos_z coefficient: -0.002402852753119701
pfx_x coefficient: 0.05281619259816755
pfx_z coefficient: 0.12725529123674095
vx0 coefficient: -0.011993494995945238
vy0 coefficient: 0.16028588823310866
vz0 coefficient: 0.002779442686315408
ax coefficient: -0.07565840536535282
ay coefficient: -0.014132510990229526
az coefficient: -0.08598745513873572
sz_top coefficient: 0.001954189215605247
sz_bot coefficient: -0.00022489653938705806
release_extension coefficient: 0.47630933643225365
pitch_name_2-Seam Fastball coefficient: 0.2564717932159001
pitch_name_4-Seam Fastball coefficient: 0.2796227781615239
pitch_name_Changeup coefficient: -0.4913861478726658
pitch_name_Curveball coefficient: 0.14671206802463627
pitch_name_Cutter coefficient: 0.05739428648947451
pitch_name_Sinker coefficient: 0.3562790709

## Analysis

Looking at the coefficients, there is one main property that can be interpreted: the signage, which can ultimately help determine the odds ratio, $exp(B_n)$ for each feature n. Negative value coefficients result in a odds ratio below 1, meaning that one unit change in that feature will lower the chance of a positive target value, in this case a strike. A positive value will indicate produce a positive odds ratio, meaning that a one unit change in that feature will make it more likely of a resulting strike. This however, assumes holding all other features constant, which is highly impractical.

cite: http://www.appstate.edu/~whiteheadjc/service/logit/intro.htm#interp



## Cross-validation

#### Use Cross-validation to ensure reliability of results above.

Use custom function in `Evaluation.py` module.

In [8]:
eval_model(log_reg, X_train, y_train, 5)

Mean test_accuracy Value: 0.48947778636114003
test_accuracy scores: [0.48546469 0.49138559 0.49213745 0.4866767  0.49172451]

Mean train_accuracy Value: 0.4906159134277135
train_accuracy scores: [0.49347813 0.49220409 0.48732042 0.48503343 0.4950435 ]

Mean test_f1 Value: 0.4637513014142328
test_f1 scores: [0.45636345 0.46190183 0.47098079 0.47014028 0.45937016]

Mean train_f1 Value: 0.4650071963313751
train_f1 scores: [0.46659256 0.46373655 0.46468434 0.46662142 0.46340111]

Mean test_roc_auc Value: 0.5576895325091001
test_roc_auc scores: [0.55216137 0.55886146 0.56395563 0.55789984 0.55556936]

Mean train_roc_auc Value: 0.5596052891684246
train_roc_auc scores: [0.56101929 0.55926907 0.55824129 0.559332   0.56016479]

