# Logistic Regression Baseline

This notebook will build a basic logistic regression model and evaluate results using cross-validation.

1. Read in the data
2. Create hold-out test set
3. Create pipeline for logistic regression. 
    - Random Oversample of data to balance out target.
    - cross-validation of results; evaluation metrics are accuracy, f1 score, and AUC. 
  

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from data_import import preprocess_data
from sklearn.metrics import accuracy_score, roc_auc_score, f1_score, confusion_matrix
from sklearn.model_selection import cross_val_score, train_test_split, cross_validate
from imblearn.over_sampling import RandomOverSampler, SMOTE
from imblearn.pipeline import make_pipeline
import warnings
warnings.filterwarnings("ignore", category=FutureWarning)
from Evaluation import eval_model

In [2]:
#load in the preprocessed data
data = preprocess_data('Statcast_data.csv')
target = data['description']
#filter out 'player_name'; will not use as a feature
data = data.iloc[:, :-2]



In [3]:
#first, create a test set
X_train, X_test, y_train, y_test = train_test_split(data, target, random_state = 777, test_size = .2)

#then instatiate the model we will use: Logistic Regression
log_reg = LogisticRegression()


## Build Model and Interpret

Build a model and apply interpretation of model on the results. This involves analyzing the values of the coefficients.

Train on one fold of training data, and test on a validation set. We will randomely oversample to balance out the ratio of classes in the target

In [4]:
ros = RandomOverSampler(ratio = 1, random_state =777)

X_os, y_os = ros.fit_resample(X_train, y_train)

log_reg.fit(X_os, y_os)

y_pred_logreg = log_reg.predict(X_test)

print(f" Test Accuracy score: {accuracy_score(y_test, y_pred_logreg)}")
print(f" Test AUC score: {roc_auc_score(y_test, y_pred_logreg)}")
print(f" Test F1 score: {f1_score(y_test, y_pred_logreg)}")

 Test Accuracy score: 0.49471967696847335
 Test AUC score: 0.5437802602436324
 Test F1 score: 0.463340206185567


In [5]:
for idx, feature in enumerate(X_train.columns):
    print(f"{feature} coefficient: {log_reg.coef_[0][idx]}")


release_speed coefficient: 0.06890089680375737
release_spin_rate coefficient: 0.0027611790843398713
release_pos_x coefficient: 0.01750560818275775
release_pos_y coefficient: 0.4687428441216609
release_pos_z coefficient: -0.00116981437195111
pfx_x coefficient: 0.08051462458566055
pfx_z coefficient: 0.13043272768889044
vx0 coefficient: -0.011627070761171346
vy0 coefficient: 0.123854178730036
vz0 coefficient: 0.0020180901944460154
ax coefficient: -0.10879907780206624
ay coefficient: -0.013326730147863746
az coefficient: -0.08993558880168531
sz_top coefficient: 0.001868631260708409
sz_bot coefficient: -0.0003122138964271333
release_extension coefficient: 0.46648261719027906
pitch_name_2-Seam Fastball coefficient: 0.2516180182626066
pitch_name_4-Seam Fastball coefficient: 0.27631143684064047
pitch_name_Changeup coefficient: -0.49608550578313315
pitch_name_Curveball coefficient: 0.1427249197603272
pitch_name_Cutter coefficient: 0.0553614974140503
pitch_name_Sinker coefficient: 0.349713439054

## Analysis

Looking at the coefficients, there is one main property that can be interpreted: the signage, which can ultimately help determine the odds ratio for each feature. Negative value coefficients result in a odds ratio below 1, meaning that one unit change in that feature will lower the chance of a positive target value, in this case a strike. A positive value will indicate produce a positive odds ratio, meaning that a one unit change in that feature will make it more likely of a resulting strike. This however, assumes holding all other features constant, which is highly impractical.

cite: http://www.appstate.edu/~whiteheadjc/service/logit/intro.htm#interp



## Cross-validation

#### Use Cross-validation to ensure reliability of results above.

Use custom function in `Evaluation.py` module.

In [6]:
eval_model(log_reg, X_train, y_train, 5)

Mean test_accuracy Value: 0.48894387316710847
test_accuracy scores: [0.48599854 0.49157971 0.49174917 0.48454109 0.49085085]

Mean train_accuracy Value: 0.490094170812981
train_accuracy scores: [0.49284718 0.49224049 0.48620414 0.48432969 0.49484936]

Mean test_f1 Value: 0.4634008707564191
test_f1 scores: [0.45695534 0.46183088 0.47095079 0.46910618 0.45816116]

Mean train_f1 Value: 0.46469219194913086
train_f1 scores: [0.4661464  0.46334175 0.46459052 0.46660308 0.4627792 ]

Mean test_roc_auc Value: 0.5576898246223492
test_roc_auc scores: [0.552736   0.55826736 0.56410466 0.55847613 0.55486497]

Mean train_roc_auc Value: 0.5595457875208769
train_roc_auc scores: [0.56085482 0.55936937 0.55815543 0.55907761 0.5602717 ]

