## Logistic Regression Model

First we will build a Logistic Regression model as our baseline classification model.

In [1]:
# data manipulation
import pandas as pd
import os

# modeling
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbPipeline

# custom helper functions
from src.models import cross_validate as cv

In [2]:
DATA_PATH = '../data/processed/'
OBS_PATH = os.path.join(DATA_PATH, 'observations_features.csv')

### Load data

In [3]:
obs = pd.read_csv(OBS_PATH)
obs.head()

Unnamed: 0,session_id,seq,buy_event,visitor_id,view_count,session_length,item_views,add_to_cart_count,transaction_count,avg_avail
0,1000001_251341,2.0,0,1000001,1.0,0.0,1.0,0.0,0.0,0.0
1,1000007_251343,2.0,0,1000007,1.0,0.0,1.0,0.0,0.0,0.0
2,1000042_251344,2.0,0,1000042,1.0,0.0,1.0,0.0,0.0,1.0
3,1000057_251346,2.0,0,1000057,1.0,0.0,1.0,0.0,0.0,1.0
4,1000067_251351,2.0,0,1000067,1.0,0.0,1.0,0.0,0.0,0.0


### Perform Train/Test split

In [4]:
X_train, X_test, y_train, y_test = cv.create_Xy(obs)

print(f'Class balance: {y_train.mean():.2%}')

Class balance: 1.57%


### Modeling

First we will build a pipeline to perform StandardScaler and then fit LogisticRegression. The pipeline will then be validated with a 3-fold cross validation.

In [5]:
log_pipe = Pipeline([
    ('ss', StandardScaler()),
    ('lm', LogisticRegression())
])

cv_results = cv.cv_model(X_train, y_train, log_pipe)
cv.log_scores(cv_results, 'log_regression')

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc
log_regression,0.984231,0.000185,0.01021,0.004247,0.434278,0.198174,0.019929,0.008317,0.740446,0.012549


### Analysis

Not a terrible validation AUC of 0.74. There is a pretty big class imbalance, so let's apply SMOTE upsampling to see if this has an impact on the results.

In [6]:
log_pipe = imbPipeline([
    ('ss', StandardScaler()),
    ('smote', SMOTE()),
    ('lm', LogisticRegression())
])

cv_results = cv.cv_model(X_train, y_train, log_pipe)
cv.log_scores(cv_results, 'log_regression')

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc
log_regression,0.478189,0.003034,0.84024,0.015868,0.024764,0.000464,0.048111,0.0009,0.752486,0.009845


### Save the results

In [7]:
RESULTS_PATH = os.path.join(DATA_PATH, 'results.csv')
results = cv.log_scores(cv_results, 'log_regression')
results.to_csv(RESULTS_PATH)

### Next Steps

The SMOTE upsampling resulted in a better AUC. Let's now try the same methodology but utilizing RandomForrest. I want to see if a tree based method will result in a better AUC.