## xgboost Model

We will now run an xgboost model to see if it can outperfom the random forest model.

In [1]:
# data manipulation
import pandas as pd
import os
import numpy as np

# modeling
from xgboost import XGBClassifier

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imbPipeline

# custom helper functions
from src.models import cross_validate as cv

In [2]:
DATA_PATH = '../data/processed/'
OBS_PATH = os.path.join(DATA_PATH, 'observations_features.csv')
RESULTS_PATH = os.path.join(DATA_PATH, 'results.csv')
model = 'xgb'

### Load data

In [3]:
obs = pd.read_csv(OBS_PATH)
obs.head()

Unnamed: 0,session_id,seq,buy_event,visitor_id,view_count,session_length,item_views,add_to_cart_count,transaction_count,avg_avail
0,1000001_251341,2.0,0,1000001,1.0,0.0,1.0,0.0,0.0,0.0
1,1000007_251343,2.0,0,1000007,1.0,0.0,1.0,0.0,0.0,0.0
2,1000042_251344,2.0,0,1000042,1.0,0.0,1.0,0.0,0.0,1.0
3,1000057_251346,2.0,0,1000057,1.0,0.0,1.0,0.0,0.0,1.0
4,1000067_251351,2.0,0,1000067,1.0,0.0,1.0,0.0,0.0,0.0


### Perform Train/Test split

In [4]:
X_train, X_test, y_train, y_test = cv.create_Xy(obs)

print(f'Class balance: {y_train.mean():.2%}')

Class balance: 1.57%


### Modeling

In [5]:
pipe = imbPipeline([
    ('smote', SMOTE()),
    ('xgb', XGBClassifier(n_estimators=500, random_state=42))
])

cv_results = cv.cv_model(X_train, y_train, pipe)
cv.log_scores(cv_results, model)

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc
xgb,0.934896,0.002566,0.213213,0.021437,0.059732,0.006123,0.093283,0.009311,0.621742,0.01817


### Save the results

In [6]:
results = pd.read_csv(RESULTS_PATH, index_col=0)

results = results.drop(index=model, errors='ignore')
results = results.append(cv.log_scores(cv_results, model), sort=False)
results.to_csv(RESULTS_PATH)
results

Unnamed: 0,avg_accuracy,std_accuracy,avg_precision,std_precision,avg_recall,std_recall,avg_f1,std_f1,avg_auc,std_auc
log_regression,0.478189,0.003034,0.84024,0.015868,0.024764,0.000464,0.048111,0.0009,0.752486,0.009845
random_forest,0.930277,0.001674,0.148949,0.016731,0.039709,0.003086,0.062687,0.005337,0.531354,0.013361
xgb,0.936546,0.003746,0.211411,0.023783,0.061001,0.005206,0.094584,0.008094,0.619189,0.016401


### Next Steps

The XGB classifier did better than the Random Forrest classifier, but worse than the Logistic Regression classifier. The interesting piece is the Logistic Regression precision is very high, but recall suffers. The xgboost model is the opposite. The Logistic Regression model seems to better classify the positive class, but does poorly on the negative class.

Let's now look at performing a hybrid sampling methodology to analyze the impact. We will focus on the Logistic Regression and xgboost models. 