##### Goal of the model
* Understand if we are able to predict if an user is likely to activate (place the first order) in the next session
* If we have a good model we can target our CRM comunnications with users that have a high score

Importing Packages

In [7]:
import pandas as pd
import sqlalchemy
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn import ensemble
from sklearn import metrics
import seaborn
import matplotlib.pyplot as plt

Fetches analytical base table

In [8]:
DB_PATH = r"C:\Users\jpgsa\Documents\GitHub\autodoc-task\autodoc-task-jan-2024\database_files\task_dataset.db"
engine = sqlalchemy.create_engine("sqlite:///" + DB_PATH)
table = "abt"

df_abt = pd.read_sql_table(table, engine)
df_abt.head()

Unnamed: 0,user_id,session_id,session_qty,days_in_base,session_qty_acc,add_to_cart_qty_acc,page_pdp_qty_acc,page_plp_qty_acc,page_search_plp_qty_acc,bounce_sessions_qty_acc,navigation_time_acc,is_next_session_activation
0,10020612649726787735u,10101053617586389067s,1,0.00022,1,0,2,0,0,0,19.0,1
1,10041641363804945509u,11031977414104065912s,1,0.001447,1,0,2,0,0,0,125.0,1
2,10061146611389501345u,13841203550317519321s,1,0.000104,1,0,0,2,0,0,9.0,1
3,10106794151185573609u,4352180653544790743s,1,0.000706,1,0,4,0,0,0,61.0,1
4,10106972389836326507u,15130917214999659073s,1,0.0,1,0,1,0,0,1,0.0,0


Separating features and target, splitting groups

In [9]:
target = "is_next_session_activation"
to_remove = ["session_id", "user_id", target]

features = df_abt.columns.tolist()
for f in to_remove:
    features.remove(f)

X = df_abt[features]
Y = df_abt[target]

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state = 94)

Training a Decision Tree

In [12]:
clf = tree.DecisionTreeClassifier()
clf.fit(X_train, y_train)

Train Group Analysis

In [36]:
y_train_pred = clf.predict(X_train)
y_train_proba = clf.predict_proba(X_train)

acc_train = metrics.accuracy_score(y_train, y_train_pred)
auc_train = metrics.roc_auc_score(y_train, y_train_proba[:,1])

print("Accuracy (Train): {}".format(acc_train))
print("AUC (Train): {}".format(auc_train))

Accuracy (Train): 0.9243243243243243
AUC (Train): 0.9786329932912105


Test Group Analysis

In [37]:
y_test_pred = clf.predict(X_test)
y_test_proba = clf.predict_proba(X_test)

acc_test = metrics.accuracy_score(y_test, y_test_pred)
auc_test = metrics.roc_auc_score(y_test, y_test_proba[:,1])

print("Accuracy (Test): {}".format(acc_test))
print("AUC (Test): {}".format(auc_test))


Accuracy (Test): 0.5992141453831041
AUC (Test): 0.5477712824260139


Feature importance for activating in the next session

In [15]:
pd.Series(clf.feature_importances_, index = X_train.columns).sort_values(ascending=False)

days_in_base               0.323401
navigation_time_acc        0.320306
page_pdp_qty_acc           0.092974
page_search_plp_qty_acc    0.086680
page_plp_qty_acc           0.081351
bounce_sessions_qty_acc    0.029274
add_to_cart_qty_acc        0.025282
session_qty                0.022221
session_qty_acc            0.018511
dtype: float64

Training a Random Forest

In [19]:
rf = ensemble.RandomForestClassifier(n_estimators = 350, min_samples_leaf = 150)
rf.fit(X_train, y_train)

Train Group Analysis

In [20]:
y_train_pred = rf.predict(X_train)
y_train_proba = rf.predict_proba(X_train)

acc_train = metrics.accuracy_score(y_train, y_train_pred)
auc_train = metrics.roc_auc_score(y_train, y_train_proba[:,1])

print("Accuracy (Train): {}".format(acc_train))
print("AUC (Train): {}".format(auc_train))

Accuracy (Train): 0.6442260442260442
AUC (Train): 0.6204701135988472


Test Group Analysis

In [21]:
y_test_pred = rf.predict(X_test)
y_test_proba = rf.predict_proba(X_test)

acc_test = metrics.accuracy_score(y_test, y_test_pred)
auc_test = metrics.roc_auc_score(y_test, y_test_proba[:,1])

print("Accuracy (Test): {}".format(acc_test))
print("AUC (Test): {}".format(auc_test))


Accuracy (Test): 0.6444007858546169
AUC (Test): 0.5969209818314678


Feature importance for activating in the next session

In [43]:
pd.Series(rf.feature_importances_, index = X_train.columns).sort_values(ascending=False)

days_in_base               0.319529
session_qty                0.170752
session_qty_acc            0.169202
navigation_time_acc        0.143184
page_pdp_qty_acc           0.064456
add_to_cart_qty_acc        0.058935
page_plp_qty_acc           0.054018
bounce_sessions_qty_acc    0.014526
page_search_plp_qty_acc    0.005397
dtype: float64

##### Conclusions
* Both models have an underwhelming AUC
* With an improved dataset we'll be able to create better predictions
* We should also perform gridsearch/randomsearch techniques to optimize models hyperparameters
* Finally, we'll be able to use our optimized model to guide our CRM team