<a href="https://colab.research.google.com/github/jaguara01/20231215_CaseStudy_ML_Neural_Network/blob/main/20231215_CaseStudy_ML_Neural_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Case study Machine learning**

AB Gaming is a digital gaming platform that offers a wide variety of games available
under monthly subscriptions. There are 3 different plans, labeled SMALL, MEDIUM and
LARGE, and can be paid in USD or EUR.
You are provided with 2 datasets:
- sales.csv: contains the sales of clients acquired since 2019-01-01. Data was
extracted on 2020-12-31. This dataset includes a unique identifier for each user
(account_id).
- user_activity.csv: this dataset contains the following user characteristics as well
as their unique identifier (same as in sales.csv):
	- gender: gender of user reported in their profile.
	-  age: age in completed years of the user at the beginning of their subscription.
	- type: device type the user has installed the gaming platform.
	- genre1: most played game genre by the user.
	- genre2: second most played game genre by the user.
	- hours: mean number of hours played by the user weekly.
	- games: median number of different games played by the user weekly

Create a ML model to predict subscribers’ churn.

We define churn as the users that stop their subscription before their 6th renewal. Hence, a user that has less than 7 orders of payment is considered a churner. For this model we will be using activity for the first 3 months, so those users that have made only 1 or 2 payments should not be included in the model.

To create the model, extract relevant data (such as churner) from the sales.csv dataset and join it with the user_activity.csv dataset.

In the results of your report remember to answer these questions:

- What is the accuracy of your model? (consider also sensibility, specificity, PPV and
NPV)
- Do you consider it a good predictive model?

In the conclusion/discussion of your report make sure to mention any limitations as well as
ways to enhance future iterations of the creation of the model.

# Libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import loguniform

In [2]:
from src import data_loader
from src import models

In [3]:
pd.set_option('display.max_columns', None)

# Data and preprocessing

The overall goal is to predict customer churn for a digital gaming platform. The process involves several key stages, from initial data loading and cleaning to feature creation and preparation for machine learning models.

1. Data Loading and Initial Merging

Two data sources are used:

sales.csv: Contains the transaction history of each user, including account_id, plan type, and currency.

user_activity.csv: Contains user attributes like gender, age, hours played, and preferred game genre.
These two datasets are merged to create a unified view of each customer.

2. Defining and Identifying Churn

Churn is defined as a user who has made fewer than 7 payments.
A new binary feature, is_churn, is created based on this definition (1 for churner, 0 for non-churner).
Data Filtering: Certain users are excluded from the model training to avoid ambiguity:
Users with only 1 or 2 payments are removed.
Users who appeared to be churners near the very end of the data collection period are also removed, as they might have renewed their subscription later.

3. Feature Engineering

To improve the model's predictive power, new features are created from the existing data in src/data_loader.py:

hours_per_game: This feature calculates the average time a user spends on a single game (hours / games).

age_group: The numerical age is converted into a categorical age_group (e.g., "18-21", "22-25"), which can help the model capture non-linear relationships between age and churn.

4. Data Preparation for Modeling

Before feeding the data to the machine learning models, several preprocessing steps are performed:

One-Hot Encoding: Categorical columns like gender, plan, genre1, etc., are converted into numerical format using one-hot encoding. This creates new columns for each category.

Data Splitting: The dataset is split into a training set (for teaching the model) and a testing set (for evaluating its performance).

Handling Class Imbalance: The dataset has significantly more non-churners than churners. To prevent the model from being biased towards the majority class, a technique called Random Oversampling is used. This involves creating copies of the minority class (churners) in the training data to balance it out.

Feature Scaling: For certain models like Neural Networks, it's important that all features are on a similar scale. StandardScaler is used to scale the numerical features. This is not necessary for tree-based models like Random Forest.

This entire process is encapsulated in the get_model_data function within the src/data_loader.py file, which prepares the data for both scaled and unscaled models.

In [4]:
df_train_full = data_loader.get_training_data(
    sales_path='sales.csv', 
    activity_path='user_activity.csv'
)

Data loading complete.
Loaded 1068 known users for training/testing.


## Prepare Data for Model Comparison

The purpose of this stage is to create two distinct versions of the dataset. This is because different types of machine learning models have different data requirements, and preparing the data accordingly is crucial for accurate performance and fair comparison.

In [None]:
# --- 1. Prepare data for model comparison ---
X = df_train_full.drop('is_churn', axis=1)
y = df_train_full['is_churn']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# --- 2. Create Scaled and Unscaled Datasets ---

# --- Unscaled Data (for RF, XGBoost) --- 

X_train_unsc = X_train
y_train_unsc = y_train
X_test_unsc = X_test

# --- Scaled Data (for NN, SVM, LogReg) ---
scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)
y_train_sc = y_train

# --- Oversampled Scaled Data (FOR NN ONLY) ---
ros_sc = RandomOverSampler(random_state=42)
X_train_nn_res, y_train_nn_res = ros_sc.fit_resample(X_train_sc, y_train)

# We must use the *original* (un-oversampled) y_test for evaluation
y_test_eval = y_test


# Models training
The main purpose of this script is to compare the performance of five different classification models: Logistic Regression, SVM, Neural Network, Random Forest, and XGBoost. It does this by training each model on the appropriate training data and then collecting their performance results into a single list called all_results.

In [6]:
all_results = []

# --- Run Scaled Models ---
print("Training Logistic Regression...")
results = models.train_logistic_regression(X_train_sc, y_train_sc, X_test_sc, y_test_eval)
all_results.append(results)
print("Done.\n")

print("Training SVM...")
results = models.train_svm(X_train_sc, y_train_sc, X_test_sc, y_test_eval)
all_results.append(results)
print("Done.\n")

print("Training Neural Network...")
results = models.train_neural_network(X_train_nn_res, y_train_nn_res, X_test_sc, y_test_eval)
all_results.append(results)
print("Done.\n")

# --- Run Unscaled Models ---
print("Training Random Forest...")
results = models.train_random_forest(X_train_unsc, y_train_unsc, X_test_unsc, y_test_eval)
all_results.append(results)
print("Done.\n")

print("Training XGBoost...")
results = models.train_xgboost(X_train_unsc, y_train_unsc, X_test_unsc, y_test_eval)
all_results.append(results)
print("Done.\n")

Training Logistic Regression...
  Tuning Logistic Regression...


Done.

Training SVM...
  Training SVM (default params)...
Done.

Training Neural Network...
Done.

Training Random Forest...
  Tuning Random Forest...
Done.

Training XGBoost...
  Tuning XGBoost...
Done.



# Models comparison

This project's goal was to build a predictive model to identify subscribers at risk of churning. To find the most effective model, we trained, tuned, and compared five different algorithms: Logistic Regression, SVM, Random Forest, XGBoost, and a Neural Network. All models were trained on the same feature-engineered, oversampled data and evaluated on a held-out test set.

The primary business requirement for this project is to find the highest number of churners. This means our most important evaluation metric is Recall (Sensitivity), as we want to maximize the number of at-risk users we can target with retention campaigns.

The final performance of each tuned model on the test set is summarized below. The table is sorted by Churn Recall, our key metric.

In [7]:
df_results = pd.DataFrame(all_results)
df_results = df_results.set_index("Model")

# Sort by our business goal: Churn_Recall
df_results.sort_values(by="Churn_Recall", ascending=False, inplace=True)

print("--- Final Model Comparison ---")
display(df_results)

--- Final Model Comparison ---


Unnamed: 0_level_0,Accuracy,Churn_Precision,Churn_Recall,Churn_F1-Score
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Logistic Regression,0.742991,0.463158,0.916667,0.615385
XGBoost,0.82243,0.573529,0.8125,0.672414
Neural Network,0.78972,0.522388,0.729167,0.608696
SVM,0.794393,0.53125,0.708333,0.607143
Random Forest,0.845794,0.682927,0.583333,0.629213


## Results analysis
Based on the business goal, the Logistic Regression model is the clear winner.

* It Meets the Goal: With a recall of 91.7%, this model is outstanding at identifying the vast majority of users who are at risk of churning. It successfully finds over 9 out of 10 churners, which is 10 percentage points higher than the next-best model.

* The Business Trade-off: This high recall is achieved by casting a "wide net," which results in a precision of 46.3%. This means that for every 10 users we identify, roughly 5-6 will be false positives (users who would not have churned). This is an acceptable and expected trade-off for a "find-all-churners" strategy.


# 5. Prediction on uncertain users

Finally, as in the original notebook, we will predict the churn status for users whose outcome was uncertain at the time of data extraction.

We will use our best model for this goal: the **Tuned Logistic Regression** model. We will train it on the *entire* dataset to create a final "production" model, and then use it to predict the churn status of these users.

In [11]:
print("--- Preparing Final Production Model ---")

# --- 1. Prepare 100% of our KNOWN data for training ---
X_train_final = df_train_full.drop('is_churn', axis=1)
y_train_final = df_train_full['is_churn']

# We must use the *same* scaler, fit on the full training data
final_scaler = StandardScaler()
X_train_final_scaled = final_scaler.fit_transform(X_train_final)

# Oversample the full training set
final_ros = RandomOverSampler(random_state=42)
X_train_final_res, y_train_final_res = final_ros.fit_resample(X_train_final_scaled, y_train_final)

print("Training final Logistic Regression model...")
# Tune the model on the full training data
param_grid = {'C': loguniform(1e-3, 1e2), 'solver': ['liblinear']}
log_reg_final = LogisticRegression(class_weight='balanced', random_state=42, max_iter=1000)
rs_model = RandomizedSearchCV(estimator=log_reg_final,
                              param_distributions=param_grid,
                              n_iter=50, cv=5,
                              scoring='recall', # Optimizing for recall
                              random_state=42, n_jobs=-1, verbose=0)
rs_model.fit(X_train_final_res, y_train_final_res)
final_model = rs_model.best_estimator_
print("Final model is trained.")

# --- 2. Load the UNCERTAIN data for prediction ---
print("Loading data for final prediction...")
df_to_predict_aligned, df_predict_info = data_loader.get_prediction_data(
    sales_path='sales.csv', 
    activity_path='user_activity.csv'
)

# --- 3. Scale and Predict ---
if df_to_predict_aligned is not None:
    # Use the SAME scaler from the training data
    X_predict_scaled = final_scaler.transform(df_to_predict_aligned)

    predictions = final_model.predict(X_predict_scaled)
    prediction_probs = final_model.predict_proba(X_predict_scaled)[:, 1] # Prob of class 1

    # --- 4. Show Results ---
    df_final_predictions = df_predict_info.copy()
    df_final_predictions['Predicted_Churn'] = predictions
    df_final_predictions['Churn_Probability'] = prediction_probs

    print("\n--- Final Predictions on Uncertain Users ---")
    display(df_final_predictions.sort_values(by='Churn_Probability', ascending=False))
else:
    print("Error: Could not load prediction data.")

--- Preparing Final Production Model ---
Training final Logistic Regression model...
Final model is trained.
Loading data for final prediction...
Loaded 234 uncertain users for prediction.
Aligning prediction columns to training data...
Data loading complete.
Loaded 1068 known users for training/testing.

--- Final Predictions on Uncertain Users ---


Unnamed: 0,account_id,age,gender,hours,games,plan,currency,Predicted_Churn,Churn_Probability
43,209144533,21,female,7.547873,5,LARGE,EUR,1,0.784556
19,101326362,20,female,6.817933,4,LARGE,USD,1,0.780361
18,88235811,21,male,5.336582,9,MEDIUM,EUR,1,0.759070
83,367994395,22,male,0.000000,0,SMALL,EUR,1,0.728024
227,980497768,21,male,6.407393,2,SMALL,EUR,1,0.716424
...,...,...,...,...,...,...,...,...,...
79,352382141,43,male,15.534520,11,SMALL,USD,0,0.236333
156,702139267,33,male,16.054059,16,SMALL,EUR,0,0.229459
89,417803607,49,male,13.537794,18,SMALL,USD,0,0.198987
194,844739633,35,male,12.952085,24,SMALL,EUR,0,0.181504
