<a href="https://colab.research.google.com/github/jaguara01/20231215_CaseStudy_ML_Neural_Network/blob/main/20231215_CaseStudy_ML_Neural_Network.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Case study Machine learning**

AB Gaming is a digital gaming platform that offers a wide variety of games available
under monthly subscriptions. There are 3 different plans, labeled SMALL, MEDIUM and
LARGE, and can be paid in USD or EUR.
You are provided with 2 datasets:
- sales.csv: contains the sales of clients acquired since 2019-01-01. Data was
extracted on 2020-12-31. This dataset includes a unique identifier for each user
(account_id).
- user_activity.csv: this dataset contains the following user characteristics as well
as their unique identifier (same as in sales.csv):
	- gender: gender of user reported in their profile.
	-  age: age in completed years of the user at the beginning of their subscription.
	- type: device type the user has installed the gaming platform.
	- genre1: most played game genre by the user.
	- genre2: second most played game genre by the user.
	- hours: mean number of hours played by the user weekly.
	- games: median number of different games played by the user weekly

Create a ML model to predict subscribers’ churn.

We define churn as the users that stop their subscription before their 6th renewal. Hence, a user that has less than 7 orders of payment is considered a churner. For this model we will be using activity for the first 3 months, so those users that have made only 1 or 2 payments should not be included in the model.

To create the model, extract relevant data (such as churner) from the sales.csv dataset and join it with the user_activity.csv dataset.

In the results of your report remember to answer these questions:

- What is the accuracy of your model? (consider also sensibility, specificity, PPV and
NPV)
- Do you consider it a good predictive model?

In the conclusion/discussion of your report make sure to mention any limitations as well as
ways to enhance future iterations of the creation of the model.

# Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt


In [28]:
import tensorflow as tf
from sklearn.preprocessing import StandardScaler
from imblearn.over_sampling import RandomOverSampler
from tensorflow import keras

# Import data

In [3]:
df_sales = pd.read_csv('/Users/alexis/Projects/20231215_CaseStudy_ML_NN_Churn_identification/sales.csv')
df_activity = pd.read_csv('/Users/alexis/Projects/20231215_CaseStudy_ML_NN_Churn_identification/user_activity.csv')

In [4]:
df_activity.head()

Unnamed: 0,account_id,gender,genre1,genre2,type,games,age,hours
0,101530,male,role-playing,action,mobile,8,21,7.573853
1,731892,female,adventure,action,computer,9,25,4.620231
2,856432,male,action,role-playing,mobile,19,35,13.608988
3,1425820,male,adventure,adventure,mobile,8,20,8.648719
4,1881252,male,action,strategy,computer,6,18,8.929738


In [5]:
df_sales.head()

Unnamed: 0,order_id,account_id,start_date,plan,amount,currency
0,C5G1ckzVUC1V,36369294,2019-03-17,MEDIUM,12.95,EUR
1,LyPKxILXvkiu,36369294,2019-04-17,MEDIUM,12.95,EUR
2,729R0C9dVx49,36369294,2019-05-17,MEDIUM,12.95,EUR
3,RrxBXQYG9Qn8,13708705,2020-08-28,SMALL,8.95,EUR
4,iYemtey2MjLT,940537915,2020-07-17,SMALL,8.95,EUR


# Churn identification

In [6]:
# A user that has less than 7 orders of payment is considered a churner
df_sales_seq = df_sales.groupby(["account_id", "plan", "currency"])["order_id"].count().reset_index(name="seq")
df_sales_seq["is_churn"]=np.where(df_sales_seq["seq"] < 7,1,0)

# For each accound_id, we want to know last transaction
#df_sales_init = df_sales.groupby(["account_id"])["start_date"].min().reset_index(name="start_date")
df_sales_last = df_sales.groupby(["account_id"])["start_date"].max().reset_index(name="last_date")



In [7]:
df_users = pd.merge(df_sales_last, df_sales_seq, on='account_id')

In [8]:
df_users.head()

Unnamed: 0,account_id,last_date,plan,currency,seq,is_churn
0,101530,2019-12-03,SMALL,EUR,4,1
1,731892,2020-10-12,SMALL,EUR,9,0
2,856432,2020-10-25,SMALL,EUR,10,0
3,1425820,2019-07-25,SMALL,EUR,2,1
4,1881252,2019-12-22,SMALL,USD,1,1


In [9]:
df_users.is_churn.value_counts()

is_churn
1    1157
0     843
Name: count, dtype: int64

Number of users with less than 2 payments and who will be rejected by the model

In [10]:
df_users[(df_users["seq"] <= 2)].is_churn.value_counts()

is_churn
1    698
Name: count, dtype: int64

Some accounts shouldn't be taken into account
  - Users that have made only 1 or 2 payments
  - Users seen as churn at the end of the data collection. Those users may or may not continue their plan after the last month (2020/12 (USD) and 2020/10 (EUR))

In [11]:
user_reject = df_users[((df_users["last_date"]>"2020-11-30") & (df_users["is_churn"] == 1) & (df_users["currency"] == "USD"))
                    | ((df_users["last_date"]>"2020-09-31") & (df_users["is_churn"] == 1) & (df_users["currency"] == "EUR"))
                    | (df_users["seq"] <= 2)]

In [12]:
#List of all the users that don't need to be taken into accout in the model
user_reject_list = user_reject.account_id.values.tolist()
df_users_filtered = df_users[~df_users['account_id'].isin(user_reject_list)]

In [13]:
df_input = pd.merge(df_activity,df_users_filtered.drop(['last_date','seq'], axis=1),how="inner", on="account_id")

df_input.head()



Unnamed: 0,account_id,gender,genre1,genre2,type,games,age,hours,plan,currency,is_churn
0,101530,male,role-playing,action,mobile,8,21,7.573853,SMALL,EUR,1
1,731892,female,adventure,action,computer,9,25,4.620231,SMALL,EUR,0
2,856432,male,action,role-playing,mobile,19,35,13.608988,SMALL,EUR,0
3,2397506,male,adventure,role-playing,computer,13,32,7.151723,SMALL,USD,0
4,2436396,male,adventure,sports,mobile,10,25,14.455147,MEDIUM,USD,0


In [14]:
df_input.to_csv('df_input.csv', index=False)

# Load data for modelisation

In [6]:
import pandas as pd
from src import data_loader
from src import models

# Set pandas to show all columns
pd.set_option('display.max_columns', None)

In [3]:
# 1. Load data for models that need scaling (NN, SVM, LogReg)
X_train_sc, X_test_sc, y_train_sc, y_test_sc = data_loader.get_model_data('df_input.csv', scaled=True)

# 2. Load data for models that do NOT need scaling (RF, XGB)
# We use the 'scaled=False' flag
X_train_unsc, X_test_unsc, y_train_unsc, y_test_unsc = data_loader.get_model_data('df_input.csv', scaled=False)

print("Data loaded and preprocessed.")

Data loaded and preprocessed.


# Run models

In [7]:
all_results = []

# --- Run Scaled Models ---
print("Training Logistic Regression...")
results = models.train_logistic_regression(X_train_sc, y_train_sc, X_test_sc, y_test_sc)
all_results.append(results)
print("Done.\n")

print("Training SVM...")
results = models.train_svm(X_train_sc, y_train_sc, X_test_sc, y_test_sc)
all_results.append(results)
print("Done.\n")

print("Training Neural Network...")
results = models.train_neural_network(X_train_sc, y_train_sc, X_test_sc, y_test_sc)
all_results.append(results)
print("Done.\n")

# --- Run Unscaled Models ---
print("Training Random Forest...")
results = models.train_random_forest(X_train_unsc, y_train_unsc, X_test_unsc, y_test_unsc)
all_results.append(results)
print("Done.\n")

print("Training XGBoost...")
results = models.train_xgboost(X_train_unsc, y_train_unsc, X_test_unsc, y_test_unsc)
all_results.append(results)
print("Done.\n")

Training Logistic Regression...
Done.

Training SVM...
Done.

Training Neural Network...
[1m7/7[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 4ms/step 
Done.

Training Random Forest...
Done.

Training XGBoost...
Done.



In [8]:
# Convert the list of dictionaries into a clean DataFrame
df_results = pd.DataFrame(all_results)
df_results = df_results.set_index("Model")
df_results.sort_values(by="Churn_F1-Score", ascending=False, inplace=True)

print("--- Final Model Comparison ---")
display(df_results)

--- Final Model Comparison ---


Unnamed: 0_level_0,Accuracy,Churn_Precision,Churn_Recall,Churn_F1-Score
Model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Random Forest,0.850467,0.722222,0.541667,0.619048
XGBoost,0.827103,0.612245,0.625,0.618557
Neural Network,0.785047,0.514706,0.729167,0.603448
Logistic Regression,0.761682,0.48,0.75,0.585366
SVM,0.799065,0.545455,0.625,0.582524
