## Keep Your Customer Model_ML | Group 6
# GROUP MEMBERS
- Adeke Mary
- Mugumya Timothy Kitulazi
- Atukwase Godson
- Muhawenimana Calorine

## PROBLEM STATEMENT
# Predicting Customer Churn in Online Retail Using Machine Learning

Customer churn is a major challenge in the online retail industry, affecting revenue and growth. This project aims to develop a binary classification machine learning model to predict whether a customer will churn (Target_Churn) based on their demographics, purchasing behavior, and engagement history.

The goal is to help the business identify at-risk customers early and take proactive steps to retain them, such as targeted promotions and improved support

## OBJECTIVES

- Explore and analyze customer behavior data to identify churn patterns.

- Build and evaluate a binary classification model to predict churn.

- Determine the most influential factors leading to customer churn.

- Provide actionable insights and recommendations to reduce churn.

In [5]:
pip install ydata_profiling

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [10]:
!pip install ydata_profiling

Defaulting to user installation because normal site-packages is not writeable


In [12]:
import pandas as pd
from ydata_profiling import ProfileReport

ModuleNotFoundError: No module named 'ydata_profiling'

In [None]:

df = pd.read_csv("online_retail_customer_churn.csv")
df

Unnamed: 0,Customer_ID,Age,Gender,Annual_Income,Total_Spend,Years_as_Customer,Num_of_Purchases,Average_Transaction_Amount,Num_of_Returns,Num_of_Support_Contacts,Satisfaction_Score,Last_Purchase_Days_Ago,Email_Opt_In,Promotion_Response,Target_Churn
0,1,62,Other,45.15,5892.58,5,22,453.80,2,0,3,129,True,Responded,True
1,2,65,Male,79.51,9025.47,13,77,22.90,2,2,3,227,False,Responded,False
2,3,18,Male,29.19,618.83,13,71,50.53,5,2,2,283,False,Responded,True
3,4,21,Other,79.63,9110.30,3,33,411.83,5,3,5,226,True,Ignored,True
4,5,21,Other,77.66,5390.88,15,43,101.19,3,0,5,242,False,Unsubscribed,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,996,54,Male,143.72,1089.09,2,29,77.75,0,3,2,88,True,Ignored,False
996,997,19,Male,164.19,3700.24,9,90,34.45,6,4,4,352,False,Responded,True
997,998,47,Female,113.31,705.85,17,69,187.37,7,3,1,172,True,Unsubscribed,False
998,999,23,Male,72.98,3891.60,7,31,483.80,1,2,5,55,False,Responded,True


In [None]:
# prof = ProfileReport(df)
# prof

In [None]:
dropped = df.drop('Customer_ID', axis="columns")
dropped

Unnamed: 0,Age,Gender,Annual_Income,Total_Spend,Years_as_Customer,Num_of_Purchases,Average_Transaction_Amount,Num_of_Returns,Num_of_Support_Contacts,Satisfaction_Score,Last_Purchase_Days_Ago,Email_Opt_In,Promotion_Response,Target_Churn
0,62,Other,45.15,5892.58,5,22,453.80,2,0,3,129,True,Responded,True
1,65,Male,79.51,9025.47,13,77,22.90,2,2,3,227,False,Responded,False
2,18,Male,29.19,618.83,13,71,50.53,5,2,2,283,False,Responded,True
3,21,Other,79.63,9110.30,3,33,411.83,5,3,5,226,True,Ignored,True
4,21,Other,77.66,5390.88,15,43,101.19,3,0,5,242,False,Unsubscribed,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,54,Male,143.72,1089.09,2,29,77.75,0,3,2,88,True,Ignored,False
996,19,Male,164.19,3700.24,9,90,34.45,6,4,4,352,False,Responded,True
997,47,Female,113.31,705.85,17,69,187.37,7,3,1,172,True,Unsubscribed,False
998,23,Male,72.98,3891.60,7,31,483.80,1,2,5,55,False,Responded,True


In [None]:
encoded = pd.get_dummies(dropped, columns=["Gender"], dtype=int)
encoded

Unnamed: 0,Age,Annual_Income,Total_Spend,Years_as_Customer,Num_of_Purchases,Average_Transaction_Amount,Num_of_Returns,Num_of_Support_Contacts,Satisfaction_Score,Last_Purchase_Days_Ago,Email_Opt_In,Promotion_Response,Target_Churn,Gender_Female,Gender_Male,Gender_Other
0,62,45.15,5892.58,5,22,453.80,2,0,3,129,True,Responded,True,0,0,1
1,65,79.51,9025.47,13,77,22.90,2,2,3,227,False,Responded,False,0,1,0
2,18,29.19,618.83,13,71,50.53,5,2,2,283,False,Responded,True,0,1,0
3,21,79.63,9110.30,3,33,411.83,5,3,5,226,True,Ignored,True,0,0,1
4,21,77.66,5390.88,15,43,101.19,3,0,5,242,False,Unsubscribed,False,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,54,143.72,1089.09,2,29,77.75,0,3,2,88,True,Ignored,False,0,1,0
996,19,164.19,3700.24,9,90,34.45,6,4,4,352,False,Responded,True,0,1,0
997,47,113.31,705.85,17,69,187.37,7,3,1,172,True,Unsubscribed,False,1,0,0
998,23,72.98,3891.60,7,31,483.80,1,2,5,55,False,Responded,True,0,1,0


In [None]:
ohe = pd.get_dummies(encoded, columns=['Email_Opt_In','Promotion_Response','Target_Churn'], drop_first=True, dtype=int)

In [None]:
ohe['Spend_per_Year'] = df['Total_Spend'] * (df['Years_as_Customer'] + 1)
ohe['Return_Rate'] = df['Num_of_Returns'] / (df['Num_of_Purchases'] + 1)
ohe['Engagement_Score'] = df['Satisfaction_Score'] - df['Num_of_Support_Contacts']

In [None]:
ohe

Unnamed: 0,Age,Annual_Income,Average_Transaction_Amount,Last_Purchase_Days_Ago,Gender_Female,Gender_Male,Email_Opt_In_True,Promotion_Response_Responded,Promotion_Response_Unsubscribed,Target_Churn_True,Spend_per_Year,Return_Rate,Engagement_Score
0,62,45.15,453.80,129,0,0,1,1,0,1,982.096667,0.086957,3
1,65,79.51,22.90,227,0,1,0,1,0,0,644.676429,0.025641,1
2,18,29.19,50.53,283,0,1,0,1,0,1,44.202143,0.069444,0
3,21,79.63,411.83,226,0,0,1,0,0,1,2277.575000,0.147059,2
4,21,77.66,101.19,242,0,0,0,0,1,0,336.930000,0.068182,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,54,143.72,77.75,88,0,1,1,0,0,0,363.030000,0.000000,-1
996,19,164.19,34.45,352,0,1,0,1,0,1,370.024000,0.065934,0
997,47,113.31,187.37,172,1,0,1,0,1,0,39.213889,0.100000,-2
998,23,72.98,483.80,55,0,1,0,1,0,1,486.450000,0.031250,3


In [None]:
y = ohe['Target_Churn_True']
y

Unnamed: 0,Target_Churn_True
0,1
1,0
2,1
3,1
4,0
...,...
995,0
996,1
997,0
998,1


In [None]:
x = ohe.drop(['Target_Churn_True'], axis="columns")
x
#

Unnamed: 0,Age,Annual_Income,Average_Transaction_Amount,Last_Purchase_Days_Ago,Gender_Female,Gender_Male,Email_Opt_In_True,Promotion_Response_Responded,Promotion_Response_Unsubscribed,Spend_per_Year,Return_Rate,Engagement_Score
0,62,45.15,453.80,129,0,0,1,1,0,982.096667,0.086957,3
1,65,79.51,22.90,227,0,1,0,1,0,644.676429,0.025641,1
2,18,29.19,50.53,283,0,1,0,1,0,44.202143,0.069444,0
3,21,79.63,411.83,226,0,0,1,0,0,2277.575000,0.147059,2
4,21,77.66,101.19,242,0,0,0,0,1,336.930000,0.068182,5
...,...,...,...,...,...,...,...,...,...,...,...,...
995,54,143.72,77.75,88,0,1,1,0,0,363.030000,0.000000,-1
996,19,164.19,34.45,352,0,1,0,1,0,370.024000,0.065934,0
997,47,113.31,187.37,172,1,0,1,0,1,39.213889,0.100000,-2
998,23,72.98,483.80,55,0,1,0,1,0,486.450000,0.031250,3


In [None]:
from sklearn.utils import class_weight
print(df['Target_Churn'].value_counts())
from xgboost import XGBClassifier


Target_Churn
True     526
False    474
Name: count, dtype: int64


In [None]:
xgboost = XGBClassifier(n_estimators=200, max_depth=4, learning_rate=0.1, enable_categorical=True)


In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

In [None]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=100)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(800, 12)
(800,)
(200, 12)
(200,)


In [None]:
xgboost.fit(X_train, y_train)

In [None]:
y_pred = xgboost.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.43      0.48      0.45        89
           1       0.54      0.48      0.50       111

    accuracy                           0.48       200
   macro avg       0.48      0.48      0.48       200
weighted avg       0.49      0.48      0.48       200



In [None]:
# model = Pipeline(steps=[('classifier', LogisticRegression(random_state=42, solver='liblinear'))])
# model = DecisionTreeClassifier(min_samples_split=40, random_state=42)
model = LogisticRegression(random_state=42, solver='liblinear')
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.44      0.44      0.44        89
           1       0.55      0.56      0.56       111

    accuracy                           0.51       200
   macro avg       0.50      0.50      0.50       200
weighted avg       0.50      0.51      0.50       200



In [None]:
ohe

Unnamed: 0,Age,Annual_Income,Average_Transaction_Amount,Last_Purchase_Days_Ago,Gender_Female,Gender_Male,Email_Opt_In_True,Promotion_Response_Responded,Promotion_Response_Unsubscribed,Target_Churn_True,Spend_per_Year,Return_Rate,Engagement_Score
0,62,45.15,453.80,129,0,0,1,1,0,1,982.096667,0.086957,3
1,65,79.51,22.90,227,0,1,0,1,0,0,644.676429,0.025641,1
2,18,29.19,50.53,283,0,1,0,1,0,1,44.202143,0.069444,0
3,21,79.63,411.83,226,0,0,1,0,0,1,2277.575000,0.147059,2
4,21,77.66,101.19,242,0,0,0,0,1,0,336.930000,0.068182,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,54,143.72,77.75,88,0,1,1,0,0,0,363.030000,0.000000,-1
996,19,164.19,34.45,352,0,1,0,1,0,1,370.024000,0.065934,0
997,47,113.31,187.37,172,1,0,1,0,1,0,39.213889,0.100000,-2
998,23,72.98,483.80,55,0,1,0,1,0,1,486.450000,0.031250,3


In [None]:
import pandas as pd

df = pd.read_csv("online_retail_customer_churn.csv")

# Feature engineering
df['Spend_per_Year'] = df['Total_Spend'] / (df['Years_as_Customer'] + 1)
df['Return_Rate'] = df['Num_of_Returns'] / (df['Num_of_Purchases'] + 1)
df['Engagement_Score'] = df['Satisfaction_Score'] - df['Num_of_Support_Contacts']

# Drop ID and separate features/target
X = df.drop(columns=["Customer_ID", "Target_Churn"])
y = df["Target_Churn"]


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE

categorical_features = ["Gender", "Promotion_Response"]
numeric_features = X.select_dtypes(include=["int64", "float64"]).columns.difference(categorical_features)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Preprocessing pipeline
preprocessor = ColumnTransformer([
    ("num", Pipeline([
        ("imputer", SimpleImputer(strategy="mean")),
        ("scaler", StandardScaler())
    ]), numeric_features),
    ("cat", OneHotEncoder(drop="first"), categorical_features)
])


X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)


smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_processed, y_train)


In [None]:
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report

model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
model.fit(X_train_balanced, y_train_balanced)


y_pred = model.predict(X_test_processed)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Report:\n", classification_report(y_test, y_pred))


Parameters: { "use_label_encoder" } are not used.



Accuracy: 0.555
Report:
               precision    recall  f1-score   support

       False       0.53      0.44      0.48        94
        True       0.57      0.66      0.61       106

    accuracy                           0.56       200
   macro avg       0.55      0.55      0.55       200
weighted avg       0.55      0.56      0.55       200

