# End-to-End Data Science Workflow Demo
This notebook generates a synthetic dataset (6500+ rows, 20 columns) with mixed types and intentional data quality issues, then walks through:
1. Data Wrangling & Cleaning
2. Exploratory Data Analysis (EDA)
3. Predictive Modeling (Regression & Classification)
4. Deployment with Gradio (two mini apps)

*Targets*
- Regression target: monthly_spend
- Classification target: churn (0/1)

*How to use*: Run cells from top to bottom in Google Colab.

# **Classification**

In [30]:
import pandas as pd
import numpy as np

df = pd.read_csv ('cleaned_data.csv')
df.head()

Unnamed: 0,customer_id,signup_date,country,city,gender,age,membership_tier,tenure_months,avg_session_length,sessions_per_month,...,clicks_per_session,device_per_session,referral_per_month,loyalty_score,engagement_index,referral_value_score,discount_sensitivity,mobile_loyalty_flag,signup_month,signup_dayofweek
0,100000,2025-02-24,malaysia,kuala lumpur,other,39,standard,8,21.51,21,...,0.014999,0.095234,0.0,55.2,3116.799,0.0,0.0,1,2,0
1,100001,2021-04-24,philippines,cebu,male,51,basic,55,19.6,16,...,0.013999,0.124992,0.0,313.5,1787.52,0.0,0.0,0,4,5
2,100002,2019-05-29,thailand,phuket,female,29,standard,78,31.98,18,...,0.002944,0.277762,0.0,491.4,3626.532,0.0,0.0,1,5,2
3,100003,2018-01-01,indonesia,jakarta,female,43,standard,95,12.71,19,...,0.012157,0.157886,0.010526,693.5,1762.877,7.3,0.0,0,1,0
4,100004,2021-06-16,malaysia,shah alam,male,51,standard,53,3.0,15,...,0.0024,0.199987,0.0,307.4,261.0,0.0,0.0,1,6,2


Check data types

In [31]:
df.dtypes

customer_id               int64
signup_date              object
country                  object
city                     object
gender                   object
age                       int64
membership_tier          object
tenure_months             int64
avg_session_length      float64
sessions_per_month        int64
support_tickets           int64
last_payment_method      object
is_mobile_user             bool
num_devices               int64
email_click_rate        float64
referral_count            int64
discount_rate           float64
satisfaction_score      float64
monthly_spend           float64
churn                     int64
total_session_time      float64
complaint_rate          float64
clicks_per_session      float64
device_per_session      float64
referral_per_month      float64
loyalty_score           float64
engagement_index        float64
referral_value_score    float64
discount_sensitivity    float64
mobile_loyalty_flag       int64
signup_month              int64
signup_d

## **Data Pre-processing**

Check for the object data type

In [32]:
categorical_cols = df.select_dtypes(include=["object"])
categorical_cols.dtypes

signup_date            object
country                object
city                   object
gender                 object
membership_tier        object
last_payment_method    object
dtype: object

Convert signup_date datatype : object -> date

In [33]:
# Convert signup_date to datetime (coerce errors); age/discount_rate to numeric; normalize gender case
df['signup_date'] = pd.to_datetime(df['signup_date'], errors='coerce')

Check data type

In [34]:
df.dtypes

customer_id                      int64
signup_date             datetime64[ns]
country                         object
city                            object
gender                          object
age                              int64
membership_tier                 object
tenure_months                    int64
avg_session_length             float64
sessions_per_month               int64
support_tickets                  int64
last_payment_method             object
is_mobile_user                    bool
num_devices                      int64
email_click_rate               float64
referral_count                   int64
discount_rate                  float64
satisfaction_score             float64
monthly_spend                  float64
churn                            int64
total_session_time             float64
complaint_rate                 float64
clicks_per_session             float64
device_per_session             float64
referral_per_month             float64
loyalty_score            

Check data data types : object -> float or int

In [35]:
df.dtypes

customer_id                      int64
signup_date             datetime64[ns]
country                         object
city                            object
gender                          object
age                              int64
membership_tier                 object
tenure_months                    int64
avg_session_length             float64
sessions_per_month               int64
support_tickets                  int64
last_payment_method             object
is_mobile_user                    bool
num_devices                      int64
email_click_rate               float64
referral_count                   int64
discount_rate                  float64
satisfaction_score             float64
monthly_spend                  float64
churn                            int64
total_session_time             float64
complaint_rate                 float64
clicks_per_session             float64
device_per_session             float64
referral_per_month             float64
loyalty_score            

In [36]:
df.head()

Unnamed: 0,customer_id,signup_date,country,city,gender,age,membership_tier,tenure_months,avg_session_length,sessions_per_month,...,clicks_per_session,device_per_session,referral_per_month,loyalty_score,engagement_index,referral_value_score,discount_sensitivity,mobile_loyalty_flag,signup_month,signup_dayofweek
0,100000,2025-02-24,malaysia,kuala lumpur,other,39,standard,8,21.51,21,...,0.014999,0.095234,0.0,55.2,3116.799,0.0,0.0,1,2,0
1,100001,2021-04-24,philippines,cebu,male,51,basic,55,19.6,16,...,0.013999,0.124992,0.0,313.5,1787.52,0.0,0.0,0,4,5
2,100002,2019-05-29,thailand,phuket,female,29,standard,78,31.98,18,...,0.002944,0.277762,0.0,491.4,3626.532,0.0,0.0,1,5,2
3,100003,2018-01-01,indonesia,jakarta,female,43,standard,95,12.71,19,...,0.012157,0.157886,0.010526,693.5,1762.877,7.3,0.0,0,1,0
4,100004,2021-06-16,malaysia,shah alam,male,51,standard,53,3.0,15,...,0.0024,0.199987,0.0,307.4,261.0,0.0,0.0,1,6,2


In [37]:
feature_cols = [col for col in df.columns if col not in ['churn', 'signup_date', 'customer_id']]
categorical_cols = ['country', 'city', 'gender', 'membership_tier', 'last_payment_method']

In [38]:
X = df[feature_cols].copy()
y = df['churn']

X.head()

Unnamed: 0,country,city,gender,age,membership_tier,tenure_months,avg_session_length,sessions_per_month,support_tickets,last_payment_method,...,clicks_per_session,device_per_session,referral_per_month,loyalty_score,engagement_index,referral_value_score,discount_sensitivity,mobile_loyalty_flag,signup_month,signup_dayofweek
0,malaysia,kuala lumpur,other,39,standard,8,21.51,21,0,credit card,...,0.014999,0.095234,0.0,55.2,3116.799,0.0,0.0,1,2,0
1,philippines,cebu,male,51,basic,55,19.6,16,1,paypal,...,0.013999,0.124992,0.0,313.5,1787.52,0.0,0.0,0,4,5
2,thailand,phuket,female,29,standard,78,31.98,18,0,credit card,...,0.002944,0.277762,0.0,491.4,3626.532,0.0,0.0,1,5,2
3,indonesia,jakarta,female,43,standard,95,12.71,19,2,credit card,...,0.012157,0.157886,0.010526,693.5,1762.877,7.3,0.0,0,1,0
4,malaysia,shah alam,male,51,standard,53,3.0,15,1,unknown,...,0.0024,0.199987,0.0,307.4,261.0,0.0,0.0,1,6,2


In [39]:
from sklearn.preprocessing import LabelEncoder

# Encode categorical features
encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))
    encoders[col] = le  # Save encoder

df.head()

Unnamed: 0,customer_id,signup_date,country,city,gender,age,membership_tier,tenure_months,avg_session_length,sessions_per_month,...,clicks_per_session,device_per_session,referral_per_month,loyalty_score,engagement_index,referral_value_score,discount_sensitivity,mobile_loyalty_flag,signup_month,signup_dayofweek
0,100000,2025-02-24,malaysia,kuala lumpur,other,39,standard,8,21.51,21,...,0.014999,0.095234,0.0,55.2,3116.799,0.0,0.0,1,2,0
1,100001,2021-04-24,philippines,cebu,male,51,basic,55,19.6,16,...,0.013999,0.124992,0.0,313.5,1787.52,0.0,0.0,0,4,5
2,100002,2019-05-29,thailand,phuket,female,29,standard,78,31.98,18,...,0.002944,0.277762,0.0,491.4,3626.532,0.0,0.0,1,5,2
3,100003,2018-01-01,indonesia,jakarta,female,43,standard,95,12.71,19,...,0.012157,0.157886,0.010526,693.5,1762.877,7.3,0.0,0,1,0
4,100004,2021-06-16,malaysia,shah alam,male,51,standard,53,3.0,15,...,0.0024,0.199987,0.0,307.4,261.0,0.0,0.0,1,6,2


In [40]:
for col, le in encoders.items():
  mapping = dict(zip(le.classes_, le.transform(le.classes_)))
  print('______')
  print(f"Mapping for {col}:")
  for i, label in enumerate(le.classes_):
    print(f"{i} → {label}")

______
Mapping for country:
0 → brunei
1 → indonesia
2 → malaysia
3 → philippines
4 → singapore
5 → thailand
6 → vietnam
______
Mapping for city:
0 → baguio
1 → bandar seri begawan
2 → bandung
3 → bangar
4 → bangkok
5 → can tho
6 → cebu
7 → central
8 → chiang mai
9 → da nang
10 → davao
11 → denpasar
12 → hai phong
13 → hanoi
14 → hat yai
15 → ho chi minh
16 → ipoh
17 → jakarta
18 → johor bahru
19 → jurong
20 → kuala belait
21 → kuala lumpur
22 → manila
23 → medan
24 → pattaya
25 → penang
26 → phuket
27 → quezon city
28 → seria
29 → shah alam
30 → surabaya
31 → tampines
32 → tutong
33 → unknown
34 → woodlands
35 → yishun
______
Mapping for gender:
0 → female
1 → male
2 → other
______
Mapping for membership_tier:
0 → basic
1 → premium
2 → standard
3 → vip
______
Mapping for last_payment_method:
0 → bank transfer
1 → credit card
2 → debit card
3 → e-wallet
4 → paypal
5 → unknown


## **Choosing model**

In [41]:
df['churn'].value_counts()

churn
0    6475
1      32
Name: count, dtype: int64

- class imbalance for churn: Non-churn (99.5%), Churn (0.5%)

In [42]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Check churn distribution
print("Train churn rate:", y_train.mean())
print("Test churn rate:", y_test.mean())

Train churn rate: 0.004995196926032661
Test churn rate: 0.004608294930875576


In [43]:
print("Original distribution:", np.bincount(y) / len(y))
print("Train distribution:   ", np.bincount(y_train) / len(y_train))
print("Test distribution:    ", np.bincount(y_test) / len(y_test))

Original distribution: [0.99508222 0.00491778]
Train distribution:    [0.9950048 0.0049952]
Test distribution:     [0.99539171 0.00460829]


Applying the following algorithms :

1.   K-NN
2.   Decision Tree
3.   SVM
4.   Random Forest
5.   Logistic Regression
6.   XGoost



Import algorithms

In [44]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report, jaccard_score, log_loss

import warnings
warnings.filterwarnings('ignore')

### **1. KNN**

In [45]:
model_KNN = KNeighborsClassifier()

#training
model_KNN.fit(X_train, y_train)

#prediction
y_pred_KNN = model_KNN.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_KNN)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_KNN))
#print('Classification report: ', classification_report(y_test, y_pred_KNN))

Accuracy score:  99.53917050691244


### **2. Decision Tree**

In [46]:
model_DT = DecisionTreeClassifier()

#training
model_DT.fit(X_train, y_train)

#prediction
y_pred_KNN = model_DT.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_KNN)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_KNN))
#print('Classification report: ', classification_report(y_test, y_pred_KNN))

Accuracy score:  99.84639016897081


### **3. SVM**

In [47]:
model_SVM = SVC()

#training
model_SVM.fit(X_train, y_train)

#prediction
y_pred_SVM = model_SVM.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_SVM)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_SVM))
#print('Classification report: ', classification_report(y_test, y_pred_SVM))

Accuracy score:  99.53917050691244


In [48]:
model_SVM = SVC(kernel = 'linear')

#training
model_SVM.fit(X_train, y_train)

#prediction
y_pred_SVM = model_SVM.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_SVM)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_SVM))
#print('Classification report: ', classification_report(y_test, y_pred_SVM))

Accuracy score:  99.84639016897081


In [49]:
model_SVM = SVC(kernel = 'poly')

#training
model_SVM.fit(X_train, y_train)

#prediction
y_pred_SVM = model_SVM.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_SVM)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_SVM))
#print('Classification report: ', classification_report(y_test, y_pred_SVM))

Accuracy score:  99.53917050691244


In [50]:
model_SVM = SVC(kernel = 'sigmoid')

#training
model_SVM.fit(X_train, y_train)

#prediction
y_pred_SVM = model_SVM.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_SVM)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_SVM))
#print('Classification report: ', classification_report(y_test, y_pred_SVM))

Accuracy score:  99.53917050691244


### **4. Random Forest**

In [51]:
model_RF = RandomForestClassifier()

#training
model_RF.fit(X_train, y_train)

#prediction
y_pred_RF = model_RF.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_RF)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_RF))
#print('Classification report: ', classification_report(y_test, y_pred_RF))

Accuracy score:  100.0


### **5. Logistic Regression**

In [52]:
model_LR = LogisticRegression()

#training
model_LR.fit(X_train, y_train)

#prediction
y_pred_LR = model_LR.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_LR)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_LR))
#print('Classification report: ', classification_report(y_test, y_pred_LR))

Accuracy score:  99.53917050691244


### **6. XGBoost**

In [53]:
model_XGBoost = XGBClassifier()

#training
model_XGBoost.fit(X_train, y_train)

#prediction
y_pred_XGBoost = model_XGBoost.predict(X_test)

#evaluation
print('Accuracy score: ', accuracy_score(y_test, y_pred_XGBoost)*100)
#print('Confusion matrix: ', confusion_matrix(y_test, y_pred_XGBoost))
#print('Classification report: ', classification_report(y_test, y_pred_XGBoost))

Accuracy score:  99.76958525345621


**Summarization of all the models**

In [54]:
from sklearn.metrics import accuracy_score, jaccard_score, log_loss, confusion_matrix, classification_report

classification_models = {
    'KNN': model_KNN,
    'Decision Tree': model_DT,
    'SVM': model_SVM,
    'Random Forest': model_RF,
    'Logistic Regression': model_LR,
    'XGBoost': model_XGBoost
}

classification_results = {}

for name, model in classification_models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    jaccard = jaccard_score(y_test, y_pred)
    # For models that can provide probabilities, calculate log loss
    try:
        y_pred_proba = model.predict_proba(X_test)[:, 1]
        logloss = log_loss(y_test, y_pred_proba)
    except AttributeError:
        logloss = "N/A" # Not all models have predict_proba

    classification_results[name] = {'Accuracy': accuracy, 'Jaccard Score': jaccard, 'Log Loss': logloss}
    print(f'{name}:')
    print(f'Accuracy: {accuracy * 100:.2f}%')
    print(f'Jaccard Score: {jaccard:.2f}')
    print(f'Log Loss: {logloss}')
    print('\n')

# Find the best model based on Accuracy
best_model_name = max(classification_results, key=lambda k: classification_results[k]['Accuracy'])
print(f'The best classification model based on Accuracy is: {best_model_name} with Accuracy of {classification_results[best_model_name]["Accuracy"]*100:.2f}%')

KNN:
Accuracy: 99.54%
Jaccard Score: 0.00
Log Loss: 0.1456510976605357


Decision Tree:
Accuracy: 99.85%
Jaccard Score: 0.75
Log Loss: 0.05536659506776849


SVM:
Accuracy: 99.54%
Jaccard Score: 0.00
Log Loss: N/A


Random Forest:
Accuracy: 100.00%
Jaccard Score: 1.00
Log Loss: 0.0035780191777050386


Logistic Regression:
Accuracy: 99.54%
Jaccard Score: 0.00
Log Loss: 0.02709979104116608


XGBoost:
Accuracy: 99.77%
Jaccard Score: 0.50
Log Loss: 0.002813725636957758


The best classification model based on Accuracy is: Random Forest with Accuracy of 100.00%


## **Classification**

In [55]:
from sklearn.model_selection import cross_val_score

# Apply cross-validation to the SVM model
model_KK = KNeighborsClassifier()
cv_scores = cross_val_score(model_RF, X, y, cv=5) # Using 5 folds

print("Cross-validation scores:", cv_scores)
print("Mean cross-validation accuracy:", cv_scores.mean() * 100)

Cross-validation scores: [1.         0.99923195 1.         0.99769408 1.        ]
Mean cross-validation accuracy: 99.93852064641284


## **Save Model and Scaler**

In [56]:
# Example correct training & saving workflow
import joblib

joblib.dump(model_RF, "results/model_RF_class.joblib")
joblib.dump(norm_x, "results/scaler.joblib")          # ← save 'scaler', not 'norm_x'
joblib.dump(encoders, "results/input_encoders.joblib")
joblib.dump(feature_cols, "results/feature_cols.joblib")

['results/feature_cols.joblib']