# **ðŸ“Š Classification Tasks: Customer Churn Prediction**

This file contains the notebook demo for classification utils script example usages.

# **STAGE 0 : BUSINESS UNDERSTANDING**

## ðŸ“Œ Problem Statement

A telecommunications company is experiencing significant customer attrition (churn), where existing customers discontinue their services and switch to competitors. The current reactive approachâ€”addressing churn only after customers have already leftâ€”has proven costly and ineffective. Customer acquisition costs are typically 5-7x higher than retention costs, making each churned customer a substantial financial loss.

The company's customer service team relies on intuition and basic rules to identify at-risk customers, resulting in missed opportunities to intervene before customers leave. Without a data-driven approach, the company cannot effectively prioritize retention efforts or allocate resources efficiently.

Building on this challenge, the company aims to develop a predictive system that can identify customers at high risk of churning before they leave, enabling proactive retention strategies based on historical customer behavior and service usage patterns.

## ðŸ“Œ Role

As a Data Scientist Team, our role involves:

- Conducting exploratory data analysis to understand customer behavior patterns
- Identifying key factors that contribute to customer churn
- Building predictive models to flag at-risk customers
- Providing actionable insights for business decision-making

## ðŸ“Œ Goals

- **Proactive Churn Prevention**: Identify at-risk customers before they churn to enable timely intervention. (_MAIN_)
- **Customer Lifetime Value Optimization**: Focus retention efforts on high-value customers most likely to churn. (_SECONDARY_)
- **Resource Allocation Efficiency**: Prioritize marketing and retention budgets toward customers who need it most. (_SECONDARY_)

## ðŸ“Œ Business Metrics

| Metric                            | Description                                               | Type        |
| --------------------------------- | --------------------------------------------------------- | ----------- |
| **Churn Rate (%)**                | Percentage of customers who churned in a given period     | _MAIN_      |
| **Customer Retention Rate (%)**   | Percentage of customers retained after intervention       | _MAIN_      |
| **Customer Lifetime Value (CLV)** | Projected revenue from a customer over their relationship | _SECONDARY_ |
| **Cost per Acquisition (CPA)**    | Cost to acquire a new customer vs. retain existing        | _SECONDARY_ |

## ðŸ“Œ Objectives

The ultimate goal of this project is to create a machine learning model that can:

- Predict customer churn with high recall to minimize missed at-risk customers (false negatives are costly)
- Provide probability scores for churn risk to enable tiered intervention strategies
- Identify the top contributing factors to churn for targeted retention campaigns

## ðŸ“Œ Success Criteria

- Model achieves **Recall â‰¥ 80%** for churn class (minimize false negatives)
- Model achieves **Precision â‰¥ 60%** to avoid excessive false alarms
- Reduction in monthly churn rate by **15-20%** through proactive interventions
- Clear identification of **top 5 churn indicators** for actionable insights

# **STAGE 1 : EXPLORATORY DATA ANALYSIS (EDA)**

## Project Setup

In [None]:
import os
import sys
from pathlib import Path

# Add root folder to path (for module imports)
sys.path.insert(0, str(Path.cwd().parent))

# Change working directory (for I/O operations)
os.chdir(Path.cwd().parent)

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_recall_fscore_support

# Import utils scripts
from utils.preprocessing import *
from utils.visualization import *
from utils.statistics import *
from utils.feature_selection import *

In [3]:
# # Read data from local
# df = pd.read_csv('data/classification_data.csv')

# Read data from github
df = pd.read_csv('https://raw.githubusercontent.com/mcikalmerdeka/ds-dataprep-utils/refs/heads/main/data/classification_data.csv')

# Preview data
display(df.head())

Unnamed: 0,customer_id,age,gender,tenure_months,monthly_charges,total_charges,contract_type,payment_method,support_tickets,account_balance,internet_service,online_security,satisfaction_score,days_since_last_login,num_products,referral_source,churn,churn_label
0,CUST_00352,52.0,Female,13.0,64.393115,873.447527,Two year,Bank Transfer,1,58.950305,Fiber optic,True,1.6,,1,Friend,1,Yes
1,CUST_00689,46.0,Male,81.0,115.102595,9634.748151,Two year,Electronic Check,1,-25.965241,Fiber optic,Yes,4.5,4.0,4,Billboard,0,No
2,CUST_00485,18.0,Other,44.0,74.210796,3291.816076,One year,Credit Card,4,133.015936,DSL,True,3.0,11.0,4,Email,1,Yes
3,CUST_00388,54.0,Male,9.0,,174.421647,One year,Bank Transfer,4,-13.197764,No,Yes,2.6,,1,Google,1,Yes
4,CUST_00031,67.0,Female,0.0,90.937098,0.0,One year,Electronic Check,4,-95.459565,DSL,yes,3.3,63.0,1,Facebook,1,Yes


In [4]:
# Check data information
info_df = check_data_information(df, df.columns.tolist())
display(info_df)

Unnamed: 0,Feature,Data Type,Null Values,Null Percentage,Duplicated Values,Unique Values,Unique Sample
0,customer_id,object,0,0.0,50,1000,"CUST_00352, CUST_00689, CUST_00485, CUST_00388..."
1,age,float64,91,8.67,50,81,"52.0, 46.0, 18.0, 54.0, 67.0"
2,gender,object,0,0.0,50,5,"Female, Male, Other, FEMALE, male"
3,tenure_months,float64,92,8.76,50,96,"13.0, 81.0, 44.0, 9.0, 0.0"
4,monthly_charges,float64,88,8.38,50,914,"64.39311525514046, 115.10259485286996, 74.2107..."
5,total_charges,float64,94,8.95,50,866,"873.447527076971, 9634.748150896536, 3291.8160..."
6,contract_type,object,7,0.67,50,4,"Two year, One year, Month-to-month, monthly, nan"
7,payment_method,object,60,5.71,50,4,"Bank Transfer, Electronic Check, Credit Card, ..."
8,support_tickets,int64,0,0.0,50,8,"1, 4, 2, 0, 3"
9,account_balance,float64,0,0.0,50,1000,"58.9503049877248, -25.965240829469444, 133.015..."


Feature Information:

| Feature                 | Description                           | Type        |
| ----------------------- | ------------------------------------- | ----------- |
| `customer_id`           | Unique customer identifier            | ID          |
| `age`                   | Customer age                          | Numeric     |
| `gender`                | Customer gender                       | Categorical |
| `tenure_months`         | Months as a customer                  | Numeric     |
| `monthly_charges`       | Monthly billing amount                | Numeric     |
| `total_charges`         | Cumulative charges                    | Numeric     |
| `contract_type`         | Service contract type                 | Categorical |
| `payment_method`        | Payment method used                   | Categorical |
| `support_tickets`       | Number of support requests            | Numeric     |
| `account_balance`       | Current account balance               | Numeric     |
| `internet_service`      | Type of internet service              | Categorical |
| `online_security`       | Has online security add-on            | Categorical |
| `satisfaction_score`    | Customer satisfaction (1-5)           | Numeric     |
| `days_since_last_login` | Engagement metric                     | Numeric     |
| `num_products`          | Number of products subscribed         | Numeric     |
| `referral_source`       | How customer was acquired             | Categorical |
| **`churn`**             | **Target: Did customer churn? (0/1)** | **Binary**  |

In [5]:
# Group columns by data type
nums_cols = df.select_dtypes(include=['int64', 'float64']).columns
cats_cols = df.select_dtypes(include=['object', 'category']).columns

# **STAGE 2 : DATA PRE-PROCESSING**

# **STAGE 3 : MODELLING AND EVALUATION**

# **STAGE 4 : Business Impact Simulation**