<a href="https://colab.research.google.com/github/nihemelandu/churn_clv_prediction/blob/main/01_business_case_and_exploration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Business Problem & ROI Framework

## Business problems:
Who is likely to churn, when, and what factors contribute to their decision to leave?

ROI Framework for Churn Analysis:
**Cost Side:**
* Average Customer Acquisition Cost (CAC)
* Customer Lifetime Value (CLV) lost per churned customer
* Cost of retention campaigns/interventions
* Analyst time and resources

**Benefit Side:**
* Prevented churn revenue (if we can reduce churn by X%, what's the revenue impact?)
* Reduced acquisition costs (retaining customers is typically 5-25x cheaper than acquiring new ones)
* Increased CLV through proactive engagement
* Operational efficiency from targeted rather than blanket retention efforts

**Key Questions to Answer First:**
1. What's our current churn rate, and what does each churned customer cost us?
2. What's a realistic improvement target? (Even a 10-20% reduction in churn can be massive)
3. What retention tactics are we capable of executing?
4. How much lead time do we need to intervene effectively?

| Question               | Inferred Answer                                                   |
| ---------------------- | ----------------------------------------------------------------- |
| **Churn Rate**         | \~3–7% monthly churn (depending on segment)                       |
| **Cost of Churn**      | \~\$70 lost gross margin per churned customer                     |
| **Realistic Target**   | +3–5% point improvement in 90-day retention                       |
| **Executable Tactics** | Cart/email retargeting, category alerts, churn-risk scoring, etc. |
| **Lead Time**          | 1–7 days depending on user frequency and behavior type            |




Standard E-commerce Churn Definitions:
Most Common Approaches:

90-day rule: No purchase in last 90 days (very common for frequent purchase categories)
180-day rule: No purchase in last 6 months (common for general retail)
365-day rule: No purchase in last year (for infrequent purchase items)

Industry-Specific Variations:
*   Fashion/Apparel: 120-180 days
*   Beauty/Personal Care: 90-120 days
*   Electronics: 180-365 days (longer purchase cycles)
*   Grocery/Consumables: 30-60 days
*   Luxury goods: 365+ days

Additional Behavioral Indicators:
1.   Email engagement: Stopped opening marketing emails
2.   Website activity: No site visits or logins
3.   App usage: No mobile app opens

Most Practical Definition:
Many e-commerce companies use "no purchase in 6 months (180 days)" as it balances:

Capturing true churn without being too aggressive

Allowing for seasonal shopping patterns

Giving reasonable time for win-back campaigns

Business Context Questions:

What's your typical customer purchase frequency?
What's your product category?
Do you have subscription elements or one-time purchases?

The 180-day threshold tends to be the sweet spot for general e-commerce unless you have specific business reasons to adjust it.

## Success Metrics Definition  

1.  Predictive Performance Metrics (Model Quality) - Did we build the model right?

ROC/AUC, confusion matrices, precision/recall, and F1 scores
Focus: How well does the model distinguish churners from non-churners?

2. Calibration Metrics (Probability Reliability)

Brier Score: Measures how close predicted probabilities are to actual outcomes
Calibration Curve: Assesses if a 70% predicted churn probability means ~70% actually churn
Focus: Can we trust the probabilities for business decisions?

3. Business-Level Success Metrics (Real-World Impact) Did we build the right model?*italicized text*

Churn Rate Reduction: Target 10-20% reduction vs. baseline
Revenue Impact: (Prevented churners) × (Average CLV)
Campaign ROI: Cost of retention interventions vs. revenue saved
Operational Efficiency: Lead time for interventions, campaign response rates

## Initial Data Exploration

1. Sample-Based Exploration First

In [None]:
import pandas as pd

sample = pd.read_csv("/content/drive/MyDrive/2019-Oct.csv", nrows=1000_000)
sample.info()
sample.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 9 columns):
 #   Column         Non-Null Count    Dtype  
---  ------         --------------    -----  
 0   event_time     1000000 non-null  object 
 1   event_type     1000000 non-null  object 
 2   product_id     1000000 non-null  int64  
 3   category_id    1000000 non-null  int64  
 4   category_code  681869 non-null   object 
 5   brand          852440 non-null   object 
 6   price          1000000 non-null  float64
 7   user_id        1000000 non-null  int64  
 8   user_session   1000000 non-null  object 
dtypes: float64(1), int64(3), object(5)
memory usage: 68.7+ MB


Unnamed: 0,product_id,category_id,price,user_id
count,1000000.0,1000000.0,1000000.0,1000000.0
mean,10347990.0,2.056348e+18,295.982471,531276300.0
std,11238270.0,1.579788e+16,368.216516,16673330.0
min,1001588.0,2.053014e+18,0.0,244951100.0
25%,1005115.0,2.053014e+18,64.32,515650700.0
50%,5100397.0,2.053014e+18,161.93,527129000.0
75%,16400260.0,2.053014e+18,360.37,548038000.0
max,53900020.0,2.17542e+18,2574.07,555717500.0


In [None]:
# Actual number of unique users in the sample
unique_users_in_sample = sample['user_id'].nunique()
print(f"Unique users in 1M sample: {unique_users_in_sample}")

Unique users in 1M sample: 163024


In [None]:
# Understand data structure without loading full dataset
sample.dtypes

Unnamed: 0,0
event_time,object
event_type,object
product_id,int64
category_id,int64
category_code,object
brand,object
price,float64
user_id,int64
user_session,object


In [None]:
sample.columns.tolist()

['event_time',
 'event_type',
 'product_id',
 'category_id',
 'category_code',
 'brand',
 'price',
 'user_id',
 'user_session']

In [None]:
#3. Check Row Counts by Event Type
# Helps understand class imbalance early.
event_counts = sample['event_type'].value_counts()
print(event_counts)

event_type
view        968513
purchase     16848
cart         14639
Name: count, dtype: int64


In [None]:
#4. Understand Time Range
sample['event_time'] = pd.to_datetime(sample['event_time'])
sample['event_time'].min(), sample['event_time'].max()

(Timestamp('2019-10-01 00:00:00+0000', tz='UTC'),
 Timestamp('2019-10-01 16:56:07+0000', tz='UTC'))

## Data Quality Assessment

In [None]:
# Missing values (partially done)
sample.isnull().sum()

# Duplicates
sample.duplicated().sum()

#4. Understand Time Range
sample['event_time'] = pd.to_datetime(sample['event_time'])
sample['event_time'].min(), sample['event_time'].max()

Unnamed: 0,0
event_time,"datetime64[ns, UTC]"
event_type,object
product_id,int64
category_id,int64
category_code,object
brand,object
price,float64
user_id,int64
user_session,object
