# 📊 Final Project - Customer Churn Prediction for Sendo Farm

## Problem Definition

Sendo Farm is an online grocery e-commerce platform that delivers essential goods, similar to a supermarket, directly to consumers for their daily meals.  
Because the products are groceries—especially fresh items such as meat, fish, fruits, and vegetables—customer satisfaction is highly sensitive to quality.  

Some customers who are dissatisfied with product quality or after-sales service choose to file complaints, request refunds, or report missing items. However, many others leave silently without expressing dissatisfaction and never purchase from Sendo Farm again.  

Customer churn is costly, especially in the grocery business where customers make frequent and recurring purchases.  
Thus, the goal of this project is to build a **supervised machine learning model** that predicts which customers are at risk of churning. This will allow Sendo Farm to proactively take preventive actions, such as personalized campaigns, compensation, or loyalty offers, to improve customer experience and retention.  

### Dataset Description
The available data includes:
- **Transaction history** (order frequency, order value, recency of last purchase).
- **Complaint & refund history** (number of complaints, refund requests).
- **Missing items history** (number of orders with missing products).
- **Purchase ratio of dry vs fresh goods** (stability and sensitivity to product type).
- **Order rating** (order satisfaction)      

These features will be engineered into customer-level data suitable for supervised ML classification (churn vs non-churn).


In [None]:
# 📦 Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

## 1. Load Dataset

In [None]:
# Load dataset (replace with actual path)
# Orders (lịch sử giao dịch mua hàng)
# Customers (thông tin khách hàng)
# Products (SKU, danh mục, giá)
# Incidents (sự kiện phàn nàn, yêu cầu hoàn tiền, thiếu hàng)
# Order_rating (đánh giá đơn hàng)
data = pd.read_csv("data/customer_churn.csv")

# Preview dataset
data.head()

## 2. Exploratory Data Analysis (EDA)
- Check missing values  
- Target distribution (churn vs not churn)  
- Descriptive statistics  
- Correlation matrix  

In [None]:
# Basic checks
print(df.shape)
print(df.info())
print(df.describe())

# Missing values
df.isnull().sum()

# Distribution plots
num_cols = ["order_count", "avg_order_value", "complaint_count", "refund_count", "missing_items_ratio"]
df[num_cols].hist(bins=20, figsize=(12,8))
plt.show()

# Churn rate
df["churn"].value_counts(normalize=True).plot(kind="bar", title="Churn Distribution")
plt.show()

# Correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(df[num_cols + ["churn"]].corr(), annot=True, cmap="coolwarm")
plt.show()

# Compare churn vs non-churn customers
sns.boxplot(x="churn", y="order_count", data=df)
plt.show()


## 3. Feature Engineering & Preprocessing
- Encode categorical variables  
- Scale numerical features  
- Train/test split  

In [None]:
## === Feature Engineering for Frequency Change ===
# Important: sudden drop in order frequency often signals churn risk

# Average frequency in month -2 and -3
# (Assume df has a 'order_date' column for transaction history)
df["order_month"] = pd.to_datetime(df["order_date"]).dt.to_period("M")

# Count orders per customer per month
monthly_orders = df.groupby(["customer_id", "order_month"]).size().reset_index(name="order_count_month")

# Pivot table for last 3 months
# (This assumes you have filtered dataset up to current month)
monthly_pivot = monthly_orders.pivot(index="customer_id", columns="order_month", values="order_count_month").fillna(0)

# Example: if we label last 3 months as -1, -2, -3
monthly_pivot["avg_freq_month_minus2_3"] = monthly_pivot.iloc[:, -3:-1].mean(axis=1)  # baseline frequency
monthly_pivot["avg_freq_last30"] = monthly_pivot.iloc[:, -1]  # most recent month

# Merge back into main dataframe
df = df.merge(monthly_pivot[["avg_freq_month_minus2_3", "avg_freq_last30"]], on="customer_id", how="left")

# Optional: create delta feature
df["freq_change"] = df["avg_freq_last30"] - df["avg_freq_month_minus2_3"]


# Example engineered features
df["recency_days"] = (pd.to_datetime("today") - df["last_order_date"]).dt.days
df["complaint_ratio"] = df["complaint_count"] / (df["order_count"] + 1)
df["refund_ratio"] = df["refund_count"] / (df["order_count"] + 1)
df["missing_ratio"] = df["missing_items"] / (df["order_count"] + 1)
df["fresh_ratio"] = df["fresh_items"] / (df["total_items"] + 1)

# Select features
feature_cols = ["recency_days", "order_count", "avg_order_value",
                "complaint_ratio", "refund_ratio", "missing_ratio", "fresh_ratio"]

X = df[feature_cols]
y = df["churn"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)



## 4. Model Building & Training
- Logistic Regression (baseline)  
- Random Forest, XGBoost  
- Hyperparameter tuning  

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Logistic Regression")
print(classification_report(y_test, y_pred_lr))

In [None]:
# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest")
print(classification_report(y_test, y_pred_rf))

In [None]:
# XGBoost
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

print("XGBoost")
print(classification_report(y_test, y_pred_xgb))

## 5. Results & Evaluation
- Compare Accuracy, F1-score, ROC-AUC  
- Confusion matrix  
- ROC curve  

In [None]:
models = {"LR": lr, "RF": rf, "XGB": xgb}

for name, model in models.items():
    y_pred = model.predict(X_test)
    print(f"\n{name} - Classification Report")
    print(classification_report(y_test, y_pred))
    
    # ROC
    y_proba = model.predict_proba(X_test)[:,1]
    auc = roc_auc_score(y_test, y_proba)
    print(f"{name} - ROC AUC: {auc:.4f}")

In [None]:
# Confusion matrix for best model (e.g. XGB)
cm = confusion_matrix(y_test, y_pred_xgb)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix - XGBoost")
plt.show()

## 6. Discussion & Conclusion
- Best model: [fill in here]  
- Key findings and interpretation  
- Limitations of the current approach  
- Future improvements (e.g., more features, deep learning, deployment)  

## 🛠️ Feature Engineering (Sơ bộ)

- Merge dữ liệu từ **Orders**, **Customers**, **Incidents**, **Ratings**.
- Tạo các feature cơ bản:
  - `total_orders`
  - `avg_rating`
  - `num_incidents`
  - `days_since_last_order`

In [None]:
import pandas as pd

# Load CSVs
df_orders = pd.read_csv('Orders.csv')
df_customers = pd.read_csv('Customers.csv')
df_products = pd.read_csv('Products.csv')
df_incidents = pd.read_csv('Incidents.csv')
df_ratings = pd.read_csv('Order_rating.csv')

# Merge cơ bản
df = df_orders.merge(df_customers, on='customer_id', how='left')
df = df.merge(df_ratings, on='order_id', how='left')
df = df.merge(df_incidents.groupby('customer_id').agg(num_incidents=('incident_id','count')).reset_index(),
              on='customer_id', how='left')
df['total_orders'] = df.groupby('customer_id')['order_id'].transform('count')
df['avg_rating'] = df.groupby('customer_id')['rating'].transform('mean')
df['days_since_last_order'] = (pd.to_datetime('today') - pd.to_datetime(df['order_date'])).dt.days

## 🔍 EDA định hướng mục tiêu

- So sánh churn rate theo các feature sơ bộ.
- Tìm ra yếu tố quan trọng ảnh hưởng đến churn.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Giả sử đã có cột churn (1: churn, 0: active)
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.boxplot(x='churn', y='total_orders', data=df, ax=axes[0])
axes[0].set_title('Total Orders vs Churn')

sns.boxplot(x='churn', y='avg_rating', data=df, ax=axes[1])
axes[1].set_title('Avg Rating vs Churn')

sns.boxplot(x='churn', y='num_incidents', data=df, ax=axes[2])
axes[2].set_title('Num Incidents vs Churn')

plt.show()

## 🚀 Feature Engineering (Nâng cao)

Dựa trên insight từ EDA:
- Tạo feature rolling 30 ngày (`orders_last_30d`).
- Tính tỷ lệ incidents chưa giải quyết (`incident_unresolved_ratio`).
- Tính **diversity index** số lượng sản phẩm khác nhau khách hàng mua.

In [None]:
# Rolling 30 ngày
df_orders['order_date'] = pd.to_datetime(df_orders['order_date'])
df_orders = df_orders.sort_values(['customer_id','order_date'])

window = pd.Timedelta(days=30)
df_orders['orders_last_30d'] = df_orders.groupby('customer_id')['order_date'].transform(
    lambda x: x.rolling(window=30, min_periods=1).count()
)

# Tỷ lệ incidents chưa giải quyết
df_incidents['unresolved'] = df_incidents['status'].apply(lambda x: 1 if x!='resolved' else 0)
incident_ratio = df_incidents.groupby('customer_id').agg(
    incident_unresolved_ratio=('unresolved','mean')
).reset_index()
df = df.merge(incident_ratio, on='customer_id', how='left')

# Diversity index sản phẩm
diversity = df_orders.groupby('customer_id')['product_id'].nunique().reset_index()
diversity.rename(columns={'product_id':'diversity_index'}, inplace=True)
df = df.merge(diversity, on='customer_id', how='left')