# 📊 Final Project - Customer Churn Prediction for Sendo Farm

## Problem Definition

Sendo Farm is an online grocery e-commerce platform that delivers essential goods, similar to a supermarket, directly to consumers for their daily meals.  
Because the products are groceries—especially fresh items such as meat, fish, fruits, and vegetables—customer satisfaction is highly sensitive to quality.  

Some customers who are dissatisfied with product quality or after-sales service choose to file complaints, request refunds, or report missing items. However, many others leave silently without expressing dissatisfaction and never purchase from Sendo Farm again.  

Customer churn is costly, especially in the grocery business where customers make frequent and recurring purchases.  
Thus, the goal of this project is to build a **supervised machine learning model** that predicts which customers are at risk of churning. This will allow Sendo Farm to proactively take preventive actions, such as personalized campaigns, compensation, or loyalty offers, to improve customer experience and retention.  

### Dataset Description
The available data includes:
- **Transaction history** (order frequency, order value, recency of last purchase).
- **Complaint & refund history** (number of complaints, refund requests).
- **Missing items history** (number of orders with missing products).
- **Purchase ratio of dry vs fresh goods** (stability and sensitivity to product type).
- **Order rating** (order satisfaction)      

These features will be engineered into customer-level data suitable for supervised ML classification (churn vs non-churn).


In [None]:
# 📦 Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

## 1. Load Dataset

In [None]:
# Load dataset (replace with actual path)
# Orders (lịch sử giao dịch mua hàng)
# Customers (thông tin khách hàng)
# Products (SKU, danh mục, giá)
# Incidents (sự kiện phàn nàn, yêu cầu hoàn tiền, thiếu hàng)
# Order_rating (đánh giá đơn hàng)
data = pd.read_csv("data/customer_churn.csv")

# Preview dataset
data.head()

## 2. Exploratory Data Analysis (EDA)
- Check missing values  
- Target distribution (churn vs not churn)  
- Descriptive statistics  
- Correlation matrix  

In [None]:
# Basic checks
print(df.shape)
print(df.info())
print(df.describe())

# Missing values
df.isnull().sum()

# Distribution plots
num_cols = ["order_count", "avg_order_value", "complaint_count", "refund_count", "missing_items_ratio"]
df[num_cols].hist(bins=20, figsize=(12,8))
plt.show()

# Churn rate
df["churn"].value_counts(normalize=True).plot(kind="bar", title="Churn Distribution")
plt.show()

# Correlation heatmap
plt.figure(figsize=(10,6))
sns.heatmap(df[num_cols + ["churn"]].corr(), annot=True, cmap="coolwarm")
plt.show()

# Compare churn vs non-churn customers
sns.boxplot(x="churn", y="order_count", data=df)
plt.show()


## 3. Feature Engineering & Preprocessing
- Encode categorical variables  
- Scale numerical features  
- Train/test split  

In [None]:
## === Feature Engineering for Frequency Change ===
# Important: sudden drop in order frequency often signals churn risk

# Average frequency in month -2 and -3
# (Assume df has a 'order_date' column for transaction history)
df["order_month"] = pd.to_datetime(df["order_date"]).dt.to_period("M")

# Count orders per customer per month
monthly_orders = df.groupby(["customer_id", "order_month"]).size().reset_index(name="order_count_month")

# Pivot table for last 3 months
# (This assumes you have filtered dataset up to current month)
monthly_pivot = monthly_orders.pivot(index="customer_id", columns="order_month", values="order_count_month").fillna(0)

# Example: if we label last 3 months as -1, -2, -3
monthly_pivot["avg_freq_month_minus2_3"] = monthly_pivot.iloc[:, -3:-1].mean(axis=1)  # baseline frequency
monthly_pivot["avg_freq_last30"] = monthly_pivot.iloc[:, -1]  # most recent month

# Merge back into main dataframe
df = df.merge(monthly_pivot[["avg_freq_month_minus2_3", "avg_freq_last30"]], on="customer_id", how="left")

# Optional: create delta feature
df["freq_change"] = df["avg_freq_last30"] - df["avg_freq_month_minus2_3"]


# Example engineered features
df["recency_days"] = (pd.to_datetime("today") - df["last_order_date"]).dt.days
df["complaint_ratio"] = df["complaint_count"] / (df["order_count"] + 1)
df["refund_ratio"] = df["refund_count"] / (df["order_count"] + 1)
df["missing_ratio"] = df["missing_items"] / (df["order_count"] + 1)
df["fresh_ratio"] = df["fresh_items"] / (df["total_items"] + 1)

# Select features
feature_cols = ["recency_days", "order_count", "avg_order_value",
                "complaint_ratio", "refund_ratio", "missing_ratio", "fresh_ratio"]

X = df[feature_cols]
y = df["churn"]

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)



## 4. Model Building & Training
- Logistic Regression (baseline)  
- Random Forest, XGBoost  
- Hyperparameter tuning  

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

print("Logistic Regression")
print(classification_report(y_test, y_pred_lr))

In [None]:
# Random Forest
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

print("Random Forest")
print(classification_report(y_test, y_pred_rf))

In [None]:
# XGBoost
xgb = XGBClassifier(use_label_encoder=False, eval_metric='logloss')
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)

print("XGBoost")
print(classification_report(y_test, y_pred_xgb))

## 5. Results & Evaluation
- Compare Accuracy, F1-score, ROC-AUC  
- Confusion matrix  
- ROC curve  

In [None]:
models = {"LR": lr, "RF": rf, "XGB": xgb}

for name, model in models.items():
    y_pred = model.predict(X_test)
    print(f"\n{name} - Classification Report")
    print(classification_report(y_test, y_pred))
    
    # ROC
    y_proba = model.predict_proba(X_test)[:,1]
    auc = roc_auc_score(y_test, y_proba)
    print(f"{name} - ROC AUC: {auc:.4f}")

In [None]:
# Confusion matrix for best model (e.g. XGB)
cm = confusion_matrix(y_test, y_pred_xgb)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.title("Confusion Matrix - XGBoost")
plt.show()

## 6. Discussion & Conclusion
- Best model: [fill in here]  
- Key findings and interpretation  
- Limitations of the current approach  
- Future improvements (e.g., more features, deep learning, deployment)  


## 🔹 Data Sources

Trong project này, chúng ta sẽ làm việc với nhiều dataset khác nhau:

- **Orders**: lịch sử giao dịch mua hàng (`order_id, customer_id, product_id, order_date, quantity, price`)
- **Customers**: thông tin khách hàng (`customer_id, gender, age, region, join_date`)
- **Products**: thông tin sản phẩm (`product_id, category, subcategory, price, brand`)
- **Incidents**: khiếu nại và hoàn tiền (`incident_id, order_id, type, date, resolved_flag`)
- **Order_rating**: đánh giá đơn hàng (`order_id, rating, feedback_text, review_date`)

Các bảng này sẽ được merge lại để xây dựng feature set cho mô hình dự đoán churn.


In [None]:

# 📦 Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

# 🔹 Giả lập dataset (có thể thay bằng pd.read_csv khi có file thật)
orders = pd.DataFrame({
    "order_id": range(1, 11),
    "customer_id": [1,1,2,3,3,3,4,5,5,6],
    "product_id": [101,102,103,104,105,106,107,108,109,110],
    "order_date": pd.date_range("2024-01-01", periods=10, freq="15D"),
    "quantity": np.random.randint(1,5, size=10),
    "price": np.random.randint(50,200, size=10)
})

customers = pd.DataFrame({
    "customer_id": [1,2,3,4,5,6],
    "gender": ["M","F","F","M","F","M"],
    "age": [25,30,40,35,28,45],
    "region": ["North","South","South","East","West","East"],
    "join_date": pd.date_range("2023-01-01", periods=6, freq="90D")
})

products = pd.DataFrame({
    "product_id": range(101,111),
    "category": ["Food","Food","Drink","Drink","Snack","Snack","Household","Household","Personal","Personal"],
    "price": np.random.randint(10,100, size=10)
})

incidents = pd.DataFrame({
    "incident_id": range(1,6),
    "order_id": [2,4,6,8,9],
    "type": ["Refund","Late Delivery","Missing Item","Refund","Damaged"],
    "date": pd.date_range("2024-02-01", periods=5, freq="30D"),
    "resolved_flag": [1,0,1,1,0]
})

order_rating = pd.DataFrame({
    "order_id": [1,2,3,5,7,9],
    "rating": [5,3,4,2,5,1],
    "feedback_text": ["Good","Late delivery","Nice","Bad packaging","Excellent","Broken item"],
    "review_date": pd.date_range("2024-02-01", periods=6, freq="20D")
})

# Xem thử dữ liệu
print("Orders sample:")
print(orders.head())
print("\nCustomers sample:")
print(customers.head())



## 🔹 Feature Engineering (Aggregation)

Một số feature gợi ý từ các bảng:

- Từ **Orders**: tổng số đơn hàng, tổng giá trị chi tiêu, tần suất mua trung bình 30 ngày gần nhất.  
- Từ **Incidents**: số lần khiếu nại, tỷ lệ khiếu nại được xử lý.  
- Từ **Order_rating**: rating trung bình, tỷ lệ rating < 3.  
- Từ **Products**: đa dạng danh mục mua hàng (số category khác nhau).  

Các feature này sẽ dùng để huấn luyện mô hình dự đoán churn (ví dụ định nghĩa **churn = 30 ngày không mua hàng**).


In [None]:

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Giả lập dataset huấn luyện (X, y)
np.random.seed(42)
X = pd.DataFrame({
    "total_orders": np.random.randint(1,20, size=50),
    "avg_rating": np.random.uniform(1,5, size=50),
    "num_incidents": np.random.randint(0,5, size=50),
    "spend": np.random.randint(100,2000, size=50)
})
y = np.random.randint(0,2, size=50)  # churn label

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

models = {
    "Logistic Regression": LogisticRegression(),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
    print(f"{name}: AUC = {auc:.3f}")
