# 📊 Final Project - Customer Churn Prediction for Sendo Farm

## Problem Definition

Sendo Farm is an online grocery e-commerce platform that delivers essential goods, similar to a supermarket, directly to consumers for their daily meals.  
Because the products are groceries—especially fresh items such as meat, fish, fruits, and vegetables—customer satisfaction is highly sensitive to quality.  

Some customers who are dissatisfied with product quality or after-sales service choose to file complaints, request refunds, or report missing items. However, many others leave silently without expressing dissatisfaction and never purchase from Sendo Farm again.  

Customer churn is costly, especially in the grocery business where customers make frequent and recurring purchases.  
Thus, the goal of this project is to build a **supervised machine learning model** that predicts which customers are at risk of churning. This will allow Sendo Farm to proactively take preventive actions, such as personalized campaigns, compensation, or loyalty offers, to improve customer experience and retention.  

### Dataset Description
The available data includes:
- **Transaction history** (order frequency, order value, recency of last purchase).
- **Complaint & refund history** (number of complaints, refund requests).
- **Missing items history** (number of orders with missing products).
- **Purchase ratio of dry vs fresh goods** (stability and sensitivity to product type).
- **Order rating** (order satisfaction)      

These features will be engineered into customer-level data suitable for supervised ML classification (churn vs non-churn).


In [None]:
# 📦 Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve

# Set display options for pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

## 1. Data Loading & Initial Merge

We will load all raw datasets (Orders, Customers, Incidents, Order_rating) and perform a preliminary merge to create a single customer-level dataset. This initial step simplifies subsequent feature engineering and analysis.

In [None]:
# 🔹 Data Sources - Let's use the provided mock dataframes for demonstration
orders = pd.DataFrame({
    "order_id": range(1, 11),
    "customer_id": [1,1,2,3,3,3,4,5,5,6],
    "product_id": [101,102,103,104,105,106,107,108,109,110],
    "order_date": pd.to_datetime(pd.date_range("2024-01-01", periods=10, freq="15D")),
    "quantity": np.random.randint(1,5, size=10),
    "price": np.random.randint(50,200, size=10)
})

customers = pd.DataFrame({
    "customer_id": [1,2,3,4,5,6],
    "gender": ["M","F","F","M","F","M"],
    "age": [25,30,40,35,28,45],
    "region": ["North","South","South","East","West","East"],
    "join_date": pd.to_datetime(pd.date_range("2023-01-01", periods=6, freq="90D"))
})

incidents = pd.DataFrame({
    "incident_id": range(1,6),
    "order_id": [2,4,6,8,9],
    "type": ["Refund","Late Delivery","Missing Item","Refund","Damaged"],
    "date": pd.to_datetime(pd.date_range("2024-02-01", periods=5, freq="30D")),
    "resolved_flag": [1,0,1,1,0]
})

order_rating = pd.DataFrame({
    "order_id": [1,2,3,5,7,9],
    "rating": [5,3,4,2,5,1],
    "feedback_text": ["Good","Late delivery","Nice","Bad packaging","Excellent","Broken item"],
    "review_date": pd.to_datetime(pd.date_range("2024-02-01", periods=6, freq="20D"))
})

# Merge orders with customer info
df_merged = orders.merge(customers, on='customer_id', how='left')

# Merge incidents and ratings via order_id
df_merged = df_merged.merge(incidents, on='order_id', how='left')
df_merged = df_merged.merge(order_rating, on='order_id', how='left')

print("Initial Merged Dataset Head:")
print(df_merged.head())

## 2. Preliminary Feature Engineering

Before diving deep into analysis, we'll create some essential customer-level features by aggregating the merged data. This gives us a basic dataset for initial exploration.

### Feature Aggregation
- **Total Orders**: Total number of orders per customer.
- **Total Spend**: Sum of money spent by each customer.
- **Avg Rating**: Average rating given by the customer.
- **Num Incidents**: Total number of incidents (complaints, refunds) reported by a customer.
- **Recency**: Number of days since the customer's last order.

In [None]:
# Aggregate features at the customer level
customer_features = df_merged.groupby('customer_id').agg(
    total_orders=('order_id', 'nunique'),
    total_spend=('price', 'sum'),
    avg_rating=('rating', 'mean'),
    num_incidents=('incident_id', 'nunique'),
    last_order_date=('order_date', 'max')
).reset_index()

# Calculate Recency: days since the last order
last_date = pd.to_datetime('today')
customer_features['days_since_last_order'] = (last_date - customer_features['last_order_date']).dt.days

# Merge with customer demographics
customer_features = customer_features.merge(customers, on='customer_id', how='left')

print("Customer-level Feature Dataset Head:")
print(customer_features.head())

## 3. Targeted Exploratory Data Analysis (EDA)

With our preliminary features, we can now conduct a targeted EDA to understand the relationship between these features and customer churn. The goal is to identify potential churn drivers and generate hypotheses for more advanced feature engineering.

In [None]:
# 🔹 Define Churn based on Recency
# Assuming a simple rule: churn if a customer has not ordered in the last 30 days.
customer_features['churn'] = (customer_features['days_since_last_order'] > 30).astype(int)

# Check Churn Distribution
churn_dist = customer_features['churn'].value_counts(normalize=True)
print("\nChurn Distribution:")
print(churn_dist)
churn_dist.plot(kind='bar', title='Churn vs Non-Churn Distribution')
plt.show()

# Analyze Churn Rate vs Key Features
print("\nChurn rate by number of incidents:")
print(customer_features.groupby('num_incidents')['churn'].mean())

print("\nChurn rate by average rating:")
bins = [1, 2, 3, 4, 5]
customer_features['avg_rating_bin'] = pd.cut(customer_features['avg_rating'], bins=bins, labels=False)
print(customer_features.groupby('avg_rating_bin')['churn'].mean())

# Visualize relationships
sns.boxplot(x='churn', y='days_since_last_order', data=customer_features)
plt.title('Days Since Last Order vs Churn')
plt.show()

sns.barplot(x='num_incidents', y='churn', data=customer_features, estimator=np.mean)
plt.title('Churn Rate vs Number of Incidents')
plt.show()

## 4. Advanced Feature Engineering

Based on the EDA findings, we will now engineer more sophisticated features that capture behavioral patterns over time, such as changes in order frequency. These features are often more predictive of churn.

In [None]:
# 🔹 Feature: Order Frequency Change
# This feature captures a sudden drop in a customer's purchasing frequency.

# Sort data by customer and date for time-series analysis
orders_sorted = orders.sort_values(by=['customer_id', 'order_date'])

# Calculate rolling order frequency (e.g., last 30 days)
orders_sorted['rolling_30d_freq'] = orders_sorted.groupby('customer_id')['order_date'].rolling('30D').count().reset_index(level=0, drop=True)

# Calculate order frequency for specific periods (e.g., month -1, month -2)
current_month = orders_sorted['order_date'].max().to_period('M')

# Filter orders from the last two months
orders_last_2_months = orders_sorted[orders_sorted['order_date'] >= (current_month - 1).to_timestamp()]

monthly_orders = orders_last_2_months.groupby(['customer_id', orders_last_2_months['order_date'].dt.to_period('M')]).size().unstack(fill_value=0)

# Create frequency change feature
if monthly_orders.shape[1] >= 2:
    monthly_orders['freq_month_minus1'] = monthly_orders.iloc[:, -1]
    monthly_orders['freq_month_minus2'] = monthly_orders.iloc[:, -2]
    monthly_orders['freq_change'] = monthly_orders['freq_month_minus1'] - monthly_orders['freq_month_minus2']
    customer_features = customer_features.merge(monthly_orders[['freq_change']], on='customer_id', how='left')
else:
    print("Not enough data to create frequency change feature.")
    customer_features['freq_change'] = 0
    
print("\nUpdated Customer Features with Advanced Features:")
print(customer_features.head())

## 5. Model Building & Training

Now we have a rich feature set, we can train and evaluate different supervised machine learning models to predict churn.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, XGBClassifier

# Handle potential missing values from merge
customer_features.fillna(0, inplace=True)

# Select features and target
feature_cols = ['total_orders', 'total_spend', 'avg_rating', 'num_incidents', 'days_since_last_order', 'freq_change']

X = customer_features[feature_cols]
y = customer_features['churn']

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Train models
models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(random_state=42),
    "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='logloss')
}

for name, model in models.items():
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n{name} - Classification Report:")
    print(classification_report(y_test, y_pred))
    y_proba = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_proba)
    print(f"{name} - ROC AUC: {auc:.4f}")

## 6. Conclusion & Future Work

Based on the evaluation metrics, we can determine the best-performing model. This project serves as a strong foundation, and future improvements could include:

- **Feature Scaling**: Apply `StandardScaler` to numerical features for models like Logistic Regression.
- **Hyperparameter Tuning**: Use `GridSearchCV` or `RandomizedSearchCV` to find optimal parameters for models like Random Forest and XGBoost.
- **Advanced Features**: Incorporate more features from the `Products` table (e.g., proportion of fresh vs. dry goods purchased).
- **Deployment**: Integrate the best model into a real-time system to identify at-risk customers dynamically.