### **Step 3: Data Preprocessing & Feature Engineering**

---

## **Objective:**
- Prepare the dataset for machine learning modeling.
- Encode categorical variables, engineer meaningful features, and calculate metrics critical for customer utility modeling.

---

## **Practical Python Implementation:**

In [3]:
import pandas as pd
import numpy as np

data = pd.read_csv('online_retail_customer_churn.csv')

# 1. Encode categorical variables with dummy/one-hot encoding
categorical_features = ['Gender', 'Promotion_Response', 'Email_Opt_In']
data_encoded = pd.get_dummies(data, columns=categorical_features, drop_first=True)

# 2. Feature Engineering
# Ratio of Returns (how often customers return their purchases)
data_encoded['Return_Ratio'] = data_encoded['Num_of_Returns'] / data_encoded['Num_of_Purchases']
data_encoded['Return_Ratio'] = data_encoded['Return_Ratio'].fillna(0)  # Handle division by zero

# Frequency of Purchase (how frequently the customer buys)
data_encoded['Purchase_Frequency'] = data_encoded['Num_of_Purchases'] / data_encoded['Last_Purchase_Days_Ago']
data_encoded['Purchase_Frequency'] = data_encoded['Purchase_Frequency'].fillna(0)  # Handle division by zero

# Engagement Score (combining Promotion Response and Email Opt-In)
# Assume engagement weights: Responded=3, Ignored=1, Unsubscribed=0, Email_Opt_In=True: +1
engagement_map = {'Responded': 3, 'Ignored': 1, 'Unsubscribed': 0}
data_encoded['Promotion_Response_Score'] = data['Promotion_Response'].map(engagement_map)
data_encoded['Email_Opt_In_Score'] = data['Email_Opt_In'].apply(lambda x: 1 if x else 0)

data_encoded['Engagement_Score'] = data_encoded['Promotion_Response_Score'] + data_encoded['Email_Opt_In_Score']

# 3. Customer Lifetime Value (CLV) Estimation
# Simple CLV calculation using transaction data and tenure
data_encoded['CLV'] = (data_encoded['Average_Transaction_Amount'] *
                       data_encoded['Purchase_Frequency'] *
                       (data_encoded['Years_as_Customer']))

# Replace NaN values (due to zero frequencies) with zero
data_encoded['CLV'] = data_encoded['CLV'].fillna(0)

# 4. Estimated Cost per Customer Intervention (for promotions)
# Let's assume an average fixed marketing cost per customer intervention (e.g., $10)
average_intervention_cost = 50
data_encoded['Intervention_Cost'] = average_intervention_cost

# Display the first few rows of the engineered data
print("Preprocessed and Engineered Data (first 10 rows):")
display_cols = [
    'Customer_ID',
    'Return_Ratio',
    'Purchase_Frequency',
    'Engagement_Score',
    'CLV',
    'Intervention_Cost',
    'Target_Churn'
]

display(data_encoded[display_cols].head(10))

data_encoded.to_csv('processed_customer_churn_data.csv', index=False)

Preprocessed and Engineered Data (first 10 rows):


Unnamed: 0,Customer_ID,Return_Ratio,Purchase_Frequency,Engagement_Score,CLV,Intervention_Cost,Target_Churn
0,1,0.090909,0.170543,4,386.96124,50,True
1,2,0.025974,0.339207,3,100.981938,50,False
2,3,0.070423,0.250883,3,164.802792,50,True
3,4,0.151515,0.146018,2,180.403407,50,True
4,5,0.069767,0.177686,0,269.70062,50,False
5,6,0.058824,0.653846,0,5190.113077,50,False
6,7,0.0,1.262295,1,1197.337377,50,False
7,8,0.034483,0.388393,1,98.822679,50,False
8,9,0.428571,0.166667,4,28.966667,50,True
9,10,0.176471,0.19883,1,1626.444678,50,False


## **Description of the Steps:**

### **Step 1: Encode categorical variables**
- **One-hot encoding** converts categorical variables (`Gender`, `Promotion_Response`, and `Email_Opt_In`) into numeric format suitable for machine learning models.

### **Step 2: Feature Engineering**
- **Return Ratio**:
  - Indicates customer dissatisfaction or issues with product quality.
- **Purchase Frequency**:
  - Highlights customers' activity and engagement level.
- **Engagement Score**:
  - Combines promotional response (higher response, higher score) and email opt-in status, reflecting overall customer engagement with marketing.

### **Step 3: Customer Lifetime Value (CLV) Estimation**
- Uses historical purchase behavior to estimate a simple yet practical CLV:
  - $$ \text{CLV} = \text{Average Transaction Amount} \times \text{Purchase Frequency} \times \text{Years as Customer} $$
- Adjusted to handle any division by zero or missing data cases by filling with zero.

### **Step 4: Intervention Cost Estimation**
- Sets a simplified, constant intervention cost per customer (this number can be refined based on actual promotional spending).

---

### 🚩 **Practical Application for Next Steps**:
- The engineered features and calculated metrics can now be directly fed into advanced predictive modeling to identify at-risk customers, optimize retention campaigns, and prioritize resource allocation based on the utility-driven approach.
