## **Step 4: Machine Learning with Utility Functions**

### **Approach A: Cost-sensitive Classification using Decision Trees**

### **Objective:**
Implement a cost-sensitive decision tree classifier that explicitly integrates **customer lifetime value (CLV)** and intervention costs into the model's predictions, optimizing decision-making to maximize retention and minimize unnecessary promotional spending.

### **How the approach works:**

- **Cost-Sensitive Classification** involves assigning different "costs" (or utilities) to different prediction outcomes:
  
| Prediction / Reality | Churn (True)               | Stay (False)                 |
|----------------------|----------------------------|------------------------------|
| **Predict Churn**    | **High Utility (CLV - Cost)** | **Negative Cost (wasted marketing spend)** |
| **Predict Stay**     | **Negative (lost CLV)** | **Neutral (No intervention needed)**   |

- The model learns to prioritize decisions where potential losses (losing high-value customers) are minimized and utility (profitability from retention) is maximized.

### **Explanation of the provided implementation:**

- **Step-by-step Data Preparation**:
  - Clearly encodes categorical variables and creates essential engineered features.
  - Calculates explicit utility values based on customer lifetime value (CLV).

- **Cost-sensitive Decision Tree Classifier**:
  - Trains the model using `sample_weight`, focusing the model’s learning more heavily on customers who represent higher potential gains or losses (utility-driven).

- **Model Evaluation**:
  - Clearly interprets performance using classification report and confusion matrix.
  - Calculates explicit **total utility** to determine the financial effectiveness of predictions.

---

### **Expected Practical Outcome**:

- Immediately identifies high-utility customers for targeted intervention.
- Reduces unnecessary promotional spending.
- Maximizes retained customer lifetime value.

---

### **Complete Practical Implementation in Python:**

Below is the full, clearly explained Python implementation:

In [3]:
# Import required libraries
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Step 1: Load and preprocess the dataset
data = pd.read_csv('online_retail_customer_churn.csv')

# Step 2: Encode categorical features safely
data_encoded = pd.get_dummies(
    data, 
    columns=['Gender', 'Promotion_Response', 'Email_Opt_In'], 
    drop_first=True
)

# Step 3: Feature Engineering
data_encoded['Return_Ratio'] = data_encoded['Num_of_Returns'] / data_encoded['Num_of_Purchases'].replace(0, 1)
data_encoded['Purchase_Frequency'] = data_encoded['Num_of_Purchases'] / data_encoded['Last_Purchase_Days_Ago'].replace(0, 1)
data_encoded['CLV'] = (data_encoded['Average_Transaction_Amount'] *
                       data_encoded['Purchase_Frequency'] *
                       data_encoded['Years_as_Customer']).fillna(0)

# Estimated intervention cost (e.g., marketing cost per offer)
data_encoded['Intervention_Cost'] = 10  # Assumed fixed cost per intervention

# Step 4: Define utility values explicitly
data_encoded['Utility_True_Positive'] = data_encoded['CLV'] - data_encoded['Intervention_Cost']
data_encoded['Utility_False_Positive'] = - data_encoded['Intervention_Cost']
data_encoded['Utility_False_Negative'] = - data_encoded['CLV']
data_encoded['Utility_True_Negative'] = 0

# Step 5: Dynamically identify features for modeling
encoded_columns = [col for col in data_encoded.columns if 
                   col.startswith('Gender_') or
                   col.startswith('Promotion_Response_') or
                   col.startswith('Email_Opt_In_')]

features = ['Return_Ratio', 'Purchase_Frequency', 'CLV'] + encoded_columns

# Display the selected features clearly
print("Selected features for modeling:", features)

# Step 6: Split dataset into training and testing sets
X = data_encoded[features]
y = data_encoded['Target_Churn'].astype(int)

X_train, X_test, y_train, y_test, util_train, util_test = train_test_split(
    X, y,
    data_encoded[['Utility_True_Positive', 'Utility_False_Positive', 
                  'Utility_False_Negative', 'Utility_True_Negative']],
    test_size=0.3, random_state=42
)

# Step 7: Train cost-sensitive Decision Tree model
# Weights are proportional to the absolute utility to emphasize high-impact decisions
sample_weights = np.where(
    y_train == 1,
    util_train['Utility_True_Positive'].abs(),
    util_train['Utility_False_Negative'].abs()
)

clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train, sample_weight=sample_weights)

# Step 8: Model prediction
y_pred = clf.predict(X_test)

# Step 9: Evaluate model performance clearly
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

print("\nConfusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
print(cm)

# Step 10: Calculate total realized utility of the model predictions
TP = (y_pred == 1) & (y_test == 1)
FP = (y_pred == 1) & (y_test == 0)
FN = (y_pred == 0) & (y_test == 1)
TN = (y_pred == 0) & (y_test == 0)

total_utility = (
    util_test['Utility_True_Positive'][TP].sum() +
    util_test['Utility_False_Positive'][FP].sum() +
    util_test['Utility_False_Negative'][FN].sum() +
    util_test['Utility_True_Negative'][TN].sum()
)

print(f"\nTotal Utility (financial impact) of model decisions: ${total_utility:.2f}")

Selected features for modeling: ['Return_Ratio', 'Purchase_Frequency', 'CLV', 'Gender_Male', 'Gender_Other', 'Promotion_Response_Responded', 'Promotion_Response_Unsubscribed', 'Email_Opt_In_True']

Classification Report:
              precision    recall  f1-score   support

           0       0.48      0.50      0.49       135
           1       0.57      0.55      0.56       165

    accuracy                           0.53       300
   macro avg       0.52      0.52      0.52       300
weighted avg       0.53      0.53      0.53       300


Confusion Matrix:
[[67 68]
 [74 91]]

Total Utility (financial impact) of model decisions: $49879.37


### **Interpretation of the Model Performance:**

Your **classification report** shows the following metrics clearly:

- **Precision (Positive class: "1" - Churn):** **0.57**  
  Indicates that out of all customers predicted to churn, **57%** actually churned. The model is moderately good at correctly identifying actual churn cases.

- **Recall (Positive class: "1" - Churn):** **0.55**  
  Indicates that the model correctly detected **55%** of all customers who actually churned. There's room for improvement to better capture customers who might churn.

- **F1-score (Positive class: "1" - Churn):** **0.56**  
  Represents a balance between precision and recall. It's slightly above average, showing the model is fairly balanced but could benefit from further tuning.

- **Accuracy:** **0.53** (53%)  
  Slightly better than random (50%), suggesting that the model provides value but has considerable potential for optimization.

---

### **Practical Insight:**
The model shows basic effectiveness, correctly classifying slightly more than half of customers. While not yet highly accurate, the **real value** comes from the **utility-based financial evaluation**:  
- Even moderate classification performance can yield significant business value if correctly prioritizing interventions based on Customer Lifetime Value (CLV).

---

## **Additional Test Examples (Predicting Churn for New Customer Profiles):**

Here's Python code clearly demonstrating how to test new individual customer examples using your trained model:

In [5]:
# Construct test examples using original categorical format
test_examples_original = pd.DataFrame([
    # Example 1: High CLV, Frequent buyer, positive engagement
    {
        'Gender': 'Male',
        'Promotion_Response': 'Responded', # high engagement
        'Email_Opt_In': True,
        'Return_Ratio': 0.1,
        'Purchase_Frequency': 0.8,
        'CLV': 1500
    },
    # Example 2: Low CLV, high returns, poor engagement
    {
        'Gender': 'Female',
        'Promotion_Response': 'Ignored', # low engagement
        'Email_Opt_In': False,
        'Return_Ratio': 0.7,
        'Purchase_Frequency': 0.1,
        'CLV': 200
    },
    # Example 3: Medium CLV, no engagement
    {
        'Gender': 'Other',
        'Promotion_Response': 'Unsubscribed', 
        'Email_Opt_In': True,
        'Return_Ratio': 0.3,
        'Purchase_Frequency': 0.5,
        'CLV': 700
    }
])

# Encode categorical features to exactly match training data
test_encoded = pd.get_dummies(test_examples_original, columns=['Gender', 'Promotion_Response', 'Email_Opt_In'])

# Ensure all original features used during training are present, filling missing ones with zeros
for col in X_train.columns:
    if col not in test_encoded.columns:
        test_encoded[col] = 0  # missing feature added with 0

# Reorder columns to match exactly
test_encoded = test_encoded[X_train.columns]

# Now predict with the trained classifier
example_predictions = clf.predict(test_encoded)

# Display predictions clearly
for i, pred in enumerate(example_predictions, 1):
    result = 'Likely to Churn' if pred == 1 else 'Likely to Stay'
    print(f"Customer Example {i}: {result}")

Customer Example 1: Likely to Churn
Customer Example 2: Likely to Stay
Customer Example 3: Likely to Churn


### **Next Steps (Recommendations for Improvement):**

- **Tune model hyperparameters** (Decision Tree max_depth, min_samples_split, class weights).
- Consider advanced methods (**Random Forest**, **Gradient Boosting**) for improved accuracy.
- Incorporate additional features or refine existing ones.
- Continuously refine **utility weights** based on actual campaign performance data.