# Ensemble Models: Random Forests and Boosting Lab

## Retail Customer Churn Model

As a junior data scientist at RetailTech Solutions, a retail analytics firm, you've been assigned to help a major e-commerce client predict customer churn. The client has noticed an increasing trend of customers abandoning their platform, despite competitive pricing and a wide product range.

Your team lead has provided you with historical customer data containing various metrics including usage patterns, customer support interactions, and account details. Your task is to build several ensemble models to predict which customers are at risk of churning, and identify the key factors driving customer departures.

The VP of Customer Experience will use your insights to develop targeted retention strategies, so both prediction accuracy and model interpretability are important. You'll apply the ensemble learning techniques you've learned, specifically Random Forest and Boosting, to address this business challenge.

## Modelling Process for Lab:
- Data Exploration
- Data Setup
- Baseline Model (Random Forest)
- Boosting Models
- Hyperparameter Tuning
- Feature Importance

## Data Overview
Data File: ecommerce_customer_data.csv

This dataset contains 15,000 customer records with 14 features and the churn target variable.

Contains columns:
- account_age_months: Number of months since customer account creation (numeric)
- avg_orders_per_month: Average number of orders placed monthly (numeric)
- avg_order_value: Average dollar amount spent per order (numeric)
- returns_rate: Proportion of items returned from total orders (numeric, 0-1)
- support_tickets_6m: Number of customer support tickets in last 6 months (integer)
- reviews_submitted: Total number of product reviews submitted (integer)
- website_visits_per_month: Average website visits per month (integer)
- cart_abandonment_rate: Proportion of shopping carts abandoned (numeric, 0-1)
- loyalty_member: Whether customer joined loyalty program (binary: 0=No, 1=Yes)
- payment_failures_12m: Number of payment failures in last 12 months (integer)
- device_type: Primary device used for shopping (ordinal: 1=Mobile, 2=Mixed, 3=Desktop)
- discount_usage_rate: Proportion of orders using discount codes (numeric, 0-1)
- days_since_last_active: Number of days since last website activity (integer)
- satisfaction_score: Customer satisfaction rating (ordinal: 1-10)
- churn: Target variable indicating customer has left (binary: 0=Retained, 1=Churned)


## Part 0: Setup - Import Libraries and Load Data

First, let's import all the necessary libraries and load the dataset.

In [39]:
# CodeGrade step0
# Run this cell without changes

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.metrics import confusion_matrix, classification_report, ConfusionMatrixDisplay
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings('ignore')

# Load the e-commerce customer data
df_ecom = pd.read_csv('ecommerce_customer_data.csv')

## Part 1: Data Exploration

In the first part here you are tasked with performing some basic EDA to investigate your data (features and target).

In [None]:
# Run this cell without changes
df_ecom.info()

In [None]:
# CodeGrade step1
# Investigate the class distribution of churn via value_counts and visualization
churn_counts = None

# Visualize (use the counts object)
X = None
y = None

# Code for plot provided
sns.barplot(x=X, y=y)
plt.xlabel('Churned')
plt.ylabel('Count')
plt.title("Distribution of Churn");

In [None]:
# CodeGrade step2
# Produce a correlation matrix using pandas to visualize potential important features
# Show correlations with churn (subset the matrix to just churn column
correlations = None
correlations

## Part 2: Data Setup

You need to prepare the data for modeling. The data provided is already processed and cleaned for the sake of this lab (categorical variables encoded). Seperate your data into X features and y target and then perform a train test split.
- Set random_state = 42
- Ensure an 80-20 split (train-test)

In [7]:
# CodeGrade step3
# Seperate data into X and y - use all features
X = None
y = None

# Split data using sklearn, follow the standard naming conventions (X_train, X_test etc...)
None

## Part 3: Baseline Random Forest Model
You need to instansiate and train (fit) an untuned random forest classifier and evaluate it using cross-validation. Use the default score of accuracy.
- Set random_state = 42 inside the model

In [8]:
# CodeGrade step4
# Instanstiate model and fit the model
rf_model = None

# Get training score
rf_train_score = None

# Cross validation scores (don't average)
rf_cv_scores = None

In [None]:
# Run this cell without changes to display results
print(f"Random Forest Training Score: {rf_train_score:.3f}")
print(f"Random Forest CV Score: {rf_cv_scores.mean():.3f}")

## Part 4: Boosting Models
In this section you will iterate on your modelling approach to investigate the performance of various untuned boosting models.
- Use random_state = 42 for all models

In [12]:
# CodeGrade step5
# Instantiate and fit all models
# Adaboost model
ada_model = None

# Gradient Boosting model
grad_model = None

# XBGboost model
xgb_model = None

# Get training scores
ada_train_score =  None
grad_train_score = None
xgb_train_score = None

# Cross validate all models using accuracy (don't average the scores)
ada_cv_scores = None
grad_cv_scores = None
xgb_cv_scores = None

In [None]:
# Run this cell without changes to dsiplay results
print(f"Training and Cross Validation Performance Comparison of Boosted Models")
print(f"Adaptive Boosting: Train - {ada_train_score:.3f}, CV - {ada_cv_scores.mean():.3f}")
print(f"Gradient Boosting: Train - {grad_train_score:.3f}, CV - {grad_cv_scores.mean():.3f}")
print(f"Extreme Gradient Boosting: Train - {xgb_train_score:.3f}, CV - {xgb_cv_scores.mean():.3f}")

## Part 5: Hyperparameter Tuning

Based on the results above you want to select the model that has the most room for improvement (is overfitting with highest train score) and attempt to optimize the model via a targeted Grid Search. Utilize the provided hyperparameters and values for your grid.
- 'learning_rate': [0.05, 0.1]
- 'n_estimators': [200, 300]
- 'max_depth': [3, 5]
- 'min_child_weight': [1, 5]
- 'scale_pos_weight': [1, 3]

NOTE: You should expect this grid search to take a minute or two to run

In [None]:
# CodeGrade step6
# Assign the model object
gs_model = None

# Create Param Grid
param_grid = None

# Instantiate GridSearchCV object
grid_search = None

# Perform the grid search (fit)
None

In [None]:
# Run this cell without changes to display results
print("Best Model Results from Grid Search")
print(f"CV Score: {grid_search.best_score_:.3f}")
print(f"Best Hyperparameters: {grid_search.best_params_}")

## Part 6: Final Model Analysis

For the sake of timing we will stop at one grid search. In practice (especially with advanced boosting models) multiple searchs are probably warranted, this grid search only touches a few of the most important hyperparameters involved. Treat the best estimator from the grid search as your final model.

In [30]:
# CodeGrade step7
# Extract final model
final_model = None

# Final Model training accurary
final_score_train = None

# Final model testing accuracy
final_score_test = None

# Produce classificaiton report
y_pred_test = final_model.predict(X_test)
cr = classification_report(y_test, y_pred_test)

# Produce confusion matrix
cm = confusion_matrix(y_test, y_pred_test)

In [None]:
# Run this cell without changes to display results
print(f"Final Model Evaluation")
print(f"Accuracy on the Training Data: {final_score_train:.3f}")
print(f"Accuracy on the Testing Data: {final_score_test:.3f}")
print(f"Classification Report")
print(cr)
print(f"Confusion Matrix")
display = ConfusionMatrixDisplay(cm)
display.plot();

Your boss specifically wanted a model with high accuracy and interpretability which you have achieved! However, based on the results above and what you know about churn and business context, what might be a good alternative metric to try and optimize for? 

Select from one of the options below:
- recall
- f1_score
- precision
- roc_auc

In [None]:
# CodeGrade step8
# Assign name of metric as string
alternative_metric = None

In [None]:
# CodeGrade step9
# Extract feature importance from final model
feature_importance = None

importances = pd.Series(feature_importance, index=X_train.columns)
importances = importances.sort_values(ascending=False)
    
plt.figure(figsize=(10, 6))
importances.plot(kind='bar')
plt.title('Feature Importances')
plt.tight_layout()
plt.show()