# Lab 10 - Ensemble
---
**Summer 2025 - Intructor: Joyce Yang**

**Adapted from teaching materials by Prof. Chris Volinksy Fall 2024**

In this notebook we will be learning about Ensembles.

This notebook may contains optional task. If you have time, have fun working on optional task. You won't be penalize if you didn't finish optional task.

**Before we begin, remember to save this notebook IN YOUR OWN GOOGLE DRIVE**.  That way you have your own copy to work on, edit and share.

In [30]:
import pandas as pd
import numpy as np
import random
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
import xgboost as xgb

## Load Data

Let's use the Marketing dataset again to see the performance for different models we learned today. Download the dataset from [here](https://drive.google.com/file/d/1pjwlHE0_1PxQd5PQTcbLM3Ti2nejVlf-/view?usp=drive_link)

Each record represents an individual who was targeted with a direct marketing offer.  The offer was a solicitation to make a charitable donation.

The columns (features) are:


|income   |  household income |
|------------|-----------------------|
|Firstdate  |  data assoc. with the first gift by this individual|
|Lastdate   |  data associated with the most recent gift|
|Amount    |   average amount by this individual over all periods (incl. zeros)|
|rfaf2     |   frequency code|
|rfaa2      |  donation amount code|
|pepstrfl  |   flag indicating a star donator|
|glast    |    amount of last gift|
|gavr     |    amount of average gift|


The target variables is `class` and is equal to one if they gave in this campaign and zero otherwise.

In [2]:
from google.colab import files
uploaded = files.upload()

Saving DirectMarketing (1).csv to DirectMarketing (1).csv


In [4]:
df = pd.read_csv('DirectMarketing (1).csv')

In [13]:
df = df.loc[df.Firstdate != 0]
df_clean = df
df_clean.loc[:, 'gavr'] = np.log(df.gavr+1)
df_clean.loc[:, 'glast'] = np.log(df.glast+1)
income_cat = pd.Categorical(df['Income'], categories=[0,1,2,3,4,5,6,7])
df_clean.loc[:,'Income'] = income_cat
rfaf2_cat = pd.Categorical(df['rfaf2'], categories=[1,2,3,4])
df_clean.loc[:,'rfaf2'] = rfaf2_cat
df_clean = pd.get_dummies(df_clean, columns=['rfaa2', 'pepstrfl','Income','rfaf2'],drop_first=True)
df_clean.loc[:,'tenure'] = df_clean.loc[:,'Lastdate'] - df_clean.loc[:,'Firstdate']
today = df_clean['Lastdate'].max()
df_clean.loc[:,'recency'] = today - df_clean.loc[:,'Lastdate']
df_clean = df_clean.drop(['Firstdate', 'Lastdate'], axis=1)
df_clean.head()

Unnamed: 0,Amount,glast,gavr,class,rfaa2_E,rfaa2_F,rfaa2_G,pepstrfl_X,Income_1,Income_2,Income_3,Income_4,Income_5,Income_6,Income_7,rfaf2_2,rfaf2_3,rfaf2_4,tenure,recency
0,0.06,1.595709,1.489299,0,False,False,True,False,False,False,True,False,False,False,False,False,False,False,100,193
1,0.16,1.397363,1.403735,1,False,False,True,True,False,True,False,False,False,False,False,False,False,True,401,100
2,0.2,1.026672,1.18701,0,True,False,False,False,False,False,False,False,False,False,False,False,False,True,93,99
3,0.13,1.448822,1.424794,0,False,False,True,False,False,False,False,False,False,True,False,True,False,False,194,99
4,0.1,1.448822,1.281681,0,False,False,True,False,False,False,False,False,False,False,False,False,False,False,201,191


In [22]:
X = df_clean.drop(['class'], axis=1)
Y = df_clean['class']
print(df.columns)

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.2, random_state=42)

Index(['Income', 'Amount', 'rfaf2', 'rfaa2', 'pepstrfl', 'glast', 'gavr',
       'class', 'first_year', 'last_year', 'donation_recency'],
      dtype='object')


## RandomForest

### Task 1
Train a Random Forest model, name it `rf_model`. Remember to Set `class weight = balanced`, because of the unbalanced data. play with max_depth, to make sure your AUC is higher than 0.55.

In [27]:
# --- Scale Numeric Features ---
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# --- Random Forest Model with Class Weight Balanced ---
rf_model = RandomForestClassifier(class_weight='balanced', random_state=42)

# --- Grid Search for Best max_depth ---
param_grid_rf = {'max_depth': [5, 10, 15, 20, 25, None],
                 'n_estimators': [100]}  # You can tune n_estimators more if needed

gs_rf = GridSearchCV(rf_model, param_grid_rf, cv=3, scoring='roc_auc', n_jobs=-1)
gs_rf.fit(X_train_scaled, Y_train)

# --- Evaluation ---
best_depth = gs_rf.best_params_['max_depth']
print(f"Best Depth: {best_depth}")

pred_rf = gs_rf.predict(X_test_scaled)
proba_rf = gs_rf.predict_proba(X_test_scaled)[:, 1]

rf_acc = accuracy_score(Y_test, pred_rf)
rf_auc = roc_auc_score(Y_test, proba_rf)
print(f"Random Forest | Accuracy: {rf_acc:.3f}, ROC AUC: {rf_auc:.3f}")
print(classification_report(Y_test, pred_rf))

Best Depth: 5
Random Forest | Accuracy: 0.596, ROC AUC: 0.616
              precision    recall  f1-score   support

           0       0.96      0.60      0.74     36409
           1       0.07      0.57      0.12      1946

    accuracy                           0.60     38355
   macro avg       0.52      0.58      0.43     38355
weighted avg       0.92      0.60      0.71     38355



## XGBoost

### Task 2

Train a XGboost model, name it `xgb_model`. Set `scale_pos_weight =sum(negative instances) / sum(positive instances)` to make it balance. Make sure your AUC is larger than 0.55.

In [28]:
import xgboost as xgb

In [34]:
# --- Compute scale_pos_weight ---
neg_count = sum(Y_train == 0)
pos_count = sum(Y_train == 1)
scale_pos_weight = neg_count / pos_count
print(f"scale_pos_weight: {scale_pos_weight:.2f}")
# --- XGBoost Model ---
xgb_model = xgb.XGBClassifier(scale_pos_weight=scale_pos_weight,
                              use_label_encoder=False,
                              eval_metric='logloss',
                              random_state=42)

# --- Grid Search for Best max_depth ---
param_grid_xgb = {'max_depth': [3, 5, 7, 10],
                  'n_estimators': [100, 200]}

gs_xgb = GridSearchCV(xgb_model, param_grid_xgb, cv=3, scoring='roc_auc', n_jobs=-1)
gs_xgb.fit(X_train_scaled, Y_train)

# --- Evaluation ---
best_params = gs_xgb.best_params_
print(f"Best Params: {best_params}")

pred_xgb = gs_xgb.predict(X_test_scaled)
proba_xgb = gs_xgb.predict_proba(X_test_scaled)[:, 1]

xgb_acc = accuracy_score(Y_test, pred_xgb)
xgb_auc = roc_auc_score(Y_test, proba_xgb)
print(f"XGBoost | Accuracy: {xgb_acc:.3f}, ROC AUC: {xgb_auc:.3f}")
print(classification_report(Y_test, pred_xgb))

scale_pos_weight: 18.75


Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)


Best Params: {'max_depth': 3, 'n_estimators': 100}
XGBoost | Accuracy: 0.621, ROC AUC: 0.617
              precision    recall  f1-score   support

           0       0.96      0.62      0.76     36409
           1       0.07      0.56      0.13      1946

    accuracy                           0.62     38355
   macro avg       0.52      0.59      0.44     38355
weighted avg       0.92      0.62      0.73     38355



## Task 3

Which model is better?