# **Ensemble Learning in Action**

**Objective**

Build, evaluate, and compare ensemble models while demonstrating an understanding of model mechanics, trade-offs, and business implications.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.inspection import permutation_importance
from sklearn.metrics import (accuracy_score, roc_auc_score, f1_score,
                             precision_score, recall_score, confusion_matrix,
                             log_loss, RocCurveDisplay, PrecisionRecallDisplay,
                             DetCurveDisplay, ConfusionMatrixDisplay, brier_score_loss)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import BaggingClassifier, AdaBoostClassifier, RandomForestClassifier, VotingClassifier

### Libraries

**Model Selection & Validation**
- cross_val_score: Performs cross-validation to evaluate model performance
- StratifiedKFold: Performs cross-validation that maintains class distributions

**Data Preprocessing**
- ColumnTransformer: Applies different preprocessing to different columns
- OneHotEncoderL Applies one hot encoding
- StandardScaler: Normalizes features to the same scale
- SimpleImputer: Imputes missing values
- Pipeline: Chains preprocessing and modelling steps together

**Model Analysis**
- permutation_importance: Determines feature importance by shuffling features

**Evaluation Metrics**
- accuracy_score
- f1_score
- precision_score
- recall_score
- roc_auc_score
- confusion_matrix
- log_loss
- brier_score_loss

**Classification Algorithms**
- LogisticRegression
- KNeighborsClassifier
- BaggingClassifier
- AdaBoostClassifier
- RandomForestClassifier
- VotingClassifier


In [2]:
# loading data sets
df_train = pd.read_csv('train.csv', delimiter = ';')
df_test = pd.read_csv('test.csv', delimiter = ';')

In [3]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


In [4]:
print(f"Training Set NaNs\n{df_train.isnull().sum()}")
print(f"Test Set NaNs\n{df_test.isnull().sum()}") # checking for NaN and nulls

Training Set NaNs
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64
Test Set NaNs
age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64


In [5]:
# checking for 'unknown' values in categorical columns in training set
for col in df_train.columns:
    if df_train[col].dtype == 'object':
        unknown_count = (df_train[col] == 'unknown').sum()
        print(f"{col}: {unknown_count}")

    

job: 288
marital: 0
education: 1857
default: 0
housing: 0
loan: 0
contact: 13020
month: 0
poutcome: 36959
y: 0


In [6]:
# checking for 'unknown' values in categorical columns in test set
for col in df_test.columns:
    if df_test[col].dtype == 'object':
        unknown_count = (df_test[col] == 'unknown').sum()
        print(f"{col}: {unknown_count}")

job: 38
marital: 0
education: 187
default: 0
housing: 0
loan: 0
contact: 1324
month: 0
poutcome: 3705
y: 0


When I first started exploring the class distribution of the output variable, I realized it was an object with a binary response for the observations,'yes' and 'no', which described whether a client subscribed to a term deposit. To make my analysis easier, I defined a dictionary where 'no' equals 0 and 'yes' equals 1 then mapped it to both the training and test sets. Following this, I transformed the output variable from an object to an integer.

In [7]:
# exploring class distribution of dependent variable
dic = {'no': 0, 'yes': 1} # creating binary dictionary for response variable

df_train['y'] = df_train['y'].map(dic)
df_test['y'] = df_test['y'].map(dic)

df_train['y'] = df_train['y'].astype(int) 
df_test['y'] = df_test['y'].astype(int) # changing the output variable from an object to an integer

In [8]:
print(f"Mean of training set response variable classes {df_train['y'].mean()}")

Mean of training set response variable classes 0.11698480458295547


In [9]:
print(f"Mean of test set response variable classes {df_test['y'].mean()}")

Mean of test set response variable classes 0.11523999115239991


Both the mean of the training and test set response variable show the majority class is no. This means most clients do not subscribe to a term deposit.

In [10]:
# defining majority and minority classes for both data sets for exploration
train_majority = df_train[df_train['y'] == 0]
test_majority = df_test[df_test['y'] == 0]

train_minority = df_train[df_train['y'] == 1]
test_minority = df_test[df_test['y'] == 1]

In [11]:
print(f"Training set majority class dimensions {train_majority.shape}")
print(f"Training set minority class dimensions {train_minority.shape}")

Training set majority class dimensions (39922, 17)
Training set minority class dimensions (5289, 17)


In [12]:
print(f"The minority class of the training set makes up {5289/39922 * 100:.2f}% of the data")

The minority class of the training set makes up 13.25% of the data


In [13]:
print(f"Test set majority class dimensions {test_majority.shape}")
print(f"Test set minority class dimensions {test_minority.shape}")

Test set majority class dimensions (4000, 17)
Test set minority class dimensions (521, 17)


In [14]:
print(f"The minority class of the test set makes up {521/4000 *100:.2f}% of the data")

The minority class of the test set makes up 13.03% of the data


There is a large class imbalance between clients who do subscribe to a term deposit and those who don't. This is a severely class-imbalanced data set.

Data imbalance refers to the concept where a certain output category is underrepresented in a data set. Class-imbalanced data sets are far more common than class-balanced data sets. The goal of training is to create a model that successfully distinguishes the positive class from the negative class. A severely class-imbalanced data set might not contain enough minority class examples for proper training. During training a model should learn what each class looks like (what feature calues correspond to what class) and how common each class is (what is the relative distribution of the classes). These questions can be addressed with a two-step technique downsampling and upweighting the majority class.

**1.** Downsampling the majority class means training on a disproportionately low percentage of majority class observations. I artificially force a class-imbalanced data set to become a little more balanced by omitting majority class examples from training. This increases the likelihood that each batch contains enough obsevations of the minority class to train the model properly. However, downsampling introduces a prediction bias by showing the model an unrealistic reality where the classes are more balanced.

**2.** Upweighting the majority class is where the majority class is weighted by the factor to which it was downsampled. Upweigthing means treating the loss on a majority class observation more harshly than the loss on a minority class observation. This will multiply the loss on one observation by the factor to which the majority class was downsampled. 

Experiment with hyperparameters to determine the factor to use to rebalance the data set. A bonus of this method is faster convergence as the model sees the minority class more often during training. 

https://developers.google.com/machine-learning/crash-course/overfitting/imbalanced-datasets 

**Exploratory Data Analysis** 
- Analyze the data set to summarize its main characteristics (use visuals, statistical models)
- I want to see what the data can tell me beyond the formal modelling or hypothesis testing task

**Ask myself** What patterns, anomalies, or relationships exist here that I might not anticipate?