# **Introduction**

This notebook explores various techniques to handle class imbalance in the Titanic dataset, specifically for predicting passenger survival. Survival rates in the Titanic dataset are not equally distributed, leading to potential biases in classification models.
Because the data is imbalanced between labels,
the metrics used to evaluate performance are **precision**, **recall** and **f1-score**.

# **Import libraries**

In [40]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, accuracy_score, classification_report
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

In [41]:
np.random.seed(42)

# **Import dataset**

In [42]:
# Connect to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [43]:
train_df = pd.read_csv('/content/drive/MyDrive/Basic ML/datasets/cleaned-data/data_v2/train.csv', index_col='PassengerId')
val_df = pd.read_csv('/content/drive/MyDrive/Basic ML/datasets/cleaned-data/data_v2/val.csv', index_col='PassengerId')
test_df = pd.read_csv('/content/drive/MyDrive/Basic ML/datasets/cleaned-data/data_v2/test.csv', index_col='PassengerId')

train_df.head()

Unnamed: 0_level_0,Age,Fare,FamilySize,Pclass,Sex,Embarked,Title_Name,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
693,25.0,56.4958,1,3,male,S,Mr,1
482,30.0,0.0,1,2,male,S,Mr,0
528,40.0,221.7792,1,1,male,S,Mr,0
856,18.0,9.35,2,3,female,S,Mrs,1
802,31.0,26.25,3,2,female,S,Mrs,1


In [44]:
numerical_features = ['Age', 'Fare', 'FamilySize']
categorical_features = ['Pclass', 'Sex', 'Embarked', 'Title_Name']

features = numerical_features + categorical_features
label = "Survived"

# Extract features and Label
X_train = train_df[features]
y_train = train_df[label]

X_val = val_df[features]
y_val = val_df[label]

X_test = test_df[features]

In [45]:
# Create a Preprocessing Pipeline
numerical_transformer = StandardScaler()
categorical_transformer = OneHotEncoder()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# **Base Logistic Regression model**

Build a Logistic Regression model using the hyperparameters found in file `LogisticRegression.ipynb`.

The choice to build the **Logistic Regression** model is the base model because it has good performance compared to other models. (Performance comparison between models can be seen in file `compare_models.ipynb` and file `baseline_models_comparison.ipynb`).

In [46]:
# Define and Train the Logistic Regression Model

base_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(C=4.893900918477489, class_weight=None, max_iter=100, solver='liblinear'))
])

base_model.fit(X_train, y_train)

In [47]:
base_model_preds = base_model.predict(X_val)
print(classification_report(y_val, base_model_preds))

              precision    recall  f1-score   support

           0       0.86      0.90      0.88       110
           1       0.83      0.77      0.80        69

    accuracy                           0.85       179
   macro avg       0.84      0.83      0.84       179
weighted avg       0.85      0.85      0.85       179



In [48]:
# Create Dataframe to save the results for comparison.
result_df = pd.DataFrame(columns=['Model', 'Classification report'])
result_df.loc[len(result_df)] = ['Base Model', classification_report(y_val, base_model_preds)]

# **Class Weighing**

Assign higher misclassification costs to the minority class, making the model pay more attention to it.

$$
W_j = \frac{n\_samples}{n\_classes \times n\_samples_j}
$$

- **W_j**: weight of class j  

- **n_samples**: total number of samples or rows in the dataset  
- **n_classes**: total number of unique classes in the target  
- **n_samples_j**: total number of rows of the corresponding class j  


In [49]:
y_train.value_counts()

Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
0,439
1,273


In [50]:
def compute_weight(target_variable: pd.Series) -> dict:
  unique_classes, counts = np.unique(y_train, return_counts=True)
  n_samples = len(target_variable)
  n_classes = len(unique_classes)
  weights = dict()
  for i in range(n_classes):
    weights[unique_classes[i]] = n_samples / (n_classes * counts[i])
  return weights

In [51]:
# Train logistic regression with class weights

class_weight_dict = compute_weight(y_train)
print(f"Weight for class 0: {class_weight_dict[0]}")
print(f"Weight for class 1: {class_weight_dict[1]}")

cw_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(C=4.893900918477489, class_weight=class_weight_dict, max_iter=100, solver='liblinear'))
])

cw_model.fit(X_train, y_train)
cw_model_preds = cw_model.predict(X_val)

print(classification_report(y_val, cw_model_preds))

result_df.loc[len(result_df)] = ['Class Weight Model', classification_report(y_val, cw_model_preds)]

Weight for class 0: 0.8109339407744874
Weight for class 1: 1.304029304029304
              precision    recall  f1-score   support

           0       0.89      0.85      0.87       110
           1       0.77      0.83      0.80        69

    accuracy                           0.84       179
   macro avg       0.83      0.84      0.83       179
weighted avg       0.84      0.84      0.84       179



Additionally, we can use the scikit-learn library to calculate weights for classes.

In [52]:
from sklearn.utils.class_weight import compute_class_weight

# Define class weights manually
class_weights = compute_class_weight(class_weight="balanced", classes=np.array([0, 1]), y=y_train)

print(class_weights)
print(f"Weight for class 0: {class_weights[0]}")
print(f"Weight for class 1: {class_weights[1]}")

[0.81093394 1.3040293 ]
Weight for class 0: 0.8109339407744874
Weight for class 1: 1.304029304029304


# **Undersampling**
Reduce the number of observations in the majority group so that it becomes equal to the number of observations in the minority group

## **Random Undersampling**

Randomly removes samples from the majority class.

In [53]:
y_train.value_counts(normalize=True)

Unnamed: 0_level_0,proportion
Survived,Unnamed: 1_level_1
0,0.616573
1,0.383427


In [54]:
y_train.value_counts()

Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
0,439
1,273


In [55]:
print(439 - 273)

166


We see that the ratio between ***class 0*** and ***class 1*** is **62:38**. We will re-divide the train set so that the ratio of ***class 0*** and ***class 1*** on the train set is **50:50**. That means we will randomly delete **166** (439 - 273) samples belonging to class 0.

In [56]:
idx_pos = np.where(train_df['Survived'].values.reshape(-1) == 1)[0]
idx_neg = np.where(train_df['Survived'].values.reshape(-1) == 0)[0]

In [57]:
np.random.shuffle(idx_pos)
np.random.shuffle(idx_neg)

In [58]:
# train set
idx_train_pos = idx_pos[:273]
idx_train_neg = idx_neg[:273]
idx_train = np.concatenate([idx_train_pos, idx_train_neg])
np.random.shuffle(idx_train)
us_train_df = train_df.iloc[idx_train]

print(f"train set shape: {us_train_df.shape}")

train set shape: (546, 8)


In [59]:
us_X_train = us_train_df[features]
us_y_train = us_train_df[label]

us_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(C=4.893900918477489, class_weight=None, max_iter=100, solver='liblinear'))
])

us_model.fit(us_X_train, us_y_train)
us_model_preds = us_model.predict(X_val)

print(classification_report(y_val, us_model_preds))

              precision    recall  f1-score   support

           0       0.90      0.86      0.88       110
           1       0.79      0.84      0.82        69

    accuracy                           0.85       179
   macro avg       0.85      0.85      0.85       179
weighted avg       0.86      0.85      0.86       179



In [60]:
result_df.loc[len(result_df)] = ['UnderSampling (Random) Model', classification_report(y_val, us_model_preds)]

## **NearMiss**

Selects majority samples that are closest to the minority class.

In [61]:
y_train.value_counts()

Unnamed: 0_level_0,count
Survived,Unnamed: 1_level_1
0,439
1,273


In [62]:
!pip install imbalanced-learn



In [63]:
from imblearn.under_sampling import NearMiss

train_dummy = pd.get_dummies(X_train, drop_first=True)
val_dummy = pd.get_dummies(X_val, drop_first=True)

nm = NearMiss()

X_train_nm, y_train_nm = nm.fit_resample(train_dummy, y_train)

print(f"X_train shape: {X_train.shape}")
print(f"X_train_nm shape: {X_train_nm.shape}")

X_train shape: (712, 7)
X_train_nm shape: (546, 11)


In [64]:
X_train_nm.head()

Unnamed: 0,Age,Fare,FamilySize,Pclass,Sex_male,Embarked_Q,Embarked_S,Title_Name_Miss,Title_Name_Mr,Title_Name_Mrs,Title_Name_Others
0,21.5,7.75,1,3,False,True,False,True,False,False,False
1,32.0,7.925,1,3,True,False,True,False,True,False,False
2,32.0,7.925,1,3,True,False,True,False,True,False,False
3,32.0,7.8958,1,3,True,False,True,False,True,False,False
4,21.5,8.1375,1,3,False,True,False,True,False,False,False


In [65]:
nm_scaler = StandardScaler()
X_train_nm[numerical_features] = nm_scaler.fit_transform(X_train_nm[numerical_features])
val_dummy[numerical_features] = nm_scaler.transform(val_dummy[numerical_features])

nm_model = LogisticRegression(C=4.893900918477489, class_weight=None, max_iter=100, solver='liblinear')

nm_model.fit(X_train_nm, y_train_nm)
nm_model_preds = nm_model.predict(val_dummy)

print(classification_report(y_val, nm_model_preds))

              precision    recall  f1-score   support

           0       0.89      0.68      0.77       110
           1       0.63      0.87      0.73        69

    accuracy                           0.75       179
   macro avg       0.76      0.78      0.75       179
weighted avg       0.79      0.75      0.76       179



In [66]:
result_df.loc[len(result_df)] = ['UnderSampling (NearMiss) Model', classification_report(y_val, nm_model_preds)]

## **Tomek Links**

Removes majority class instances that are closest to minority class instances.

In [67]:
from imblearn.under_sampling import TomekLinks

train_dummy_tl = pd.get_dummies(X_train, drop_first=True)
val_dummy_tl = pd.get_dummies(X_val, drop_first=True)

tl = TomekLinks()

X_train_tl, y_train_tl = tl.fit_resample(train_dummy_tl, y_train)

print(f"X_train shape: {X_train.shape}")
print(f"X_train_tl shape: {X_train_tl.shape}")

X_train shape: (712, 7)
X_train_tl shape: (658, 11)


In [68]:
tl_scaler = StandardScaler()
X_train_tl[numerical_features] = tl_scaler.fit_transform(X_train_tl[numerical_features])
val_dummy_tl[numerical_features] = tl_scaler.transform(val_dummy_tl[numerical_features])

tl_model = LogisticRegression(C=4.893900918477489, class_weight=None, max_iter=100, solver='liblinear')

tl_model.fit(X_train_tl, y_train_tl)
tl_model_preds = tl_model.predict(val_dummy_tl)

print(classification_report(y_val, tl_model_preds))

              precision    recall  f1-score   support

           0       0.85      0.86      0.86       110
           1       0.78      0.75      0.76        69

    accuracy                           0.82       179
   macro avg       0.81      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179



In [69]:
result_df.loc[len(result_df)] = ['UnderSampling (TomekLinks) Model', classification_report(y_val, tl_model_preds)]

# **Oversampling**
Increasing Minority Class Instances

## **Random Oversampling**
Duplicate or randomly generate new instances of the minority class.

In [70]:
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler()

train_dummy_ros = pd.get_dummies(X_train, drop_first=True)
val_dummy_ros = pd.get_dummies(X_val, drop_first=True)

X_train_ros, y_train_ros = ros.fit_resample(train_dummy_ros, y_train)

print(f"X_train shape: {X_train.shape}")
print(f"X_train_tl shape: {X_train_ros.shape}")

X_train shape: (712, 7)
X_train_tl shape: (878, 11)


In [71]:
ros_scaler = StandardScaler()
X_train_ros[numerical_features] = ros_scaler.fit_transform(X_train_ros[numerical_features])
val_dummy_ros[numerical_features] = ros_scaler.transform(val_dummy_ros[numerical_features])

ros_model = LogisticRegression(C=4.893900918477489, class_weight=None, max_iter=100, solver='liblinear')

ros_model.fit(X_train_ros, y_train_ros)
ros_model_preds = ros_model.predict(val_dummy_ros)

print(classification_report(y_val, ros_model_preds))

result_df.loc[len(result_df)] = ['OverSampling (Random) Model', classification_report(y_val, tl_model_preds)]

              precision    recall  f1-score   support

           0       0.88      0.85      0.87       110
           1       0.78      0.81      0.79        69

    accuracy                           0.84       179
   macro avg       0.83      0.83      0.83       179
weighted avg       0.84      0.84      0.84       179



## **SMOTE (Synthetic Minority Over-sampling Technique)**

 Generates synthetic examples of the minority class using k-nearest neighbors.

In [72]:
from imblearn.over_sampling import SMOTE

smote = SMOTE()

train_dummy_smote = pd.get_dummies(X_train, drop_first=True)
val_dummy_smote = pd.get_dummies(X_val, drop_first=True)

X_train_smote, y_train_smote = smote.fit_resample(train_dummy_smote, y_train)

print(f"X_train shape: {X_train.shape}")
print(f"X_train_tl shape: {X_train_smote.shape}")

X_train shape: (712, 7)
X_train_tl shape: (878, 11)


In [73]:
smote_scaler = StandardScaler()
X_train_smote[numerical_features] = smote_scaler.fit_transform(X_train_smote[numerical_features])
val_dummy_smote[numerical_features] = smote_scaler.transform(val_dummy_smote[numerical_features])

smote_model = LogisticRegression(C=4.893900918477489, class_weight=None, max_iter=100, solver='liblinear')

smote_model.fit(X_train_smote, y_train_smote)
smote_model_preds = smote_model.predict(val_dummy_smote)

print(classification_report(y_val, smote_model_preds))

result_df.loc[len(result_df)] = ['OverSampling (SMOTE) Model', classification_report(y_val, smote_model_preds)]

              precision    recall  f1-score   support

           0       0.85      0.85      0.85       110
           1       0.76      0.75      0.76        69

    accuracy                           0.82       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.82      0.82      0.82       179



## **ADASYN (Adaptive Synthetic Sampling)**

 Similar to SMOTE but generates more synthetic samples for difficult-to-classify regions.

In [74]:
from imblearn.over_sampling import ADASYN

adasyn = ADASYN()

train_dummy_adasyn = pd.get_dummies(X_train, drop_first=True)
val_dummy_adasyn = pd.get_dummies(X_val, drop_first=True)

X_train_adasyn, y_train_adasyn = adasyn.fit_resample(train_dummy_adasyn, y_train)

print(f"X_train shape: {X_train.shape}")
print(f"X_train_tl shape: {X_train_adasyn.shape}")

X_train shape: (712, 7)
X_train_tl shape: (885, 11)


In [75]:
adasyn_scaler = StandardScaler()
X_train_adasyn[numerical_features] = adasyn_scaler.fit_transform(X_train_adasyn[numerical_features])
val_dummy_adasyn[numerical_features] = adasyn_scaler.transform(val_dummy_adasyn[numerical_features])

adasyn_model = LogisticRegression(C=4.893900918477489, class_weight=None, max_iter=100, solver='liblinear')

adasyn_model.fit(X_train_adasyn, y_train_adasyn)
adasyn_model_preds = adasyn_model.predict(val_dummy_adasyn)

print(classification_report(y_val, adasyn_model_preds))

result_df.loc[len(result_df)] = ['OverSampling (ADASYN) Model', classification_report(y_val, adasyn_model_preds)]

              precision    recall  f1-score   support

           0       0.87      0.84      0.85       110
           1       0.75      0.80      0.77        69

    accuracy                           0.82       179
   macro avg       0.81      0.82      0.81       179
weighted avg       0.82      0.82      0.82       179



# **Bagging & Boosting Learning**

**1. Bagging (Random Forest) Helps Reduce Variance**
- Uses multiple models to ensure the minority class is well-represented.
- Balanced **Random Forest** applies random undersampling for better class balance.

**2. Boosting Focuses on Hard-to-Classify Minority Samples**
- **AdaBoost:** Increases weights for misclassified minority class samples.
- **Gradient Boosting**: Optimizes errors, prioritizing the minority class.

# **Conclusion**

Print and compare model performance using the methods above.

In [77]:
result_df

Unnamed: 0,Model,Classification report
0,Base Model,precision recall f1-score ...
1,Class Weight Model,precision recall f1-score ...
2,UnderSampling (Random) Model,precision recall f1-score ...
3,UnderSampling (NearMiss) Model,precision recall f1-score ...
4,UnderSampling (TomekLinks) Model,precision recall f1-score ...
5,OverSampling (Random) Model,precision recall f1-score ...
6,OverSampling (SMOTE) Model,precision recall f1-score ...
7,OverSampling (ADASYN) Model,precision recall f1-score ...


In [78]:
for idx_row in range(len(result_df)):
  print(result_df['Model'][idx_row])
  print(result_df['Classification report'][idx_row])
  print('\n========================================================\n')

Base Model
              precision    recall  f1-score   support

           0       0.86      0.90      0.88       110
           1       0.83      0.77      0.80        69

    accuracy                           0.85       179
   macro avg       0.84      0.83      0.84       179
weighted avg       0.85      0.85      0.85       179



Class Weight Model
              precision    recall  f1-score   support

           0       0.89      0.85      0.87       110
           1       0.77      0.83      0.80        69

    accuracy                           0.84       179
   macro avg       0.83      0.84      0.83       179
weighted avg       0.84      0.84      0.84       179



UnderSampling (Random) Model
              precision    recall  f1-score   support

           0       0.90      0.86      0.88       110
           1       0.79      0.84      0.82        69

    accuracy                           0.85       179
   macro avg       0.85      0.85      0.85       179
weighted av

## Comparison of Models Handling Imbalanced Data

### 1. Base Model (No Handling of Imbalanced Data)
- **Accuracy:** 85%
- **Class 0:** Precision 86%, Recall 90%, F1-score 88%
- **Class 1:** Precision 83%, Recall 77%, F1-score 80%
- The initial model has quite high accuracy but still shows imbalance when label **1** has lower recall than label **0** (**77% vs 90%**)..

---

### 2. Class Weight Model
- **Accuracy:** 84%
- **Class 0:** Precision 89%, Recall 85%, F1-score 87%
- **Class 1:** Precision 77%, Recall 83%, F1-score 80%
- Recall for class **1** improved, but precision slightly dropped, making the model better at detecting the minority class but not too obvious.

---

### 3. UnderSampling Models

#### **Random UnderSampling**
- **Accuracy:** 85%
- **Class 0:** Precision 90%, Recall 86%, F1-score 88%
- **Class 1:** Precision 79%, Recall 84%, F1-score 82%
- The model is quite balanced but has the potential to lose information due to data omission.

#### **NearMiss UnderSampling**
- **Accuracy:** 75%
- **Class 0:** Precision 89%, Recall 68%, F1-score 77%
- **Class 1:** Precision 63%, Recall 87%, F1-score 73%
- Recall for class **1** significantly increased, but precision dropped, leading to imbalance in predictions.

#### **TomekLinks UnderSampling**
- **Accuracy:** 82%
- **Class 0:** Precision 85%, Recall 86%, F1-score 86%
- **Class 1:** Precision 78%, Recall 75%, F1-score 76%
- Did not significantly improve over the base model; recall for class **1** slightly decreased.

---

### 4. OverSampling Models

#### **Random OverSampling**
- **Accuracy:** 82%
- **Class 0:** Precision 85%, Recall 86%, F1-score 86%
- **Class 1:** Precision 78%, Recall 75%, F1-score 76%
- Minimal improvement over the base model.

#### **SMOTE OverSampling**
- **Accuracy:** 82%
- **Class 0:** Precision 85%, Recall 85%, F1-score 85%
- **Class 1:** Precision 76%, Recall 75%, F1-score 76%
- Similar results to Random OverSampling, with no major improvement.

#### **ADASYN OverSampling**
- **Accuracy:** 82%
- **Class 0:** Precision 87%, Recall 84%, F1-score 85%
- **Class 1:** Precision 75%, Recall 80%, F1-score 77%
- Improved recall for class **1** while maintaining a balanced precision.

---

## Summary & Recommendations
- The **Base Model** performs well but struggles with class **1** detection.
- **Class Weight Model** improves recall for class **1** without changing the dataset, making it a simple yet effective approach.
- **Random UnderSampling** achieves the best balance between precision and recall without drastically reducing accuracy.
- **NearMiss UnderSampling** prioritizes class **1** detection but at the cost of lower overall accuracy.
- **ADASYN OverSampling** is the best oversampling method, improving recall for class **1** without significantly reducing precision.

### 👉 Recommendation:
- If overall accuracy is the priority, **Random UnderSampling** is the best option.
- If detecting the minority class is the priority, **ADASYN** or **Class Weight Model** are preferable.