<div class="alert alert-block alert-info">

## <center> GROUP PROJECT - TO GRANT OR NOT TO GRANT: DECIDING ON COMPENSATION BENEFITS </center> <br>
#  <center> <b> STACKING </center> <br>
## <center> Fall Semester 2024-2025 <center>
<br>
<center> Group 46: <center> <br>
<center>Afonso Ascensão, 20240684 <br><center>
<center>Duarte Marques, 20240522 <br><center>
<center>Joana Esteves, 20240746 <br><center>
<center>Rita Serra, 20240515 <br><center>
<center>Rodrigo Luís, 20240742 <br><center>

<div>

**Description of contents:**
- Apply pipeline to preprocess the data.
- Attempt to enhance performance by leveraging the strengths of our two best-performing models so far: XGBoost and MLPClassifier, having into account not only f1 macro score but all the number of classes the model generated predictions to. To achieve this, it was applied an ensemble method, stacking, which integrates the predictions of both models using a Logistic Regression meta-model.
- Assessement of the model using cross validation.

**Table of Contents**
- [1. Import the needed Libraries](#importlibraries)
- [2. Import Dataset](#importdataset)
- [3. Preprocessing](#section_3)
- [4. Stacking](#section_4)


<a class="anchor" id="section_1">

# 1. Import Libraries

</a>

In [1]:

import pandas as pd
import numpy as np

# Preprocessing
## Pipeline
from joblib import load
from transformers import *
## Target Encoding
from sklearn.preprocessing import LabelEncoder

# Model algorithms
from xgboost import XGBClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
## Weights 
from sklearn.utils.class_weight import compute_sample_weight

# Evaluation metrics
from sklearn.metrics import classification_report, f1_score

# Cross validation, parameter tuning
from sklearn.model_selection import StratifiedKFold

np.random.seed(42)

# Define a seed
random_state = 42

# Display all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option("display.max_columns", None)

<a class="anchor" id="section_2">

# 2. Import Dataset and Pipeline

</a>

In [2]:
# Train and validation w/ split, separate X and y to apply preprocessing
transformed_train_split = pd.read_parquet("transformed_train_split.parquet")
transformed_val_split = pd.read_parquet("transformed_val_split.parquet")

# Test set with predicted agreement column, apply preprocessing 
test_transformed_agreement = pd.read_parquet("test_transformed_agreement.parquet")

# Dataset with no split for cross validation, apply pipeline inside cross validation
transformed_train_data = pd.read_parquet("transformed_train_data.parquet")

In [3]:
# Load pipeline
pipeline = load('pipeline.joblib') 

<a class="anchor" id="section_3">

# 3. Preprocessing

</a>

In [4]:
# Separate X and y for train after split
X_train = transformed_train_split.drop(['Claim Injury Type'], axis = 1)
y_train = transformed_train_split['Claim Injury Type']
y_train = y_train.values.ravel()

# Separate X and y for validation after split
X_val = transformed_val_split.drop(['Claim Injury Type'], axis = 1)
y_val = transformed_val_split['Claim Injury Type']

# Separate X and y for dataset before split
X = transformed_train_data.drop(['Claim Injury Type'], axis = 1)
y = transformed_train_data['Claim Injury Type']

In [5]:
# Apply encoding of y for train and validation sets

# Initialize target encoder
label_encoder = LabelEncoder()

# Encode target
y_train_encoded = label_encoder.fit_transform(y_train)
y_val_encoded = label_encoder.transform(y_val)

In [6]:
# Apply preprocessing pipeline to the train, validation and test sets
X_train_preprocessed = pipeline.fit_transform(X_train, y_train_encoded)
X_val_preprocessed = pipeline.transform(X_val)
test_data_preprocessed = pipeline.transform(test_transformed_agreement)

In [7]:
print("Selected Features:", X_train_preprocessed.columns.values)

Selected Features: ['Attorney/Representative' 'Average Weekly Wage Log' 'C-2 Delivered'
 'Industry Code' 'Time Assembly to Hearing' 'Hearing Held'
 'Agreement Reached' 'C-3 Delivered on Time' 'Part of Body Group_Trunk'
 'Part of Body Group_Lower Extremities' 'IME-4 Count Log'
 'District Name_NYC' 'Part of Body Group_Upper Extremities' 'Gender'
 'Carrier Type_2A. SIF' 'Cause of Injury Group_X' 'Assembly Year'
 'Cause of Injury Group_VI']


<a class="anchor" id="section_4">

# 4. Stacking

</a>

**Variables for model:**
- X_train_preprocessed;
- y_train_preprocessed;
- X_val_preprocessed;
- y_val_encoded;
- test_data_preprocessed.

**Variables for CV:**
- X: no preprocessing and no split;
- y: no preprocessing and no split;
- Apply pipeline and weights inside cv.


**XGBOOST MODEL:**

In [8]:
# Compute weights for each sample
weights = compute_sample_weight('balanced', y_train_encoded)

xgb = XGBClassifier(
                           objective='multi:softmax',
                           n_estimators = 150,
                           max_depth = 10,
                           learning_rate = 0.01,
                           random_state = random_state
                           )

# Fit to train data
xgb.fit(X_train_preprocessed, y_train_encoded, sample_weight=weights)

**MLP CLASSIFIER MODEL:**

In [None]:
mlp = MLPClassifier(
    hidden_layer_sizes=(64,64),    
    activation='relu',             
    solver='adam',                
    learning_rate_init=0.001,     
    max_iter=1000,                 
    alpha=0.0001,                 
    random_state=random_state,     
)

mlp.fit(X_train_preprocessed, y_train_encoded)

**STACKING MODEL:**

In [None]:
# Get predictions - probabilities 
mlp_train_proba = mlp.predict_proba(X_train_preprocessed)
xgb_train_proba = xgb.predict_proba(X_train_preprocessed)

# Combine predictions 
stacked_features_train = np.hstack((mlp_train_proba, xgb_train_proba))

# Fit meta model on train using the mlp and xgb predictions as features
meta_model = LogisticRegression(random_state=random_state, max_iter=1000)
meta_model.fit(stacked_features_train, y_train_encoded)


In [None]:
# Get validation scores 

mlp_val_proba = mlp.predict_proba(X_val_preprocessed)
xgb_val_proba = xgb.predict_proba(X_val_preprocessed)
stacked_features_val = np.hstack((mlp_val_proba, xgb_val_proba))
y_pred_val = meta_model.predict(stacked_features_val)

print(classification_report(y_val_encoded, y_pred_val))

              precision    recall  f1-score   support

           0       0.67      0.43      0.52      1248
           1       0.84      0.97      0.90     29108
           2       0.38      0.11      0.17      6890
           3       0.75      0.82      0.78     14851
           4       0.64      0.64      0.64      4828
           5       0.29      0.01      0.02       421
           6       0.00      0.00      0.00        10
           7       0.48      0.21      0.29        47

    accuracy                           0.78     57403
   macro avg       0.50      0.40      0.42     57403
weighted avg       0.74      0.78      0.74     57403



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [None]:
# Get train scores

mlp_train_proba = mlp.predict_proba(X_train_preprocessed)
xgb_train_proba = xgb.predict_proba(X_train_preprocessed)
stacked_features_train = np.hstack((mlp_train_proba, xgb_train_proba))
y_pred_train = meta_model.predict(stacked_features_train)

print(classification_report(y_train_encoded, y_pred_train))

              precision    recall  f1-score   support

           0       0.66      0.42      0.51     11229
           1       0.84      0.97      0.90    261970
           2       0.39      0.12      0.18     62016
           3       0.75      0.83      0.79    133656
           4       0.65      0.64      0.65     43452
           5       0.53      0.02      0.04      3790
           6       1.00      0.14      0.24        87
           7       0.53      0.29      0.38       423

    accuracy                           0.78    516623
   macro avg       0.67      0.43      0.46    516623
weighted avg       0.74      0.78      0.75    516623



**Cross-validation w/ 5 splits for final assessement of the model:**

In [13]:
# Cross-validation
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)

# Encode the target 
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Store scores for each fold
cv_scores = []

for train_idx, val_idx in kf.split(X, y_encoded):
    
    # Data for the current split
    X_train_cv, X_val_cv = X.iloc[train_idx], X.iloc[val_idx]
    y_train_cv, y_val_cv = y_encoded[train_idx], y_encoded[val_idx]

    # Preprocess training and validation sets
    X_train_cv_preprocessed = pipeline.fit_transform(X_train_cv, y_train_cv)
    X_val_cv_preprocessed = pipeline.transform(X_val_cv)

    # Compute sample weights for the training data
    train_sample_weights = compute_sample_weight('balanced', y_train_cv)

    # Fit the predefined model
    xgb.fit(X_train_cv_preprocessed, y_train_cv, sample_weight=train_sample_weights)
    mlp.fit(X_train_cv_preprocessed, y_train_cv)

    mlp_train_proba = mlp.predict_proba(X_train_cv_preprocessed)
    xgb_train_proba = xgb.predict_proba(X_train_cv_preprocessed)

    # Combine predictions
    stacked_features_train = np.hstack((mlp_train_proba, xgb_train_proba))

    # Fit meta model on train using the mlp and xgb outputs
    meta_model.fit(stacked_features_train, y_train_cv)

    mlp_val_proba = mlp.predict_proba(X_val_cv_preprocessed)
    xgb_val_proba = xgb.predict_proba(X_val_cv_preprocessed)
    stacked_features_val = np.hstack((mlp_val_proba, xgb_val_proba))
    y_pred_val = meta_model.predict(stacked_features_val)

    f1 = f1_score(y_val_cv, y_pred_val, average='macro')

    cv_scores.append(f1)

# Convert scores to a NumPy array for easier calculations
cv_scores = np.array(cv_scores)

# Print the results
print("Cross-validation scores (F1-macro):", cv_scores)
print("Mean CV score:", cv_scores.mean())

Cross-validation scores (F1-macro): [0.40213601 0.4163565  0.40946918 0.41464634 0.40441891]
Mean CV score: 0.40940538709692353


- While this model improves the F1 macro score very slightly, but it fails to predict all classes effectively.
- The computational cost outweighs the marginal performance gain.
- Therefore, we will prioritize another model and focus on optimizing it instead of proceeding with the stacking approach.