<div class="alert alert-block alert-info">

## <center> GROUP PROJECT - TO GRANT OR NOT TO GRANT: DECIDING ON COMPENSATION BENEFITS </center> <br>
#  <center> <b> Random Forest </center> <br>
## <center> Fall Semester 2024-2025 <center>
<br>
<center> Group 46: <center> <br>
<center>Afonso Ascensão, 20240684 <br><center>
<center>Duarte Marques, 20240522 <br><center>
<center>Joana Esteves, 20240746 <br><center>
<center>Rita Serra, 20240515 <br><center>
<center>Rodrigo Luís, 20240742 <br><center>

<div>

**Description of contents:**
- Apply pipeline to preprocess the data.
- Implement Random Forest algorithm and perform tuning of hyperparameters making use of gridsearch.
- Assessement of the model using cross validation.
- Generate prediction for the test sample.

**Table of Contents**
- [1. Import the needed Libraries](#importlibraries)
- [2. Import Dataset](#importdataset)
- [3. Preprocessing](#section_3)
- [4. Random Forest](#section_4)


<a class="anchor" id="section_1">

# 1. Import Libraries

</a>

In [2]:

import pandas as pd
import numpy as np

# Preprocessing
## Pipeline
from sklearn.pipeline import Pipeline
from joblib import load
from transformers import *

## Target Encoding
from sklearn.preprocessing import LabelEncoder

# Model algorithm - xgboost
from sklearn.ensemble import RandomForestClassifier

# Evaluation metrics
from sklearn.metrics import classification_report, f1_score,accuracy_score

from sklearn.utils.class_weight import compute_sample_weight

# Cross validation, parameter tuning
from sklearn.model_selection import StratifiedKFold
from itertools import product

# Define a seed
random_state = 42


In [3]:
pd.set_option("display.max_columns", None)
pd.set_option('display.max_rows', None)

random_state = 42

<a class="anchor" id="section_2">

# 2. Import Dataset and Pipeline

</a>

In [4]:
# Train and validation w/ split, separate X and y to apply preprocessing
transformed_train_split = pd.read_parquet("transformed_train_split.parquet")
transformed_val_split = pd.read_parquet("transformed_val_split.parquet")

# Test set with predicted agreement column, apply preprocessing 
test_transformed_agreement = pd.read_parquet("test_transformed_agreement.parquet")

# Dataset with no split for cross validation, apply pipeline inside cross validation
transformed_train_data = pd.read_parquet("transformed_train_data.parquet")

In [5]:
# Load pipeline
pipeline = load('pipeline.joblib') 

<a class="anchor" id="section_3">

# 3. Preprocessing

</a>

<a class="anchor" id="section_3_1">

## 3.1. Pipeline

</a>

In [6]:
# Separate X and y for train after split
X_train = transformed_train_split.drop(['Claim Injury Type'], axis = 1)
y_train = transformed_train_split['Claim Injury Type']

# Separate X and y for validation after split
X_val = transformed_val_split.drop(['Claim Injury Type'], axis = 1)
y_val = transformed_val_split['Claim Injury Type']

# Separate X and y for dataset before split
X = transformed_train_data.drop(['Claim Injury Type'], axis = 1)
y = transformed_train_data['Claim Injury Type']

In [7]:
X_train.head(1)

Unnamed: 0_level_0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Medical Fee Region,WCIO Cause of Injury Code,WCIO Nature of Injury Code,WCIO Part Of Body Code,Agreement Reached,Number of Dependents,Accident Year,Assembly Year
Claim Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1
5930957,2022-02-15,49,False,2022-02-25,True,1727.58,1972.0,2022-02-25,2022-09-20,STATE INSURANCE FUND,2A. SIF,ORANGE,False,ALBANY,2023-01-17,False,3,92.0,I,32.0,10.0,-9.0,0.0,1,2022.0,2022


In [8]:
# Apply encoding of y for train and validation sets

# Initialize target encoder
label_encoder = LabelEncoder()

# Encode target
y_train_encoded = label_encoder.fit_transform(y_train)
y_val_encoded = label_encoder.transform(y_val)

In [9]:
# Apply preprocessing pipeline to the train, validation and test sets

X_train_preprocessed = pipeline.fit_transform(X_train, y_train_encoded)
X_val_preprocessed = pipeline.transform(X_val)
test_data_preprocessed = pipeline.transform(test_transformed_agreement)

In [10]:
# Selected Features
X_train_preprocessed.columns

Index(['Attorney/Representative', 'Average Weekly Wage Log', 'C-2 Delivered',
       'Industry Code', 'Time Assembly to Hearing', 'Hearing Held',
       'Agreement Reached', 'C-3 Delivered on Time',
       'Part of Body Group_Trunk', 'Part of Body Group_Lower Extremities',
       'IME-4 Count Log', 'District Name_NYC',
       'Part of Body Group_Upper Extremities', 'Gender',
       'Carrier Type_2A. SIF', 'Cause of Injury Group_X', 'Assembly Year',
       'Cause of Injury Group_VI'],
      dtype='object')

<a class="anchor" id="section_4">

# 4. Random Forest

</a>

**Variables for model:**
- X_train_preprocessed;
- y_train_encoded;
- X_val_preprocessed;
- y_val_encoded;
- test_data_preprocessed.

**Variables for CV:**
- X: no preprocessing;
- y: no preprocessing;
- Apply preprocessing inside cv.


In [11]:
X_train_preprocessed.head()

Unnamed: 0_level_0,Attorney/Representative,Average Weekly Wage Log,C-2 Delivered,Industry Code,Time Assembly to Hearing,Hearing Held,Agreement Reached,C-3 Delivered on Time,Part of Body Group_Trunk,Part of Body Group_Lower Extremities,IME-4 Count Log,District Name_NYC,Part of Body Group_Upper Extremities,Gender,Carrier Type_2A. SIF,Cause of Injury Group_X,Assembly Year,Cause of Injury Group_VI
Claim Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
5930957,1.0,0.394401,1.0,0.727,0.205678,1.0,0.0,1.0,0.0,0.0,0.191959,0.0,0.0,0.0,1.0,0.0,1.0,0.0
6091925,1.0,0.332599,1.0,0.057506,0.0,0.0,0.0,1.0,0.0,0.0,0.253756,0.0,1.0,1.0,0.0,0.0,1.0,0.0
5736622,0.0,0.384698,1.0,0.727,0.0,0.0,0.0,0.0,0.0,0.0,0.112289,0.0,0.0,1.0,0.0,0.0,0.5,0.0
5549121,1.0,0.287965,1.0,0.021138,0.116719,1.0,0.0,1.0,1.0,0.0,0.304247,1.0,0.0,1.0,0.0,0.0,0.0,0.0
5755487,0.0,0.313751,1.0,0.169532,0.0,0.0,0.0,0.0,0.0,0.0,0.112289,1.0,1.0,0.0,0.0,0.0,0.5,0.0


In [12]:
# # Initial settings
# random_state = 42  # Define random state
# model_rf = RandomForestClassifier(random_state=random_state)

# # Define parameter grid for RandomForestClassifier
# param_grid = {
#     'n_estimators': [150,170,180],       
#     'max_samples': [0.5,0.6],  
#     'max_depth': [8,9]       
#     }
# # Generate all parameter combinations
# param_grid_list = [
#     dict(zip(param_grid.keys(), values))
#     for values in product(*param_grid.values())
# ]

# # Initialize StratifiedKFold
# kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)

# # Initialize target encoder
# label_encoder = LabelEncoder()


# # Lists to store results
# scores = []
# best_params = []

# #Loop over parameter combinations
# for params in param_grid_list:
#     fold_scores = []
    
#     # Cross-validation loop
#     for train_idx, val_idx in kf.split(X, y):
#         # Split data into training and validation
#         X_train_cv, X_val_cv = X.iloc[train_idx], X.iloc[val_idx]
#         y_train_cv, y_val_cv = y.iloc[train_idx], y.iloc[val_idx]

    
#         # Encode labels
#         y_train_cv_encoded = label_encoder.fit_transform(y_train_cv)
#         y_val_cv_encoded = label_encoder.transform(y_val_cv)

#         # Preprocess data
#         X_train_cv_preprocessed = pipeline.fit_transform(X_train_cv,y_train_cv_encoded)
#         X_val_cv_preprocessed = pipeline.transform(X_val_cv)

#         weights = compute_sample_weight(class_weight='balanced', y=y_train_cv_encoded)
#         # Set model parameters
#         model_rf.set_params(**params)

#         # Train the model
#         model_rf.fit(X_train_cv_preprocessed, y_train_cv_encoded, sample_weight=weights)
        
#         # Make predictions and evaluate
#         y_pred_cv = model_rf.predict(X_val_cv_preprocessed)
#         f1 = f1_score(y_val_cv_encoded, y_pred_cv, average='macro')

#         fold_scores.append(f1)

#     # Calculate average score for this parameter set
#     avg_score = np.mean(fold_scores)
#     scores.append(avg_score)
#     best_params.append(params)

# # Get the best parameters and final score
# best_index = np.argmax(scores)
# print("Best parameters:", best_params[best_index])
# print("Best F1-score:", scores[best_index])

- Best parameters: {'n_estimators': 180, 'max_samples': 0.4, 'max_depth': 9}
- Best f1 macro score:0.3279974519510943

**Model with best parameters:**


In [13]:
# # Compute weights for each sample
# weights = compute_sample_weight(class_weight='balanced', y=y_train_encoded)

# model_rfc = RandomForestClassifier(n_estimators=180, max_samples= 0.4, max_depth= 9, random_state=42)

# # Train the model
# model_rfc.fit(X_train_preprocessed, y_train_encoded, sample_weight=weights)

# y_pred_train = model_rfc.predict(X_train_preprocessed)
# # Make predictions on validation data
# y_pred_val = model_rfc.predict(X_val_preprocessed)

# print("\nClassification Report Train Data:\n", classification_report(y_train_encoded, y_pred_train, digits=6))
# print("\nClassification Report Validation Data:\n", classification_report(y_val_encoded, y_pred_val, digits=6))

- The gap between the F1-macro scores (training and validation) indicates mild overfitting.
- The model is not predicting class 6.
- We will try to decrease the max_depth and increase the max_samples to try to improve class prediction and reduce overfitting.

In [14]:
weights = compute_sample_weight(class_weight='balanced', y=y_train_encoded)

model_rfc = RandomForestClassifier(n_estimators=180, max_samples= 0.5, max_depth= 7, random_state=42)

# Train the model
model_rfc.fit(X_train_preprocessed, y_train_encoded, sample_weight=weights)

y_pred_train = model_rfc.predict(X_train_preprocessed)
# Make predictions on validation data
y_pred_val = model_rfc.predict(X_val_preprocessed)

print("\nClassification Report Train Data:\n", classification_report(y_train_encoded, y_pred_train, digits=6))
print("\nClassification Report Validation Data:\n", classification_report(y_val_encoded, y_pred_val, digits=6))


Classification Report Train Data:
               precision    recall  f1-score   support

           0   0.409535  0.549292  0.469228     11229
           1   0.774861  0.881479  0.824738    261970
           2   0.307849  0.058630  0.098501     62016
           3   0.863495  0.218090  0.348229    133656
           4   0.405959  0.716837  0.518360     43452
           5   0.049523  0.746966  0.092888      3790
           6   0.017193  0.781609  0.033647        87
           7   0.019771  0.940898  0.038727       423

    accuracy                       0.589054    516623
   macro avg   0.356023  0.611725  0.303040    516623
weighted avg   0.696696  0.589054  0.574640    516623


Classification Report Validation Data:
               precision    recall  f1-score   support

           0   0.414398  0.548878  0.472251      1248
           1   0.773909  0.883503  0.825083     29108
           2   0.317629  0.060668  0.101877      6890
           3   0.857561  0.216888  0.346214     14851
 

- With this parameter adjustment, the F1-score decreased slightly, but the model is now able to predict class 6, and we reduced overfitting.

**Cross-validation w/ 5 splits for final assessement of the model with selected hyperparameters:**

In [15]:
# General configurations
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Encode the target variable
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)  # 'y' is the original target variable

# List to store F1-macro scores for each fold
cv_scores = []

# Cross-validation loop
for train_idx, val_idx in kf.split(X, y_encoded):
    
    # Split the data for the current fold
    X_train_cv, X_val_cv = X.iloc[train_idx], X.iloc[val_idx]
    y_train_cv, y_val_cv = y_encoded[train_idx], y_encoded[val_idx]

    # Preprocess training and validation sets
    X_train_cv_preprocessed = pipeline.fit_transform(X_train_cv, y_train_cv)  # Fit pipeline on training data
    X_val_cv_preprocessed = pipeline.transform(X_val_cv)  # Transform validation data

    # Compute sample weights for the training data
    train_sample_weights = compute_sample_weight('balanced', y_train_cv)

    # Train the model
    model_rfc.fit(X_train_cv_preprocessed, y_train_cv, sample_weight=train_sample_weights)

    # Make predictions on the validation set
    y_val_pred = model_rfc.predict(X_val_cv_preprocessed)

    # Calculate F1-macro for the current fold
    f1 = f1_score(y_val_cv, y_val_pred, average='macro')

    # Append the F1-macro score to the list
    cv_scores.append(f1)

# Convert scores to a NumPy array for easier calculations
cv_scores = np.array(cv_scores)

# Print cross-validation results
print("Cross-validation scores (F1-macro):", cv_scores)
print("Mean CV score (F1-macro):", cv_scores.mean())


Cross-validation scores (F1-macro): [0.3025859  0.30392932 0.30001235 0.30329114 0.29982656]
Mean CV score (F1-macro): 0.3019290547625668
Standard deviation (F1-macro): 0.0016959951185972547


**Predictions for test set:**

In [16]:
# Encode target
y_encoded = label_encoder.fit_transform(y)

# Compute weights for each sample
weights = compute_sample_weight('balanced', y_encoded)

# Fit pipeline on X and apply tranformations
X_preprocessed = pipeline.fit_transform(X, y_encoded)

# Apply pipeline fitted on X
test_data_preprocessed_X = pipeline.transform(test_transformed_agreement)

# Fit on X 
model_rfc.fit(X_preprocessed,y_encoded, sample_weight=weights)

# Make predictions on test data
y_pred = model_rfc.predict(test_data_preprocessed_X)

# Get original y for submission
y_pred_categorical = label_encoder.inverse_transform(y_pred) 


In [17]:


submission = pd.DataFrame({
    "Claim Identifier": test_data_preprocessed_X.index,
    "Claim Injury Type": y_pred_categorical
})


# # Save to CSV in the required format
submission.to_csv("Group46_versionX.csv", index=False)