<div class="alert alert-block alert-info">

## <center> GROUP PROJECT - TO GRANT OR NOT TO GRANT: DECIDING ON COMPENSATION BENEFITS </center> <br>
#  <center> <b> Agreement Reached Model </center> <br>
## <center> Fall Semester 2024-2025 <center>
<br>
<center> Group 46: <center> <br>
<center>Afonso Ascensão, 20240684 <br><center>
<center>Duarte Marques, 20240522 <br><center>
<center>Joana Esteves, 20240746 <br><center>
<center>Rita Serra, 20240515 <br><center>
<center>Rodrigo Luís, 20240742 <br><center>

<div>

**Description of contents:**
This notebook aims to develop a model to predict the binary variable "Agreement Reached" that is present in the train data for our main model (target "Claim Injury Type") but it was not part of the test dataset. The final selected model from this notebook makes predictions for the target "Agreement Reached" to join our test dataset, in order for us to be able to include it as a feature for our main model.
- Apply pipeline to preprocess the data.
- Implement xgboost algorithm, perform tuning of hyperparameters making use of gridsearch.
- Test KNN algorithm and oversampling with SMOTE.
- Generate predictions of the target variable "Agreement Reached" for the test sample.

**Table of Contents**
- [1. Import the needed Libraries](#importlibraries)
- [2. Import Dataset](#importdataset)
- [3. Split and Pipeline](#section_3)
- [4. Models](#section_4)


<a class="anchor" id="importlibraries">

# 1. Import the needed Libraries

</a>

In [15]:

import pandas as pd
import numpy as np


import warnings
warnings.filterwarnings('ignore')

# Preprocessing
## Pipeline
from sklearn.pipeline import Pipeline
from joblib import load
from transformers import *
## Target Encoding
from sklearn.preprocessing import LabelEncoder

# Model Algorithm
from xgboost import XGBClassifier
from sklearn.neighbors import KNeighborsClassifier

# Data Split
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold, cross_val_score

# Evaluation Metrics
from sklearn.metrics import precision_score, recall_score, f1_score, classification_report, confusion_matrix, make_scorer

# Define a seed
random_state = 42
np.random.seed(42)

# Data Oversampling
from imblearn.over_sampling import SMOTE


<a class="anchor" id="importdataset">

# 2. Import Dataset

</a>

In [2]:
#target_data = pd.read_csv('train_data.csv', sep = ',')
train_data = pd.read_parquet('transformed_train_data.parquet')
test_data = pd.read_parquet('transformed_test_data.parquet')
pd.set_option("display.max_columns", None)

In [3]:
pipeline = load('pipeline.joblib') 

In [4]:
train_data_original = train_data.copy()
test_data_original = test_data.copy()

In [5]:
train_data = train_data.drop(columns=["Claim Injury Type"])

<a class="anchor" id="section_3">

# 3. Split and Pipeline

</a>

In [6]:
X = train_data.drop(['Agreement Reached'], axis = 1)
y = train_data['Agreement Reached']

In [7]:
# Stratified split to deal with classe unbalance in the target
X_train, X_val, y_train, y_val = train_test_split(X,y, test_size = 0.1,
                                                  random_state = 0,
                                                  stratify = y,
                                                  shuffle = True)

In [8]:
# To use for feature selection 

# Target encoding
label_encoder = LabelEncoder()

y_train_encoded = label_encoder.fit_transform(y_train)
y_val_encoded = label_encoder.transform(y_val)

# Preprocessing pipeline

# Apply preprocessing to the training and validation sets
X_train_preprocessed = pipeline.fit_transform(X_train,y_train_encoded)
X_val_preprocessed = pipeline.transform(X_val)
test_data_preprocessed = pipeline.transform(test_data)


In [9]:
print("Selected features:", X_train_preprocessed.columns.values)

Selected features: ['Attorney/Representative' 'Hearing Held' 'Carrier Type_3A. SELF PUBLIC'
 'Time Accident to Assembly' 'Average Weekly Wage Log' 'Assembly Year'
 'C-3 Delivered']


<a class="anchor" id="section_4">

# 4. Models

</a>

In [10]:
def metrics(y_train, pred_train , y_val, pred_val):
    print('___________________________________________________________________________________________________________')
    print('                                                     TRAIN                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_train, pred_train))
    print(confusion_matrix(y_train, pred_train))


    print('___________________________________________________________________________________________________________')
    print('                                                VALIDATION                                                 ')
    print('-----------------------------------------------------------------------------------------------------------')
    print(classification_report(y_val, pred_val))
    print(confusion_matrix(y_val, pred_val))

In [22]:
model_xgbc = XGBClassifier(scale_pos_weight = 3.7, subsample = 0.95)

# Fit to train data
model_xgbc.fit(X_train_preprocessed, y_train_encoded)

# Make predictions on validation data

labels_train = model_xgbc.predict(X_train_preprocessed)
y_pred = model_xgbc.predict(X_val_preprocessed)

# Get scores
print("Precision:", precision_score(y_val_encoded, y_pred, average='weighted'))
print("Recall:", recall_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score Macro:", f1_score(y_val_encoded, y_pred, average='macro'))

print(metrics(y_train, labels_train, y_val_encoded, y_pred))

print("Score on training:" , model_xgbc.score(X_train_preprocessed, y_train_encoded))
print("Score on validation:", model_xgbc.score(X_val_preprocessed, y_val_encoded))

Precision: 0.9457639152181322
Recall: 0.9251781265787502
F1 Score: 0.9339566246563661
F1 Score Macro: 0.6799950168467361
___________________________________________________________________________________________________________
                                                     TRAIN                                                 
-----------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       0.98      0.95      0.96    492515
         1.0       0.34      0.57      0.43     24108

    accuracy                           0.93    516623
   macro avg       0.66      0.76      0.69    516623
weighted avg       0.95      0.93      0.94    516623

[[465905  26610]
 [ 10370  13738]]
___________________________________________________________________________________________________________
                                                VALIDATION                           

**Hyperparameter tuning - Grid Search:**

In [26]:
'''full_pipeline = Pipeline(
    pipeline.steps + [('model', XGBClassifier(random_state = random_state))]
)

# Parameter grid
param_grid = {
    'model__subsample': [0.95, 8.85],  
    'model__scale_pos_weight': [3.7, 3.9]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=full_pipeline,
    param_grid=param_grid,
    scoring='f1_macro',
    cv=3,
    n_jobs=1
)

# Encode target
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Fit GridSearchCV
grid_search.fit(X, y_encoded)

# Display best parameters and best cv score 
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)'''

Best parameters: {'model__scale_pos_weight': 3.9, 'model__subsample': 0.95}
Best score: 0.42102312896340033


In [11]:
'''full_pipeline = Pipeline(
    pipeline.steps + [('model', XGBClassifier(random_state = random_state, subsample = 0.95, scale_pos_weight = 3.9))]
)

# Parameter grid
param_grid = {
    'model__n_estimators': [2800, 2900],  
    'model__max_depth': [8,9],
    'model__learning_rate': [0.001, 0.01],
    'model__gamma':[1,1.35]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=full_pipeline,
    param_grid=param_grid,
    scoring='f1_macro',
    cv=3,
    n_jobs=1
)

# Encode target
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Fit GridSearchCV
grid_search.fit(X, y_encoded)

# Display best parameters and best cv score 
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_) '''

Best parameters: {'model__gamma': 1.35, 'model__learning_rate': 0.01, 'model__max_depth': 9, 'model__n_estimators': 2900}
Best score: 0.603664268318609


**XGBoost Algorithm:**

In [23]:
model_xgbc = XGBClassifier(n_estimators=2900, random_state=random_state, learning_rate = 0.01, max_depth = 9, 
                           gamma = 1.35, scale_pos_weight = 3.9, subsample = 0.95)

# Fit to train data
model_xgbc.fit(X_train_preprocessed, y_train_encoded)

# Make predictions on validation data

labels_train = model_xgbc.predict(X_train_preprocessed)
y_pred = model_xgbc.predict(X_val_preprocessed)

# Get scores
print("Precision:", precision_score(y_val_encoded, y_pred, average='weighted'))
print("Recall:", recall_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score Macro:", f1_score(y_val_encoded, y_pred, average='macro'))

print(metrics(y_train, labels_train, y_val_encoded, y_pred))

print("Score on training:" , model_xgbc.score(X_train_preprocessed, y_train_encoded))
print("Score on validation:", model_xgbc.score(X_val_preprocessed, y_val_encoded))

Precision: 0.9460405449372886
Recall: 0.9256833266554013
F1 Score: 0.9343564928738122
F1 Score Macro: 0.6816157673210732
___________________________________________________________________________________________________________
                                                     TRAIN                                                 
-----------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       0.98      0.95      0.96    492515
         1.0       0.35      0.58      0.43     24108

    accuracy                           0.93    516623
   macro avg       0.66      0.76      0.70    516623
weighted avg       0.95      0.93      0.94    516623

[[466533  25982]
 [ 10242  13866]]
___________________________________________________________________________________________________________
                                                VALIDATION                           

In [24]:
model_xgbc = XGBClassifier(n_estimators=2900, random_state=random_state, learning_rate = 0.0009, max_depth = 9, 
                           gamma = 1.35, scale_pos_weight = 3.9, subsample = 0.95)

# Fit to train data
model_xgbc.fit(X_train_preprocessed, y_train_encoded)

# Make predictions on validation data

labels_train = model_xgbc.predict(X_train_preprocessed)
y_pred = model_xgbc.predict(X_val_preprocessed)

# Get scores
print("Precision:", precision_score(y_val_encoded, y_pred, average='weighted'))
print("Recall:", recall_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score Macro:", f1_score(y_val_encoded, y_pred, average='macro'))

print(metrics(y_train, labels_train, y_val_encoded, y_pred))

print("Score on training:" , model_xgbc.score(X_train_preprocessed, y_train_encoded))
print("Score on validation:", model_xgbc.score(X_val_preprocessed, y_val_encoded))

Precision: 0.9447088402085785
Recall: 0.9329651760360956
F1 Score: 0.9382135465267801
F1 Score Macro: 0.6829318248560607
___________________________________________________________________________________________________________
                                                     TRAIN                                                 
-----------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       0.97      0.96      0.96    492515
         1.0       0.35      0.49      0.40     24108

    accuracy                           0.93    516623
   macro avg       0.66      0.72      0.68    516623
weighted avg       0.95      0.93      0.94    516623

[[470404  22111]
 [ 12393  11715]]
___________________________________________________________________________________________________________
                                                VALIDATION                           

The final estimators for XGBoost were chosen in a combination of the GridSearchCV and trial-and-error.

**K Nearest Neighboors Algorithm:**

In [114]:
"""modelKNN = KNeighborsClassifier(n_neighbors=13, algorithm="kd_tree")
modelKNN.fit(X = X_train_preprocessed, y = y_train_encoded)
labels_train = modelKNN.predict(X_train_preprocessed)
y_pred = modelKNN.predict(X_val_preprocessed)


print("Precision:", precision_score(y_val_encoded, y_pred, average='weighted'))
print("Recall:", recall_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score Macro:", f1_score(y_val_encoded, y_pred, average='macro'))

print(confusion_matrix(y_val_encoded, y_pred))
print(metrics(y_train_encoded, labels_train, y_val_encoded, y_pred))

print("Score on training:" , modelKNN.score(X_train_preprocessed, y_train_encoded))
print("Score on validation:", modelKNN.score(X_val_preprocessed, y_val_encoded))"""

Precision: 0.9395047218227405
Recall: 0.954427468947616
F1 Score: 0.9396184271408513
F1 Score Macro: 0.5808025142762964
[[54490   234]
 [ 2382   297]]
___________________________________________________________________________________________________________
                                                     TRAIN                                                 
-----------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.96      1.00      0.98    492515
           1       0.68      0.14      0.23     24108

    accuracy                           0.96    516623
   macro avg       0.82      0.57      0.60    516623
weighted avg       0.95      0.96      0.94    516623

[[490921   1594]
 [ 20740   3368]]
___________________________________________________________________________________________________________
                                                VALIDAT

**SMOTE oversampling:**

In [37]:
"""oversample = SMOTE(sampling_strategy=0.2, random_state=42)
X_train_smote, y_train_smote = oversample.fit_resample(X_train_preprocessed, y_train_encoded)"""

In [None]:
"""model_xgbc = XGBClassifier(n_estimators=2900, random_state=random_state, learning_rate = 0.0009, max_depth = 9, 
                           gamma = 1.35, subsample = 0.95)

# Fit to train data
model_xgbc.fit(X_train_smote, y_train_smote)

# Make predictions on validation data

labels_train = model_xgbc.predict(X_train_smote)
y_pred = model_xgbc.predict(X_val_preprocessed)

# Get scores
print("Precision:", precision_score(y_val_encoded, y_pred, average='weighted'))
print("Recall:", recall_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score Macro:", f1_score(y_val_encoded, y_pred, average='macro'))

print(metrics(y_train_smote, labels_train, y_val_encoded, y_pred))

print("Score on training:" , model_xgbc.score(X_train_smote, y_train_smote))
print("Score on validation:", model_xgbc.score(X_val_preprocessed, y_val_encoded))"""

Precision: 0.9446807420403542
Recall: 0.9305611204989286
F1 Score: 0.9367905062334977
F1 Score Macro: 0.680701738022086
___________________________________________________________________________________________________________
                                                     TRAIN                                                 
-----------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

           0       0.91      0.95      0.93    492515
           1       0.70      0.55      0.61     98503

    accuracy                           0.89    591018
   macro avg       0.80      0.75      0.77    591018
weighted avg       0.88      0.89      0.88    591018

[[468807  23708]
 [ 44257  54246]]
___________________________________________________________________________________________________________
                                                VALIDATION                            

The oversampled dataset has a very similar F1 macro score when compared to the imbalanced dataset, however, it overfits, and that is why we opted to use the previous model.

In [39]:
model_xgbc = XGBClassifier(n_estimators=2900, random_state=random_state, learning_rate = 0.0009, max_depth = 9, 
                           gamma = 1.35, scale_pos_weight = 3.9, subsample = 0.95)

# Fit to train data
model_xgbc.fit(X_train_preprocessed, y_train_encoded)

# Make predictions on validation data

labels_train = model_xgbc.predict(X_train_preprocessed)
y_pred = model_xgbc.predict(X_val_preprocessed)

# Get scores
print("Precision:", precision_score(y_val_encoded, y_pred, average='weighted'))
print("Recall:", recall_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score:", f1_score(y_val_encoded, y_pred, average='weighted'))
print("F1 Score Macro:", f1_score(y_val_encoded, y_pred, average='macro'))

print(metrics(y_train, labels_train, y_val_encoded, y_pred))

print("Score on training:" , model_xgbc.score(X_train_preprocessed, y_train_encoded))
print("Score on validation:", model_xgbc.score(X_val_preprocessed, y_val_encoded))

Precision: 0.9456232400254032
Recall: 0.9300036583453827
F1 Score: 0.9368080200423585
F1 Score Macro: 0.684216718673517
___________________________________________________________________________________________________________
                                                     TRAIN                                                 
-----------------------------------------------------------------------------------------------------------
              precision    recall  f1-score   support

         0.0       0.98      0.95      0.96    492515
         1.0       0.34      0.51      0.41     24108

    accuracy                           0.93    516623
   macro avg       0.66      0.73      0.68    516623
weighted avg       0.95      0.93      0.94    516623

[[468208  24307]
 [ 11758  12350]]
___________________________________________________________________________________________________________
                                                VALIDATION                            

**Assessement of final model:**

In [16]:
full_pipeline = Pipeline(
    pipeline.steps + [('model', model_xgbc)]
)

# Cross-validation
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=random_state)

# Encode target
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Use F1-macro because of class imbalance
scorer = make_scorer(f1_score, average='macro')

# Get scores for cross-validation
# Apply preprocessing inside cv for X and y_encoded
cv_scores = cross_val_score(full_pipeline, X, y_encoded, cv=cv, scoring=scorer)

# Print results
print("Cross-validation scores (F1-macro):", cv_scores)
print("Mean CV score:", cv_scores.mean())

Cross-validation scores (F1-macro): [0.66375158 0.68438313 0.66129182 0.65716003 0.67542338]
Mean CV score: 0.6684019871628759


Final F1 Macro Score: 0.668

**Final predictions:**

In [96]:
y_pred = model_xgbc.predict(test_data_preprocessed)

# Get original y for submission
y_pred_categorical = label_encoder.inverse_transform(y_pred) 

test_data_original.insert(22, "Agreement Reached", y_pred_categorical.tolist())


In [97]:
test_data_original["Agreement Reached"] = test_data_original["Agreement Reached"].astype('bool')

# # Save to CSV in the required format
test_data_original.to_parquet("test_transformed_agreement.parquet", index=True)