### **Group 31** <br>
* Ana Margarida Valente, nr 20240936
* Eduardo Mendes, nr 20240850
* Julia Karpienia, nr 20240514
* Marta Boavida, nr 20240519
* Victoria Goon, nr 20240550

## 0. Import Packages

In [None]:
## Import standard data processing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Setting seaborn style
sns.set()

from sklearn.preprocessing import LabelEncoder

## Import Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LassoCV
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC

# Import Cross Validation methods
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score
import xgboost as xgb

# from imblearn.over_sampling import SMOTE

pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_colwidth', None) #Show all columns

## Supress warnings
import warnings
warnings.filterwarnings('ignore')


In [None]:
# import sys
# !{sys.executable} -m pip install imbalanced-learn

<a class="anchor" id="importdatasets">

## 1. Import Datasets

</a>

In [3]:
train_data = pd.read_csv("train_encoded.csv", low_memory=False)
validation_data = pd.read_csv("validation_encoded.csv", low_memory=False)
test_data = pd.read_csv("test_encoded.csv")

In [4]:
train_data = train_data.set_index("Claim Identifier")
validation_data = validation_data.set_index("Claim Identifier")
test_data = test_data.set_index("Claim Identifier")

In [5]:
X_train = train_data.drop('Claim Injury Type', axis = 1)
y_train = train_data['Claim Injury Type']

X_val = validation_data.drop('Claim Injury Type', axis = 1)
y_val = validation_data['Claim Injury Type']

In [None]:
X_train.head()

In [None]:
X_val.head()

In [None]:
y_val.head()

### 1.1 Encode Target Variable
Label Encoder for target variable (training and validation):
<br/> <br/>
(This needs to be done in both the proprocessing notebook as well as here to be able to interpret the results properly when a model is tested.)

In [6]:
#Initiate Label encoder
label_encoder = LabelEncoder()

#Fit the encoder on the training target variable
Y_train_encoded = label_encoder.fit_transform(y_train)

#Transform the training and validation target variable
Y_val_encoded = label_encoder.transform(y_val)

y_val_unencoded = y_train.copy()

#Convert the results back to DataFrames while overriding the previous variable names
y_train = pd.DataFrame(Y_train_encoded, columns=['encoded_target'], index=pd.Series(y_train.index))
y_val = pd.DataFrame(Y_val_encoded, columns=['encoded_target'], index=pd.Series(y_val.index))

### Undersampling

In [None]:
# # add the encoded variables back to the x set
# training_data_undersampled = pd.concat([X_train, y_train], axis=1)

# # Separate majority and minority classes
# majority_classes = {}
# for x in range(0,8):
#     if x != 6:
#         majority_classes[x] = training_data_undersampled[training_data_undersampled["encoded_target"] == x]

# minority_class = training_data_undersampled[training_data_undersampled["encoded_target"] == 6]

# size = int(len(minority_class) + (len(minority_class) * 2))

# print(size)

# # Perform undersampling
# undersampled_majority_0 = majority_classes[0].sample(n=size, random_state=42)
# undersampled_majority_1 = majority_classes[1].sample(n=size, random_state=42)
# undersampled_majority_2 = majority_classes[2].sample(n=size, random_state=42)
# undersampled_majority_3 = majority_classes[3].sample(n=size, random_state=42)
# undersampled_majority_4 = majority_classes[4].sample(n=size, random_state=42)
# undersampled_majority_5 = majority_classes[5].sample(n=size, random_state=42)
# undersampled_majority_7 = majority_classes[7].sample(n=size, random_state=42)
# # undersampled_majority.head()
# balanced_data = pd.concat([undersampled_majority_0, undersampled_majority_1, undersampled_majority_2, 
#                            undersampled_majority_3, undersampled_majority_4, undersampled_majority_5, 
#                            minority_class, undersampled_majority_7])

# # Separate features and target
# X_train = balanced_data.drop(columns='encoded_target')
# y_train = balanced_data['encoded_target']

# # Check class distribution after undersampling
# print("Class distribution after undersampling:", y_train.value_counts())

<a class="anchor" id="model">

## 2. Model
</a>

Type of Problem <br/>
The type of problem to be solved is a multiclassification problem where the output is between 8 different choices. We will use a simple Logistical Regression model set to be able to compute multiple classes.<br/>
<br/>
Metric used:<br/>
As a classification problem, we observed the following metrics to determine the effectiveness of our model:
 - accuracy
 - precision
 - recall
 - f1 score

 Each point is measured in a different and observing them all allows us to get an accurate view of our model's results.

In [7]:
# Functions to help display metrics for all models

# helper method for score_model - not to be used seperately
def print_scores(per_class):
    for x,y in zip(per_class, np.unique(y_val_unencoded)):
        if str(y) == "7. PTD": # add an extra tab for better alignment
            print("["+str(y)+"]:     \t\t" + str(round(x,2))) 
        else:
            print("["+str(y)+"]:     \t" + str(round(x,2)))

# displays the scores for Precision, Recall, and F1
def score_model(y_actual, y_predicted, score_train, score_test):

    print("------------ F1 ------------")
    f1_per_class = f1_score(y_actual, y_predicted, average=None)
    print_scores(f1_per_class)#, y_actual)
    f1_per_weighted = f1_score(y_actual, y_predicted, average='macro')
    print("\nMacro f1: " + str(round(f1_per_weighted, 3)) + "\n")

    print("------ Individual Score Comparisons ------ ")
    print("Train Score: " + str(score_train))
    print("Test Score: " + str(score_test))
    diff = np.abs(score_train - score_test)
    print("Difference: " + str(diff))

    print("--------- Accuracy ---------\n")
    acc_score = accuracy_score(y_actual, y_predicted)
    print("Accuracy Score: " + str(acc_score) + "\n")

    print("--------- Precision ---------")
    precision_per_class = precision_score(y_actual, y_predicted, average=None)
    print_scores(precision_per_class)#, y_actual)
    precision_weighted = precision_score(y_actual, y_predicted, average='macro')
    print("\nMacro precision: " + str(round(precision_weighted, 3)) + "\n")

    print("---------- Recall ----------")
    recall_per_class = recall_score(y_actual, y_predicted, average=None)
    print_scores(recall_per_class)#, y_actual)
    recall_per_weighted = recall_score(y_actual, y_predicted, average='macro')
    print("\nMacro recall: " + str(round(recall_per_weighted, 3)) + "\n")


#### Logistic Regression

In [None]:
# 0.379

# Create the model
lr_model = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=10)

# Fit the model to the training set
lr_model.fit(X_train, y_train)

# Determine the scores for the model for both train and validation sets
score_train = lr_model.score(X_train, y_train)
score_test = lr_model.score(X_val, y_val)

# Use the model to predict on the validation set
lr_y_pred = lr_model.predict(X_val)

# Display the model metrics using the score_model function
score_model(y_val, lr_y_pred, score_train, score_test)

#### DECISION TREE

Gridsearch - decision tree:

In [None]:
# # Create a DecisionTreeClassifier
# dt_classifier = DecisionTreeClassifier(random_state=42)

# # Define the parameter grid to search
# param_grid = {
#     'criterion': ['gini', 'entropy'],                          # Split criterion
#     'splitter': ['best', 'random'],                             # Splitting strategy
#     'max_depth': [None, 10, 20, 30],                            # Max depth of the tree
#     'min_samples_split': [2, 5, 10],                            # Minimum samples to split an internal node
#     'min_samples_leaf': [1, 2, 4],                              # Minimum samples at a leaf node
#     'max_features': [None, 'sqrt', 'log2'],                     # Max features to consider for splits
#     'max_leaf_nodes': [None, 10, 20, 30],                       # Max number of leaf nodes
#     'min_impurity_decrease': [0.0, 0.1, 0.2]                   # Minimum impurity decrease to split
# }

# # Set up GridSearchCV with 5-fold cross-validation and scoring based on accuracy
# grid_search = GridSearchCV(estimator=dt_classifier, param_grid=param_grid, cv=5, scoring='accuracy')

# # Fit GridSearchCV on the training data
# grid_search.fit(X_train, y_train)

# # Print the best parameters and the best score
# print("Best Parameters:", grid_search.best_params_)
# print("Best Score:", grid_search.best_score_)

# # You can also access the best model found
# best_model = grid_search.best_estimator_

# #Best Parameters: {'criterion': 'gini', 'max_depth': 10, 'max_features': None, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'splitter': 'best'}
# #Best Score: 0.7769977245887005


Model - Decision Tree:

In [None]:
# 0.366
# Initialize the Decision Tree Classifier
decision_tree = DecisionTreeClassifier(
    criterion='gini',  # 'gini' for Gini Impurity or 'entropy' for Information Gain
    max_depth=10, 
    max_features=None,
    max_leaf_nodes=None, 
    min_impurity_decrease= 0.0,
    min_samples_leaf= 1,
    min_samples_split=2,
    splitter='best',  # Maximum depth of the tree (None means no limit)
    random_state=42    # Random seed for reproducibility
)

# Train the model
decision_tree.fit(X_train, y_train)

# Determine the scores for the model for both train and validation sets
score_train = decision_tree.score(X_train, y_train)
score_test = decision_tree.score(X_val, y_val)

# Make predictions
dt_y_pred = decision_tree.predict(X_val)

# Display the model metrics using the score_model function
score_model(y_val, dt_y_pred, score_train, score_test)

#### K Nearest Neighbors

Grid Search - KNN 

In [None]:
# # Define the parameter grid for KNN
# param_grid = {
#     'n_neighbors': [3, 5, 10, 15],                 
#     'algorithm': ['brute', 'kd_tree'],             
#     'metric': ['euclidean', 'manhattan', 'minkowski'], 
#     'weights': ['uniform', 'distance']             
# }

# # Set up the GridSearchCV with KNN classifier
# grid_search = GridSearchCV(
#     KNeighborsClassifier(),
#     param_grid,
#     cv=5,                                         
#     scoring='f1_macro'                                                      
# )

# # Fit the grid search to the training data
# grid_search.fit(X_train, y_train)

# # Print the best parameters and the best score
# print("Best Parameters:", grid_search.best_params_)
# print("Best Score:", grid_search.best_score_)


Model - KNN<br/>
KNN takes too long to process due to our large dataset so will be commented out.

In [None]:
# # 0.334

# # Create the KNN model
# # n_neighbors specifies the number of neighbors to use for classification
# knn_model = KNeighborsClassifier(n_neighbors=5)  

# # Fit the model to the training set
# knn_model.fit(X_train, y_train)

# # Determine the scores for the model for both train and validation sets
# score_train = knn_model.score(X_train, y_train)
# score_test = knn_model.score(X_val, y_val)

# # Use the model to predict on the validation set
# knn_y_pred = knn_model.predict(X_val)

# # Display the model metrics using the score_model function
# score_model(y_val, knn_y_pred, score_train, score_test)

#### NEURAL NETWORK:

GridSearch - MLPClasssifer:

In [None]:
# # Define the parameter grid
# param_grid = {
#     'hidden_layer_sizes': [(64,), (64, 32), (128, 64)],  # Different hidden layer architectures
#     'activation': ['relu', 'tanh'],                     # Activation functions
#     'solver': ['adam', 'sgd'],                          # Optimizers
#     'alpha': [0.0001, 0.001, 0.01],                     # L2 regularization (alpha)
#     'learning_rate_init': [0.001, 0.01, 0.1],           # Learning rates
# }

# # Create the MLPClassifier model
# mlp = MLPClassifier(max_iter=200, random_state=42)  # Keeping max_iter constant at 200

# # Create the GridSearchCV object
# grid_search = GridSearchCV(estimator=mlp,
#                            param_grid=param_grid,
#                            cv=5,  # 5-fold cross-validation
#                            scoring='f1_macro',  # Evaluation metric
#                            verbose=2,           # Display progress logs
#                            n_jobs=-1)           # Use all available processors

# # Fit the grid search to the training data
# grid_search.fit(X_train, y_train)

# # Print the best parameters and best score
# print("Best Parameters:", grid_search.best_params_)
# print("Best Score:", grid_search.best_score_)


MODEL:

In [None]:
#f1 score of 0.415

# Create the model
mlpc_model = MLPClassifier(hidden_layer_sizes=(64, 32),  # Two hidden layers: 64 and 32 neurons
                      activation='relu',           # ReLU activation function
                      solver='adam',               # Adam optimizer
                      alpha=0.0001,                # Regularization term (L2 penalty)
                      learning_rate_init=0.001,    # Initial learning rate
                      max_iter=200,                # Maximum number of iterations
                      random_state=42)             # For reproducibility

# Fit the model to the training set
mlpc_model.fit(X_train, y_train)

# Determine the scores for the model for both train and validation sets
score_train = mlpc_model.score(X_train, y_train)  # Accuracy on training data
score_test = mlpc_model.score(X_val, y_val)      # Accuracy on validation data

# Use the model to predict on the validation set
mplc_y_pred = mlpc_model.predict(X_val)

# Display the model metrics using the score_model function
score_model(y_val, mplc_y_pred, score_train, score_test)


<a href="https://scikit-learn.org/1.5/modules/generated/sklearn.ensemble.RandomForestClassifier.html">Random Forest</a> -> (overfits) <br/>
Fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. <br/>

In [None]:
# 0.379

# Initialize the Random Forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model
rf_model.fit(X_train, y_train)

# Determine the scores for the model for both train and validation sets
score_train = rf_model.score(X_train, y_train)  # Accuracy on training data
score_test = rf_model.score(X_val, y_val)      # Accuracy on validation data

# Use the model to predict on the validation set
rf_y_pred = rf_model.predict(X_val)

# Display the model metrics using the score_model function
score_model(y_val, rf_y_pred, score_train, score_test)


In [None]:
# pos_weight = np.sum(y_train == 6) / np.sum(y_train != 6)

# score = 0
# score_settings = ""

# for x in range(1,20):
#     for y in range(50, 151, 10):
#         for z in np.arange(0, 1.1, 0.1):
#             xgb_model = xgb.XGBClassifier(
#                 n_estimators=y,  # Number of trees
#                 learning_rate=z,  # Step size shrinkage
#                 max_depth=x,       # Maximum depth of a tree
#                 random_state=42,   # For reproducibility
#                 use_label_encoder=False,  # Avoid warning for encoding
#                 eval_metric='mlogloss',    # Evaluation metric for multi-class classification
#                 scale_pos_weight = pos_weight
#             )
#             xgb_model.fit(X_train, y_train)
#             xgb_y_pred = xgb_model.predict(X_val)
#             f1 = f1_score(y_val, xgb_y_pred, average="macro")

#             if f1 > score:
#                 score = f1
#                 score_settings = "max_depth: " + str(x) + " | n_estimators: " + str(y) + " | lr: " + str(z)

# print(score)
# print(score_settings)

<a href="https://xgboost.readthedocs.io/en/stable/tutorials/index.html">XGBoost</a> -> (tends to overfit): <br/>
Also using decision trees


In [None]:
# 0.442
# max_depth = 19, n_estimators = 150, lr = 0.6 -> overfitting
xgb_model = xgb.XGBClassifier(
    n_estimators=110,  # Number of trees
    learning_rate=0.2,  # Step size shrinkage
    max_depth=7,       # Maximum depth of a tree
    random_state=42,   # For reproducibility
    use_label_encoder=False,  # Avoid warning for encoding
    eval_metric='mlogloss'    # Evaluation metric for multi-class classification
)

# Train the model
xgb_model.fit(X_train, y_train)

# Determine the scores for the model for both train and validation sets
score_train = xgb_model.score(X_train, y_train)  # Accuracy on training data
score_test = xgb_model.score(X_val, y_val)      # Accuracy on validation data

# Use the model to predict on the validation set
xgb_y_pred = xgb_model.predict(X_val)

# Display the model metrics using the score_model function
score_model(y_val, xgb_y_pred, score_train, score_test)

Gradient Boosted Decision Trees

In [None]:
# 16 min = max_depth = 6 - .402 f1
gbdt_model = GradientBoostingClassifier(
    n_estimators=100,       # Number of boosting stages
    learning_rate=0.1,      # Shrinks contribution of each tree
    max_depth=6,            # Limits depth of each tree to prevent overfitting
    random_state=42         # For reproducibility
)

# Train the model
gbdt_model.fit(X_train, y_train)

# Determine the scores for the model for both train and validation sets
score_train = gbdt_model.score(X_train, y_train)  # Accuracy on training data
score_test = gbdt_model.score(X_val, y_val)      # Accuracy on validation data

# Use the model to predict on the validation set
gbdt_y_pred = gbdt_model.predict(X_val)

# Display the model metrics using the score_model function
score_model(y_val, gbdt_y_pred, score_train, score_test)

Bagging Methods

In [None]:
# 0.4 f1 macro score
bagging_model = BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=10, random_state=42)
bagging_model.fit(X_train, y_train)
bagging_y_pred = bagging_model.predict(X_val)

score_train = bagging_model.score(X_train, y_train)
score_test = bagging_model.score(X_val, y_val)

score_model(y_val, bagging_y_pred, score_train, score_test)

In [None]:
# 0.425 f1 macro score
bagging_model = BaggingClassifier(estimator=xgb.XGBClassifier(), n_estimators=10, random_state=42)
bagging_model.fit(X_train, y_train)
bagging_y_pred = bagging_model.predict(X_val)

score_train = bagging_model.score(X_train, y_train)
score_test = bagging_model.score(X_val, y_val)

score_model(y_val, bagging_y_pred, score_train, score_test)

In [None]:
# (hidden_layer_sizes=(13,), max_iter=500, random_state=42) - 0.395 (no overfit) 12m 58s
# (hidden_layer_sizes=(15,), max_iter=500, random_state=42) - 0.407 (no overfit) 11m 34s
# (hidden_layer_sizes=(20,), max_iter=500, random_state=42) - 0.407 (no overfit) 15m 20s
# (hidden_layer_sizes=(10,), max_iter=500, random_state=42) - 0.389 (no overfit) 4m 6s
# (hidden_layer_sizes=(10,), max_iter=1000, random_state=42) - 0.389 (no overfit) 4m 8s
base_model = MLPClassifier(hidden_layer_sizes=(18,), max_iter=500, random_state=42)
bagging_model = BaggingClassifier(estimator=base_model, n_estimators=10, random_state=42)
bagging_model.fit(X_train, y_train)
bagging_y_pred = bagging_model.predict(X_val)

score_train = bagging_model.score(X_train, y_train)
score_test = bagging_model.score(X_val, y_val)

score_model(y_val, bagging_y_pred, score_train, score_test)

In [None]:
# - LR -> XGB -> MLPC
# 0.425 - LR -> MLP -> XGB w/ a 0.0099 difference in scores (37m)
# 0.410 - MLP -> XGB -> GBC w/ a 0.011 difference in scores (93m 35s)

base_models = [
    ('mlpc', MLPClassifier()),
    ('xgb', xgb.XGBClassifier() )
]

stacked_model = StackingClassifier(estimators=base_models, final_estimator=GradientBoostingClassifier())
stacked_model.fit(X_train, y_train)
y_pred = stacked_model.predict(X_val)

score_train = stacked_model.score(X_train, y_train)
score_test = stacked_model.score(X_val, y_val)

score_model(y_val, y_pred, score_train, score_test)

Weighted Averaging

In [None]:
# lr_y_pred_f1   = f1_score(y_val, lr_y_pred, average='macro')
# dt_y_pred_f1   = f1_score(y_val, dt_y_pred, average='macro')
# knn_y_pred_f1  = f1_score(y_val, knn_y_pred, average='macro')
mplc_y_pred_f1 = f1_score(y_val, mplc_y_pred, average='macro')
# rf_y_pred_f1   = f1_score(y_val, rf_y_pred, average='macro')
xgb_y_pred_f1  = f1_score(y_val, xgb_y_pred, average='macro')
gbdt_y_pred_f1 = f1_score(y_val, gbdt_y_pred, average='macro')

# f1_score(y_actual, y_predicted, average='macro')

# Assign weights based on F1 scores
#weights = [lr_y_pred_f1, dt_y_pred_f1, knn_y_pred_f1, mplc_y_pred_f1, rf_y_pred_f1, xgb_y_pred_f1, gbdt_y_pred_f1]
weights = [mplc_y_pred_f1, xgb_y_pred_f1, gbdt_y_pred_f1]
weights = np.array(weights) / np.sum(weights)  # Normalize weights

# Make weighted predictions
# lr_probs    = lr_model.predict_proba(X_val)[:, 1]
# dt_probs    = decision_tree.predict_proba(X_val)[:, 1]
# knn_probs   = knn_model.predict_proba(X_val)[:, 1]
mplc_probs  = mlpc_model.predict_proba(X_val)[:, 1]
# rf_probs    = rf_model.predict_proba(X_val)[:, 1]
xgb_probs   = xgb_model.predict_proba(X_val)[:, 1]
gbdt_probs  = gbdt_model.predict_proba(X_val)[:, 1]

# Aggregate predictions using weights
weighted_probs = (
                    # weights[0] * lr_probs +
                #   weights[1] * dt_probs +
                #   weights[2] * knn_probs +
                  weights[0] * mplc_probs + 
                #   weights[4] * rf_probs + 
                  weights[1] * xgb_probs + 
                  weights[2] * gbdt_probs)

# Final predictions (threshold = 0.5)
final_predictions = (weighted_probs >= 0.2).astype(int)

# Evaluate the ensemble
final_f1 = f1_score(y_val, final_predictions, average='macro')
print(f"Weighted Ensemble F1 Score: {final_f1:.2f}")


<a class="anchor" id="kaggle">

## 11. Kaggle Submission
</a>

In [None]:
# get the model prediction
# y_pred_test = model.predict(test_data)

In [None]:
# y_pred_test

In [None]:
# # decode the prediction labels back to their original values
# decoded_labels = label_encoder.inverse_transform(y_pred_test)
# decoded_labels

In [None]:
# test_data.shape

In [None]:
# # combine the prediction values with their claim identifiers into a dataframe
# kaggle_submission = pd.DataFrame({"Claim Identifier": test_data.index, "Claim Injury Type":decoded_labels})
# kaggle_submission.head()

In [None]:
# Compile the resulting dataframe into a csv file named "Kaggle_submission.csv"
# this will be found in the directory the file is currently running from
# if a file exists with the same name, it will overwrite it with the new output.
# kaggle_submission.to_csv("Kaggle_Submission.csv", index=False)