### **Group 31** <br>
* Ana Margarida Valente, nr 20240936
* Eduardo Mendes, nr 20240850
* Julia Karpienia, nr 20240514
* Marta Boavida, nr 20240519
* Victoria Goon, nr 20240550

## 0. Import Packages

In [1]:
## Import standard data processing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Import datetime object for date columns in dataset
from datetime import datetime

## Setting seaborn style
sns.set()

from math import ceil
from sklearn.impute import KNNImputer

from sklearn.preprocessing import LabelEncoder

## Import Models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LassoCV

# Import Cross Validation methods
from sklearn.model_selection import KFold, RepeatedKFold, StratifiedKFold
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score


pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_colwidth', None) #Show all columns

## Supress warnings
import warnings
warnings.filterwarnings('ignore')


<a class="anchor" id="importdatasets">

## 1. Import Datasets

</a>

In [2]:
train_data = pd.read_csv("train_encoded.csv", low_memory=False)
validation_data = pd.read_csv("validation_encoded.csv", low_memory=False)
test_data = pd.read_csv("test_encoded.csv")

In [3]:
train_data = train_data.set_index("Claim Identifier")
validation_data = validation_data.set_index("Claim Identifier")
test_data = test_data.set_index("Claim Identifier")

In [4]:
X_train = train_data.drop('Claim Injury Type', axis = 1)
y_train = train_data['Claim Injury Type']

X_val = validation_data.drop('Claim Injury Type', axis = 1)
y_val = validation_data['Claim Injury Type']

In [5]:
X_train.head()

Unnamed: 0_level_0,Season_of_Accident,Age_Group,Industry_Avg_Weekly_Wage,COVID_Age,Industry Code_0.0012337365382006041,Industry Code_0.0041962118509063315,Industry Code_0.005244008462779349,Industry Code_0.014149023312846438,Industry Code_0.015433013885189634,Industry Code_0.015887812894180082,...,zip_code_cat_8,zip_code_cat_9,zip_code_cat_Other,First Hearing Date Binary_1,C-2 Date Bin_1,C-3 Date Bin_1,Age at Injury,Average Weekly Wage,IME-4 Count,Days Between Accident_Assembly
Claim Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5935707,Winter,56-65,407.191537,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,1.282837,0.597492,-0.420035,-0.118281
5868764,Fall,18-25,815.925522,,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,-1.431733,-0.208755,-0.420035,-0.123253
5986945,Spring,36-45,407.191537,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,-0.257865,1.264815,-0.420035,-0.123253
5665055,Winter,56-65,407.191537,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,1.20947,-0.535991,-0.420035,-0.085134
5595404,Fall,46-55,810.49005,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,0.402436,1.72855,0.243009,-0.116624


In [6]:
X_val.head()

Unnamed: 0_level_0,Season_of_Accident,Age_Group,Age_at_Assembly,COVID_Age,Industry Code_0.0012337365382006041,Industry Code_0.0041962118509063315,Industry Code_0.005244008462779349,Industry Code_0.014149023312846438,Industry Code_0.015433013885189634,Industry Code_0.015887812894180082,...,zip_code_cat_8,zip_code_cat_9,zip_code_cat_Other,First Hearing Date Binary_1,C-2 Date Bin_1,C-3 Date Bin_1,Age at Injury,Average Weekly Wage,IME-4 Count,Days Between Accident_Assembly
Claim Identifier,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
5517094,Summer,36-45,1984.0,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,-0.477965,-0.535991,-0.420035,-0.13154
6133770,Fall,36-45,1984.0,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,-0.331231,-0.535991,-0.420035,-0.134855
5741413,Summer,56-65,1964.0,,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.062737,-0.535991,-0.420035,-0.129883
6082466,Summer,36-45,1985.0,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,-0.404598,-0.535991,-0.420035,0.027567
6086244,Fall,18-25,2001.0,,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,1.0,-1.578466,-0.535991,-0.420035,-0.13154


In [7]:
y_val.head()

Claim Identifier
5517094    2. NON-COMP
6133770    2. NON-COMP
5741413    2. NON-COMP
6082466    2. NON-COMP
6086244    3. MED ONLY
Name: Claim Injury Type, dtype: object

### 1.1 Encode Target Variable
Label Encoder for target variable (training and validation):
<br/> <br/>
(This needs to be done in both the proprocessing notebook as well as here to be able to interpret the results properly when a model is tested.)

In [8]:
#Initiate Label encoder
label_encoder = LabelEncoder()

#Fit the encoder on the training target variable
Y_train_encoded = label_encoder.fit_transform(y_train)

#Transform the training and validation target variable
Y_val_encoded = label_encoder.transform(y_val)

y_val_unencoded = y_train.copy()

#Convert the results back to DataFrames while overriding the previous variable names
y_train = pd.DataFrame(Y_train_encoded, columns=['encoded_target'], index=pd.Series(y_train.index))
y_val = pd.DataFrame(Y_val_encoded, columns=['encoded_target'], index=pd.Series(y_val.index))

<a class="anchor" id="model">

## 2. Model
</a>

Type of Problem <br/>
The type of problem to be solved is a multiclassification problem where the output is between 8 different choices. We will use a simple Logistical Regression model set to be able to compute multiple classes.<br/>
<br/>
Metric used:<br/>
As a classification problem, we observed the following metrics to determine the effectiveness of our model:
 - accuracy
 - precision
 - recall
 - f1 score

 Each point is measured in a different and observing them all allows us to get an accurate view of our model's results.

In [9]:
# Functions to help display metrics for all models

# helper method for score_model - not to be used seperately
def print_scores(per_class):
    for x,y in zip(per_class, np.unique(y_val_unencoded)):
        if str(y) == "7. PTD": # add an extra tab for better alignment
            print("["+str(y)+"]:     \t\t" + str(round(x,2))) 
        else:
            print("["+str(y)+"]:     \t" + str(round(x,2)))

# displays the scores for Precision, Recall, and F1
def score_model(y_actual, y_predicted, score_train, score_test):

    print("--------- Accuracy ---------\n")
    acc_score = accuracy_score(y_actual, y_predicted)
    print("Accuracy Score: " + str(acc_score) + "\n")

    print("--------- Precision ---------")
    precision_per_class = precision_score(y_actual, y_predicted, average=None)
    print_scores(precision_per_class)#, y_actual)
    precision_weighted = precision_score(y_actual, y_predicted, average='macro')
    print("\nMacro precision: " + str(round(precision_weighted, 3)) + "\n")

    print("---------- Recall ----------")
    recall_per_class = recall_score(y_actual, y_predicted, average=None)
    print_scores(recall_per_class)#, y_actual)
    recall_per_weighted = recall_score(y_actual, y_predicted, average='macro')
    print("\nMacro recall: " + str(round(recall_per_weighted, 3)) + "\n")

    print("------------ F1 ------------")
    f1_per_class = f1_score(y_actual, y_predicted, average=None)
    print_scores(f1_per_class)#, y_actual)
    f1_per_weighted = f1_score(y_actual, y_predicted, average='macro')
    print("\nMacro f1: " + str(round(f1_per_weighted, 3)) + "\n")

    print("------ Individual Score Comparisons ------ ")
    print("Train Score: " + str(score_train))
    print("Test Score: " + str(score_test))
    diff = np.abs(score_train - score_test)
    print("Difference: " + str(diff))

In [10]:
X_train.drop(['Season_of_Accident', 'Age_Group', 'Industry_Avg_Weekly_Wage','COVID_Age'], inplace=True, axis=1)

In [11]:
X_val.drop(["Age_at_Assembly", 'Season_of_Accident', 'Age_Group','COVID_Age'], inplace=True, axis=1)

### Models from class
- Decision Tree (J48)
- K-Nearest Neighbor
- Logistic Regression - already done
- Naive Bayes
- Neural Network

In [12]:
# create the model
model = LogisticRegression(multi_class='multinomial', solver='lbfgs', C=10)

# fit the model to the training set
model.fit(X_train, y_train)

# determine the scores for the model for both train and validation
score_train = model.score(X_train, y_train)
score_test = model.score(X_val, y_val)

# use model to predict on validation set
y_pred = model.predict(X_val)

# display the model metrics
score_model(y_val, y_pred, score_train, score_test)

--------- Accuracy ---------

Accuracy Score: 0.7524621388088822

--------- Precision ---------
[1. CANCELLED]:     	0.66
[2. NON-COMP]:     	0.83
[3. MED ONLY]:     	0.31
[4. TEMPORARY]:     	0.68
[5. PPD SCH LOSS]:     	0.64
[6. PPD NSL]:     	0.1
[7. PTD]:     		0.0
[8. DEATH]:     	0.63

Macro precision: 0.481

---------- Recall ----------
[1. CANCELLED]:     	0.44
[2. NON-COMP]:     	0.96
[3. MED ONLY]:     	0.07
[4. TEMPORARY]:     	0.8
[5. PPD SCH LOSS]:     	0.48
[6. PPD NSL]:     	0.01
[7. PTD]:     		0.0
[8. DEATH]:     	0.21

Macro recall: 0.37

------------ F1 ------------
[1. CANCELLED]:     	0.53
[2. NON-COMP]:     	0.89
[3. MED ONLY]:     	0.12
[4. TEMPORARY]:     	0.74
[5. PPD SCH LOSS]:     	0.55
[6. PPD NSL]:     	0.01
[7. PTD]:     		0.0
[8. DEATH]:     	0.31

Macro f1: 0.393

------ Individual Score Comparisons ------ 
Train Score: 0.7523028911145843
Test Score: 0.7524621388088822
Difference: 0.00015924769429798147


In [13]:
# print the confusion matrix
confusion_matrix(y_val, y_pred)

array([[ 1637,  1868,    90,   134,    12,     2,     0,     0],
       [  746, 83679,   896,  1917,    82,     1,     0,     3],
       [   32, 11828,  1532,  6379,   894,     5,     0,     2],
       [   43,  3750,  2137, 35749,  2803,    55,     3,    12],
       [    5,    98,   285,  7142,  6946,     7,     1,     0],
       [    0,     5,    17,  1125,   108,     8,     0,     0],
       [    0,     0,     0,    25,     2,     2,     0,     0],
       [    3,    24,    11,    74,     0,     0,     0,    29]])

### Models found from papers:

- Support Vector Machine
- Random Forest
- gradient boosted decision trees
- XGBoost
- CatBoost
- Neural Network Ensembles

<a class="anchor" id="kaggle">

## 11. Kaggle Submission
</a>

In [14]:
# get the model prediction
# y_pred_test = model.predict(test_data)

In [15]:
# y_pred_test

In [16]:
# # decode the prediction labels back to their original values
# decoded_labels = label_encoder.inverse_transform(y_pred_test)
# decoded_labels

In [17]:
# test_data.shape

In [18]:
# # combine the prediction values with their claim identifiers into a dataframe
# kaggle_submission = pd.DataFrame({"Claim Identifier": test_data.index, "Claim Injury Type":decoded_labels})
# kaggle_submission.head()

In [19]:
# Compile the resulting dataframe into a csv file named "Kaggle_submission.csv"
# this will be found in the directory the file is currently running from
# if a file exists with the same name, it will overwrite it with the new output.
# kaggle_submission.to_csv("Kaggle_Submission.csv", index=False)