# **<h3 align="center">Machine Learning 2024-25</h3>**
## **<h3 align="center">To Grant or Not to Grant: Deciding on Compensation Benefits</h3>**


**Group 38 members:**<br>
Ana Marta Azinheira  - 20240496@novaims.unl.pt - 20240496<br>
Braulio Damba - 20240007@novaims.unl.pt - 20240007<br>
Henry Tirla  - 20221016@novaims.unl.pt - 20221016<br>
Marco Galão  - r20201545@novaims.unl.pt - r20201545<br>
Rodrigo Sardinha - 20211627@novaims.unl.pt - 20211627<br>

<a id = "toc"></a>

# Table of Contents

* [1. Import the Libraries](#import_libraries)
* [2. Import the Dataset](#import_dataset)
* [3. Description of the Dataset’s Structure](#dataset_structure)
* [4. Exploring the Dataset](#exploration)
    * [4.1. Constant Features](#constant_features)
    * [4.2. Duplicates](#duplicates)
    * [4.3. Missing Values](#missing_values)
    * [4.4. Data Types](#data_types)
    * [4.5. Inconsistencies](#inconsistencies)
    * [4.6. Outliers](#outliers)
* [5. Creating New Features](#chapter4)
    * [5.1.](#sub_section_4_2_1)
    * [5.2.](#sub_section_4_2_2)
    * [5.3.](#sub_section_4_2_3)
* [6. Correlation Matrix](#section_5_2)

# 1. Import the Libraries

In [7]:
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

# 2. Import the Dataset

In [37]:
train_data = pd.read_csv("train_data.csv")

  train_data = pd.read_csv("train_data.csv")


# 3. Description of the Dataset’s Structure

- *Accident Date* - Injury date of the claim.

- *Age at Injury* - Age of injured worker when the injury occurred.
    
- *Alternative Dispute Resolution* - Adjudication processes external to the Board.

- *Assembly Date* - The date the claim was first assembled.
    
- *Attorney/Representative* - Is the claim being represented by an Attorney?

- *Average Weekly Wage* - The wage used to calculate workers’ compensation, disability, or an Paid Leave wage replacement benefits.

- *Birth Year* - The reported year of birth of the injured worker.
    
- *C-2 Date* - Date of receipt of the Employer's Report of Work-Related Injury/Illness or equivalent (formerly Form C-2).
    
- *C-3 Date* - Date Form C-3 (Employee Claim Form) was received.
    
- *Carrier Name* - Name of primary insurance provider responsible for providing workers’ compensation coverage to the injured worker’s employer.
    
- *Carrier Type* - Type of primary insurance provider responsible for providing workers’ compensation coverage.

- *Claim Identifier* - Unique identifier for each claim, assigned by WCB.

- *County of Injury* - Name of the New York County where the injury occurred.

- *COVID-19 Indicator* - Indication that the claim may be associated with COVID-19.

- *District Name* - Name of the WCB district office that oversees claims for that region or area of the state.

- *First Hearing Date* - Date the first hearing was held on a claim at a WCB hearing location. A blank date means the claim has not yet had a hearing held.

- *Gender* - The reported gender of the injured worker.

- *IME-4 Count* - Number of IME-4 forms received per claim. The IME-4 form is the “Independent Examiner's Report of Independent Medical Examination” form.

- *Industry Code* - NAICS code and descriptions are available at: https://www.naics.com/search-naics-codes-by-industry/.

- *Industry Code Description* - 2-digit NAICS industry code description used to classify businesses according to their economic activity.

- *Medical Fee Region* - Approximate region where the injured worker would receive medical service.

- *OIICS Nature of Injury Description* - The OIICS nature of injury codes & descriptions are available at https://www.bls.gov/iif/oiics_manual_2007.pdf.

- *WCIO Cause of Injury Code* - The WCIO cause of injury codes & descriptions are at https://www.wcio.org/Active%20PNC/WCIO_Cause_Table.pdf

- *WCIO Cause of Injury Description* - See description of field above.

- *WCIO Nature of Injury Code* - The WCIO nature of injury are available at https://www.wcio.org/Active%20PNC/WCIO_Nature_Table.pdf

- *WCIO Nature of Injury Description* - See description of field above.

- *WCIO Part Of Body Code* - The WCIO part of body codes & descriptions are available at https://www.wcio.org/Active%20PNC/WCIO_Part_Table.pdf

- *WCIO Part Of Body Description* - See description of field above.

- *Zip Code* - The reported ZIP code of the injured worker’s home address.

- *Agreement Reached* - Binary variable: Yes if there is an agreement without the involvement of the WCB -> unknown at the start of a claim.

- *WCB Decision* - Multiclass variable: Decision of the WCB relative to the claim: “Accident” means that claim refers to workplace accident, “Occupational Disease” means illness from the workplace. -> requires WCB deliberation so it is unknown at start of claim.

- *Claim Injury Type* - Main target variable: Deliberation of the WCB relative to benefits awarded to the claim. Numbering indicates severity.

# 4. Exploring the DataSet

In [39]:
train_data.head()

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,...,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents
0,2019-12-30,31.0,N,2020-01-01,N,0.0,1988.0,2019-12-31,,NEW HAMPSHIRE INSURANCE CO,...,27.0,FROM LIQUID OR GREASE SPILLS,10.0,CONTUSION,62.0,BUTTOCKS,13662.0,0.0,Not Work Related,1.0
1,2019-08-30,46.0,N,2020-01-01,Y,1745.93,1973.0,2020-01-01,2020-01-14,ZURICH AMERICAN INSURANCE CO,...,97.0,REPETITIVE MOTION,49.0,SPRAIN OR TEAR,38.0,SHOULDER(S),14569.0,1.0,Not Work Related,4.0
2,2019-12-06,40.0,N,2020-01-01,N,1434.8,1979.0,2020-01-01,,INDEMNITY INSURANCE CO OF,...,79.0,OBJECT BEING LIFTED OR HANDLED,7.0,CONCUSSION,10.0,MULTIPLE HEAD INJURY,12589.0,0.0,Not Work Related,6.0
3,,,,2020-01-01,,,,,,,...,,,,,,,,,,
4,2019-12-30,61.0,N,2020-01-01,N,,1958.0,2019-12-31,,STATE INSURANCE FUND,...,16.0,"HAND TOOL, UTENSIL; NOT POWERED",43.0,PUNCTURE,36.0,FINGER(S),12603.0,0.0,Not Work Related,1.0


In [41]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 593471 entries, 0 to 593470
Data columns (total 33 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   Accident Date                       570337 non-null  object 
 1   Age at Injury                       574026 non-null  float64
 2   Alternative Dispute Resolution      574026 non-null  object 
 3   Assembly Date                       593471 non-null  object 
 4   Attorney/Representative             574026 non-null  object 
 5   Average Weekly Wage                 545375 non-null  float64
 6   Birth Year                          544948 non-null  float64
 7   C-2 Date                            559466 non-null  object 
 8   C-3 Date                            187245 non-null  object 
 9   Carrier Name                        574026 non-null  object 
 10  Carrier Type                        574026 non-null  object 
 11  Claim Identifier          

In [43]:
train_data.shape

(593471, 33)

## Insights 
- Do only 'Assembly Date' and 'Claim Identifier' not contain any NaN values?

In [45]:
train_data.describe().round(2)

Unnamed: 0,Age at Injury,Average Weekly Wage,Birth Year,Claim Identifier,IME-4 Count,Industry Code,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Nature of Injury Code,WCIO Part Of Body Code,Agreement Reached,Number of Dependents
count,574026.0,545375.0,544948.0,593471.0,132803.0,564068.0,0.0,558386.0,558369.0,556944.0,574026.0,574026.0
mean,42.11,491.09,1886.77,23667600.0,3.21,58.65,,54.38,41.01,39.74,0.05,3.01
std,14.26,6092.92,414.64,107927100.0,2.83,19.64,,25.87,22.21,22.37,0.21,2.0
min,0.0,0.0,0.0,5393066.0,1.0,11.0,,1.0,1.0,-9.0,0.0,0.0
25%,31.0,0.0,1965.0,5593414.0,1.0,45.0,,31.0,16.0,33.0,0.0,1.0
50%,42.0,0.0,1977.0,5791212.0,2.0,61.0,,56.0,49.0,38.0,0.0,3.0
75%,54.0,841.0,1989.0,5991000.0,4.0,71.0,,75.0,52.0,53.0,0.0,5.0
max,117.0,2828079.0,2018.0,999891700.0,73.0,92.0,,99.0,91.0,99.0,1.0,6.0


# 4.1. Constant Features

# 4.2. Duplicates

In [47]:
#Count the duplicated data
train_data.duplicated().value_counts()

False    593471
Name: count, dtype: int64

# 4.3. Missing Values

In [49]:
train_data.isna().astype(int).sum()

Accident Date                          23134
Age at Injury                          19445
Alternative Dispute Resolution         19445
Assembly Date                              0
Attorney/Representative                19445
Average Weekly Wage                    48096
Birth Year                             48523
C-2 Date                               34005
C-3 Date                              406226
Carrier Name                           19445
Carrier Type                           19445
Claim Identifier                           0
Claim Injury Type                      19445
County of Injury                       19445
COVID-19 Indicator                     19445
District Name                          19445
First Hearing Date                    442673
Gender                                 19445
IME-4 Count                           460668
Industry Code                          29403
Industry Code Description              29403
Medical Fee Region                     19445
OIICS Natu

__NOTE:__ In this dataset we don't have categorical variables. However, if we want to check the descriptive statistics for categorical data we just need to use the method `.describe(include =['O'] `

__`Step 7`__ What is the mean value of `BD4` when the target `DrugPlant` is equal to 0? And when is equal to 1?

In [11]:
drugs_truth.groupby('DrugPlant')['BD4'].mean()

DrugPlant
0     67573.953370
1    103339.467412
Name: BD4, dtype: float64

__`Step 8`__ How many observations do we have where DrugPlant is equal to 0? And to 1?

In [12]:
drugs_truth['DrugPlant'].value_counts()

DrugPlant
0    7463
1     537
Name: count, dtype: int64

__`Step 9`__ How many observations do we have where `BD3` is equal to 15?

In [13]:
drugs_truth[drugs_truth['BD3'] == 15]

Unnamed: 0_level_0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9,DrugPlant
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1013,657,60,15,95586,0,0,29,54,1407,1
1020,589,26,15,40944,1,0,4,67,52,0
1026,824,56,15,90390,0,0,22,52,959,0
1034,657,32,15,54327,1,0,4,21,47,0
1049,927,35,15,45434,1,1,4,14,90,0
...,...,...,...,...,...,...,...,...,...,...
10980,835,65,15,89275,0,0,23,5,1022,1
10981,771,74,15,120600,0,0,38,16,1978,0
10984,616,47,15,74348,0,1,6,97,153,0
10995,1025,39,15,58121,1,1,4,6,61,0


__`Step 10`__ Look for correlations between the different features with the method `.corr(method = 'spearman')`

In [None]:
#compute the correlation matrix of the features
# it is useful to assess multicollinearity (data redundancy) and to identify the most relevant features (more correlated with the target)
drugs_truth.drop(columns = 'DrugPlant').corr(method = 'spearman')

# 4. Modify the data

<img src="01_images/phase04.png" alt="Drawing" style="width: 500px;"/>

After the exploration and understanding of data, we need to fix possible problems on data like missing values or outliers and we can create new variables in order to get variables with higher predictive power. <br>
At this moment, we are going to ignore this. <br>However, to create a predictive model we need to identify what are our independent variables and the dependent one (the target), as also we need to split our data into at least two different datasets - the train and the validation.

__`Step 11`__ Create a new dataset named as `X` that will include all the independent variables.

In [12]:
X = drugs_truth.iloc[:,:-1]

In [18]:
# or
# X = drugs_truth.drop(columns = 'DrugPlant')

In [19]:
X

Unnamed: 0_level_0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1001,789,68,16,90782,0,0,29,66,1402
1002,623,78,20,113023,0,0,31,6,1537
1003,583,24,18,28344,1,0,4,69,44
1004,893,59,19,93571,0,1,21,10,888
1006,792,32,20,22386,1,1,5,65,56
...,...,...,...,...,...,...,...,...,...
10995,1025,39,15,58121,1,1,4,6,61
10996,967,28,17,54292,0,0,23,72,1011
10997,637,76,15,125962,0,1,33,75,1668
10998,586,69,19,99628,0,0,30,98,1469


__`Step 12`__ Create a new dataset named as `y` that will include the dependent variable (the last column - DrugPlant)

In [20]:
y = drugs_truth.iloc[:,-1]

In [22]:
# or
# y = drugs_truth['DrugPlant']

ID
1001     0
1002     0
1003     0
1004     0
1006     0
        ..
10995    0
10996    1
10997    0
10998    0
10999    0
Name: DrugPlant, Length: 8000, dtype: int64

In [23]:
y

ID
1001     0
1002     0
1003     0
1004     0
1006     0
        ..
10995    0
10996    1
10997    0
10998    0
10999    0
Name: DrugPlant, Length: 8000, dtype: int64

__`Step 13`__ Using the `train_test_split()`, split the data into train and validation, where the training dataset should contain 70% of the observations. (We are going to talk more about this in a future class). 

In [21]:
X_train, X_validation, y_train, y_validation = train_test_split(X,y,
                                                               train_size = 0.7, 
                                                               shuffle = True, 
                                                               stratify = y) # to have the same percentage of 1s in train and test data 

# 5. Modelling - Create a predictive model

It is time to create a model. At this step, we are going to implement a simple algorithm named as "Decision Trees". 

<img src="01_images/phase05.png" alt="Drawing" style="width: 500px;"/>

__`Step 14`__ Create an instance of a DecisionTreeClassifier named as `dt` with the default parameters and fit the instance to the training data (again, we are going to talk more about this later).

In [24]:
dt = DecisionTreeClassifier().fit(X_train, y_train)

__`Step 15`__ Using the model just created in the previous step, predict the values of the target in the train dataset using the method `.predict()`. Assign those values to the object `predictions_train`

In [25]:
predictions_train = dt.predict(X_train)

__`Step 16`__ Similarly to what you have done in the previous step, predict the target values for the validation dataset and assign those values to the object `predictions_val`

In [26]:
predictions_val = dt.predict(X_validation)

# 6. Assess

We already have the ground truth and the predicted values. In this way we can start evaluating the performance of our model in the train and the validation dataset.

<img src="01_images/phase06.png" alt="Drawing" style="width: 500px;"/>

__`Step 17`__ Using the method `.score()`, check the mean accuracy of the model `dt`in the train dataset.

In [27]:
dt.score(X_train, y_train)

1.0

__`Step 18`__ Similarly to what you have done in step 17, check the mean accuracy now for the validation dataset.

In [28]:
dt.score(X_validation, y_validation)

0.9033333333333333

Are we dealing with a case of __overfitting__? <br>
Yes, decision trees are known to be prone to overfitting. <br>
Luckily, there are strategies to avoid this problem. <br>
We are going to understand better what is overfitting and how to avoid it in the different algorithms in the next classes.

It is time to check the confusion matrix of the model for the training and the validation dataset. <br> <br>
__`Step 19`__ Check the confusion matrix for the training dataset, passing as parameters the ground truth (y_train) and the predicted values (predictions_train)<br>
[[TN, FP],<br>
[FN. TP]]

In [31]:
confusion_matrix(y_train, predictions_train)

array([[5224,    0],
       [   0,  376]])

__`Step 20`__ Do the same for the validation dataset.

In [32]:
confusion_matrix(y_validation, predictions_val)

array([[2113,  126],
       [ 106,   55]])

__Can we conclude something from the results above?__ <br>It seems that our model is not so good at predicting the 1's in the target. <br>__Why?__ <br>Because we are dealing with an unbalanced dataset (more about this in the future). 

We are going to learn also different metrics that allow to understant better the performance of our model in unbalanced datasets - the mean accuracy is not a good metric to evaluate those cases.

# 7. Deploy

In the end, we want to classify the unclassified data. If we are already satisfied with our model, we can now predict the target to the new dataset.

__`Step 21`__ Check the dataset that we want to classify, imported as `drugs_2classify`

In [33]:
# test data
drugs_2classify

Unnamed: 0_level_0,BD1,BD2,BD3,BD4,BD5,BD6,BD7,BD8,BD9
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1005,1062,59,18,91852,0,1,25,26,1138
1014,619,35,18,57997,1,1,5,28,81
1015,1133,32,20,50289,1,0,1,231,20
1017,624,22,16,20043,0,0,2,37,16
1018,940,76,15,112765,0,0,39,72,2039
...,...,...,...,...,...,...,...,...,...
10975,794,72,19,93012,0,0,23,38,1028
10985,839,67,19,100928,0,0,25,6,1152
10986,1155,23,17,50329,0,0,7,80,188
10990,1036,73,18,104990,0,1,32,75,1607


__`Step 22`__ Using the `.predict()` method and the model created named as `dt`, predict the target on the new dataset and assign those values to a column named as `DrugPlant`

In [23]:
drugs_2classify['DrugPlant'] = dt.predict(drugs_2classify)

__`Step 23`__ Check the new dataset.

In [None]:
drugs_2classify

Now we have already predicted the target for our new dataset! Next, if we wish to save a set of predictions, we can export a solution to a csv file.

In [25]:
#export test data predictions
drugs_2classify['DrugPlant'].to_csv('Exercise1_predictions.csv')