# Table of Contents

[0. Context](#context)<br>

[1. Import the Dataset and Explore the Data](#import-the-dataset-and-explore-the-data)<br>
Check data contents, provide descriptive statistics, and check for incoherencies in the data.<br>
Explore data visually and extract relevant insights.<br>
Explain your rationale and findings.<br>
Do not forget to analyze multivariate relationships.<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.1 Importing Libraries](#11-importing-libraries)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.2 Loading and Reading the Dataset](#12-loading-and-reading-the-dataset)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.3 Descriptive Statistics](#13-descriptive-statistics)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.4 Incoherencies](#14-incoherencies)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.5 Exploring Data Visually](#15-exploring-data-visually)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[1.6 Pair-wise Relationships](#16-pair-wise-relationships)<br>

[2. Clean and Pre-process the Data](#clean-and-pre-process-the-data)<br>
Are there any missing values? Take action to handle them.<br>
Check the dataset for outliers and pre-process them. Justify your decisions.<br>
Deal with categorical variables.<br>
Review current features and create extra features if needed. Explain your steps.<br>
Perform data scaling. Explain the reasoning behind your choices.<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.1 Missing Values](#21-missing-values)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.2 Duplicates](#22-duplicates)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.3 Outliers](#23-outliers)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.4 Categorical Data](#24-categorical-data)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.5 Aggregations](#25-aggregations)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.6 Feature Engineering](#26-feature-engineering)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.7 Multivariate Relationships](#27-multivariate-relationships)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[2.8 Data Scaling](#28-data-scaling)<br>

[3. Feature Selection](#feature-selection)<br>
Define and implement an unambiguous strategy for feature selection.<br>
Use methods discussed in the course.<br>
Present and justify your final selection.<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.1 Filter Methods](#31-filter-methods)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[3.1.1 Univariate Variables](#311-univariate-variables)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[3.1.2 Correlation Indices](#312-correlation-indices)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;[3.1.3 Chi-Square](#313-chi-square)<br>
&nbsp;&nbsp;&nbsp;&nbsp;[3.2 Wrapper Methods](#32-wrapper-methods)<br>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;

# 0. Context <a name="0-context"></a>

The New York Workers’ Compensation Board (WCB) administers and regulates workers’ compensation, disability, and other workers’ benefits. <br>
**WCB is responsible for assembling and deciding on claims whenever it becomes aware of a workplace injury**. Since 2000, the WCB has assembled and reviewed more than 5 million claims. However, manually reviewing all claims is an arduous and time-consuming process. For that reason, the WCB has reached out to Nova IMS to assist them in the creation of a model that can automate the decision-making whenever a new claim is received. <br>

Our task is to **create a classification model that can accurately predict the WCB’s final decision on what type o 
injury (Claim Injury Type) should be given to a caim.l To do that, the
WCB has provided labelled data with all claims assembled between
2020 and 2022 <br>

# 1. Import the Dataset and Explore the Data <a name="import-the-dataset-and-explore-the-data"></a>

## 1.1 Importing Libraries <a name="11-importing-libraries"></a>

In [1]:
# Remember: library imports are ALWAYS at the top of the script, no exceptions!
import sqlite3
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import statistics
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from kmodes.kmodes import KModes
from math import ceil
from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split


# for better resolution plots
%config InlineBackend.figure_format = 'retina' # optionally, you can change 'svg' to 'retina'

# Seeting seaborn style
sns.set()

In [2]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from math import ceil
from imblearn.over_sampling import SMOTE, SVMSMOTE
from scipy.stats import randint as sp_randint
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest, f_classif, RFE, mutual_info_classif
from sklearn.linear_model import LassoCV, SGDClassifier, LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score, roc_auc_score, precision_score, recall_score, make_scorer
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, StratifiedKFold, GridSearchCV, RandomizedSearchCV
from sklearn.naive_bayes import BernoulliNB, GaussianNB, MultinomialNB, ComplementNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, HistGradientBoostingClassifier, StackingClassifier, VotingClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.utils import class_weight
from openpyxl import load_workbook

## 1.2 Loading and Reading the Dataset <a name="12-loading-and-reading-the-dataset"></a>

In [4]:
wcb = pd.read_csv('train_data.csv', sep = ',', low_memory=False)   #sep is good to seperate data
pd.set_option('display.max_columns', None) #to be able too see all columns
wcb.head(5)

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,Claim Identifier,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents
0,2019-12-30,31.0,N,2020-01-01,N,0.0,1988.0,2019-12-31,,NEW HAMPSHIRE INSURANCE CO,1A. PRIVATE,5393875,2. NON-COMP,ST. LAWRENCE,N,SYRACUSE,,M,,44.0,RETAIL TRADE,I,,27.0,FROM LIQUID OR GREASE SPILLS,10.0,CONTUSION,62.0,BUTTOCKS,13662.0,0.0,Not Work Related,1.0
1,2019-08-30,46.0,N,2020-01-01,Y,1745.93,1973.0,2020-01-01,2020-01-14,ZURICH AMERICAN INSURANCE CO,1A. PRIVATE,5393091,4. TEMPORARY,WYOMING,N,ROCHESTER,2020-02-21,F,4.0,23.0,CONSTRUCTION,I,,97.0,REPETITIVE MOTION,49.0,SPRAIN OR TEAR,38.0,SHOULDER(S),14569.0,1.0,Not Work Related,4.0
2,2019-12-06,40.0,N,2020-01-01,N,1434.8,1979.0,2020-01-01,,INDEMNITY INSURANCE CO OF,1A. PRIVATE,5393889,4. TEMPORARY,ORANGE,N,ALBANY,,M,,56.0,ADMINISTRATIVE AND SUPPORT AND WASTE MANAGEMEN...,II,,79.0,OBJECT BEING LIFTED OR HANDLED,7.0,CONCUSSION,10.0,MULTIPLE HEAD INJURY,12589.0,0.0,Not Work Related,6.0
3,,,,2020-01-01,,,,,,,,957648180,,,,,,,,,,,,,,,,,,,,,
4,2019-12-30,61.0,N,2020-01-01,N,,1958.0,2019-12-31,,STATE INSURANCE FUND,2A. SIF,5393887,2. NON-COMP,DUTCHESS,N,ALBANY,,M,,62.0,HEALTH CARE AND SOCIAL ASSISTANCE,II,,16.0,"HAND TOOL, UTENSIL; NOT POWERED",43.0,PUNCTURE,36.0,FINGER(S),12603.0,0.0,Not Work Related,1.0


In [5]:
test_wcb = pd.read_csv('test_data.csv', sep = ',', low_memory=False)   #sep is good to seperate data
pd.set_option('display.max_columns', None) #to be able too see all columns

In [9]:
wcb = wcb.dropna(subset=['Claim Injury Type']) # antes de fazer isto mostrar que há valores nulos aqui
# importante fazer uma exploração dos dados antes disto e tirar conclusões, inclusive que há linhas com os valores todos nulos


In [12]:
X = wcb.drop("Claim Injury Type", axis = 1)
y = wcb["Claim Injury Type"]

In [14]:
from sklearn.model_selection import train_test_split


X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state = 0, stratify = y,
                                                  shuffle = True)



In [15]:
#train_copy = X_train.copy()

### Metadata

**Claim Dates**  
`Accident Date` Injury date of the claim.  
`Assembly Date` The date the claim was first assembled.  
`C-2 Date` Date of receipt of the Employer's Report of Work-Related Injury/Illness or equivalent (formerly Form C-2).  
`C-3 Date` Date Form C-3 (Employee Claim Form) was received.  
`First Hearing Date` Date the first hearing was held on a claim at a WCB hearing location. A blank date means the claim has not yet had a hearing held.  

**Worker Demographics**  
`Age at Injury` Age of injured worker when the injury occurred.  
`Birth Year` The reported year of birth of the injured worker.  
`Gender` The reported gender of the injured worker.  
`Zip Code` The reported ZIP code of the injured worker’s home address.  

**Claim and Case Information**  
`Alternative Dispute Resolution` Adjudication processes external to the Board.  
`Attorney/Representative` Is the claim being represented by an Attorney?  
`Claim Identifier` Unique identifier for each claim, assigned by WCB.  
`Carrier Name` Name of primary insurance provider responsible for providing workers’ compensation coverage to the injured worker’s employer.  
`Carrier Type` Type of primary insurance provider responsible for providing workers’ compensation coverage.  
`Average Weekly Wage` The wage used to calculate workers’ compensation, disability, or paid leave wage replacement benefits.  

**Location and Region**  
`County of Injury` Name of the New York County where the injury occurred.  
`District Name` Name of the WCB district office that oversees claims for that region or area of the state.  
`Medical Fee Region` Approximate region where the injured worker would receive medical service.  

**Incident and Injury Details**  
`COVID-19 Indicator` Indication that the claim may be associated with COVID-19.  
`IME-4 Count` Number of IME-4 forms received per claim. The IME-4 form is the “Independent Examiner's Report of Independent Medical Examination” form.  

**Industry Classification**  
`Industry Code` NAICS code and descriptions are available at https://www.naics.com/search-naics-codes-by-industry/.  
`Industry Code Description` 2-digit NAICS industry code description used to classify businesses according to their economic activity.  

**Injury Descriptions and Codes**  
`OIICS Nature of Injury Description` The OIICS nature of injury codes & descriptions are available at https://www.bls.gov/iif/oiics_manual_2007.pdf.  
`WCIO Cause of Injury Code` The WCIO cause of injury codes & descriptions are available at https://www.wcio.org/Active%20PNC/WCIO_Cause_Table.pdf.  
`WCIO Cause of Injury Description` See description of field above.  
`WCIO Nature of Injury Code` The WCIO nature of injury codes are available at https://www.wcio.org/Active%20PNC/WCIO_Nature_Table.pdf.  
`WCIO Nature of Injury Description` See description of field above.  
`WCIO Part Of Body Code` The WCIO part of body codes & descriptions are available at https://www.wcio.org/Active%20PNC/WCIO_Part_Table.pdf.  
`WCIO Part Of Body Description` See description of field above.  

**Claim Outcomes**  
`Agreement Reached` Binary variable: Yes if there is an agreement without the involvement of the WCB; otherwise unknown at the start of a claim.  
`WCB Decision` Multiclass variable: Decision of the WCB relative to the claim; "Accident" indicates a workplace accident, and "Occupational Disease" indicates illness from the workplace, both of which require WCB deliberation and may be unknown at the claim's start.  
`Claim Injury Type` Main target variable: Deliberation of the WCB relative to benefits awarded to the claim, with numbering indicating severity.  
aim, with numbering indicating severity. <br>  

## 1.3 Descriptive Statistics <a name="13-descriptive-statistics"></a>

#### Shape

In [22]:
wcb.shape

(574026, 33)

In [24]:
wcb.info()

<class 'pandas.core.frame.DataFrame'>
Index: 574026 entries, 0 to 593467
Data columns (total 33 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   Accident Date                       570337 non-null  object 
 1   Age at Injury                       574026 non-null  float64
 2   Alternative Dispute Resolution      574026 non-null  object 
 3   Assembly Date                       574026 non-null  object 
 4   Attorney/Representative             574026 non-null  object 
 5   Average Weekly Wage                 545375 non-null  float64
 6   Birth Year                          544948 non-null  float64
 7   C-2 Date                            559466 non-null  object 
 8   C-3 Date                            187245 non-null  object 
 9   Carrier Name                        574026 non-null  object 
 10  Carrier Type                        574026 non-null  object 
 11  Claim Identifier               

### Any footnotes?

In [27]:
wcb.tail(5)
# No, there isn't any

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,Claim Identifier,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents
593451,2022-12-14,35.0,N,2022-12-30,N,0.0,1987.0,2022-12-30,,STATE INSURANCE FUND,2A. SIF,6165265,2. NON-COMP,BRONX,N,NYC,,M,,56.0,ADMINISTRATIVE AND SUPPORT AND WASTE MANAGEMEN...,IV,,45.0,COLLISION OR SIDESWIPE WITH ANOTHER VEHICLE,10.0,CONTUSION,42.0,LOWER BACK AREA,10467,0.0,Not Work Related,4.0
593455,2022-12-15,33.0,N,2022-12-31,N,0.0,1989.0,2022-12-31,,WESCO INSURANCE CO,1A. PRIVATE,6165285,2. NON-COMP,NASSAU,N,NYC,,M,,62.0,HEALTH CARE AND SOCIAL ASSISTANCE,IV,,74.0,"FELLOW WORKER, PATIENT OR OTHER PERSON",37.0,INFLAMMATION,35.0,HAND,11590,0.0,Not Work Related,6.0
593456,2022-12-13,61.0,N,2022-12-31,N,991.08,1961.0,2022-12-31,,SECURITY NATIONAL INSURANCE CO,1A. PRIVATE,6165506,4. TEMPORARY,ERIE,N,BUFFALO,,F,1.0,62.0,HEALTH CARE AND SOCIAL ASSISTANCE,II,,98.0,"CUMULATIVE, NOC",80.0,"ALL OTHER CUMULATIVE INJURY, NOC",34.0,WRIST,14227,0.0,Not Work Related,3.0
593457,2022-12-14,24.0,N,2022-12-31,N,0.0,1998.0,2022-12-31,,TECHNOLOGY INSURANCE CO. INC.,1A. PRIVATE,6165339,2. NON-COMP,NEW YORK,N,NYC,,F,,62.0,HEALTH CARE AND SOCIAL ASSISTANCE,IV,,59.0,USING TOOL OR MACHINERY,59.0,"ALL OTHER SPECIFIC INJURIES, NOC",55.0,ANKLE,10029,0.0,Not Work Related,5.0
593467,2022-12-13,72.0,N,2022-12-31,N,0.0,1950.0,2022-12-31,,TECHNOLOGY INSURANCE CO. INC.,1A. PRIVATE,6165075,2. NON-COMP,SULLIVAN,N,BINGHAMTON,,F,,48.0,TRANSPORTATION AND WAREHOUSING,I,,25.0,FROM DIFFERENT LEVEL (ELEVATION),90.0,MULTIPLE PHYSICAL INJURIES ONLY,-9.0,MULTIPLE,12779,0.0,Not Work Related,3.0


In [29]:
wcb.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Accident Date,570337,5539,2020-03-01,1245
Alternative Dispute Resolution,574026,3,N,571412
Assembly Date,574026,897,2020-03-06,1413
Attorney/Representative,574026,2,N,392291
C-2 Date,559466,2475,2021-05-11,1847
C-3 Date,187245,1648,2021-04-21,350
Carrier Name,574026,2046,STATE INSURANCE FUND,111144
Carrier Type,574026,8,1A. PRIVATE,285368
Claim Injury Type,574026,8,2. NON-COMP,291078
County of Injury,574026,63,SUFFOLK,60430


In [30]:
wcb.describe(include='number').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Age at Injury,574026.0,42.11427,14.256432,0.0,31.0,42.0,54.0,117.0
Average Weekly Wage,545375.0,491.0883,6092.91812,0.0,0.0,0.0,841.0,2828079.0
Birth Year,544948.0,1886.768,414.644423,0.0,1965.0,1977.0,1989.0,2018.0
Claim Identifier,574026.0,5778956.0,222308.226013,5393066.0,5586764.25,5778282.5,5971328.75,6165685.0
IME-4 Count,132803.0,3.207337,2.832303,1.0,1.0,2.0,4.0,73.0
Industry Code,564068.0,58.64531,19.644175,11.0,45.0,61.0,71.0,92.0
OIICS Nature of Injury Description,0.0,,,,,,,
WCIO Cause of Injury Code,558386.0,54.38114,25.874281,1.0,31.0,56.0,75.0,99.0
WCIO Nature of Injury Code,558369.0,41.01384,22.207521,1.0,16.0,49.0,52.0,91.0
WCIO Part Of Body Code,556944.0,39.73815,22.36594,-9.0,33.0,38.0,53.0,99.0


## 1.4 Incoherencies <a name="14-incoherencies"></a>

In [34]:
wcb.dtypes

Accident Date                          object
Age at Injury                         float64
Alternative Dispute Resolution         object
Assembly Date                          object
Attorney/Representative                object
Average Weekly Wage                   float64
Birth Year                            float64
C-2 Date                               object
C-3 Date                               object
Carrier Name                           object
Carrier Type                           object
Claim Identifier                        int64
Claim Injury Type                      object
County of Injury                       object
COVID-19 Indicator                     object
District Name                          object
First Hearing Date                     object
Gender                                 object
IME-4 Count                           float64
Industry Code                         float64
Industry Code Description              object
Medical Fee Region                

### Datatypes:

**Numerical Data: <br>
float -> int** <br>
`Age at Injury` <br>
`Birth Year` <br>
`IME-4 Count`  <br>
`Number of Dependents` <br>
`WCIO Cause of Injury Code` <br>
`WCIO Nature of Injury Code` <br>
`WCIO Part Of Body Code` <br>
`Industry Code` <br>


**Object -> Dates** <br>
`C-2 Date` <br>
`C-3 Date` <br>
`First Hearing Date` <br>
`Accident Date`  <br>
`Assembly Dates` <br>

### Change in datatypes

**Numeric data from float to integer:**

In [39]:
wcb_float_to_int = ['Age at Injury', 'Birth Year', 'IME-4 Count', 'Number of Dependents', 'WCIO Cause of Injury Code',
                    'WCIO Nature of Injury Code', 'WCIO Part Of Body Code', 'Industry Code']

for col in wcb_float_to_int:
    # Convert the column to numeric, then to Int64 (nullable integer type)
    X_train[col] = pd.to_numeric(X_train[col], errors='coerce').astype('Int64')
    X_val[col] = pd.to_numeric(X_val[col], errors='coerce').astype('Int64')
    

**Dates from object to datetime format:**

In [42]:
# Convert dates treated as objects to datetime format
X_train['C-2 Date'] = pd.to_datetime(X_train['C-2 Date'], errors='coerce')
X_train['C-3 Date'] = pd.to_datetime(X_train['C-3 Date'], errors='coerce')
X_train['Accident Date'] = pd.to_datetime(X_train['Accident Date'], errors='coerce')
X_train['First Hearing Date'] = pd.to_datetime(X_train['First Hearing Date'], errors='coerce')
X_train['Assembly Date'] = pd.to_datetime(X_train['Assembly Date'], errors='coerce')

In [43]:
# Convert dates treated as objects to datetime format
X_val['C-2 Date'] = pd.to_datetime(X_val['C-2 Date'], errors='coerce')
X_val['C-3 Date'] = pd.to_datetime(X_val['C-3 Date'], errors='coerce')
X_val['Accident Date'] = pd.to_datetime(X_val['Accident Date'], errors='coerce')
X_val['First Hearing Date'] = pd.to_datetime(X_val['First Hearing Date'], errors='coerce')
X_val['Assembly Date'] = pd.to_datetime(X_val['Assembly Date'], errors='coerce')

In [45]:
X_train['Gender'] = X_train['Gender'].replace('U', np.nan)

In [48]:
X_val['Gender'] = X_val['Gender'].replace('U', np.nan)

In [50]:
wcb.dtypes

Accident Date                          object
Age at Injury                         float64
Alternative Dispute Resolution         object
Assembly Date                          object
Attorney/Representative                object
Average Weekly Wage                   float64
Birth Year                            float64
C-2 Date                               object
C-3 Date                               object
Carrier Name                           object
Carrier Type                           object
Claim Identifier                        int64
Claim Injury Type                      object
County of Injury                       object
COVID-19 Indicator                     object
District Name                          object
First Hearing Date                     object
Gender                                 object
IME-4 Count                           float64
Industry Code                         float64
Industry Code Description              object
Medical Fee Region                

# CLAIM IDENTIFIER REPEATS ONE VALUE

## 1.5 Exploring Data Visually <a name="15-exploring-data-visually"></a>

### Numerical Data Visualization

Since the variable `Claim Identifier` doesn't have any variance we think it won't the helpful for the development of the model, therefore we think it should be dropped.

### Categorical Data Visualization

Since the variable `WCB Decision` doesn't have any variance we think it won't the helpful for the development of the model, therefore we think it should be dropped.

## 1.6 Pair-wise Relationships <a name="16-pair-wise-relationships"></a>

WCB has provided labelled data with all claims assembled between 2020 and 2022

In [60]:
wcb.shape

(574026, 33)

# 2. Clean and Pre-process the Data<a name="Clean and Pre-process the Data"></a>

## 2.1 Missing Values <a name="21-missing-values"></a>

`Step 1` **Disposable rows and columns**

Check for % non-null values for each feature:

In [66]:
# Calculate the non-null percentage and null counts
non_null_percentage = X_train.notna().mean() * 100
null_counts = X_train.isna().sum()

# Convert to a DataFrame for better display
non_null_df = pd.DataFrame({
    'Variable': X_train.columns,
    'Non-Null Percentage': non_null_percentage.values,
    'Null Values': null_counts.values,})

# Sort the DataFrame by Non-Null Percentage
non_null_df.sort_values('Non-Null Percentage', ascending=False)

Unnamed: 0,Variable,Non-Null Percentage,Null Values
31,Number of Dependents,100.0,0
14,District Name,100.0,0
2,Alternative Dispute Resolution,100.0,0
3,Assembly Date,100.0,0
4,Attorney/Representative,100.0,0
30,WCB Decision,100.0,0
29,Agreement Reached,100.0,0
20,Medical Fee Region,100.0,0
1,Age at Injury,100.0,0
9,Carrier Name,100.0,0


**Drop column with all null values:**

In [70]:
X_train.drop(columns=['OIICS Nature of Injury Description'], inplace=True)

In [72]:
X_val.drop(columns=['OIICS Nature of Injury Description'], inplace=True)

Check how many rows with missing values there are throughout all dataset columns: <br>
*Excluding 'Assembly Date' and 'Claim Identifier', which have values for every row but aren't relevant enough to keep if all others are NaN*

In [75]:
# Exclude columns 'Assembly Date' and 'Claim Identifier'
nan_columns = X_train.columns.drop(['Assembly Date', 'Claim Identifier'])
nan_columns = X_val.columns.drop(['Assembly Date', 'Claim Identifier'])
# Count how many rows have all NaN values in the selected columns
((X_train[nan_columns].isnull()) | (X_train[nan_columns] == 0)).all(axis=1).sum()
((X_val[nan_columns].isnull()) | (X_val[nan_columns] == 0)).all(axis=1).sum()


# não está a dropar nada porque tive de retirar em cima no split


0

In [76]:
'''# Exclude columns 'Assembly Date' and 'Claim Identifier'
nan_columns = X_train.columns.drop(['Assembly Date', 'Claim Identifier'])

# Identificar as linhas que devem ser dropadas (todas NaN ou 0 nas colunas selecionadas)
rows_to_drop = X_train[((X_train[nan_columns].isnull()) | (X_train[nan_columns] == 0)).all(axis=1)].index

# Remover essas linhas de X_train
X_train = X_train.drop(rows_to_drop)

# Remover as mesmas linhas de y_train
y_train = y_train.drop(rows_to_drop)

# (Opcional) Resetar os índices, se necessário
X_train = X_train.reset_index(drop=True)
y_train = y_train.reset_index(drop=True)
'''

# como não está a dropar nenhuma não é preciso este código

"# Exclude columns 'Assembly Date' and 'Claim Identifier'\nnan_columns = X_train.columns.drop(['Assembly Date', 'Claim Identifier'])\n\n# Identificar as linhas que devem ser dropadas (todas NaN ou 0 nas colunas selecionadas)\nrows_to_drop = X_train[((X_train[nan_columns].isnull()) | (X_train[nan_columns] == 0)).all(axis=1)].index\n\n# Remover essas linhas de X_train\nX_train = X_train.drop(rows_to_drop)\n\n# Remover as mesmas linhas de y_train\ny_train = y_train.drop(rows_to_drop)\n\n# (Opcional) Resetar os índices, se necessário\nX_train = X_train.reset_index(drop=True)\ny_train = y_train.reset_index(drop=True)\n"

**Drop rows with all NaN values, identified in previous step for the selected columns (all but 2):**

In [80]:
X_train = X_train.dropna(subset = nan_columns, how = 'all')
X_val = X_val.dropna(subset = nan_columns, how = 'all')

In [81]:
X_train.shape
# Calculate the non-null percentage and null counts
non_null_percentage = X_train.notna().mean() * 100
null_counts = X_train.isna().sum()

# Convert to a DataFrame for better display
non_null_df = pd.DataFrame({
    'Variable': X_train.columns,
    'Non-Null Percentage': non_null_percentage.values,
    'Null Values': null_counts.values,})

# Sort the DataFrame by Non-Null Percentage
non_null_df.sort_values('Non-Null Percentage', ascending=False)

Unnamed: 0,Variable,Non-Null Percentage,Null Values
30,Number of Dependents,100.0,0
14,District Name,100.0,0
2,Alternative Dispute Resolution,100.0,0
3,Assembly Date,100.0,0
4,Attorney/Representative,100.0,0
29,WCB Decision,100.0,0
28,Agreement Reached,100.0,0
20,Medical Fee Region,100.0,0
1,Age at Injury,100.0,0
9,Carrier Name,100.0,0


In [84]:
X_val.shape
# Calculate the non-null percentage and null counts
non_null_percentage = X_val.notna().mean() * 100
null_counts = X_val.isna().sum()

# Convert to a DataFrame for better display
non_null_df = pd.DataFrame({
    'Variable': X_val.columns,
    'Non-Null Percentage': non_null_percentage.values,
    'Null Values': null_counts.values,})

# Sort the DataFrame by Non-Null Percentage
non_null_df.sort_values('Non-Null Percentage', ascending=False)

Unnamed: 0,Variable,Non-Null Percentage,Null Values
30,Number of Dependents,100.0,0
14,District Name,100.0,0
2,Alternative Dispute Resolution,100.0,0
3,Assembly Date,100.0,0
4,Attorney/Representative,100.0,0
29,WCB Decision,100.0,0
28,Agreement Reached,100.0,0
20,Medical Fee Region,100.0,0
1,Age at Injury,100.0,0
9,Carrier Name,100.0,0


In [86]:
wcb.shape

(574026, 33)

In [88]:
# Remover linhas de X_train com base nas colunas especificadas
X_train = X_train.dropna(subset=['WCIO Part Of Body Description'])
X_train = X_train.dropna(subset=['WCIO Cause of Injury Description'])
X_train = X_train.dropna(subset=['WCIO Nature of Injury Description'])
X_train = X_train.dropna(subset=['Industry Code'])

# Garantir que os índices de y_train fiquem alinhados com os de X_train
y_train = y_train.loc[X_train.index]

In [89]:
# Remover linhas de X_val com base nas colunas especificadas
X_val = X_val.dropna(subset=['WCIO Part Of Body Description'])
X_val = X_val.dropna(subset=['WCIO Cause of Injury Description'])
X_val = X_val.dropna(subset=['WCIO Nature of Injury Description'])
X_val = X_val.dropna(subset=['Industry Code'])

# Garantir que os índices de y_train fiquem alinhados com os de X_val
y_val = y_val.loc[X_val.index]

In summary: <br>
We **treated missing values for a total of 15 features** with this operation.

**Columns to drop** = 1, `OIICS Nature of Injury Description` <br>
**Rows to drop in total** = 19445 <br>

By removing all null rows (except for `Assembly Date` and `Claim Identifier`) all the null values of the target variable `Claim Injury Type` are also removed, not needing another step to delete them.

**Change in 'IME-4 Count' null values**

Since `IME-4 Count` has only integers different from 0, we are assuming that all null values mean that an independent medical evaluation wasn't requested, therefore all null values will be changed to 0's.

In [95]:
X_train['IME-4 Count'] = X_train['IME-4 Count'].fillna(0)
X_train['IME-4 Count'].value_counts()
X_train['IME-4 Count'].quantile(0.5)

0.0

In [97]:
X_val['IME-4 Count'] = X_val['IME-4 Count'].fillna(0)
X_val['IME-4 Count'].value_counts()
X_val['IME-4 Count'].quantile(0.5)

0.0

In [99]:
# Calculate the non-null percentage and null counts
non_null_percentage = X_train.notna().mean() * 100
null_counts = X_train.isna().sum()

# Convert to a DataFrame for better display
non_null_df = pd.DataFrame({
    'Variable': X_train.columns,
    'Non-Null Percentage': non_null_percentage.values,
    'Null Values': null_counts.values,})

# Sort the DataFrame by Non-Null Percentage
non_null_df.sort_values('Non-Null Percentage', ascending=False)

Unnamed: 0,Variable,Non-Null Percentage,Null Values
30,Number of Dependents,100.0,0
24,WCIO Nature of Injury Description,100.0,0
22,WCIO Cause of Injury Description,100.0,0
21,WCIO Cause of Injury Code,100.0,0
20,Medical Fee Region,100.0,0
19,Industry Code Description,100.0,0
18,Industry Code,100.0,0
17,IME-4 Count,100.0,0
25,WCIO Part Of Body Code,100.0,0
1,Age at Injury,100.0,0


In [100]:
# Calculate the non-null percentage and null counts
non_null_percentage = X_val.notna().mean() * 100
null_counts = X_val.isna().sum()

# Convert to a DataFrame for better display
non_null_df = pd.DataFrame({
    'Variable': X_val.columns,
    'Non-Null Percentage': non_null_percentage.values,
    'Null Values': null_counts.values,})

# Sort the DataFrame by Non-Null Percentage
non_null_df.sort_values('Non-Null Percentage', ascending=False)

Unnamed: 0,Variable,Non-Null Percentage,Null Values
30,Number of Dependents,100.0,0
24,WCIO Nature of Injury Description,100.0,0
22,WCIO Cause of Injury Description,100.0,0
21,WCIO Cause of Injury Code,100.0,0
20,Medical Fee Region,100.0,0
19,Industry Code Description,100.0,0
18,Industry Code,100.0,0
17,IME-4 Count,100.0,0
25,WCIO Part Of Body Code,100.0,0
1,Age at Injury,100.0,0


**Since the median value of `Average Weekly Wage` is 0, and more than half of the values of this variable are also 0, we think that replacing null values with the median is the most correct approach** <br>
In this case, the mean is a reasonable value considering the real world conditions, however the values range from 0 to 2.8 milion, reducing the credibility of the mean

In [104]:
X_train['Average Weekly Wage'] = X_train['Average Weekly Wage'].fillna(0)

In [106]:
X_val['Average Weekly Wage'] = X_val['Average Weekly Wage'].fillna(0)

**Since the median value of the diference between `Assembly Date` and `C-2 Date` is 0 days and the mean is 4 days, we believe that the best approach is to fill the null values of the `C-2 Date` equal to the values in `Assembly Date`** <br>
We choose `Assembly Date` to replace the null values because these two variables have high correlation between them (we show this foward in the code)

In [109]:
X_train['C-2 Date'] = X_train.apply(lambda x: x['Assembly Date'] if pd.isna(x['C-2 Date']) else x['C-2 Date'], axis=1)

In [110]:
X_val['C-2 Date'] = X_val.apply(lambda x: x['Assembly Date'] if pd.isna(x['C-2 Date']) else x['C-2 Date'], axis=1)

In [111]:
X_train.shape

(444027, 31)

In [112]:
y_train.shape

(444027,)

**Transform the variable `Gender` into numeric where 0 is Male, 1 is Female and 2 is Non-Binary, with the handling of the null values with the mode, which is male in this case**

In [115]:
X_train['Gender'] = X_train['Gender'].fillna(0)

X_train['Gender'] = X_train['Gender'].replace({'M': 0, 'F': 1, 'X': 2}).astype('Int64')

In [116]:
X_val['Gender'] = X_val['Gender'].fillna(0)

X_val['Gender'] = X_val['Gender'].replace({'M': 0, 'F': 1, 'X': 2}).astype('Int64')

**Fill `Zip Code`with "Unknown Values"**

In [123]:
X_train['Zip Code'] = X_train['Zip Code'].fillna('Unknown')

In [125]:
X_val['Zip Code'] = X_val['Zip Code'].fillna('Unknown')

In [127]:
# Calculate the non-null percentage and null counts
non_null_percentage = X_train.notna().mean() * 100
null_counts = X_train.isna().sum()

# Convert to a DataFrame for better display
non_null_df = pd.DataFrame({
    'Variable': X_train.columns,
    'Non-Null Percentage': non_null_percentage.values,
    'Null Values': null_counts.values,})

# Sort the DataFrame by Non-Null Percentage
non_null_df.sort_values('Non-Null Percentage', ascending=False)    

Unnamed: 0,Variable,Non-Null Percentage,Null Values
30,Number of Dependents,100.0,0
14,District Name,100.0,0
25,WCIO Part Of Body Code,100.0,0
24,WCIO Nature of Injury Description,100.0,0
23,WCIO Nature of Injury Code,100.0,0
22,WCIO Cause of Injury Description,100.0,0
21,WCIO Cause of Injury Code,100.0,0
20,Medical Fee Region,100.0,0
19,Industry Code Description,100.0,0
18,Industry Code,100.0,0


In [129]:
# Calculate the non-null percentage and null counts
non_null_percentage = X_val.notna().mean() * 100
null_counts = X_val.isna().sum()

# Convert to a DataFrame for better display
non_null_df = pd.DataFrame({
    'Variable': X_val.columns,
    'Non-Null Percentage': non_null_percentage.values,
    'Null Values': null_counts.values,})

# Sort the DataFrame by Non-Null Percentage
non_null_df.sort_values('Non-Null Percentage', ascending=False)    

Unnamed: 0,Variable,Non-Null Percentage,Null Values
30,Number of Dependents,100.0,0
14,District Name,100.0,0
25,WCIO Part Of Body Code,100.0,0
24,WCIO Nature of Injury Description,100.0,0
23,WCIO Nature of Injury Code,100.0,0
22,WCIO Cause of Injury Description,100.0,0
21,WCIO Cause of Injury Code,100.0,0
20,Medical Fee Region,100.0,0
19,Industry Code Description,100.0,0
18,Industry Code,100.0,0


## 2.2  Duplicates <a name="23-Duplicates"></a>

In [132]:
X_train[['Industry Code', 'Industry Code Description']].nunique()

X_train.drop_duplicates(subset=['Industry Code'])[['Industry Code', 'Industry Code Description']]

# Remove duplicatas com base no código e mantém apenas a primeira ocorrência de cada código
wcb_unicos = X_train.drop_duplicates(subset=['Industry Code'])[['Industry Code', 'Industry Code Description']]

# Conta a frequência de cada descrição na lista sem duplicatas de código
descricao_repetidas = wcb_unicos['Industry Code Description'].value_counts()

# Filtra para mostrar apenas as descrições que se repetem
descricao_repetidas[descricao_repetidas > 1]


# X_train[X_train['Industry Code Description'].isin(['MANUFACTURING', 'RETAIL TRADE', 'TRANSPORTATION AND WAREHOUSING'])][['Industry Code', 'Industry Code Description']]
X_train[X_train['Industry Code Description'].isin(['TRANSPORTATION AND WAREHOUSING'])][['Industry Code', 'Industry Code Description']]

X_train['Industry Code'] = X_train['Industry Code'].replace({45: 44, 32: 31, 33: 31, 49: 48})

In [134]:
X_val[['Industry Code', 'Industry Code Description']].nunique()

X_val.drop_duplicates(subset=['Industry Code'])[['Industry Code', 'Industry Code Description']]

# Remove duplicatas com base no código e mantém apenas a primeira ocorrência de cada código
wcb_unicos = X_val.drop_duplicates(subset=['Industry Code'])[['Industry Code', 'Industry Code Description']]

# Conta a frequência de cada descrição na lista sem duplicatas de código
descricao_repetidas = wcb_unicos['Industry Code Description'].value_counts()

# Filtra para mostrar apenas as descrições que se repetem
descricao_repetidas[descricao_repetidas > 1]


# X_train[X_train['Industry Code Description'].isin(['MANUFACTURING', 'RETAIL TRADE', 'TRANSPORTATION AND WAREHOUSING'])][['Industry Code', 'Industry Code Description']]
X_val[X_val['Industry Code Description'].isin(['TRANSPORTATION AND WAREHOUSING'])][['Industry Code', 'Industry Code Description']]

X_val['Industry Code'] = X_val['Industry Code'].replace({45: 44, 32: 31, 33: 31, 49: 48})

In [136]:
X_train[['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']].nunique()

X_train.drop_duplicates(subset=['WCIO Cause of Injury Code'])[['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']]

# Remove duplicatas com base no código e mantém apenas a primeira ocorrência de cada código
wcb_unicos = X_train.drop_duplicates(subset=['WCIO Cause of Injury Code'])[['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']]

# Conta a frequência de cada descrição na lista sem duplicatas de código
descricao_repetidas = wcb_unicos['WCIO Cause of Injury Description'].value_counts()

# Filtra para mostrar apenas as descrições que se repetem
descricao_repetidas[descricao_repetidas > 1]

#X_train[X_train['WCIO Cause of Injury Description'].isin(['OBJECT BEING LIFTED OR HANDLED', 'REPETITIVE MOTION'])][['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']]

X_train[X_train['WCIO Cause of Injury Description'].isin(['OBJECT BEING LIFTED OR HANDLED'])][['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']]

X_train['WCIO Cause of Injury Code'] = X_train['WCIO Cause of Injury Code'].replace({79: 17, 66: 17, 97: 94})

In [138]:
X_val[['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']].nunique()

X_val.drop_duplicates(subset=['WCIO Cause of Injury Code'])[['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']]

# Remove duplicatas com base no código e mantém apenas a primeira ocorrência de cada código
wcb_unicos = X_val.drop_duplicates(subset=['WCIO Cause of Injury Code'])[['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']]

# Conta a frequência de cada descrição na lista sem duplicatas de código
descricao_repetidas = wcb_unicos['WCIO Cause of Injury Description'].value_counts()

# Filtra para mostrar apenas as descrições que se repetem
descricao_repetidas[descricao_repetidas > 1]

#X_train[X_train['WCIO Cause of Injury Description'].isin(['OBJECT BEING LIFTED OR HANDLED', 'REPETITIVE MOTION'])][['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']]

X_val[X_val['WCIO Cause of Injury Description'].isin(['OBJECT BEING LIFTED OR HANDLED'])][['WCIO Cause of Injury Code', 'WCIO Cause of Injury Description']]

X_val['WCIO Cause of Injury Code'] = X_val['WCIO Cause of Injury Code'].replace({79: 17, 66: 17, 97: 94})

## 2.3 Outliers <a name="23-outliers"></a>

In [141]:
print(X_val['Age at Injury'].quantile(0.995))
print(X_val['Age at Injury'].quantile(0.005))

X_val['Age at Injury'] = X_val.apply(lambda x: 16 if 16 >= x['Age at Injury'] else x['Age at Injury'], axis=1)
X_val['Age at Injury'] = X_val.apply(lambda x: 85 if 85 <= x['Age at Injury'] else x['Age at Injury'], axis=1)

75.0
16.0


In [142]:
Q1 = X_val['Accident Date'].quantile(0.25)
Q3 = X_val['Accident Date'].quantile(0.75)
IQR = Q3 - Q1

# Define os limites de outliers usando um intervalo de tempo, também sem horas
lower_bound = pd.to_datetime((Q1 - pd.Timedelta(days=1.5 * IQR.days)).strftime('%Y-%m-%d'))

# Filtra as linhas de 'Accident Date' que estão dentro dos limites de outliers
X_val = X_val[(X_val['Accident Date'] >= lower_bound)]



# Garantir que y_train esteja sincronizado com X_train
y_val = y_val.loc[X_val.index]

In [143]:
'''box_plot_features = ['Age at Injury', 'Average Weekly Wage', 'Claim Injury Type', 'First Hearing Date', 'Accident Date', 'Assembly Date']


sns.set()
fig, axes = plt.subplots(2, ceil(len(box_plot_features) / 2), figsize=(40, 11))
for ax, feat in zip(axes.flatten(), box_plot_features):
    sns.boxplot(x=wcb[feat], ax=ax)

title = "Numeric Variables' Box Plots"
plt.suptitle(title)
plt.show()'''

'box_plot_features = [\'Age at Injury\', \'Average Weekly Wage\', \'Claim Injury Type\', \'First Hearing Date\', \'Accident Date\', \'Assembly Date\']\n\n\nsns.set()\nfig, axes = plt.subplots(2, ceil(len(box_plot_features) / 2), figsize=(40, 11))\nfor ax, feat in zip(axes.flatten(), box_plot_features):\n    sns.boxplot(x=wcb[feat], ax=ax)\n\ntitle = "Numeric Variables\' Box Plots"\nplt.suptitle(title)\nplt.show()'

In [144]:
Q1 = X_train['Accident Date'].quantile(0.25)
Q3 = X_train['Accident Date'].quantile(0.75)
IQR = Q3 - Q1

# Define os limites de outliers usando um intervalo de tempo, também sem horas
lower_bound = pd.to_datetime((Q1 - pd.Timedelta(days=1.5 * IQR.days)).strftime('%Y-%m-%d'))

# Filtra as linhas de 'Accident Date' que estão dentro dos limites de outliers
X_train = X_train[(X_train['Accident Date'] >= lower_bound)]



# Garantir que y_train esteja sincronizado com X_train
y_train = y_train.loc[X_train.index]

## 2.4  Categorical Data <a name="23-Categorical Data"></a>

**Turn 'C-3 Date' into a binary outcome feature, where 0  "no form received" and 1 = "at least 1 form received"**

In [151]:
X_train['C-3 Date'] = X_train['C-3 Date'].apply(lambda x: 0 if pd.isna(x) else 1)

In [152]:
X_val['C-3 Date'] = X_val['C-3 Date'].apply(lambda x: 0 if pd.isna(x) else 1)

**Turn 'First Hearing Date' into a binary outcome feature, where 0 = "there was no hearing" and 1 = "there was a hearing"**

In [156]:
X_train['First Hearing Date'] = X_train['First Hearing Date'].apply(lambda x: 0 if pd.isna(x) else 1)

In [158]:
X_val['First Hearing Date'] = X_val['First Hearing Date'].apply(lambda x: 0 if pd.isna(x) else 1)

In [161]:
X_train['Alternative Dispute Resolution'] = X_train['Alternative Dispute Resolution'].replace({'N': 0, 'Y': 1})

In [163]:
X_val['Alternative Dispute Resolution'] = X_val['Alternative Dispute Resolution'].replace({'N': 0, 'Y': 1})

**Turn `COVID-19 Indicator` into a binary outcome feature, where N:0 and Y:1**

In [168]:
X_train['COVID-19 Indicator'] = X_train['COVID-19 Indicator'].replace({'N': 0, 'Y': 1})

In [170]:
X_val['COVID-19 Indicator'] = X_val['COVID-19 Indicator'].replace({'N': 0, 'Y': 1})

In [173]:
X_train['Carrier Type'] = X_train['Carrier Type'].replace({
    '1A. PRIVATE': 1,
    '2A. SIF': 2,
    '3A. SELF PUBLIC': 3,
    '4A. SELF PRIVATE': 4,
    '5A. SPECIAL FUND - CONS. COMM. (SECT. 25-A)': 5,
    '5C. SPECIAL FUND - POI CARRIER WCB MENANDS': 5,
    '5D. SPECIAL FUND - UNKNOWN': 5,
    'UNKNOWN': 0
}).astype('Int64')
X_train['Carrier Type'].value_counts()

Carrier Type
1    219792
3     93736
2     86467
4     37997
5       513
0       257
Name: count, dtype: Int64

In [175]:
X_val['Carrier Type'] = X_val['Carrier Type'].replace({
    '1A. PRIVATE': 1,
    '2A. SIF': 2,
    '3A. SELF PUBLIC': 3,
    '4A. SELF PRIVATE': 4,
    '5A. SPECIAL FUND - CONS. COMM. (SECT. 25-A)': 5,
    '5C. SPECIAL FUND - POI CARRIER WCB MENANDS': 5,
    '5D. SPECIAL FUND - UNKNOWN': 5,
    'UNKNOWN': 0
}).astype('Int64')
X_val['Carrier Type'].value_counts()

Carrier Type
1    54943
3    23441
2    21702
4     9449
5      136
0       65
Name: count, dtype: Int64

In [177]:
X_train['Attorney/Representative'] = X_train['Attorney/Representative'].replace({'N': 0, 'Y': 1})

In [179]:
X_val['Attorney/Representative'] = X_val['Attorney/Representative'].replace({'N': 0, 'Y': 1})

In [181]:
y_train = y_train.replace({
    '1. CANCELLED': 1, 
    '2. NON-COMP': 2, 
    '3. MED ONLY': 3, 
    '4. TEMPORARY': 4, 
    '5. PPD SCH LOSS': 5, 
    '6. PPD NSL': 6, 
    '7. PTD': 7, 
    '8. DEATH': 8
})



In [183]:
y_val = y_val.replace({
    '1. CANCELLED': 1, 
    '2. NON-COMP': 2, 
    '3. MED ONLY': 3, 
    '4. TEMPORARY': 4, 
    '5. PPD SCH LOSS': 5, 
    '6. PPD NSL': 6, 
    '7. PTD': 7, 
    '8. DEATH': 8
})



In [185]:
X_train['District Name'] = X_train['District Name'].replace({
    'NYC': 1,
    'ALBANY': 2,
    'HAUPPAUGE': 3,
    'BUFFALO': 4,
    'SYRACUSE': 5,
    'ROCHESTER': 6,
    'BINGHAMTON': 7,
    'STATEWIDE': 8
})

In [187]:
X_val['District Name'] = X_val['District Name'].replace({
    'NYC': 1,
    'ALBANY': 2,
    'HAUPPAUGE': 3,
    'BUFFALO': 4,
    'SYRACUSE': 5,
    'ROCHESTER': 6,
    'BINGHAMTON': 7,
    'STATEWIDE': 8
})

In [189]:
X_train['Medical Fee Region'] = X_train['Medical Fee Region'].replace({
    'I': 1,
    'II': 2,
    'III': 3,
    'IV': 4,
    'UK': 5
})


In [191]:
X_val['Medical Fee Region'] = X_val['Medical Fee Region'].replace({
    'I': 1,
    'II': 2,
    'III': 3,
    'IV': 4,
    'UK': 5
})


In [193]:
X_train['WCIO Part Of Body Code'] = X_train['WCIO Part Of Body Code'].replace({47: 23, 43: 22, 25: 18})

In [195]:
X_val['WCIO Part Of Body Code'] = X_val['WCIO Part Of Body Code'].replace({47: 23, 43: 22, 25: 18})

In [197]:
X_train['WCIO Part Of Body Code'] = X_train['WCIO Part Of Body Code'].replace(-9, 9)

In [199]:
X_val['WCIO Part Of Body Code'] = X_val['WCIO Part Of Body Code'].replace(-9, 9)

## 2.5 Agreggations <a name="23-Agregations"></a>

## 2.6 Feature Engineering <a name="23-Feature Engineering"></a>

In [203]:
X = (X_train['Assembly Date'] - X_train['Accident Date']).dt.days
X.describe()

count    438762.000000
mean         30.360081
std          76.595037
min        -929.000000
25%           5.000000
50%           9.000000
75%          22.000000
max        1589.000000
dtype: float64

In [205]:
X = (X_val['Assembly Date'] - X_val['Accident Date']).dt.days
X.describe()

count    109736.000000
mean         30.291636
std          76.657304
min        -701.000000
25%           5.000000
50%           9.000000
75%          22.000000
max        1411.000000
dtype: float64

In [207]:
X_train['Accident Date'] = X_train.apply(lambda x: x['Assembly Date'] - pd.Timedelta(days=10) if pd.isna(x['Accident Date']) else x['Accident Date'], axis=1)

In [208]:
X_train['Accident Date'].describe()

count                           438762
mean     2021-06-21 00:55:45.631572480
min                2018-06-14 00:00:00
25%                2020-09-25 00:00:00
50%                2021-07-06 00:00:00
75%                2022-03-25 00:00:00
max                2023-09-29 00:00:00
Name: Accident Date, dtype: object

In [211]:
X_val['Accident Date'] = X_val.apply(lambda x: x['Assembly Date'] - pd.Timedelta(days=10) if pd.isna(x['Accident Date']) else x['Accident Date'], axis=1)

In [212]:
X_val['Accident Date'].describe()

count                           109736
mean     2021-06-21 08:08:30.461471488
min                2018-06-25 00:00:00
25%                2020-09-28 00:00:00
50%                2021-07-06 00:00:00
75%                2022-03-25 00:00:00
max                2023-09-11 00:00:00
Name: Accident Date, dtype: object

**Most of the `Birth Year` values can be calculated by subtracting `Accident Date` with `Age at Injury`, since `Accident Date` still has some null values**

In [216]:
X_train['Birth Year'] = X_train['Birth Year'].fillna(X_train['Accident Date'].dt.year - X_train['Age at Injury'])

In [218]:
X_train['Birth Year'].describe()

count       438762.0
mean     1899.088959
std       387.386489
min              0.0
25%           1965.0
50%           1977.0
75%           1989.0
max           2022.0
Name: Birth Year, dtype: Float64

In [220]:
X_val['Birth Year'] = X_val['Birth Year'].fillna(X_val['Accident Date'].dt.year - X_val['Age at Injury'])

In [222]:
X_val['Birth Year'].describe()

count       109736.0
mean     1897.813015
std       390.446306
min              0.0
25%           1965.0
50%           1977.0
75%           1989.0
max           2018.0
Name: Birth Year, dtype: Float64

In [224]:
Y = (X_train['Assembly Date'] - X_train['C-2 Date']).dt.days
Y.describe()

count    438762.000000
mean         -6.709444
std          60.004484
min       -1395.000000
25%           0.000000
50%           0.000000
75%           0.000000
max        8847.000000
dtype: float64

In [226]:
Y = (X_val['Assembly Date'] - X_val['C-2 Date']).dt.days
Y.describe()

count    109736.000000
mean         -6.749180
std          52.162541
min       -1444.000000
25%           0.000000
50%           0.000000
75%           0.000000
max        7765.000000
dtype: float64

In [228]:
Z = (X_train['Assembly Date'] - X_train['Accident Date']).dt.days
Z.describe()

count    438762.000000
mean         30.360081
std          76.595037
min        -929.000000
25%           5.000000
50%           9.000000
75%          22.000000
max        1589.000000
dtype: float64

In [230]:
Z = (X_val['Assembly Date'] - X_val['Accident Date']).dt.days
Z.describe()

count    109736.000000
mean         30.291636
std          76.657304
min        -701.000000
25%           5.000000
50%           9.000000
75%          22.000000
max        1411.000000
dtype: float64

In [232]:
wcb.shape

(574026, 33)

In [234]:
wcb.shape

(574026, 33)

In [236]:
'''import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import chi2_contingency

# Step 1: Select only categorical columns
categorical_df = wcb.select_dtypes(include=['object', 'datetime64[ns]'])

# Step 2: Define a function to calculate Cramér's V
def cramers_v(x, y):
    confusion_matrix = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum().sum()
    r, k = confusion_matrix.shape
    return np.sqrt(chi2 / (n * (min(r, k) - 1)))

# Step 3: Create an empty DataFrame to store Cramér's V values
cramers_v_matrix = pd.DataFrame(np.zeros((categorical_df.shape[1], categorical_df.shape[1])), 
                                columns=categorical_df.columns, 
                                index=categorical_df.columns)

# Step 4: Calculate Cramér's V for each pair of categorical variables
for col1 in categorical_df.columns:
    for col2 in categorical_df.columns:
        if col1 != col2:
            cramers_v_matrix.loc[col1, col2] = cramers_v(categorical_df[col1], categorical_df[col2])
        else:
            cramers_v_matrix.loc[col1, col2] = 1  # Set diagonal to 1

# Step 5: Apply a mask for values below 0.4
mask = (cramers_v_matrix > -0.4) & (cramers_v_matrix < 0.4)

# Step 6: Visualize the correlation matrix with masking
plt.figure(figsize=(10, 8))
sns.heatmap(cramers_v_matrix, annot=True, cmap='coolwarm', linewidths=0.5, mask=mask)

# Display the plot
plt.title('Cramér\'s V Correlation Matrix (Categorical Variables Only, Correlations ≥ 0.4)')
plt.show()

'''

"import pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nfrom scipy.stats import chi2_contingency\n\n# Step 1: Select only categorical columns\ncategorical_df = wcb.select_dtypes(include=['object', 'datetime64[ns]'])\n\n# Step 2: Define a function to calculate Cramér's V\ndef cramers_v(x, y):\n    confusion_matrix = pd.crosstab(x, y)\n    chi2 = chi2_contingency(confusion_matrix)[0]\n    n = confusion_matrix.sum().sum()\n    r, k = confusion_matrix.shape\n    return np.sqrt(chi2 / (n * (min(r, k) - 1)))\n\n# Step 3: Create an empty DataFrame to store Cramér's V values\ncramers_v_matrix = pd.DataFrame(np.zeros((categorical_df.shape[1], categorical_df.shape[1])), \n                                columns=categorical_df.columns, \n                                index=categorical_df.columns)\n\n# Step 4: Calculate Cramér's V for each pair of categorical variables\nfor col1 in categorical_df.columns:\n    for col2 in categorical_df.columns:\n       

In [238]:
print(X_train['Age at Injury'].quantile(0.995))
print(X_train['Age at Injury'].quantile(0.005))

X_train['Age at Injury'] = X_train.apply(lambda x: 16 if 16 >= x['Age at Injury'] else x['Age at Injury'], axis=1)
X_train['Age at Injury'] = X_train.apply(lambda x: 85 if 85 <= x['Age at Injury'] else x['Age at Injury'], axis=1)

75.0
18.0


In [239]:
X_train['Birth Year'] = X_train['Accident Date'].dt.year - X_train['Age at Injury']

In [240]:
X_val['Birth Year'] = X_val['Accident Date'].dt.year - X_val['Age at Injury']

In [241]:
X_train.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Carrier Name,438762,1900,STATE INSURANCE FUND,86467
County of Injury,438762,63,SUFFOLK,46111
Industry Code Description,438762,20,HEALTH CARE AND SOCIAL ASSISTANCE,89402
WCIO Cause of Injury Description,438762,74,LIFTING,36611
WCIO Nature of Injury Description,438762,56,STRAIN OR TEAR,120285
WCIO Part Of Body Description,438762,54,LOWER BACK AREA,40791
Zip Code,438762,7268,Unknown,21782
WCB Decision,438762,1,Not Work Related,438762


In [242]:
X_val.describe(include='object').T

Unnamed: 0,count,unique,top,freq
Carrier Name,109736,1592,STATE INSURANCE FUND,21702
County of Injury,109736,63,SUFFOLK,11595
Industry Code Description,109736,20,HEALTH CARE AND SOCIAL ASSISTANCE,22288
WCIO Cause of Injury Description,109736,74,LIFTING,9175
WCIO Nature of Injury Description,109736,55,STRAIN OR TEAR,30110
WCIO Part Of Body Description,109736,54,LOWER BACK AREA,10140
Zip Code,109736,3988,Unknown,5571
WCB Decision,109736,1,Not Work Related,109736


In [243]:
X_train['District Name'].value_counts()

District Name
1    206447
2     66464
3     46192
4     35052
5     34520
6     30607
7     17043
8      2437
Name: count, dtype: int64

In [244]:
X_val['District Name'].value_counts()

District Name
1    51470
2    16821
3    11608
4     8756
5     8711
6     7566
7     4190
8      614
Name: count, dtype: int64

## 2.7 Multivariate Relationships <a name="23-Multivariate Relationships"></a>

## 2.8 Data Scaling <a name="23-Data Scaling"></a>

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder
scalers = {
    'minmax': MinMaxScaler(),
    'standard': StandardScaler(),
    'ordinal': OrdinalEncoder()
}

# Função para aplicar o scaler correto
def apply_scaling(X_val):
    for column in X_val.columns:
        num_unique_values = X_val[column].nunique()  # Número de valores únicos
        print(f'Coluna: {column}, Valores únicos: {num_unique_values}')

        # Definir a técnica de scaling com base no número de valores únicos
        if num_unique_values <= 10:  # Variáveis com poucos valores únicos (menor ou igual a 10)
            X_val[column] = scalers['ordinal'].fit_transform(X_val[[column]])
        else:  # Variáveis com muitos valores únicos
            # Pode usar MinMaxScaler ou StandardScaler dependendo do seu caso
            X_val[column] = scalers['standard'].fit_transform(X_val[[column]])

    return X_val

# Aplicar o scaling ao seu DataFrame
X_val_scaled = apply_scaling(X_val)

# Ver as primeiras linhas para verificar se o scaling foi aplicado
print(X_val_scaled.head())

In [260]:
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
X_train["County of Injury"] = encoder.fit_transform(X_train[["County of Injury"]])
X_val["County of Injury"] = encoder.fit_transform(X_val[["County of Injury"]])

In [262]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OrdinalEncoder
scalers = {
    'minmax': MinMaxScaler(),
    'standard': StandardScaler(),
    'ordinal': OrdinalEncoder()
}

# Função para aplicar o scaler correto
def apply_scaling(X_train):
    for column in X_train.columns:
        num_unique_values = X_train[column].nunique()  # Número de valores únicos
        print(f'Coluna: {column}, Valores únicos: {num_unique_values}')

        # Definir a técnica de scaling com base no número de valores únicos
        if num_unique_values <= 10:  # Variáveis com poucos valores únicos (menor ou igual a 10)
            X_train[column] = scalers['ordinal'].fit_transform(X_train[[column]])
        else:  # Variáveis com muitos valores únicos
            # Pode usar MinMaxScaler ou StandardScaler dependendo do seu caso
            X_train[column] = scalers['standard'].fit_transform(X_train[[column]])

    return X_train

# Aplicar o scaling ao seu DataFrame
X_train_scaled = apply_scaling(X_train)

# Ver as primeiras linhas para verificar se o scaling foi aplicado
print(X_train_scaled.head())


Coluna: Accident Date, Valores únicos: 1672
Coluna: Age at Injury, Valores únicos: 70
Coluna: Alternative Dispute Resolution, Valores únicos: 2
Coluna: Assembly Date, Valores únicos: 889
Coluna: Attorney/Representative, Valores únicos: 2
Coluna: Average Weekly Wage, Valores únicos: 102436
Coluna: Birth Year, Valores únicos: 74
Coluna: C-2 Date, Valores únicos: 1426
Coluna: C-3 Date, Valores únicos: 2
Coluna: Carrier Name, Valores únicos: 1900


ValueError: could not convert string to float: 'NEW YORK STATE MUNICIPAL'

In [264]:
from sklearn.preprocessing import LabelEncoder


# Inicializando o LabelEncoder
label_encoder = LabelEncoder()

# Transformando a coluna 'Carrier Name'
X_train['Carrier Name'] = label_encoder.fit_transform(X_train['Carrier Name'])
X_val['Carrier Name'] = label_encoder.fit_transform(X_val['Carrier Name'])



In [266]:
# Listar as colunas que devem ser mantidas com tipos originais
cols_to_exclude = ['Accident Date', 'Assembly Date', 'C-2 Date', 'Zip Code', 'Average Weekly Wage']

# Converter todas as outras colunas para Int64
for col in X_train.columns:
    if col not in cols_to_exclude:
        X_train[col] = X_train[col].astype('Int64')

# Verificar os tipos de dados após a conversão
print(X_train.dtypes)


TypeError: cannot safely cast non-equivalent object to int64

In [268]:
# Listar as colunas que devem ser mantidas com tipos originais
cols_to_exclude = ['Accident Date', 'Assembly Date', 'C-2 Date', 'Zip Code', 'Average Weekly Wage']

# Converter todas as outras colunas para Int64
for col in X_val.columns:
    if col not in cols_to_exclude:
        X_val[col] = X_val[col].astype('Int64')

# Verificar os tipos de dados após a conversão
print(X_val.dtypes)

ValueError: invalid literal for int() with base 10: 'TRANSPORTATION AND WAREHOUSING'

In [270]:
# Passo 1: Remover as colunas de data
X_train.drop(['Accident Date', 'Assembly Date', 'C-2 Date'], axis=1, inplace=True)

# Passo 2: Transformar as colunas de ano e mês para tipo Int64
X_train['Accident Year'] = X_train['Accident Year'].astype('Int64')
X_train['Accident Month'] = X_train['Accident Month'].astype('Int64')

X_train['Assembly Year'] = X_train['Assembly Year'].astype('Int64')
X_train['Assembly Month'] = X_train['Assembly Month'].astype('Int64')

X_train['C-2 Year'] = X_train['C-2 Year'].astype('Int64')
X_train['C-2 Month'] = X_train['C-2 Month'].astype('Int64')

# Verificar os tipos de dados após a transformação
print(X_train.dtypes)


KeyError: 'Accident Year'

In [None]:
# Passo 1: Remover as colunas de data
X_val.drop(['Accident Date', 'Assembly Date', 'C-2 Date'], axis=1, inplace=True)

# Passo 2: Transformar as colunas de ano e mês para tipo Int64
X_val['Accident Year'] = X_val['Accident Year'].astype('Int64')
X_val['Accident Month'] = X_val['Accident Month'].astype('Int64')

X_val['Assembly Year'] = X_val['Assembly Year'].astype('Int64')
X_val['Assembly Month'] = X_val['Assembly Month'].astype('Int64')

X_val['C-2 Year'] = X_val['C-2 Year'].astype('Int64')
X_val['C-2 Month'] = X_val['C-2 Month'].astype('Int64')

# Verificar os tipos de dados após a transformação
print(X_val.dtypes)


In [273]:
X_train['Accident Year'] = X_train['Accident Date'].dt.year
X_train['Accident Month'] = X_train['Accident Date'].dt.month

X_train['Assembly Year'] = X_train['Assembly Date'].dt.year
X_train['Assembly Month'] = X_train['Assembly Date'].dt.month

X_train['C-2 Year'] = X_train['C-2 Date'].dt.year
X_train['C-2 Month'] = X_train['C-2 Date'].dt.month


KeyError: 'Accident Date'

In [275]:
X_val['Accident Year'] = X_val['Accident Date'].dt.year
X_val['Accident Month'] = X_val['Accident Date'].dt.month

X_val['Assembly Year'] = X_val['Assembly Date'].dt.year
X_val['Assembly Month'] = X_val['Assembly Date'].dt.month

X_val['C-2 Year'] = X_val['C-2 Date'].dt.year
X_val['C-2 Month'] = X_val['C-2 Date'].dt.month

In [277]:
X_train['Carrier Name'].nunique()

1900

In [279]:
X_val['Carrier Name'].nunique()

1592

In [281]:
X_train['Carrier Name'].unique()

array([1136,    1, 1255, ...,   76, 1459,  374])

In [283]:
print(X_train["Carrier Name"].dtype)


int64


In [285]:
X_val['Carrier Name'].unique()

<IntegerArray>
[1319,  238, 1104,  570, 1454,  327, 1078,  922,  498, 1325,
 ...
 1000,   61,   92, 1486, 1313,  478,  564,  682,  676,  506]
Length: 1592, dtype: Int64

In [287]:
print(X_val["Carrier Name"].dtype)

Int64


In [289]:
X_train['Agreement Reached']

65669     0.0
489008    0.0
502078    0.0
81536     0.0
453257    0.0
         ... 
486338    1.0
121348    0.0
459105    0.0
568426    0.0
198794    0.0
Name: Agreement Reached, Length: 438762, dtype: float64

In [291]:
X_val['Agreement Reached']

507028    0.0
22703     0.0
580762    0.0
219837    0.0
111908    0.0
         ... 
571153    0.0
459293    0.0
229483    0.0
269008    0.0
590323    0.0
Name: Agreement Reached, Length: 109736, dtype: float64

In [293]:
print(X_train.dtypes)

Age at Injury                        float64
Alternative Dispute Resolution       float64
Attorney/Representative              float64
Average Weekly Wage                  float64
Birth Year                           float64
C-3 Date                             float64
Carrier Name                           int64
Carrier Type                           Int64
Claim Identifier                       int64
County of Injury                     float64
COVID-19 Indicator                     int64
District Name                          int64
First Hearing Date                     int64
Gender                                 Int64
IME-4 Count                            Int64
Industry Code                          Int64
Industry Code Description             object
Medical Fee Region                     int64
WCIO Cause of Injury Code              Int64
WCIO Cause of Injury Description      object
WCIO Nature of Injury Code             Int64
WCIO Nature of Injury Description     object
WCIO Part 

In [None]:
print(X_train.nunique())

In [None]:
X_train.head()

In [None]:
X_val.head()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

def cor_heatmap(cor):
    # Criar uma cópia da matriz de correlação para aplicar a máscara
    cor_copy = cor.copy()
    
    # Aplicar a máscara: substituir valores menores que 0.5 por NaN (não exibir no heatmap)
    cor_copy[cor_copy < 0.5] = np.nan
    
    # Plotar o heatmap
    plt.figure(figsize=(12,10))
    sns.heatmap(data = cor_copy, annot=True, cmap=plt.cm.Reds, fmt='.1f', cbar=True)
    plt.show()

In [None]:
cor_spearman = X_train_scaled.corr(method ='spearman')
cor_spearman

In [None]:
cor_heatmap(cor_spearman)

In [None]:
X_train

# 3. Feature Selection <a name="3- Feature Selection"></a>

## 3.1 Filter Methods <a name="31- Filter Methods"></a>

## 3.1.1 Univariate Variables <a name="23-Univariate Variables"></a>

In [None]:

# Supondo que o dataset X_train já está carregado no ambiente de trabalho
# Identifica colunas categóricas
categorical_columns = X_train.select_dtypes(include='object').columns

# Calcula a variância das frequências para cada coluna categórica
variance_results = {}

for col in categorical_columns:
    frequencies = X_train[col].value_counts()  # Conta a frequência de cada categoria
    variance_results[col] = frequencies.var()  # Calcula a variância das frequências

# Converte o dicionário de resultados para um DataFrame para visualização
variance_df = pd.DataFrame(variance_results.items(), columns=["Variable", "Variance"])

# Mostra o resultado
print(variance_df)


In [None]:
# Supondo que o dataset X_train já está carregado no ambiente de trabalho
# Identifica colunas categóricas
categorical_columns = X_train.select_dtypes(include='number').columns

# Calcula a variância das frequências para cada coluna categórica
variance_results = {}

for col in categorical_columns:
    frequencies = X_train[col].value_counts()  # Conta a frequência de cada categoria
    variance_results[col] = frequencies.var()  # Calcula a variância das frequências

# Converte o dicionário de resultados para um DataFrame para visualização
variance_df = pd.DataFrame(variance_results.items(), columns=["Variable", "Variance"])

# Mostra o resultado
print(variance_df)


## 3.1.2 Correlation indices <a name="23-Correlation Indices"></a>

In [None]:
# Step 1: Select only numerical columns
numerical_df = X_train.select_dtypes(include='number')

# Step 2: Calculate the correlation matrix
correlation_matrix = numerical_df.corr()

# Step 3: Visualize the correlation matrix
plt.figure(figsize=(10, 8))  # Optional: Adjusts the size of the plot
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)

# Display the plot
plt.title('Correlation Matrix (Numerical Variables Only)')
plt.show()

In [None]:
# Step 1: Select only numerical columns
numerical_df = X_val.select_dtypes(include='number')

# Step 2: Calculate the correlation matrix
correlation_matrix = numerical_df.corr()

# Step 3: Visualize the correlation matrix
plt.figure(figsize=(10, 8))  # Optional: Adjusts the size of the plot
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', linewidths=0.5)

# Display the plot
plt.title('Correlation Matrix (Numerical Variables Only)')
plt.show()

In [None]:
X_train = X_train.drop(columns=["Number of Dependents", "WCB Decision", "Industry Code Description", "WCIO Cause of Injury Description", "WCIO Nature of Injury Description", "WCIO Part Of Body Description"])

In [None]:
X_val = X_val.drop(columns=["Number of Dependents", "WCB Decision", "Industry Code Description", "WCIO Cause of Injury Description", "WCIO Nature of Injury Description", "WCIO Part Of Body Description"])

In [None]:
X_train = X_train.drop(columns=["Zip Code"])

In [None]:
X_val = X_val.drop(columns=["Zip Code"])

**Testar modelo**

In [None]:
modelKNN = KNeighborsClassifier()

In [None]:
modelKNN.fit(X = X_train, y = y_train)

In [None]:
model = LogisticRegression()

In [None]:
labels_train = modelKNN.predict(X_train)
labels_val = modelKNN.predict(X_val)
labels_train

In [None]:
modelKNN.predict_proba(X_val)

In [None]:
print(modelKNN.score(X_train, y_train))
print(modelKNN.score(X_val, y_val))

In [None]:
log_model = LogisticRegression()

In [None]:
log_model.fit(X_train, y_train)

## 3.1.3 Chi-Square <a name="23- Chi Square"></a>