# **TO GRANT OR NOT TO GRANT: DECIDING ON COMPENSATION BENEFITS - PART 1: INITIAL INSPECTION**

## 1. Imports, Options and Ingestion

In [1]:
# importing the libraries
import pandas as pd
import numpy as np
from datetime import timedelta
from matplotlib import pyplot as plt
import re
import json

In [2]:
# setting the options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
pd.set_option('display.max_rows', None)
np.set_printoptions(threshold=np.inf)
plt.style.use('seaborn-v0_8-dark')

In [3]:
# importing the data
df = pd.read_csv('../Data/train_data.csv', sep=',')
df_test = pd.read_csv('../Data/test_data.csv', sep=',')


  df = pd.read_csv('../Data/train_data.csv', sep=',')


Just by importing the dataset, we get a warning saying column 29 has mixed data types - we will check this in a bit.

## 2. Macro-inspection

In [4]:
# check the shape of the dataset
df.shape

(593471, 33)

In [5]:
# check the first rows of the dataset
df.head()

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,Claim Identifier,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents
0,2019-12-30,31.0,N,2020-01-01,N,0.0,1988.0,2019-12-31,,NEW HAMPSHIRE INSURANCE CO,1A. PRIVATE,5393875,2. NON-COMP,ST. LAWRENCE,N,SYRACUSE,,M,,44.0,RETAIL TRADE,I,,27.0,FROM LIQUID OR GREASE SPILLS,10.0,CONTUSION,62.0,BUTTOCKS,13662.0,0.0,Not Work Related,1.0
1,2019-08-30,46.0,N,2020-01-01,Y,1745.93,1973.0,2020-01-01,2020-01-14,ZURICH AMERICAN INSURANCE CO,1A. PRIVATE,5393091,4. TEMPORARY,WYOMING,N,ROCHESTER,2020-02-21,F,4.0,23.0,CONSTRUCTION,I,,97.0,REPETITIVE MOTION,49.0,SPRAIN OR TEAR,38.0,SHOULDER(S),14569.0,1.0,Not Work Related,4.0
2,2019-12-06,40.0,N,2020-01-01,N,1434.8,1979.0,2020-01-01,,INDEMNITY INSURANCE CO OF,1A. PRIVATE,5393889,4. TEMPORARY,ORANGE,N,ALBANY,,M,,56.0,ADMINISTRATIVE AND SUPPORT AND WASTE MANAGEMENT AND REMEDIAT,II,,79.0,OBJECT BEING LIFTED OR HANDLED,7.0,CONCUSSION,10.0,MULTIPLE HEAD INJURY,12589.0,0.0,Not Work Related,6.0
3,,,,2020-01-01,,,,,,,,957648180,,,,,,,,,,,,,,,,,,,,,
4,2019-12-30,61.0,N,2020-01-01,N,,1958.0,2019-12-31,,STATE INSURANCE FUND,2A. SIF,5393887,2. NON-COMP,DUTCHESS,N,ALBANY,,M,,62.0,HEALTH CARE AND SOCIAL ASSISTANCE,II,,16.0,"HAND TOOL, UTENSIL; NOT POWERED",43.0,PUNCTURE,36.0,FINGER(S),12603.0,0.0,Not Work Related,1.0


In [6]:
# check if there are any aggregation rows at the end of the dataset
df.tail()

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,Claim Identifier,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents
593466,,,,2022-12-31,,,,,,,,327160035,,,,,,,,,,,,,,,,,,,,,
593467,2022-12-13,72.0,N,2022-12-31,N,0.0,1950.0,2022-12-31,,TECHNOLOGY INSURANCE CO. INC.,1A. PRIVATE,6165075,2. NON-COMP,SULLIVAN,N,BINGHAMTON,,F,,48.0,TRANSPORTATION AND WAREHOUSING,I,,25.0,FROM DIFFERENT LEVEL (ELEVATION),90.0,MULTIPLE PHYSICAL INJURIES ONLY,-9.0,MULTIPLE,12779.0,0.0,Not Work Related,3.0
593468,,,,2022-12-31,,,,,,,,249875936,,,,,,,,,,,,,,,,,,,,,
593469,,,,2022-12-31,,,,,,,,120584215,,,,,,,,,,,,,,,,,,,,,
593470,,,,2022-12-31,,,,,,,,818961390,,,,,,,,,,,,,,,,,,,,,


In [7]:
# check the datatypes and null counts
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 593471 entries, 0 to 593470
Data columns (total 33 columns):
 #   Column                              Non-Null Count   Dtype  
---  ------                              --------------   -----  
 0   Accident Date                       570337 non-null  object 
 1   Age at Injury                       574026 non-null  float64
 2   Alternative Dispute Resolution      574026 non-null  object 
 3   Assembly Date                       593471 non-null  object 
 4   Attorney/Representative             574026 non-null  object 
 5   Average Weekly Wage                 545375 non-null  float64
 6   Birth Year                          544948 non-null  float64
 7   C-2 Date                            559466 non-null  object 
 8   C-3 Date                            187245 non-null  object 
 9   Carrier Name                        574026 non-null  object 
 10  Carrier Type                        574026 non-null  object 
 11  Claim Identifier          

**Data type analysis:**

Features that should be integers:
- 'Age at Injury'
- 'Birth Year'
- 'IME-4 Count'
- 'Industry Code'
- 'WCIO Cause of Injury Code'
- 'WCIO Nature of Injury Code'
- 'WCIO Part Of Body Code'
- 'Number of Dependents'

Features that should be booleans:
- 'Agreement Reached'

These features will be converted to the appropriate data types in section [Section 4](#4-data-types)

### 2.1. Claim Identifier

This feature is the unique identifier of each claim - we will analyse this column in more depth in an attempt to assign it as the index of our dataframe.

In [8]:
# check if there are any duplicate values for this column
df[df['Claim Identifier'].duplicated(keep=False)]

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,Claim Identifier,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents
257901,,,,2021-05-21,,,,,,,,292668076,,,,,,,,,,,,,,,,,,,,,
526445,,,,2022-09-05,,,,,,,,292668076,,,,,,,,,,,,,,,,,,,,,


Apparently, we have one duplicated 'Claim Identifier', where all values, except for 'Assembly Date', are NaNs. </br>
We will use the default python behavior and drop the second appearence of the repeated 'Claim Identifier'. </br>
As we saw previously, this column has no null values, so we can set it as the dataframe index.

In [9]:
df = df[~df['Claim Identifier'].duplicated()].set_index('Claim Identifier').rename_axis(None)

In [10]:
# we set the index on the test data
df_test = df_test.set_index('Claim Identifier').rename_axis(None)

## 3. Missing Values

In [11]:
# calculate the number of NaNs for each column
nan_counts = df.isna().sum()

# get the total number of rows
total_rows = df.shape[0]

# calculate the percentage of NaN values for each column
percentage_nans = (nan_counts / total_rows) * 100

# format the percentages with '%' sign
percentage_nans = percentage_nans.apply(lambda x: f"{x:.2f}%")

# combine all the information into a DataFrame
nan_summary = pd.DataFrame({
    'NaN Count': nan_counts
    , 'Total Values': [total_rows] * len(nan_counts)
    , 'Percentage NaN': percentage_nans
})

display(nan_summary)

Unnamed: 0,NaN Count,Total Values,Percentage NaN
Accident Date,23133,593470,3.90%
Age at Injury,19444,593470,3.28%
Alternative Dispute Resolution,19444,593470,3.28%
Assembly Date,0,593470,0.00%
Attorney/Representative,19444,593470,3.28%
Average Weekly Wage,48095,593470,8.10%
Birth Year,48522,593470,8.18%
C-2 Date,34004,593470,5.73%
C-3 Date,406225,593470,68.45%
Carrier Name,19444,593470,3.28%


By analysing the output above, we conclude:
- **'C-3 Date'**: more than 50% of the data for this feature is missing, it can be due to process status - the employee has not yet sent the report
- **'First Hearing Date'**: around 75% of the data for this feature is missing - this means hearings have not been held
- **'IME-4 Count'**: more than 75% of the data for this feature is missing, it can be due to process status - the independent examiner has not sent the report
- **'OIICS Nature of Injury Description'**: only has null values

For **'C-3 Date'** and **'First Hearing Date'**, the missing values will be imputed using an imputation algorithm in the Notebook 'PART 3 - PreProcess, Feature Selection, Model Assessment and Evaluation', whereas the missing values of **'IME-4 Count'** will be imputed with the value 0 and **'OIICS Nature of Injury Description'** will be disregarded from the dataset.

## 4. Data Types

As we imported the data, we got a warning saying columns had mixed data types. We shall take a look at this issue now.

In [12]:
def check_mixed_types(column):
    # if there are more than one unique data types (exclude NaN values, as these are considered floats)
    return len(set(column[pd.notna(column)].apply(type))) > 1

In [13]:
mixed_type_columns = [col for col in df.columns if check_mixed_types(df[col])]

print(mixed_type_columns)

['Zip Code']


In [14]:
for col in mixed_type_columns:
    print(df[col].apply(type).value_counts())

Zip Code
<class 'str'>      530364
<class 'float'>     63106
Name: count, dtype: int64


The values for 'Zip Code' are split across two data types - string and float. We shall look at the feature values in order to make a decision.

In [15]:
# check if there are any values that contain any non numeric character to make a decision about the final data type
print(df[df['Zip Code'].apply(lambda x: bool(re.search(r'\D', str(x))))]['Zip Code'].unique())

[nan 'L1N 5' 'T1B0P' 'L6Y 1' 'JMDMR' 'N2P 1' 'H7X3Z' 'L2N 3' 'L0R 1'
 'L1A 3' 'T4R1E' 'L7L 6' 'L2W 1' 'K9K 1' '.1605' 'M4E1S' 'L2A 5' 'UNKNO'
 'T2V' 'J0L1B' 'AB106' 'J1H2Y' 'L2A 1' 'H2T2W' 'T1J5G' '22-40' 'M3K2B'
 'M3K2C' 'T1S 0' 'N6C4E' 'H1V2L' 'V6Z3G' 'KOC 1' 'M6S5B' 'V3H1H' 'L2H0H'
 'L1C0K' 'JOL1B' 'SLV' 'BT48-' 'L5N7C' 'JMDCN' 'V6T1Z' 'T2X3P' 'L2A1R'
 'L4A0G' 'H1V3S' 'K6H2L' 'JMAKN' 'L6H 3' 'J0J1S' 'B3E1A' 'V6T5C' 'JMBTS'
 'V2B0E' 'T6R 0' 'JMDWD' 'M6K3C' 'L9W' 'S7R 0' 'F94A0' 'P7G' 'Z2E1N'
 'L0S1N' 'K2L 4' 'H1G0A' 'B3M 3' 10567.0 11213.0 11421.0 11378.0 10035.0
 11385.0 10302.0 10305.0 12550.0 10027.0 11710.0 12180.0 11365.0 11418.0
 10598.0 10465.0 18851.0 6820.0 10952.0 14557.0 72524.0 7882.0 10923.0
 12901.0 11207.0 12590.0 11580.0 11725.0 10033.0 10532.0 12804.0 14614.0
 11779.0 11224.0 11435.0 11735.0 11367.0 11772.0 11218.0 11434.0 11901.0
 11577.0 11004.0 13083.0 10037.0 11214.0 10458.0 11422.0 14590.0 7822.0
 10001.0 10306.0 11704.0 10453.0 11414.0 11206.0 11203.0 11212.0 1

Since we have determined that there are values for 'Zip Code' that contain other than numeric characters, we shall set this feature as a string.

In [16]:
# change the data type to string
df['Zip Code'] = df['Zip Code'].astype(str)

In [17]:
# re-checking the data type counts
print(df['Zip Code'].apply(type).value_counts())

Zip Code
<class 'str'>    593470
Name: count, dtype: int64


In [18]:
# apply transformation to the test data
df_test['Zip Code'] = df_test['Zip Code'].astype(str)

Let us now fixed the remaining data types, as identified in [Section 2](#2-macro-inspection)

In [19]:
features_to_cast = [
    'Age at Injury'
    , 'Birth Year'
    , 'IME-4 Count'
    , 'Industry Code'
    , 'WCIO Cause of Injury Code'
    , 'WCIO Nature of Injury Code'
    , 'WCIO Part Of Body Code'
    , 'Number of Dependents'
    , 'Agreement Reached'
]

In [20]:
for feature in features_to_cast:
    try:
        df[feature] = pd.to_numeric(df[feature], errors='coerce').astype('Int64')
        df_test[feature] = pd.to_numeric(df_test[feature], errors='coerce').astype('Int64')
    
    except KeyError:
        pass

In [21]:
# checking if the conversion was successful
for feature in features_to_cast:
    print(feature, '\t', df[feature].dtypes)

Age at Injury 	 Int64
Birth Year 	 Int64
IME-4 Count 	 Int64
Industry Code 	 Int64
WCIO Cause of Injury Code 	 Int64
WCIO Nature of Injury Code 	 Int64
WCIO Part Of Body Code 	 Int64
Number of Dependents 	 Int64
Agreement Reached 	 Int64


## 5. Date Formats

In [22]:
date_columns = [
    'Accident Date'
    , 'Assembly Date'
    , 'C-2 Date'
    , 'C-3 Date'
    , 'First Hearing Date'
]

In [23]:
def check_date_yyyymmdd(df, col):
    date_pattern = r'^(\d{4})-(\d{2})-(\d{2})$'

    def validate(date):
        if isinstance(date, str):
            # heck if the date is in the correct format
            match = re.match(date_pattern, date)

            # if it matches the format
            if match:

                # define month and day
                month = int(match.group(1))
                day = int(match.group(2))

                # check if month is between 1 and 12
                if 1 <= month <= 12:

                    # check if the months have the correct max day
                    if (month == 2 and day <= 29) or \
                    (month in [4, 6, 9, 11] and day <= 30) or \
                    (month in [1, 3, 5, 7, 8, 10, 12] and day <= 31):
                        
                        return True
                    
            return False
    
    return df[col].apply(validate)

In [24]:
# check if there are any values in these date columns that do not follow this format
for column in date_columns:
    if check_date_yyyymmdd(df, column).any():
        display(check_date_yyyymmdd(df, column))

In [25]:
# check if there are any values in these date columns of the test data that do not follow this format
for column in date_columns:
    if check_date_yyyymmdd(df_test, column).any():
        display(check_date_yyyymmdd(df_test, column))

Since we have assessed that all dates are in the YYYY-MM-DD format, we are going to transform these columns into datetime

In [26]:
for col in date_columns:
    df[col] = pd.to_datetime(df[col], errors='coerce', format='%Y-%m-%d')
    df_test[col] = pd.to_datetime(df_test[col], errors='coerce', format='%Y-%m-%d')

In [27]:
df[date_columns].dtypes

Accident Date         datetime64[ns]
Assembly Date         datetime64[ns]
C-2 Date              datetime64[ns]
C-3 Date              datetime64[ns]
First Hearing Date    datetime64[ns]
dtype: object

## 6. Unique Values

In [28]:
# check the number of unique values for each column
df.nunique()

Accident Date                           5539
Age at Injury                            108
Alternative Dispute Resolution             3
Assembly Date                           1096
Attorney/Representative                    2
Average Weekly Wage                   120024
Birth Year                               107
C-2 Date                                2475
C-3 Date                                1648
Carrier Name                            2046
Carrier Type                               8
Claim Injury Type                          8
County of Injury                          63
COVID-19 Indicator                         2
District Name                              8
First Hearing Date                      1094
Gender                                     4
IME-4 Count                               41
Industry Code                             24
Industry Code Description                 20
Medical Fee Region                         5
OIICS Nature of Injury Description         0
WCIO Cause

In [29]:
# check the unique values for each column
pd.DataFrame({
    'feature_type': [df[column].dtype for column in df.columns],
    'unique_values': df.apply(lambda col: sorted(pd.Series(col.dropna().unique().tolist())))  # we disregard NaN values, so we can sort the unique values
})

Unnamed: 0,feature_type,unique_values
Accident Date,datetime64[ns],"[1961-09-06 00:00:00, 1963-10-01 00:00:00, 1966-06-01 00:00:00, 1966-09-13 00:00:00, 1967-01-01 00:00:00, 1967-04-05 00:00:00, 1967-10-26 00:00:00, 1969-04-01 00:00:00, 1969-05-03 00:00:00, 1970-12-10 00:00:00, 1971-03-17 00:00:00, 1971-04-27 00:00:00, 1971-05-08 00:00:00, 1972-05-26 00:00:00, 1972-09-26 00:00:00, 1972-11-15 00:00:00, 1972-12-26 00:00:00, 1973-01-15 00:00:00, 1973-02-27 00:00:00, 1973-04-24 00:00:00, 1973-05-01 00:00:00, 1973-05-31 00:00:00, 1973-12-12 00:00:00, 1973-12-24 00:00:00, 1974-02-22 00:00:00, 1974-06-06 00:00:00, 1974-06-22 00:00:00, 1974-07-09 00:00:00, 1974-10-29 00:00:00, 1975-05-01 00:00:00, 1975-05-06 00:00:00, 1975-06-09 00:00:00, 1975-06-17 00:00:00, 1975-07-01 00:00:00, 1975-08-06 00:00:00, 1975-10-16 00:00:00, 1975-12-22 00:00:00, 1976-07-21 00:00:00, 1976-08-17 00:00:00, 1976-10-14 00:00:00, 1977-03-04 00:00:00, 1977-04-04 00:00:00, 1977-06-15 00:00:00, 1977-06-17 00:00:00, 1977-06-21 00:00:00, 1977-07-22 00:00:00, 1977-09-16 00:00:00, 1977-12-29 00:00:00, 1978-01-23 00:00:00, 1978-02-02 00:00:00, 1978-04-04 00:00:00, 1978-05-19 00:00:00, 1978-07-28 00:00:00, 1978-08-14 00:00:00, 1979-03-01 00:00:00, 1979-08-28 00:00:00, 1979-10-08 00:00:00, 1979-11-02 00:00:00, 1980-01-01 00:00:00, 1980-01-16 00:00:00, 1980-03-26 00:00:00, 1980-03-28 00:00:00, 1980-04-24 00:00:00, 1980-04-28 00:00:00, 1980-05-01 00:00:00, 1980-05-09 00:00:00, 1980-05-17 00:00:00, 1980-10-06 00:00:00, 1980-12-16 00:00:00, 1980-12-17 00:00:00, 1981-01-28 00:00:00, 1981-02-09 00:00:00, 1981-02-16 00:00:00, 1981-04-21 00:00:00, 1981-06-22 00:00:00, 1981-09-23 00:00:00, 1981-11-12 00:00:00, 1981-12-09 00:00:00, 1981-12-30 00:00:00, 1982-01-01 00:00:00, 1982-02-01 00:00:00, 1982-03-12 00:00:00, 1982-03-25 00:00:00, 1982-05-21 00:00:00, 1982-06-10 00:00:00, 1982-06-14 00:00:00, 1982-08-01 00:00:00, 1982-10-15 00:00:00, 1982-12-07 00:00:00, 1983-01-18 00:00:00, 1983-02-14 00:00:00, 1983-02-19 00:00:00, 1983-05-01 00:00:00, 1983-06-06 00:00:00, 1983-07-12 00:00:00, 1983-07-18 00:00:00, 1983-07-28 00:00:00, 1983-08-22 00:00:00, 1984-01-16 00:00:00, 1984-01-30 00:00:00, ...]"
Age at Injury,Int64,"[0, 1, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 104, ...]"
Alternative Dispute Resolution,object,"[N, U, Y]"
Assembly Date,datetime64[ns],"[2020-01-01 00:00:00, 2020-01-02 00:00:00, 2020-01-03 00:00:00, 2020-01-04 00:00:00, 2020-01-05 00:00:00, 2020-01-06 00:00:00, 2020-01-07 00:00:00, 2020-01-08 00:00:00, 2020-01-09 00:00:00, 2020-01-10 00:00:00, 2020-01-11 00:00:00, 2020-01-12 00:00:00, 2020-01-13 00:00:00, 2020-01-14 00:00:00, 2020-01-15 00:00:00, 2020-01-16 00:00:00, 2020-01-17 00:00:00, 2020-01-18 00:00:00, 2020-01-19 00:00:00, 2020-01-20 00:00:00, 2020-01-21 00:00:00, 2020-01-22 00:00:00, 2020-01-23 00:00:00, 2020-01-24 00:00:00, 2020-01-25 00:00:00, 2020-01-26 00:00:00, 2020-01-27 00:00:00, 2020-01-28 00:00:00, 2020-01-29 00:00:00, 2020-01-30 00:00:00, 2020-01-31 00:00:00, 2020-02-01 00:00:00, 2020-02-02 00:00:00, 2020-02-03 00:00:00, 2020-02-04 00:00:00, 2020-02-05 00:00:00, 2020-02-06 00:00:00, 2020-02-07 00:00:00, 2020-02-08 00:00:00, 2020-02-09 00:00:00, 2020-02-10 00:00:00, 2020-02-11 00:00:00, 2020-02-12 00:00:00, 2020-02-13 00:00:00, 2020-02-14 00:00:00, 2020-02-15 00:00:00, 2020-02-16 00:00:00, 2020-02-17 00:00:00, 2020-02-18 00:00:00, 2020-02-19 00:00:00, 2020-02-20 00:00:00, 2020-02-21 00:00:00, 2020-02-22 00:00:00, 2020-02-23 00:00:00, 2020-02-24 00:00:00, 2020-02-25 00:00:00, 2020-02-26 00:00:00, 2020-02-27 00:00:00, 2020-02-28 00:00:00, 2020-02-29 00:00:00, 2020-03-01 00:00:00, 2020-03-02 00:00:00, 2020-03-03 00:00:00, 2020-03-04 00:00:00, 2020-03-05 00:00:00, 2020-03-06 00:00:00, 2020-03-07 00:00:00, 2020-03-08 00:00:00, 2020-03-09 00:00:00, 2020-03-10 00:00:00, 2020-03-11 00:00:00, 2020-03-12 00:00:00, 2020-03-13 00:00:00, 2020-03-14 00:00:00, 2020-03-15 00:00:00, 2020-03-16 00:00:00, 2020-03-17 00:00:00, 2020-03-18 00:00:00, 2020-03-19 00:00:00, 2020-03-20 00:00:00, 2020-03-21 00:00:00, 2020-03-22 00:00:00, 2020-03-23 00:00:00, 2020-03-24 00:00:00, 2020-03-25 00:00:00, 2020-03-26 00:00:00, 2020-03-27 00:00:00, 2020-03-28 00:00:00, 2020-03-29 00:00:00, 2020-03-30 00:00:00, 2020-03-31 00:00:00, 2020-04-01 00:00:00, 2020-04-02 00:00:00, 2020-04-03 00:00:00, 2020-04-04 00:00:00, 2020-04-05 00:00:00, 2020-04-06 00:00:00, 2020-04-07 00:00:00, 2020-04-08 00:00:00, 2020-04-09 00:00:00, ...]"
Attorney/Representative,object,"[N, Y]"
Average Weekly Wage,float64,"[0.0, 13.94, 17.41, 23.27, 27.49, 30.88, 31.54, 32.0, 34.23, 35.0, 36.86, 38.13, 38.46, 40.0, 41.73, 42.92, 43.5, 44.65, 46.15, 48.39, 48.92, 50.0, 50.33, 50.53, 50.83, 51.15, 51.76, 52.24, 52.63, 55.0, 57.59, 58.37, 58.72, 59.14, 61.08, 62.43, 62.62, 62.77, 63.0, 65.95, 66.33, 66.66, 66.97, 67.03, 67.25, 69.15, 70.0, 71.33, 72.26, 72.28, 75.0, 76.0, 76.42, 77.04, 78.27, 78.31, 78.47, 80.0, 80.22, 80.34, 80.35, 81.25, 81.67, 82.0, 82.92, 83.51, 84.62, 86.05, 86.25, 87.83, 88.0, 88.4, 89.17, 90.0, 90.39, 90.75, 91.0, 91.08, 91.27, 91.92, 92.4, 92.54, 93.02, 93.19, 94.0, 94.34, 95.0, 95.88, 96.0, 96.08, 96.3, 96.95, 97.83, 98.28, 98.61, 99.5, 99.71, 99.95, 100.0, 100.2, ...]"
Birth Year,Int64,"[0, 1900, 1901, 1902, 1903, 1907, 1909, 1910, 1911, 1916, 1917, 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929, 1930, 1931, 1932, 1933, 1934, 1935, 1936, 1937, 1938, 1939, 1940, 1941, 1942, 1943, 1944, 1945, 1946, 1947, 1948, 1949, 1950, 1951, 1952, 1953, 1954, 1955, 1956, 1957, 1958, 1959, 1960, 1961, 1962, 1963, 1964, 1965, 1966, 1967, 1968, 1969, 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, 1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, ...]"
C-2 Date,datetime64[ns],"[1996-01-12 00:00:00, 1996-01-31 00:00:00, 1996-02-14 00:00:00, 1996-03-06 00:00:00, 1996-03-29 00:00:00, 1996-04-25 00:00:00, 1996-05-09 00:00:00, 1996-05-10 00:00:00, 1996-05-16 00:00:00, 1996-05-22 00:00:00, 1996-06-04 00:00:00, 1996-06-06 00:00:00, 1996-06-18 00:00:00, 1996-06-19 00:00:00, 1996-06-20 00:00:00, 1996-06-27 00:00:00, 1996-06-28 00:00:00, 1996-07-01 00:00:00, 1996-07-08 00:00:00, 1996-07-23 00:00:00, 1996-07-29 00:00:00, 1996-08-08 00:00:00, 1996-08-12 00:00:00, 1996-08-13 00:00:00, 1996-08-15 00:00:00, 1996-08-20 00:00:00, 1996-09-17 00:00:00, 1996-10-09 00:00:00, 1996-10-16 00:00:00, 1996-10-29 00:00:00, 1996-11-06 00:00:00, 1996-11-08 00:00:00, 1996-11-14 00:00:00, 1996-11-21 00:00:00, 1996-12-05 00:00:00, 1996-12-19 00:00:00, 1996-12-23 00:00:00, 1997-01-15 00:00:00, 1997-01-21 00:00:00, 1997-01-24 00:00:00, 1997-02-07 00:00:00, 1997-02-10 00:00:00, 1997-02-12 00:00:00, 1997-02-13 00:00:00, 1997-02-19 00:00:00, 1997-02-20 00:00:00, 1997-03-06 00:00:00, 1997-03-18 00:00:00, 1997-03-24 00:00:00, 1997-04-01 00:00:00, 1997-04-02 00:00:00, 1997-04-10 00:00:00, 1997-04-17 00:00:00, 1997-04-24 00:00:00, 1997-05-07 00:00:00, 1997-06-12 00:00:00, 1997-06-16 00:00:00, 1997-06-20 00:00:00, 1997-06-25 00:00:00, 1997-07-01 00:00:00, 1997-07-03 00:00:00, 1997-07-17 00:00:00, 1997-07-18 00:00:00, 1997-09-05 00:00:00, 1997-09-10 00:00:00, 1997-09-16 00:00:00, 1997-09-17 00:00:00, 1997-09-24 00:00:00, 1997-10-15 00:00:00, 1997-10-16 00:00:00, 1997-10-22 00:00:00, 1997-10-23 00:00:00, 1997-10-24 00:00:00, 1997-11-24 00:00:00, 1997-12-05 00:00:00, 1997-12-08 00:00:00, 1997-12-15 00:00:00, 1998-01-26 00:00:00, 1998-02-05 00:00:00, 1998-02-27 00:00:00, 1998-03-17 00:00:00, 1998-03-18 00:00:00, 1998-04-06 00:00:00, 1998-04-17 00:00:00, 1998-06-04 00:00:00, 1998-06-08 00:00:00, 1998-06-10 00:00:00, 1998-06-23 00:00:00, 1998-06-26 00:00:00, 1998-06-29 00:00:00, 1998-07-14 00:00:00, 1998-07-23 00:00:00, 1998-07-27 00:00:00, 1998-08-10 00:00:00, 1998-10-01 00:00:00, 1998-10-13 00:00:00, 1998-10-26 00:00:00, 1998-10-27 00:00:00, 1998-10-30 00:00:00, 1998-11-17 00:00:00, ...]"
C-3 Date,datetime64[ns],"[1992-11-13 00:00:00, 2010-05-14 00:00:00, 2010-07-29 00:00:00, 2010-08-27 00:00:00, 2010-10-08 00:00:00, 2011-01-11 00:00:00, 2011-03-03 00:00:00, 2011-05-26 00:00:00, 2012-07-12 00:00:00, 2012-11-02 00:00:00, 2013-04-03 00:00:00, 2013-08-07 00:00:00, 2013-12-25 00:00:00, 2016-06-20 00:00:00, 2016-11-08 00:00:00, 2017-02-09 00:00:00, 2017-09-01 00:00:00, 2017-09-07 00:00:00, 2017-09-12 00:00:00, 2017-11-08 00:00:00, 2018-01-30 00:00:00, 2018-02-05 00:00:00, 2018-02-13 00:00:00, 2018-02-26 00:00:00, 2018-03-09 00:00:00, 2018-04-04 00:00:00, 2018-05-17 00:00:00, 2018-06-13 00:00:00, 2018-06-26 00:00:00, 2018-06-28 00:00:00, 2018-07-11 00:00:00, 2018-07-19 00:00:00, 2018-07-30 00:00:00, 2018-10-24 00:00:00, 2018-10-30 00:00:00, 2018-11-09 00:00:00, 2018-11-23 00:00:00, 2018-12-14 00:00:00, 2019-01-04 00:00:00, 2019-01-14 00:00:00, 2019-01-17 00:00:00, 2019-02-25 00:00:00, 2019-02-27 00:00:00, 2019-03-25 00:00:00, 2019-03-28 00:00:00, 2019-03-29 00:00:00, 2019-04-03 00:00:00, 2019-04-04 00:00:00, 2019-05-02 00:00:00, 2019-05-15 00:00:00, 2019-05-17 00:00:00, 2019-05-21 00:00:00, 2019-05-23 00:00:00, 2019-05-24 00:00:00, 2019-06-03 00:00:00, 2019-06-21 00:00:00, 2019-06-28 00:00:00, 2019-07-03 00:00:00, 2019-07-12 00:00:00, 2019-07-18 00:00:00, 2019-07-19 00:00:00, 2019-07-22 00:00:00, 2019-07-24 00:00:00, 2019-07-25 00:00:00, 2019-07-26 00:00:00, 2019-07-29 00:00:00, 2019-07-31 00:00:00, 2019-08-07 00:00:00, 2019-08-08 00:00:00, 2019-08-14 00:00:00, 2019-08-16 00:00:00, 2019-08-19 00:00:00, 2019-08-22 00:00:00, 2019-08-29 00:00:00, 2019-09-06 00:00:00, 2019-09-10 00:00:00, 2019-09-13 00:00:00, 2019-09-17 00:00:00, 2019-09-19 00:00:00, 2019-10-03 00:00:00, 2019-10-07 00:00:00, 2019-10-08 00:00:00, 2019-10-11 00:00:00, 2019-10-21 00:00:00, 2019-10-22 00:00:00, 2019-10-27 00:00:00, 2019-10-28 00:00:00, 2019-10-29 00:00:00, 2019-10-30 00:00:00, 2019-10-31 00:00:00, 2019-11-01 00:00:00, 2019-11-04 00:00:00, 2019-11-06 00:00:00, 2019-11-07 00:00:00, 2019-11-08 00:00:00, 2019-11-11 00:00:00, 2019-11-12 00:00:00, 2019-11-13 00:00:00, 2019-11-14 00:00:00, 2019-11-15 00:00:00, ...]"
Carrier Name,object,"[*** CARRIER UNDETERMINED ***, A I U INSURANCE COMPANY, ABF FREIGHT SYSTEM, INC., ABF FREIGHT SYSTEMS INC DEL, ACADIA INSURANCE COMPANY, ACCIDENT FUND GENERAL, ACCIDENT FUND GENERAL INS CO, ACCIDENT FUND INSURANCE, ACCIDENT FUND INSURANCE CO, ACCIDENT FUND NATIONAL INS CO, ACCIDENT FUND NATONAL INS CO, ACCREDITED SURETY AND CASUALTY, ACE AMERICAN INSURANCE CO, ACE AMERICAN INSURANCE CO., ACE FIRE UNDERWRITERS, ACE FIRE UNDERWRITERS INS, ACE PROPERTY & CASUALTY INS CO, ACE PROPERTY AND CASUALTY, ACIG INSURANCE COMPANY, ADDISON CENTRAL SCHOOL DISTRIC, ADDISON CSD, ADIRONDACK CENTRAL SCHOOL, ADIRONDACK CSD, ADMIRAL INDEMNITY COMPANY, ADMIRAL INDEMNITY CORP., AIG PROPERTY CASUALTY CO., AIG PROPERTY CASUALTY COMPANY, AIRBORNE EXPRESS, INC., AIU INSURANCE CO, AKRON CENTRAL SCHOOL DISTRICT, AKRON CSD, ALBANY CITY OF, ALBANY CITY SCHOOL DISTRICT, ALBANY, COUNTY OF, ALBION CENTRAL SCHOOL DIST, ALBION CENTRAL SCHOOL DISTRICT, ALDEN C.S.D., ALDEN CENTRAL SCHOOL DIST, ALEXANDRIA CENTRAL SCHOOL DIST, ALFRED-ALMOND CENTRAL SCHOOL, ALFRED-ALMOND CSD, ALL AMERICA INSURANCE CO, ALLEGANY COUNTY MUTUAL, ALLEGANY-LIMESTONE, ALLEGANY-LIMESTONE CSD, ALLIANCE NATIONAL INS CO, ALLIANCE NATIONAL INSURANCE CO, ALLIANZ GLOBAL RISKS US, ALLIED EASTERN INDEMNITY CO, ALLIED PROPERTY & CASUALTY, ALLIED PROPERTY AND CASUALTY, ALLMERICA FIN BENEFIT INS CO, ALLMERICA FINANCIAL ALLIANCE, ALLMERICA FINANCIAL BENEFIT, ALLSTATE INSURANCE COMPANY, ALTMAR-PARISH-WILLIAMSTOWN CSD, AMAGANSETT UFSD, AMCO INSURANCE COMPANY, AMERICAN AUTOMOBILE INS CO, AMERICAN AUTOMOBILE INS. CO., AMERICAN CASUALTY CO, AMERICAN CASUALTY CO OF, AMERICAN FIRE & CASUALTY CO, AMERICAN FIRE & CASUALTY CO., AMERICAN GUAR & LIAB INS CO, AMERICAN GUARANTEE & LIABILITY, AMERICAN HOME ASSURANCE CO, AMERICAN INSURANCE CO, AMERICAN MFG. MUT. INS. CO., AMERICAN MOTORISTS INS CO, AMERICAN PROTECTION INS CO., AMERICAN ZURICH INSURANCE CO, AMERISURE INS CO, AMERISURE INSURANCE COMPANY, AMERISURE MUTUAL INSURANCE CO, AMGUARD INSURANCE COMPANY, AMHERST C.S.D, AMHERST CENTRAL SCHOOL DIST, AMITYVILLE UFSD, AMITYVILLE UNION FREE, AMTRUST INS CO OF KANSAS INC, AMTRUST INSURANCE COMPANY, AMTRUST INSURANCE COMPANY OF, ANDES CENTRAL SCHOOL, ANDES CENTRAL SCHOOL DISTRICT, ANDOVER CENTRAL SCHOOL, ARCH INDEMNITY INSURANCE CO, ARCH INDEMNITY INSURANCE CO., ARCH INSURANCE CO, ARCH INSURANCE COMPANY, ARDSLEY UFSD, ARDSLEY UNION FREE SCHL DIST, ARGONAUT INSURANCE COMPANY, ARGONAUT-MIDWEST INS CO, ARGONAUT-MIDWEST INS, CO, ARGYLE CENTRAL SCHOOL DIST, ARI INSURANCE COMPANY, ARKPORT CENTRAL SCHOOL, ARLINGTON CENTRAL, ARLINGTON CSD, ...]"


Looking at this information, we can point out the following:
- **Age at Injury**: since this dataset concerns workplace injuries, values under 14 (legal working age in the US) are weird
- **Birth Year**: we have '0.0' values
- **OIICS Nature of Injury Description**: has no values
- **WCIO Part Of Body Code**: has a negative value
- **WCB Decision**: only has one value

## 7. Data Consistency

### 7.1. Birth Year

We will try to change the 0 values by computing a new value using 'Accident Date' and 'Age at Injury'

In [30]:
# amount of 0 values
len(df[df['Birth Year'] == 0])

25081

In [31]:
# amount of values we can fix using valid values of the other two columns
len(df[(df['Birth Year'] == 0) & (df['Accident Date'].notna()) & (df['Age at Injury'] >= 14)])

23141

In [32]:
# define the update condition
condition = (df['Birth Year'] == 0) & (df['Accident Date'].notna()) & (df['Age at Injury'] >= 14)

# update the values based on the condition
df.loc[condition, 'Birth Year'] = df['Accident Date'].dt.year - df['Age at Injury']

In [33]:
# amount of 0 values after the changes
len(df[df['Birth Year'] == 0])

1940

In [34]:
# check the test data
len(df_test[df_test['Birth Year'] == 0])

19596

In [35]:
# amount of values we can fix using valid values of the other two columns
len(df_test[(df_test['Birth Year'] == 0) & (df_test['Accident Date'].notna()) & (df_test['Age at Injury'] >= 14)])

18765

In [36]:
# define the update condition
condition = (df_test['Birth Year'] == 0) & (df_test['Accident Date'].notna()) & (df_test['Age at Injury'] >= 14)

# update the values based on the condition
df_test.loc[condition, 'Birth Year'] = df_test['Accident Date'].dt.year - df_test['Age at Injury']

In [37]:
# amount of 0 values after the changes
len(df_test[df_test['Birth Year'] == 0])

831

We will set these as NaN since they are inconsistent with the data.

In [38]:
df.loc[df['Birth Year'] == 0, 'Birth Year'] = np.nan
df_test.loc[df_test['Birth Year'] == 0, 'Birth Year'] = np.nan

### 7.2. Age at Injury

We will check the values that are under the minimum legal working age.

In [39]:
len(df[df['Age at Injury'] < 14])

5511

In [40]:
# amount of values we will be able to correct using 'Accident Date' and 'Birth Year'
len(df[(df['Age at Injury'] < 14) & (df['Accident Date'].notna()) & (df['Birth Year'] != 0)])

122

In [41]:
age_condition = (df['Age at Injury'] < 14) & (df['Accident Date'].notna()) & (df['Birth Year'] != 0)

df.loc[age_condition, 'Age at Injury'] = df['Accident Date'].dt.year - df['Birth Year']

In [42]:
# amount of values that remained inconsistent
len(df[(df['Age at Injury'] < 14) & (df['Accident Date'].notna()) & (df['Birth Year'] != 0)])

27

In [43]:
# amount of values we will be able to compute after imputation of missing values
len(df[(df['Age at Injury'] < 14) & ((df['Accident Date'].isna()) | (df['Birth Year'].isna()))])

5389

In [44]:
# amount of values we cannot fix
len(df[(df['Age at Injury'] < 14) & ~((df['Accident Date'].isna()) | (df['Birth Year'].isna()))])

27

We will later set the inconsistent values as NaN for imputation.

In [45]:
# check the test data
len(df_test[df_test['Age at Injury'] < 14])

3283

In [46]:
# amount of values we will be able to correct using 'Accident Date' and 'Birth Year'
len(df_test[(df_test['Age at Injury'] < 14) & (df_test['Accident Date'].notna()) & (df_test['Birth Year'] != 0)])

176

In [47]:
age_condition = (df_test['Age at Injury'] < 14) & (df_test['Accident Date'].notna()) & (df_test['Birth Year'] != 0)

df_test.loc[age_condition, 'Age at Injury'] = df_test['Accident Date'].dt.year - df_test['Birth Year']

In [48]:
# amount of values that remained inconsistent
len(df_test[(df_test['Age at Injury'] < 14) & (df_test['Accident Date'].notna()) & (df_test['Birth Year'] != 0)])

16

In [49]:
# amount of values we will be able to compute after imputation of missing values
len(df_test[(df_test['Age at Injury'] < 14) & ((df_test['Accident Date'].isna()) | (df_test['Birth Year'].isna()))])

3107

In [50]:
# amount of values we cannot fix
len(df_test[(df_test['Age at Injury'] < 14) & ~((df_test['Accident Date'].isna()) | (df_test['Birth Year'].isna()))])

16

We will later set the inconsistent values as NaN for imputation.

### 7.3. Gender

In [51]:
df['Gender'].value_counts()

Gender
M    335218
F    234051
U      4711
X        46
Name: count, dtype: int64

In [52]:
df.groupby(['Gender', 'Claim Injury Type'])['Claim Injury Type'].count()

Gender  Claim Injury Type
F       1. CANCELLED           4171
        2. NON-COMP          129243
        3. MED ONLY           28110
        4. TEMPORARY          56128
        5. PPD SCH LOSS       14625
        6. PPD NSL             1691
        7. PTD                   23
        8. DEATH                 60
M       1. CANCELLED           8131
        2. NON-COMP          158751
        3. MED ONLY           40326
        4. TEMPORARY          91551
        5. PPD SCH LOSS       33468
        6. PPD NSL             2507
        7. PTD                   74
        8. DEATH                410
U       1. CANCELLED            172
        2. NON-COMP            3072
        3. MED ONLY             452
        4. TEMPORARY            823
        5. PPD SCH LOSS         179
        6. PPD NSL               13
X       1. CANCELLED              3
        2. NON-COMP              12
        3. MED ONLY              18
        4. TEMPORARY              5
        5. PPD SCH LOSS           8
Na

Looking at these catories and after a little bit of research, we can assume that F, M, U and X represent, respectively, Female, Male, Unknown and Other.</br>
No 'Claim Injury Type' over 6 was attributed to genders other than M and F.

In [53]:
# create a new dataframe with dummified 'Gender' and the target 
df_gender = pd.get_dummies(df, columns=['Gender'], prefix='Gender')
df_gender = df_gender[[col for col in df_gender.columns if col.startswith('Gender_')] + ['Claim Injury Type']]
df_gender['Claim Injury Type'] = df_gender['Claim Injury Type'].str[0]

In [54]:
# check the correlation of the gender categories with the target
df_gender[[col for col in df_gender if col.startswith('Gender_')]].corrwith(df_gender['Claim Injury Type'])

Gender_F   -0.075761
Gender_M    0.080968
Gender_U   -0.029821
Gender_X    0.001288
dtype: float64

The categories themselves do not seem to have any correlation with the target and since we only have 46 values in category 'X', which is fairly a drop in the ocean considering we have almost 600k rows - it is best if we remove these values and this category itself.

In [55]:
df['Gender'].isna().sum()

19444

In [56]:
# check the test data
df_test['Gender'].isna().sum()

0

It does not make much sense to have an 'Unknown' category, we will simply cast these values to NaN

In [57]:
df.loc[df['Gender'] == 'U', 'Gender'] = np.nan

In [58]:
df_test.loc[df_test['Gender'] == 'U', 'Gender'] = np.nan

Now, as we want to delete the "X" category, we will set these values to NaN and later perform imputation.

In [59]:
df.loc[df['Gender'] == 'X', 'Gender'] = np.nan

In [60]:
# apply the transformation to the test data
df_test.loc[df_test['Gender'] == 'X', 'Gender'] = np.nan

In [61]:
df['Gender'].isna().sum()

24201

In [62]:
df_test['Gender'].isna().sum()

5613

### 7.4. Dates

In this section we will check the consistency between the date columns.

We know that:
- 'Accident Date' should be the smaller date value
- 'First Hearing Date' should be the largest date value

We do not have any knowledge about there being a specific order between 'C-2 Date', 'C-3 Date' and 'Assembly Date'.

#### 7.4.1. Accident Date

In [63]:
len(df[
    (df['Accident Date'] > df['Assembly Date']) |  # accident happened after claim assembly
    (df['Accident Date'] > df['C-2 Date']) |  # accident happened after c-2 form delivery
    (df['Accident Date'] > df['C-3 Date']) |  # accident happened after c-3 form delivery
    (df['Accident Date'] > df['First Hearing Date'])  # accident happened after first hearing
])

1701

In [64]:
# check the test data
len(df_test[
    (df_test['Accident Date'] > df_test['Assembly Date']) |  # accident happened after claim assembly
    (df_test['Accident Date'] > df_test['C-2 Date']) |  # accident happened after c-2 form delivery
    (df_test['Accident Date'] > df_test['C-3 Date']) |  # accident happened after c-3 form delivery
    (df_test['Accident Date'] > df_test['First Hearing Date'])  # accident happened after first hearing
])

315

Since we know the value for 'Accident Date' should be the smallest, we will swap that column's value with the smallest of the dates.

In [65]:
def update_accident(row):
   date_columns = ['Accident Date', 'Assembly Date', 'C-2 Date', 'C-3 Date', 'First Hearing Date']

   date_values = pd.to_datetime(row[date_columns], errors='coerce')

   earliest_date_column = date_values.idxmin(skipna=True)

   row[earliest_date_column], row['Accident Date'] = row['Accident Date'], row[earliest_date_column]

   return row

In [66]:
train_accident_condition = (
    (
        (df['Accident Date'] > df['Assembly Date']) |
        (df['Accident Date'] > df['C-2 Date']) |
        (df['Accident Date'] > df['C-3 Date']) |
        (df['Accident Date'] > df['First Hearing Date'])
    ) &
    (df['Accident Date'].notna())
)

In [67]:
df.loc[train_accident_condition] = df.loc[train_accident_condition].apply(update_accident, axis=1)

In [68]:
# check transformation
len(df[
    (df['Accident Date'] > df['Assembly Date']) |  # accident happened after claim assembly
    (df['Accident Date'] > df['C-2 Date']) |  # accident happened after c-2 form delivery
    (df['Accident Date'] > df['C-3 Date']) |  # accident happened after c-3 form delivery
    (df['Accident Date'] > df['First Hearing Date'])  # accident happened after first hearing
])

0

In [69]:
test_accident_condition = (
    (
        (df_test['Accident Date'] > df_test['Assembly Date']) |
        (df_test['Accident Date'] > df_test['C-2 Date']) |
        (df_test['Accident Date'] > df_test['C-3 Date']) |
        (df_test['Accident Date'] > df_test['First Hearing Date'])
    ) &
    (df_test['Accident Date'].notna())
)

In [70]:
# apply transformation to the test data
df_test.loc[test_accident_condition] = df_test.loc[test_accident_condition].apply(update_accident, axis=1)

In [71]:
# check the test data
len(df_test[
    (df_test['Accident Date'] > df_test['Assembly Date']) |  # accident happened after claim assembly
    (df_test['Accident Date'] > df_test['C-2 Date']) |  # accident happened after c-2 form delivery
    (df_test['Accident Date'] > df_test['C-3 Date']) |  # accident happened after c-3 form delivery
    (df_test['Accident Date'] > df_test['First Hearing Date'])  # accident happened after first hearing
])

0

#### 7.4.2. First Hearing Date

We will set this value has the latest date.

In [72]:
len(df[
    (df['First Hearing Date'] < df['Accident Date']) |
    (df['First Hearing Date'] < df['C-2 Date']) |
    (df['First Hearing Date'] < df['C-3 Date']) |
    (df['First Hearing Date'] < df['Assembly Date'])
])

7576

In [73]:
# check the test data
len(df_test[
    (df_test['First Hearing Date'] < df_test['Accident Date']) |
    (df_test['First Hearing Date'] < df_test['C-2 Date']) |
    (df_test['First Hearing Date'] < df_test['C-3 Date']) |
    (df_test['First Hearing Date'] < df_test['Assembly Date'])
])

1364

Since we know the value for 'First Hearing Date' should be the largest, we will swap that column's value with the largest of the dates.

In [74]:
def update_hearing(row):
   date_columns = ['Accident Date', 'Assembly Date', 'C-2 Date', 'C-3 Date', 'First Hearing Date']

   date_values = pd.to_datetime(row[date_columns], errors='coerce')

   latest_date_column = date_values.idxmax(skipna=True)

   row[latest_date_column], row['First Hearing Date'] = row['First Hearing Date'], row[latest_date_column]

   return row

In [75]:
condition = (
    (
        (df['First Hearing Date'] < df['Accident Date']) |
        (df['First Hearing Date'] < df['C-2 Date']) |
        (df['First Hearing Date'] < df['C-3 Date']) |
        (df['First Hearing Date'] < df['Assembly Date'])
    ) &
    (df['First Hearing Date'].notna())
)

In [76]:
df.loc[condition] = df.loc[condition].apply(update_hearing, axis=1)

In [77]:
# check transformation
len(df[
    (df['First Hearing Date'] < df['Accident Date']) |
    (df['First Hearing Date'] < df['C-2 Date']) |
    (df['First Hearing Date'] < df['C-3 Date']) |
    (df['First Hearing Date'] < df['Assembly Date'])
])

0

In [78]:
condition = (
    (
        (df_test['First Hearing Date'] < df_test['Accident Date']) |
        (df_test['First Hearing Date'] < df_test['C-2 Date']) |
        (df_test['First Hearing Date'] < df_test['C-3 Date']) |
        (df_test['First Hearing Date'] < df_test['Assembly Date'])
    ) &
    (df_test['First Hearing Date'].notna())
)

In [79]:
df_test.loc[condition] = df_test.loc[condition].apply(update_hearing, axis=1)

In [80]:
len(df_test[
    (df_test['First Hearing Date'] < df_test['Accident Date']) |
    (df_test['First Hearing Date'] < df_test['C-2 Date']) |
    (df_test['First Hearing Date'] < df_test['C-3 Date']) |
    (df_test['First Hearing Date'] < df_test['Assembly Date'])
])

0

We have fixed the inconsistencies related with dates! Or so we taught...

#### 7.4.3. Assembly Date

In [81]:
len(df[df['Assembly Date'].dt.year > 2022])

16

In [82]:
len(df_test[df_test['Assembly Date'].dt.year < 2022])

0

With all these transformations, we happened to create a new inconsistency: our test data is only supposed to have claims assembled until the end of 2022 and we have some values that are after 2022. To fix this we will simply set the year back to 2022 and recheck the previous inconsistencies.

In [83]:
def fix_date(date):
    if date.year > 2022:
        if date.month == 2 and date.day == 29:
            return date.replace(day=28, year=2022)

        return date.replace(year=2022)

    else:
        return date

In [84]:
df['Assembly Date'] = df['Assembly Date'].apply(fix_date)

In [85]:
len(df[df['Assembly Date'].dt.year > 2022])

0

In [86]:
# accident inconsistencies
len(df[
    (df['Accident Date'] > df['Assembly Date']) |  # accident happened after claim assembly
    (df['Accident Date'] > df['C-2 Date']) |  # accident happened after c-2 form delivery
    (df['Accident Date'] > df['C-3 Date']) |  # accident happened after c-3 form delivery
    (df['Accident Date'] > df['First Hearing Date'])  # accident happened after first hearing
])

13

In [87]:
# fixing them
df.loc[train_accident_condition] = df.loc[train_accident_condition].apply(update_accident, axis=1)

In [88]:
# rechecking
len(df[
    (df['Accident Date'] > df['Assembly Date']) |  # accident happened after claim assembly
    (df['Accident Date'] > df['C-2 Date']) |  # accident happened after c-2 form delivery
    (df['Accident Date'] > df['C-3 Date']) |  # accident happened after c-3 form delivery
    (df['Accident Date'] > df['First Hearing Date'])  # accident happened after first hearing
])

0

In [89]:
# hearing inconsistencies
len(df[
    (df['First Hearing Date'] < df['Accident Date']) |
    (df['First Hearing Date'] < df['C-2 Date']) |
    (df['First Hearing Date'] < df['C-3 Date']) |
    (df['First Hearing Date'] < df['Assembly Date'])
])

0

#### 7.4.4. C-2 Date Timing

After a brief research, we know that the C-2 form must be submitted by the employer within 10 days of knowing about the accident.

In [90]:
len(df[df['C-2 Date'] > (df['Accident Date'] + timedelta(days=10))])

266581

Almost half of the data is inconsistent with the timing and so we will check the correlation of this inconsistency with the target variable.

In [91]:
c2_timing = df[df['C-2 Date'] > (df['Accident Date'] + timedelta(days=10))].copy()

In [92]:
c2_timing['Claim Injury Type'].value_counts(normalize=True)

Claim Injury Type
2. NON-COMP        0.468158
4. TEMPORARY       0.265844
3. MED ONLY        0.127560
5. PPD SCH LOSS    0.111144
1. CANCELLED       0.017199
6. PPD NSL         0.008511
8. DEATH           0.001380
7. PTD             0.000203
Name: proportion, dtype: float64

Around half of these values lead to a claim either cancelled or non-compensated. This tells us this information could potentially be a good target predictor.</br>
For that matter, we create a new binary variable 'C-2 Missed Timing' which tells us wether the C-2 form was delivered within the allowed timeframe or not.

In [93]:
df['C-2 Missed Timing'] = np.where(
    df['C-2 Date'].notna(),  # Check if 'C-2 Date' is not null
    np.where(df['C-2 Date'] > (df['Accident Date'] + timedelta(days=10)), 1, 0),
    np.nan  # NaN in 'C-2 Date'
)

In [94]:
df_test['C-2 Missed Timing'] = np.where(
    df_test['C-2 Date'].notna(),  # Check if 'C-2 Date' is not null
    np.where(df_test['C-2 Date'] > (df_test['Accident Date'] + timedelta(days=10)), 1, 0),
    np.nan  # NaN in 'C-2 Date'
)

#### 7.4.5. C-3 Date Timing

This form must be delivered within 2 years of the accident.

In [95]:
len(df[df['C-3 Date'] > (df['Accident Date'] + timedelta(days=(365*2)))])

2097

In [96]:
c3_timing = df[df['C-3 Date'] > (df['Accident Date'] + timedelta(days=(365*2)))].copy()

In [97]:
c3_timing['Claim Injury Type'].value_counts(normalize=True)

Claim Injury Type
2. NON-COMP        0.424893
4. TEMPORARY       0.209347
5. PPD SCH LOSS    0.158321
3. MED ONLY        0.143538
1. CANCELLED       0.056748
6. PPD NSL         0.005722
8. DEATH           0.000954
7. PTD             0.000477
Name: proportion, dtype: float64

Half of the claims that miss the C-3 form delivery timeline belong to the Non-compensated class - we will create new binary feature 'C-3 Missed Timing' which tells us if the C-3 form was delivered within the allowed timeframe, attempting to improve model performance.

In [98]:
df['C-3 Missed Timing'] = np.where(
    df['C-3 Date'].notna(),  # Check if 'C-3 Date' is not null
    np.where(df['C-3 Date'] > (df['Accident Date'] + timedelta(days=(365*2))), 1, 0),
    np.nan  # NaN in 'C-3 Date'
)

In [99]:
df_test['C-3 Missed Timing'] =  np.where(
    df_test['C-3 Date'].notna(),  # Check if 'C-3 Date' is not null
    np.where(df_test['C-3 Date'] > (df_test['Accident Date'] + timedelta(days=(365*2))), 1, 0),
    np.nan  # NaN in 'C-3 Date'
)

### 7.5. Covid-19 Indicator

In [100]:
df[
    (df['COVID-19 Indicator'] == "Y") &
    (df['Accident Date'].dt.year < 2020)
]

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents,C-2 Missed Timing,C-3 Missed Timing
5401736,2019-12-17,29,N,2020-01-13,Y,2141.8,1990,2020-01-30,2020-01-08,"POLICE, FIRE, SANITATION",3A. SELF PUBLIC,5. PPD SCH LOSS,ROCKLAND,Y,NYC,NaT,F,3.0,92,PUBLIC ADMINISTRATION,III,,83,PANDEMIC,83,COVID-19,48,INTERNAL ORGANS,10954.0,0,Not Work Related,3,1.0,0.0
5481714,2012-01-27,55,N,2020-05-13,Y,2287.26,1965,2020-04-12,2020-08-07,STATE INSURANCE FUND,2A. SIF,4. TEMPORARY,BRONX,Y,NYC,2020-11-18,M,,92,PUBLIC ADMINISTRATION,IV,,83,PANDEMIC,83,COVID-19,60,LUNGS,10475.0,0,Not Work Related,2,1.0,1.0
5628807,2018-02-05,48,N,2021-01-04,Y,1421.29,1972,2021-01-28,2020-10-23,MANHATTAN & BRONX SURFACE,3A. SELF PUBLIC,4. TEMPORARY,KINGS,Y,NYC,2021-03-25,F,,48,TRANSPORTATION AND WAREHOUSING,IV,,83,PANDEMIC,83,COVID-19,65,INSUFFICIENT INFO TO PROPERLY IDENTIFY - UNCLASSIFIED,11236.0,0,Not Work Related,4,1.0,1.0
5642885,2019-12-22,53,N,2021-01-25,N,0.0,1966,2021-01-23,NaT,NYU LANGONE HOSPITALS,4A. SELF PRIVATE,2. NON-COMP,SUFFOLK,Y,HAUPPAUGE,NaT,M,,62,HEALTH CARE AND SOCIAL ASSISTANCE,IV,,83,PANDEMIC,83,COVID-19,91,BODY SYSTEMS AND MULTIPLE BODY SYSTEMS,11787.0,0,Not Work Related,0,1.0,
5837852,2010-07-29,43,N,2021-10-20,Y,1482.71,1978,2021-11-01,2021-09-10,NYC TRANSIT AUTHORITY,3A. SELF PUBLIC,4. TEMPORARY,KINGS,Y,NYC,2021-12-23,F,,48,TRANSPORTATION AND WAREHOUSING,IV,,83,PANDEMIC,83,COVID-19,91,BODY SYSTEMS AND MULTIPLE BODY SYSTEMS,,0,Not Work Related,3,1.0,1.0


Looks like the date changes caused some inconsistencies regarding Accident Date and COVID-19 Indicator. We will drop the 3 inconsistent rows (the ones whose accident did not occur after December 2019).

In [101]:
df.drop(index=[5481714, 5628807, 5837852], inplace=True)

In [102]:
df_test[
    (df_test['COVID-19 Indicator'] == "Y") &
    (df_test['Accident Date'].dt.year < 2020)
]

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Number of Dependents,C-2 Missed Timing,C-3 Missed Timing


No inconsistencies in the test data.

### 7.6. The Wanna-be NaN

As we have seen in [Chapter 6](#6-unique-values), we have string representations of missing values in 'Zip Code' - we shall convert these to the default NaN from numpy.

In [103]:
df['Zip Code'] = df['Zip Code'].replace('nan', np.nan)

In [104]:
# check for this inconsistency in the test data
len(df_test[df_test['Zip Code'] == 'nan'])

19342

In [105]:
# fixing the inconsistency in the test data
df_test['Zip Code'] = df_test['Zip Code'].replace('nan', np.nan)

### 7.7. Duplicated Values

In [106]:
# check how many duplicated rows we have
df.duplicated(keep=False).sum()

19446

In [107]:
# check how many of these are distinct
len(df[df.duplicated(keep=False)].drop_duplicates())

1097

In fact, even though we have 19k+ duplicated rows, 1k of these rows are distinct, i.e., we have rows with more than one duplicate

In [108]:
df[df.duplicated()].drop_duplicates().head()

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents,C-2 Missed Timing,C-3 Missed Timing
363651794,NaT,,,2020-01-01,,,,NaT,NaT,,,,,,,NaT,,,,,,,,,,,,,,,,,,
237445888,NaT,,,2020-01-02,,,,NaT,NaT,,,,,,,NaT,,,,,,,,,,,,,,,,,,
573254410,NaT,,,2020-01-03,,,,NaT,NaT,,,,,,,NaT,,,,,,,,,,,,,,,,,,
475623942,NaT,,,2020-01-04,,,,NaT,NaT,,,,,,,NaT,,,,,,,,,,,,,,,,,,
312265350,NaT,,,2020-01-05,,,,NaT,NaT,,,,,,,NaT,,,,,,,,,,,,,,,,,,


We shall check if all values of all rows are NaN with the exception of 'Assembly Date' and the newly created column 'C-2 Missed Timing'

In [109]:
# drop duplicates from those duplicated rows
unique_duplicated_rows = df[df.duplicated(keep=False)].drop_duplicates()

# check if all values (excluding 'Assembly Date', 'C-2 Missed Timing' and 'C-3 Missed Timing') are NaN
nan_check = unique_duplicated_rows.drop(columns=['Assembly Date', 'C-2 Missed Timing', 'C-3 Missed Timing']).isna().all(axis=1)

# check the rows that have other than missing values
unique_duplicated_rows[~nan_check]

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents,C-2 Missed Timing,C-3 Missed Timing
5686981,NaT,0,N,2021-03-25,Y,0.0,1968,NaT,2021-03-16,ACE AMERICAN INSURANCE CO,1A. PRIVATE,1. CANCELLED,UNKNOWN,N,NYC,2022-10-24,M,,,,UK,,,,,,,,7032,0,Not Work Related,0,,0.0


This means that all duplicated rows have missing values in every column but 'Assembly Date', except for one row

In [110]:
# percentage of duplicate values
df.duplicated().sum() / df.shape[0] * 100

3.091831559294788

The duplicated (and "empty") rows we have been discussing account for around 3.1% of our data - we will drop these values.

In [111]:
df.drop_duplicates(inplace=True)

### 7.8. The Target

In [112]:
# check how many missing values we have in our target variable
df['Claim Injury Type'].isna().sum()

1096

In [113]:
df[df['Claim Injury Type'].isna()].head()

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents,C-2 Missed Timing,C-3 Missed Timing
957648180,NaT,,,2020-01-01,,,,NaT,NaT,,,,,,,NaT,,,,,,,,,,,,,,,,,,
653337074,NaT,,,2020-01-02,,,,NaT,NaT,,,,,,,NaT,,,,,,,,,,,,,,,,,,
489154817,NaT,,,2020-01-03,,,,NaT,NaT,,,,,,,NaT,,,,,,,,,,,,,,,,,,
351928125,NaT,,,2020-01-04,,,,NaT,NaT,,,,,,,NaT,,,,,,,,,,,,,,,,,,
454184427,NaT,,,2020-01-05,,,,NaT,NaT,,,,,,,NaT,,,,,,,,,,,,,,,,,,


At a first glance, it looks like these rows are all NaN values with the exception of 'Assembly Date' and the new columns - just like before. Let us check if this is indeed the case.

In [114]:
no_target_df = df[df['Claim Injury Type'].isna()]

# check if all values (excluding 'Assembly Date', 'C-2 Missed Timing' and 'C-3 Missed Timing') are NaN
nan_check = no_target_df.drop(columns=['Assembly Date', 'C-2 Missed Timing', 'C-3 Missed Timing']).isna().all(axis=1)

# check the rows that have other than missing values
no_target_df[~nan_check]

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents,C-2 Missed Timing,C-3 Missed Timing


It seems like all these rows have missing values in every column but 'Assembly Date' - we will also remove these rows.

In [115]:
df = df[~df['Claim Injury Type'].isna()]

Knowing the original dataset had 593471 rows, let us check what percentage of the data we have removed so far.

In [116]:
(1 - (df.shape[0] / 593471)) * 100

3.277160973324733

We are still below the 5% threshold, so we are good to go!

## 8. Feature Engineering

In [117]:
# amount of days between the accident and the claim assembly
df['Days Difference'] = (df['Assembly Date'] - df['Accident Date']).dt.days
df_test['Days Difference'] = (df_test['Assembly Date'] - df_test['Accident Date']).dt.days

# binary columns if the date is missing or not
df['C-2 Missing'] = df['C-2 Date'].isnull().astype(int)
df_test['C-2 Missing'] = df_test['C-2 Date'].isnull().astype(int)
df['C-3 Missing'] = df['C-3 Date'].isnull().astype(int)
df_test['C-3 Missing'] = df_test['C-3 Date'].isnull().astype(int)
df['Has Hearing'] = np.where(df['First Hearing Date'].notna(), 1, 0)
df_test['Has Hearing'] = np.where(df_test['First Hearing Date'].notna(), 1, 0)
df['Has IME-4 Report'] = np.where(df['IME-4 Count'].notna(), 1, 0)
df_test['Has IME-4 Report'] = np.where(df_test['IME-4 Count'].notna(), 1, 0)

In [118]:
numeric_features = [
    'Age at Injury'
    , 'Average Weekly Wage'
    , 'Birth Year'
    , 'IME-4 Count'
    , 'Number of Dependents'
]

In [119]:
# log transformation of numeric features
for feature in numeric_features:
    df[f'Log {feature}'] = np.log(df[feature] + 1e-9)
    df_test[f'Log {feature}'] = np.log(df_test[feature] + 1e-9)

    # summing a very small constant, to avoid the log(0) issue

In [120]:
df.head()

Unnamed: 0,Accident Date,Age at Injury,Alternative Dispute Resolution,Assembly Date,Attorney/Representative,Average Weekly Wage,Birth Year,C-2 Date,C-3 Date,Carrier Name,Carrier Type,Claim Injury Type,County of Injury,COVID-19 Indicator,District Name,First Hearing Date,Gender,IME-4 Count,Industry Code,Industry Code Description,Medical Fee Region,OIICS Nature of Injury Description,WCIO Cause of Injury Code,WCIO Cause of Injury Description,WCIO Nature of Injury Code,WCIO Nature of Injury Description,WCIO Part Of Body Code,WCIO Part Of Body Description,Zip Code,Agreement Reached,WCB Decision,Number of Dependents,C-2 Missed Timing,C-3 Missed Timing,Days Difference,C-2 Missing,C-3 Missing,Has Hearing,Has IME-4 Report,Log Age at Injury,Log Average Weekly Wage,Log Birth Year,Log IME-4 Count,Log Number of Dependents
5393875,2019-12-30,31,N,2020-01-01,N,0.0,1988,2019-12-31,NaT,NEW HAMPSHIRE INSURANCE CO,1A. PRIVATE,2. NON-COMP,ST. LAWRENCE,N,SYRACUSE,NaT,M,,44,RETAIL TRADE,I,,27,FROM LIQUID OR GREASE SPILLS,10,CONTUSION,62,BUTTOCKS,13662,0,Not Work Related,1,0.0,,2.0,0,1,0,0,3.433987,-20.723266,7.594884,,0.0
5393091,2019-08-30,46,N,2020-01-01,Y,1745.93,1973,2020-01-01,2020-01-14,ZURICH AMERICAN INSURANCE CO,1A. PRIVATE,4. TEMPORARY,WYOMING,N,ROCHESTER,2020-02-21,F,4.0,23,CONSTRUCTION,I,,97,REPETITIVE MOTION,49,SPRAIN OR TEAR,38,SHOULDER(S),14569,1,Not Work Related,4,1.0,0.0,124.0,0,0,1,1,3.828641,7.465043,7.587311,1.386294,1.386294
5393889,2019-12-06,40,N,2020-01-01,N,1434.8,1979,2020-01-01,NaT,INDEMNITY INSURANCE CO OF,1A. PRIVATE,4. TEMPORARY,ORANGE,N,ALBANY,NaT,M,,56,ADMINISTRATIVE AND SUPPORT AND WASTE MANAGEMENT AND REMEDIAT,II,,79,OBJECT BEING LIFTED OR HANDLED,7,CONCUSSION,10,MULTIPLE HEAD INJURY,12589,0,Not Work Related,6,1.0,,26.0,0,1,0,0,3.688879,7.268781,7.590347,,1.791759
5393887,2019-12-30,61,N,2020-01-01,N,,1958,2019-12-31,NaT,STATE INSURANCE FUND,2A. SIF,2. NON-COMP,DUTCHESS,N,ALBANY,NaT,M,,62,HEALTH CARE AND SOCIAL ASSISTANCE,II,,16,"HAND TOOL, UTENSIL; NOT POWERED",43,PUNCTURE,36,FINGER(S),12603,0,Not Work Related,1,0.0,,2.0,0,1,0,0,4.110874,,7.579679,,0.0
5393863,2019-12-26,67,N,2020-01-01,N,0.0,1952,2019-12-31,NaT,INDEMNITY INS. OF N AMERICA,1A. PRIVATE,3. MED ONLY,SUFFOLK,N,HAUPPAUGE,NaT,M,,44,RETAIL TRADE,IV,,31,"FALL, SLIP OR TRIP, NOC",10,CONTUSION,38,SHOULDER(S),11772,0,Not Work Related,5,0.0,,6.0,0,1,0,0,4.204693,-20.723266,7.57661,,1.609438


## 9. Data Statistics

Let us recheck our count of missing values now that we have fixed some inconsistencies and look at some summary statistics

### 9.1. Missing Values (again)

In [121]:
# calculate the number of NaNs for each column
nan_counts = df.isna().sum()

# get the total number of rows
total_rows = df.shape[0]

# calculate the percentage of NaN values for each column
percentage_nans = (nan_counts / total_rows) * 100

# format the percentages with '%' sign
percentage_nans = percentage_nans.apply(lambda x: f"{x:.2f}%")

# combine all the information into a DataFrame
nan_summary = pd.DataFrame({
    'NaN Count': nan_counts
    , 'Total Values': [total_rows] * len(nan_counts)
    , 'Percentage NaN': percentage_nans
})

display(nan_summary)

Unnamed: 0,NaN Count,Total Values,Percentage NaN
Accident Date,3688,574022,0.64%
Age at Injury,0,574022,0.00%
Alternative Dispute Resolution,0,574022,0.00%
Assembly Date,0,574022,0.00%
Attorney/Representative,0,574022,0.00%
Average Weekly Wage,28651,574022,4.99%
Birth Year,31018,574022,5.40%
C-2 Date,14559,574022,2.54%
C-3 Date,386781,574022,67.38%
Carrier Name,0,574022,0.00%


We have considerably reduced the number of missing values - nice!

### 9.2. Summary Statistics

In [122]:
# defining the numeric, date and categorical features
numeric_features.extend(['Days Difference', *[feat for feat in df.columns if 'Log' in feat]])

date_features = [
    'Accident Date'
    , 'Assembly Date'
    , 'C-2 Date'
    , 'C-3 Date'
    , 'First Hearing Date'
]

categorical_features = [
    feat for feat in df.columns 
    if feat not in numeric_features \
        and feat not in date_features \
        and feat != 'OIICS Nature of Injury Description'
]

In [123]:
# helper functions for aggregation
def freq(x): return x.value_counts().iloc[0] if not x.empty else None
def mode(x): return x.mode().iloc[0] if not x.mode().empty else None
def _25(x): return x.quantile(0.25)
def _75(x): return x.quantile(0.75)
def _90(x): return x.quantile(0.90)
def _95(x): return x.quantile(0.95)
def _98(x): return x.quantile(0.98)        


# metric aggregations 
metric_functions = [
    'sum'
    , 'mean'
    , 'std'
    , 'var'
    , 'skew'
    , 'kurt'
    , 'min'
    , _25
    , 'median'
    , _75
    , _90
    , _95
    , _98
    , 'max'
    , mode
]

# date aggregations
date_functions = [
    'count'
    , 'nunique'
    , mode
    , 'min'
    , _25
    , 'median'
    , _75
    , _90
    , _95
    , _98
    , 'max'
] 

# categorical aggregations
categorical_functions = [
    'count'
    , 'nunique'
    , mode
    , freq
]

def get_aggregations(_data, _type, selected):  
    agg_dict = {column : _type for column in _data[selected].columns}

    return _data[selected].agg(agg_dict).round(2).T 

In [124]:
get_aggregations(df, metric_functions, numeric_features)

Unnamed: 0,sum,mean,std,var,skew,kurt,min,_25,median,_75,_90,_95,_98,max,mode
Age at Injury,24185620.0,42.13,14.28,203.91,-0.01,-0.32,0.0,31.0,42.0,54.0,61.0,64.0,68.0,120.0,31.0
Average Weekly Wage,267822100.0,491.08,6092.94,37123913.76,378.98,160019.95,0.0,0.0,0.0,841.0,1550.08,1977.44,2523.17,2828079.0,0.0
Birth Year,1073951000.0,1977.8,13.85,191.82,-0.15,-0.7,1900.0,1966.0,1978.0,1989.0,1996.0,1999.0,2001.0,2018.0,1990.0
IME-4 Count,425944.0,3.21,2.83,8.02,2.59,13.21,1.0,1.0,2.0,4.0,7.0,9.0,12.0,73.0,1.0
Number of Dependents,1725834.0,3.01,2.0,4.0,-0.0,-1.25,0.0,1.0,3.0,5.0,6.0,6.0,6.0,6.0,6.0
Days Difference,47211040.0,82.78,605.48,366602.77,14.52,251.83,0.0,5.0,10.0,25.0,74.0,166.0,421.0,21503.0,6.0
Log Age at Injury,1989065.0,3.47,2.38,5.64,-9.87,97.55,-20.72,3.43,3.74,3.99,4.11,4.16,4.22,4.79,3.43
Log Average Weekly Wage,-5496489.0,-10.08,13.46,181.22,0.48,-1.77,-20.72,-20.72,-20.72,6.73,7.35,7.59,7.83,14.86,-20.72
Log Birth Year,4121245.0,7.59,0.01,0.0,-0.17,-0.68,7.55,7.58,7.59,7.6,7.6,7.6,7.6,7.61,7.6
Log IME-4 Count,116087.6,0.87,0.74,0.55,0.42,-0.68,0.0,0.0,0.69,1.39,1.95,2.2,2.48,4.29,0.0


- **Age at Injury**: the average age at the time of injury is 42 years-old; we have very low values (which we will fix after imputation of other columns) and very high values (a brief search online tells us that the average retirement age is between 63 and 65, maybe values much larger than these do not make much sense)
- **Average Weekly Wage**: more than half of the people in the dataset have 0 as their average weekly wage; the highest earning person earns more than $2.8B a week
- **IME-4 Count**: on average, a claim gets 3 examiner reports, but the most reports a claim has gotten was 73 (this may be a bit unusual)
- **Number of Dependents**: people have on average 3 dependents
- **Days Difference**: on average, it takes 3 months for the claim to be assembled; however, we could have some extreme values as the person in the 75% percentile only needed 40 days to get their claim assembled
- **Log Transformations**: the distribution of the previous features have changed and transformed to ones that resemble normal distributions - mean closer to 0 and much smaller standard deviations (with the exception of Number of Dependents, which originally resembled a uniform distribution)

In the notebook 'PART 3 - PreProcess, Feature Selection, Model Assessment and Evaluation', a threshold will be applied to **Age at Injury**, **Average Weekly Wage** and **IME-4 Count** to handle the extreme values.

In [125]:
get_aggregations(df, date_functions, date_features)

Unnamed: 0,count,nunique,mode,min,_25,median,_75,_90,_95,_98,max
Accident Date,570334,5505,2020-03-01 00:00:00,1961-09-06 00:00:00,2020-09-14 00:00:00,2021-06-27 00:00:00,2022-03-21 00:00:00,2022-08-25 00:00:00,2022-10-19 00:00:00,2022-11-23 00:00:00,2022-12-30 00:00:00
Assembly Date,574022,902,2020-03-06 00:00:00,2020-01-01 00:00:00,2020-10-27 00:00:00,2021-07-30 00:00:00,2022-04-20 00:00:00,2022-09-19 00:00:00,2022-11-04 00:00:00,2022-12-08 00:00:00,2022-12-31 00:00:00
C-2 Date,559463,2426,2021-05-11 00:00:00,1996-01-12 00:00:00,2020-11-05 00:00:00,2021-08-06 00:00:00,2022-04-26 00:00:00,2022-09-22 00:00:00,2022-11-09 00:00:00,2022-12-13 00:00:00,2024-05-31 00:00:00
C-3 Date,187241,1632,2021-04-21 00:00:00,2010-05-14 00:00:00,2020-10-27 00:00:00,2021-07-20 00:00:00,2022-04-18 00:00:00,2022-10-05 00:00:00,2022-12-05 00:00:00,2023-02-09 00:00:00,2024-05-31 00:00:00
First Hearing Date,150794,1279,2022-04-15 00:00:00,2020-01-30 00:00:00,2021-06-08 00:00:00,2022-03-15 00:00:00,2023-01-13 00:00:00,2023-07-19 00:00:00,2023-11-14 00:00:00,2024-03-01 00:00:00,2024-06-07 00:00:00


- **Accident Date**: more than 75% of accidents happened after 2020, eventhough the first accident happened in 1961
- **First Hearing Date**: more than 50% of hearings were held in 2022

In [126]:
get_aggregations(df, categorical_functions, categorical_features)

Unnamed: 0,count,nunique,mode,freq
Alternative Dispute Resolution,574022.0,3.0,N,571408.0
Attorney/Representative,574022.0,2.0,N,392291.0
Carrier Name,574022.0,2046.0,STATE INSURANCE FUND,111143.0
Carrier Type,574022.0,8.0,1A. PRIVATE,285367.0
Claim Injury Type,574022.0,8.0,2. NON-COMP,291078.0
County of Injury,574022.0,63.0,SUFFOLK,60430.0
COVID-19 Indicator,574022.0,2.0,N,546504.0
District Name,574022.0,8.0,NYC,270775.0
Gender,569265.0,2.0,M,335216.0
Industry Code,564065.0,24.0,62,114339.0


- **Attorney/Representative**: most claims do not have an attorney/representative
- **Gender**: males are more prone to have a workplace injury
- **COVID-19 Indicator**: around 95% of claims have a negative value, suggesting that pandemic-related claims are a minor portion of the data
- **Industry Code Description**: the industry which gets more claims is HEALTH CARE AND SOCIAL ASSISTANCE
- **WCIO Cause of Injury Description**: LIFTING is the biggest cause of injuries
- **WCIO Nature of Injury Description**: STRAIN OR TEAR is the biggest nature of injuries
- **WCIO Part Of Body Description**: people most get injured on their LOWER BACK AREA
- **Zip Code**: such low frequency for the most frequent value indicates the claims are geographically dispersed at this granularity
- **Agreement Reached**: most claims need the involvement of the WCB to get closed
- **C-2 Missed Timing**: the C-2 form delivery date is missed more than half the times
- **C-3 Missed Timing**: the C-3 form delivery date is missed more than 75% of the times
- **C-2 Missing**: more than 75% of claims have the C-2 form missing
- **Has IME-4 Report**: more than 75% of claims do not have an examiner report

## 10. Export

We export the data with the already made changes and will continue our analysis in the notebook 'Part2-VisualExploration.ipynb'

In [127]:
df_dict = {
    'train': df
    ,'test': df_test
}

for name, dataframe in df_dict.items():
    # save dataframe datatypes to dictionary
    dtypes_dict = dataframe.dtypes.apply(lambda x: str(x)).to_dict()

    # save dataframe datatypes to json
    with open(f'../Data/{name}_dtypes_initial_inspection.json', 'w') as f:
        json.dump(dtypes_dict, f)

In [128]:
df.to_csv('../Data/train_data_initial_inspection.csv', index=True)
df_test.to_csv('../Data/test_data_initial_inspection.csv', index=True)