#### Imports

In [None]:
# importing the libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import StratifiedKFold, cross_val_score, train_test_split
from imblearn.pipeline import Pipeline
from sklearn.metrics import make_scorer, f1_score
from xgboost import XGBClassifier
from sklearn.feature_selection import RFE


In [None]:
# setting the options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
np.set_printoptions(threshold=np.inf)

In [15]:
df_train = pd.read_csv('train_data_preproc.csv', sep=',').set_index('Unnamed: 0')
df_val = pd.read_csv('validation_data_preproc.csv', sep=',').set_index('Unnamed: 0')
df_test = pd.read_csv('test_data_preproc.csv', sep=',').set_index('Unnamed: 0')

In [16]:
X_train = df_train.drop(columns=['Claim Injury Type'])
y_train = df_train['Claim Injury Type']

X_val = df_val.drop(columns=['Claim Injury Type'])
y_val = df_val['Claim Injury Type']

### Feature selection

First we are going to define the model. We chose to use XGBoost for now because we believe it is a good model to deal with complex relationships and noisy data and after testing some models like Random Forest we were happier with the results this one provided. We are defining the number of trees as 200 to try to reduce the overfit and the same logic applies to the depth of each tree being equal to 6  <br>

We are going to use a Wrapper Method, mor specifically Recursive Feature Elimination, to find the optimum number of features for the model we defined. We also use Stratified K-Fold to guarantee that each fold contains the same class distribution of the training dataset. <br>

To evaluate model performance we use macro F1-Score, which is the simple average of the F1-scores for each class in a multiclass problem. This way, we are also consistent with Kaggle.

In [None]:
# defining the moedel

model = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    scale_pos_weight=1,
    random_state=42,
    use_label_encoder=False,
    eval_metric='mlogloss'
)

kf = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)

best_f1_score = 0
best_num_features = 0
best_selected_features = []

results = []


Now we are going to perform a for loop to test different numbers of features and determine the optimal count. Due to the fact that this code is very computationally expensive, we run different feature ranges in smaller batches, rather than testing all at once. The code below is an example of one range, and we will later present the results obtained. We are keeping not only the score on the validation, but also on the training so that we can evaluate if the model overfits.

In [None]:
for n_features in range(18, 20):

    rfe = RFE(estimator=model, n_features_to_select=n_features, step=1)
    X_train_rfe = rfe.fit_transform(X_train, y_train)
    
    fold_val_scores = []
    fold_train_scores = []
    
    for train_index, val_index in kf.split(X_train_rfe, y_train):
        X_fold_train, X_fold_val = X_train_rfe[train_index], X_train_rfe[val_index]
        y_fold_train, y_fold_val = y_train.iloc[train_index], y_train.iloc[val_index]
        
        model.fit(X_fold_train, y_fold_train)
        
        y_train_pred = model.predict(X_fold_train)
        y_val_pred = model.predict(X_fold_val)
        
        train_f1 = f1_score(y_fold_train, y_train_pred, average='macro')
        val_f1 = f1_score(y_fold_val, y_val_pred, average='macro')
        
        fold_train_scores.append(train_f1)
        fold_val_scores.append(val_f1)
    
    avg_train_f1 = np.mean(fold_train_scores)
    avg_val_f1 = np.mean(fold_val_scores)
    overfit_percentage = ((avg_train_f1 - avg_val_f1) / avg_train_f1) * 100
    
    print(f"Number of features: {n_features} | Avg Train F1-Score: {avg_train_f1:.4f} | Avg Val F1-Score: {avg_val_f1:.4f} | Overfit %: {overfit_percentage:.2f}%")
    
    results.append({
        'Number of features': n_features,
        'Average Training F1-Score Macro': avg_train_f1,
        'Average Validation F1-Score Macro': avg_val_f1,
        'Overfit Percentage': overfit_percentage
    })
    
    if avg_val_f1 > best_f1_score:
        best_f1_score = avg_val_f1
        best_num_features = n_features
        best_selected_features = X_train.columns[rfe.support_]

Finally, we see our best results and convert our results to a dataframe to keep it on a csv

In [None]:
results_df = pd.DataFrame(results)

print(f"\nBest number of features: {best_num_features}")
print(f"Best F1-Score Macro on validation: {best_f1_score:.4f}")
print("Selected features:", best_selected_features.tolist())

results_df.to_csv("feature_selection_results_(18-19).csv", index=False)

In [29]:
#Importing the results we got before

file_0 = pd.read_csv("feature_selection_results_(18-19).csv", sep=',')
file_1 = pd.read_csv("feature_selection_results_(20-24).csv", sep=',')
file_2 = pd.read_csv("feature_selection_results_(25-31).csv", sep=',')
file_3 = pd.read_csv("feature_selection_results_(32-48).csv", sep=',')

combined_results = pd.concat([file_0, file_1, file_2, file_3], ignore_index=True)

combined_results

Unnamed: 0,Number of features,Average Training F1-Score Macro,Average Validation F1-Score Macro,Overfit Percentage
0,18,0.477194,0.416142,12.794023
1,19,0.503021,0.41836,16.830404
2,20,0.528748,0.426781,19.284629
3,21,0.532941,0.426205,20.027617
4,22,0.529757,0.426111,19.564871
5,23,0.552152,0.428181,22.452409
6,24,0.552132,0.429018,22.297988
7,25,0.572365,0.429646,24.934956
8,26,0.564921,0.430286,23.832486
9,27,0.568166,0.429675,24.375122


Analysing our results we understand that the less number of features the less the model overfits and the F1 Score doesn't change much. However, we are aware that we are still ocuring in substantial overfitting and that the model isn't generalysing as well as we'd like, but we'll evaluate how the validation scores change on each fold of the cross validation and see if the validation score is stable on these. If so, for now, we are prioritizing a best score on the validation so we are going ot choose 26 as the number of features 

### Model Assessment - XGBoost 

Now that we chose to keep 26 features, we run RFE again just to select these 26 best features for the model. We will define a pipeline to facilitate the process in case we want to apply techniques to deal with class imbalance like SMOTE.

In [None]:
num_features = 26 

model = XGBClassifier(
    n_estimators=200,
    max_depth=6,
    learning_rate=0.1,
    scale_pos_weight=1,
    random_state=42,
    use_label_encoder=False,
    eval_metric='mlogloss'
)

pipeline = Pipeline([
    ('classifier', model)
])

rfe = RFE(estimator=model, n_features_to_select=num_features, step=5)
X_train_rfe = rfe.fit_transform(X_train, y_train)

all_features = X_train.columns
selected_features = all_features[rfe.support_]
removed_features = all_features[~rfe.support_]

X_train_sample_selected = X_train[selected_features]
X_val_sample_selected = X_val[selected_features]
print("Features selecionadas:", selected_features.tolist())


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



Features selecionadas: ['Alternative Dispute Resolution', 'Attorney/Representative', 'Average Weekly Wage', 'Carrier Name', 'COVID-19 Indicator', 'IME-4 Count', 'Industry Code', 'WCIO Nature of Injury Code', 'WCIO Part Of Body Code', 'C-2 Missed Timing', 'Days Difference', 'C-2 Missing', 'C-3 Missing', 'Has Hearing', 'Has IME-4 Report', 'Accident Date_year', 'Assembly Date_year', 'C-2 Date_year', 'C-3 Date_year', 'First Hearing Date_year', 'Carrier Type_3A. SELF PUBLIC', 'Carrier Type_5D. SPECIAL FUND - UNKNOWN', 'Carrier Type_UNKNOWN', 'District Name_NYC', 'District Name_ROCHESTER', 'District Name_STATEWIDE']


We'll use the Stratified K-Fold to evaluate the model in each of the folders and to optimize the amount of data we have

In [None]:
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

train_fold_scores = []
val_fold_scores = []

for train_index, val_index in kf.split(X_train_sample_selected, y_train):
    X_fold_train, X_fold_val = X_train_sample_selected.iloc[train_index], X_train_sample_selected.iloc[val_index]
    y_fold_train, y_fold_val = y_train.iloc[train_index], y_train.iloc[val_index]
    
    pipeline.fit(X_fold_train, y_fold_train)
    
    y_train_pred = pipeline.predict(X_fold_train)
    y_val_pred = pipeline.predict(X_fold_val)
    
    train_f1 = f1_score(y_fold_train, y_train_pred, average='macro')
    val_f1 = f1_score(y_fold_val, y_val_pred, average='macro')
    
    train_fold_scores.append(train_f1)
    val_fold_scores.append(val_f1)

    print(f"Fold - F1-Score Macro (Train): {train_f1:.4f} | F1-Score Macro (Validation): {val_f1:.4f}")

average_train_f1 = np.mean(train_fold_scores)
average_val_f1 = np.mean(val_fold_scores)

print(f"\nAverage F1-Score Macro on Train: {average_train_f1:.4f}")
print(f"Average F1-Score Macro on Validation: {average_val_f1:.4f}")

  print(f"\Average F1-Score Macro on Train: {average_train_f1:.4f}")
Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



Fold - F1-Score Macro (Train): 0.5424 | F1-Score Macro (Validation): 0.4379


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



Fold - F1-Score Macro (Train): 0.5588 | F1-Score Macro (Validation): 0.4223


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



Fold - F1-Score Macro (Train): 0.5486 | F1-Score Macro (Validation): 0.4417


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



Fold - F1-Score Macro (Train): 0.5569 | F1-Score Macro (Validation): 0.4296


Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



Fold - F1-Score Macro (Train): 0.5538 | F1-Score Macro (Validation): 0.4241
\Average F1-Score Macro on Train: 0.5521
Average F1-Score Macro on Validation: 0.4311


We see that the validation score and training score don't vary much between each fold so we are keeping these results for now. Again, we could choose a model that performs worst on the validation data but that doesn't overfit but we prefered a better score on the unseen data.

In [None]:
#Training the model in the entire dataset before making a prediction on the test data

pipeline.fit(X_train_sample_selected, y_train)

Parameters: { "scale_pos_weight", "use_label_encoder" } are not used.



In [37]:
X_test_selected = df_test[selected_features]

y_test_pred = pipeline.predict(X_test_selected)

class_mapping = {
    0: "1. CANCELLED",
    1: "2. NON-COMP",
    2: "3. MED ONLY",
    3: "4. TEMPORARY",
    4: "5. PPD SCH LOSS",
    5: "6. PPD NSL",
    6: "7. PTD",
    7: "8. DEATH"
}

df_submission = pd.DataFrame({
    'Claim Identifier': df_test.index,
    'Claim Injury Type': y_test_pred
})

df_submission['Claim Injury Type'] = df_submission['Claim Injury Type'].map(class_mapping)

df_submission.to_csv("Group43_Version09.csv", index=False)