<div class="alert alert-block alert-info">

## <center> GROUP PROJECT - TO GRANT OR NOT TO GRANT: DECIDING ON COMPENSATION BENEFITS </center> <br>
#  <center> <b> K Nearest Neighbors Classifier </center> <br>
## <center> Fall Semester 2024-2025 <center>
<br>
<center> Group 46: <center> <br>
<center>Afonso Ascensão, 20240684 <br><center>
<center>Duarte Marques, 20240522 <br><center>
<center>Joana Esteves, 20240746 <br><center>
<center>Rita Serra, 20240515 <br><center>
<center>Rodrigo Luís, 20240742 <br><center>

<div>

**Description of contents:**
- Apply pipeline to preprocess the data.
- Implement K Nearest Neighbors algorithm, perform tuning of hyperparameters making use of gridsearch.
- Assessement of the model using cross validation.
- Generate prediction for the test sample.

**Table of Contents**
- [1. Import the needed Libraries](#importlibraries)
- [2. Import Dataset](#importdataset)
- [3. Preprocessing](#section_3)
- [4. kNN Classifier](#section_4)


<a class="anchor" id="section_1">

# 1. Import Libraries

</a>

In [1]:

import pandas as pd
import numpy as np

# Preprocessing
## Pipeline
from sklearn.pipeline import Pipeline
from joblib import load
from transformers import *
## Target Encoding
from sklearn.preprocessing import LabelEncoder

# Model algorithm - kNN
from sklearn.neighbors import KNeighborsClassifier

# Evaluation metrics
from sklearn.metrics import classification_report, make_scorer, f1_score

# Cross validation, parameter tuning
from sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold

# Define a seed
random_state = 42
np.random.seed(42)

# Display all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option("display.max_columns", None)

<a class="anchor" id="section_2">

# 2. Import Dataset and Pipeline

</a>

In [2]:
# Train and validation w/ split, separate X and y to apply preprocessing
transformed_train_split = pd.read_parquet("transformed_train_split.parquet")
transformed_val_split = pd.read_parquet("transformed_val_split.parquet")

# Test set with predicted agreement column, apply preprocessing 
test_transformed_agreement = pd.read_parquet("test_transformed_agreement.parquet")

# Dataset with no split for cross validation, apply pipeline inside cross validation
transformed_train_data = pd.read_parquet("transformed_train_data.parquet")

In [3]:
# Load pipeline
pipeline = load('pipeline.joblib') 

<a class="anchor" id="section_3">

# 3. Preprocessing

</a>

In [4]:
# Separate X and y for train after split
X_train = transformed_train_split.drop(['Claim Injury Type'], axis = 1)
y_train = transformed_train_split['Claim Injury Type']

# Separate X and y for validation after split
X_val = transformed_val_split.drop(['Claim Injury Type'], axis = 1)
y_val = transformed_val_split['Claim Injury Type']

# Separate X and y for dataset before split
X = transformed_train_data.drop(['Claim Injury Type'], axis = 1)
y = transformed_train_data['Claim Injury Type']

In [5]:
# Apply encoding of y for train and validation sets

# Initialize target encoder
label_encoder = LabelEncoder()

# Encode target
y_train_encoded = label_encoder.fit_transform(y_train)
y_val_encoded = label_encoder.transform(y_val)

In [6]:
# Apply preprocessing pipeline to the train, validation and test sets
X_train_preprocessed = pipeline.fit_transform(X_train, y_train_encoded)
X_val_preprocessed = pipeline.transform(X_val)
test_data_preprocessed = pipeline.transform(test_transformed_agreement)

In [7]:
print("Selected features:", X_train_preprocessed.columns.values)

Selected features: ['Attorney/Representative' 'Average Weekly Wage Log' 'C-2 Delivered'
 'Industry Code' 'Time Assembly to Hearing' 'Hearing Held'
 'Agreement Reached' 'C-3 Delivered on Time' 'Part of Body Group_Trunk'
 'Part of Body Group_Lower Extremities' 'IME-4 Count Log'
 'District Name_NYC' 'Part of Body Group_Upper Extremities' 'Gender'
 'Carrier Type_2A. SIF' 'Cause of Injury Group_X' 'Assembly Year'
 'Cause of Injury Group_VI']


<a class="anchor" id="section_4">

# 4. kNN Classifier

</a>

**Variables for model:**
- X_train_preprocessed;
- y_train_preprocessed;
- X_val_preprocessed;
- y_val_encoded;
- test_data_preprocessed.

**Variables for CV:**
- X: no preprocessing and no split;
- y: no preprocessing and no split;
- Apply pipeline inside cv.


In [10]:
# GridSearchCV

# Append the model to the imported pipeline
full_pipeline = Pipeline(
    pipeline.steps + [('model', KNeighborsClassifier(algorithm="kd_tree"))]
)

# Parameter grid
param_grid = {
    'model__n_neighbors': [10, 13, 15],
    'model__weights':["distance", "uniform"]
}

# GridSearchCV
grid_search = GridSearchCV(
    estimator=full_pipeline,
    param_grid=param_grid,
    scoring='f1_macro',
    cv=3,
    n_jobs=-1
)

# Encode target
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Fit GridSearchCV
grid_search.fit(X, y_encoded)

# Display best parameters and best cv score 
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_) 

Best parameters: {'model__n_neighbors': 10, 'model__weights': 'uniform'}
Best score: 0.3278968095380786


**Model with selected hyperparameters:**

In [8]:
knn = KNeighborsClassifier(
    n_neighbors=10,
    algorithm="kd_tree", 
    weights="uniform"
)

knn.fit(X_train_preprocessed, y_train_encoded)

# Predictions and Evaluation
y_pred_train = knn.predict(X_train_preprocessed)
y_pred_val = knn.predict(X_val_preprocessed)

print("\nClassification Report:\n", classification_report(y_train_encoded, y_pred_train))
print("\nClassification Report:\n", classification_report(y_val_encoded, y_pred_val))

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))



Classification Report:
               precision    recall  f1-score   support

           0       0.64      0.42      0.51     11229
           1       0.80      0.96      0.87    261970
           2       0.46      0.15      0.23     62016
           3       0.75      0.75      0.75    133656
           4       0.67      0.61      0.64     43452
           5       0.51      0.02      0.05      3790
           6       0.00      0.00      0.00        87
           7       0.58      0.15      0.24       423

    accuracy                           0.76    516623
   macro avg       0.55      0.38      0.41    516623
weighted avg       0.73      0.76      0.73    516623


Classification Report:
               precision    recall  f1-score   support

           0       0.61      0.41      0.49      1248
           1       0.78      0.95      0.86     29108
           2       0.29      0.09      0.14      6890
           3       0.70      0.69      0.70     14851
           4       0.59     

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


- The scores are similar for train and validation.
- The model is not predicting class 6 and 7, most likely because it does not have enough observations.
- The minority classes have a very low f1 score.
- The f1 macro scores are low for both train and validation which indicate the model may be underfitting.

**Cross-validation w/ 5 splits for final assessement of the model with selected hyperparameters:**

In [9]:
# Append the model to the imported pipeline
full_pipeline = Pipeline(
    pipeline.steps + [('model', knn)]
)

# Cross-validation
cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=random_state)

# Encode target
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Use F1-macro because of class imbalance
scorer = make_scorer(f1_score, average='macro')

# Get scores for cross-validation
# Apply preprocessing inside cv for X and y_encoded
cv_scores = cross_val_score(full_pipeline, X, y_encoded, cv=cv, scoring=scorer)

# Print results
print("Cross-validation scores (F1-macro):", cv_scores)
print("Mean CV score:", cv_scores.mean())

Cross-validation scores (F1-macro): [0.36470054 0.36521196 0.35786806]
Mean CV score: 0.36259351931565237


- Mean CV f1 macro: 0.3625935193156523
- This value will be compared with the other models mean cv scores to select the best model.

**Predictions for test set using all incial train data:**

In [10]:
# Encode target
y_encoded = label_encoder.fit_transform(y)

# Fit pipeline on X and apply tranformations
X_preprocessed = pipeline.fit_transform(X, y_encoded)

# Apply pipeline fitted on X
test_data_preprocessed_X = pipeline.transform(test_transformed_agreement)

# Fit on X 
knn.fit(X_preprocessed,y_encoded)

# Make predictions on test data
y_pred = knn.predict(test_data_preprocessed_X)

# Get original y for submission
y_pred_categorical = label_encoder.inverse_transform(y_pred) 

submission = pd.DataFrame({
    "Claim Identifier": test_data_preprocessed.index,
    "Claim Injury Type": y_pred_categorical
})


# Save to CSV to upload on kaggle
submission.to_csv("Group46_versionX.csv", index=False)

- Kaggle f1 macro score from test: 0.32389.