# Diplomatura de Datos: Kaggle Competition

## Modelo - Random Forest

Necesitamos superar el *baseline* impuesto en la competencia de **Kaggle**.

En esta *notebook*, utilizaremos **Random Forest**.

In [1]:
# Import the required packages
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#### Lectura de Datos

In [2]:
train_df = pd.read_csv('DataSet/travel_insurance_prediction_train.csv')
test_df = pd.read_csv('DataSet/travel_insurance_prediction_test.csv')

#### Preprocesamiento de Datos

In [3]:
# Data Transformation
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder

bin_cols = ['Age', 'AnnualIncome']
hot_cols = ['Employment Type', 'GraduateOrNot', 'FamilyMembers', 'FrequentFlyer', 'EverTravelledAbroad']

transformer = make_column_transformer(
    # We organize the columns @bin_cols in intervals.
    (KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile'), bin_cols),
    # We convert the columns @hot_cols in numbers.
    (OneHotEncoder(categories='auto', dtype='int', handle_unknown='ignore'), hot_cols),
    # We keep the remaining columns.
    remainder='passthrough')

In [4]:
# Training Data
X_train = transformer.fit_transform(train_df.drop(columns=['Customer', 'TravelInsurance']))
y_train = train_df['TravelInsurance'].values

# Testing Data
X_test = transformer.transform(test_df.drop(columns=['Customer']))

#### Definición del Modelo (por Defecto)

In [5]:
# Default Model
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier(random_state=123) 
clf.fit(X_train, y_train);

In [6]:
# Verification
from sklearn.metrics import classification_report

predictions = clf.predict(X_train)
print(classification_report(y_train, predictions))

              precision    recall  f1-score   support

           0       0.88      0.97      0.92       958
           1       0.93      0.76      0.83       532

    accuracy                           0.89      1490
   macro avg       0.90      0.86      0.88      1490
weighted avg       0.90      0.89      0.89      1490



Se obtiene un `F1-Score` de **0.83** para la clase positiva.

#### Definición del Modelo (por Búsqueda de Parámetros)

In [7]:
# Parameter Search
from sklearn.model_selection import GridSearchCV

search_params = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 5, 10],
    'min_samples_leaf': [1, 2, 3],
    'n_estimators': [100, 50, 25, 10, 5],
}

forest = RandomForestClassifier(random_state=123)
forest_clf = GridSearchCV(forest, search_params, cv=5, scoring='f1', n_jobs=-1)
forest_clf.fit(X_train, y_train)

best_forest_clf = forest_clf.best_estimator_

In [8]:
# Verification
predictions = best_forest_clf.predict(X_train)
print(classification_report(y_train, predictions))

              precision    recall  f1-score   support

           0       0.84      0.96      0.89       958
           1       0.89      0.67      0.76       532

    accuracy                           0.85      1490
   macro avg       0.87      0.81      0.83      1490
weighted avg       0.86      0.85      0.85      1490



Se obtiene un `F1-Score` de **0.76** para la clase positiva.

#### Generación de la Publicación

In [9]:
test_id = test_df['Customer']
test_pred = clf.predict(X_test)

submission = pd.DataFrame(list(zip(test_id, test_pred)), columns=['Customer', 'TravelInsurance'])
submission.to_csv('DataSet/travel_insurance_submission_RandomForest.csv', header=True, index=False)