# Diplomatura de Datos: Kaggle Competition

## Modelo - SVC

Necesitamos superar el *baseline* impuesto en la competencia de **Kaggle**.

En esta *notebook*, utilizaremos **svm.SVC**.

In [1]:
# Import the required packages
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#### Lectura de Datos

In [2]:
train_df = pd.read_csv('DataSet/travel_insurance_prediction_train.csv')
test_df = pd.read_csv('DataSet/travel_insurance_prediction_test.csv')

#### Preprocesamiento de Datos

In [3]:
# Data Transformation
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder

bin_cols = ['Age', 'AnnualIncome']
hot_cols = ['Employment Type', 'GraduateOrNot', 'FamilyMembers', 'FrequentFlyer', 'EverTravelledAbroad']

transformer = make_column_transformer(
    # We organize the columns @bin_cols in intervals.
    (KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile'), bin_cols),
    # We convert the columns @hot_cols in numbers.
    (OneHotEncoder(categories='auto', dtype='int', handle_unknown='ignore'), hot_cols),
    # We keep the remaining columns.
    remainder='passthrough')

In [4]:
# Training Data
X_train = transformer.fit_transform(train_df.drop(columns=['Customer', 'TravelInsurance']))
y_train = train_df['TravelInsurance'].values

# Testing Data
X_test = transformer.transform(test_df.drop(columns=['Customer']))

#### Definición del Modelo (por Defecto)

In [10]:
# Default Model
import sklearn.svm

clf = sklearn.svm.SVC(random_state=123)
clf.fit(X_train, y_train)

SVC(random_state=123)

In [11]:
# Verification
from sklearn.metrics import classification_report

predictions = clf.predict(X_train)
print(classification_report(y_train, predictions))

              precision    recall  f1-score   support

           0       0.79      0.96      0.87       958
           1       0.87      0.55      0.67       532

    accuracy                           0.81      1490
   macro avg       0.83      0.75      0.77      1490
weighted avg       0.82      0.81      0.80      1490



Se obtiene un `F1-Score` de **0.67** para la clase positiva.

#### Definición del Modelo (por Búsqueda de Parámetros)

In [12]:
# Parameter Search
from sklearn.model_selection import GridSearchCV

search_params = {
    'kernel': ['linear','poly', 'rbf', 'sigmoid'],
    'gamma': ['scale', 'auto'],
    'C': [0.05,0.1,0.5,1.]
}

svm = sklearn.svm.SVC()
svm_clf = GridSearchCV(svm, search_params, cv=5, scoring='f1', n_jobs=-1)
svm_clf.fit(X_train, y_train)

best_svm_clf = svm_clf.best_estimator_

In [13]:
# Verification
predictions = best_svm_clf.predict(X_train)
print(classification_report(y_train, predictions))

              precision    recall  f1-score   support

           0       0.81      0.96      0.88       958
           1       0.88      0.60      0.71       532

    accuracy                           0.83      1490
   macro avg       0.85      0.78      0.80      1490
weighted avg       0.84      0.83      0.82      1490



Se obtiene un `F1-Score` de **0.83** para la clase positiva.

#### Generación de la Publicación

In [14]:
test_id = test_df['Customer']
test_pred = best_svm_clf.predict(X_test)

submission = pd.DataFrame(list(zip(test_id, test_pred)), columns=['Customer', 'TravelInsurance'])
submission.to_csv('DataSet/travel_insurance_submission_svm_GridSearch.csv', header=True, index=False)