# <p style="background-color:#018cb7; font-size:100%; text-align:left; color:#FFFFFF; padding: 15px 5px 15px 25px; border-radius: 15px;">Porto Seguro’s Safe Driver Prediction - Kaggle</p>

## Predict if a driver will file an insurance claim next year

<img src='./img/safe-driving-730x432.jpeg' style='height:400px; border-radius: 5px;'/>

---
# <p style="background-color:#018cb7; font-size:100%; text-align:left; color:#FFFFFF; padding: 15px 5px 15px 25px; border-radius: 15px;">Contents:</p>

* [1. Introduction](#1)
    * [1.1 Data Description](#1.1)
    * [1.2 Libraries](#1.2)
    * [1.3 Loading Dataset](#1.3)
* [2. Feature Engineering](#2)
    * [2.1 SMOTE](#2.1)
    * [2.2 Cross-validation](#2.2)
    * [2.3 Feature Selection](#2.3)
* [3. Models](#3)
    * [3.1 Lightgbm](#3.1)
    * [3.2 Neural Networks](#3.2)
* [4. Evaluation](#4)
* [5. Kaggle Submission](#5)

<a id="1"></a>
# <p style="background-color:#018cb7; font-size:100%; text-align:left; color:#FFFFFF; padding: 15px 5px 15px 25px; border-radius: 15px;">1- Introduction</p>

<img title="Porto Seguro" src="img/porto-seguro-logo-1-3.png" alt="Porto Seguro" style='height:80px; padding: 10px; padding-right: 15px;' align="center"> 

[Porto Seguro](https://www.portoseguro.com.br/en/institutional) is one of the largest insurance companies in Brazil specialized in car and home insurance. Located in São Paulo, Porto Seguro has been one of the leading insurers in Brazil since its foundation in 1945.

A key challenge faced by all major insurers is, when it comes to car insurance, how to address fairness towards good drivers and try not to penalize those who have a good driving history on account of a few bad drivers. Inaccuracies in car insurance claim predictions usually raise its cost for good drivers and reduce the price for bad ones.

Porto Seguro has been applying Machine Learning for more than 20 years and intends to make car insurance more accessible to everyone. Thinking about that, the company created an online competition to help them explore new and more powerful ML methods.

<img title="Porto Seguro" src="img/Kaggle_logo.png" alt="Porto Seguro" style='height:80px; padding: 10px; padding-right: 15px' align="center">

[Kaggle](https://www.kaggle.com/) is an online community of data scientists and allows users to find and publish data sets, explore and build ML models, and enter competitions to solve data science challenges.

In this [competition](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/), the challenge is build a model that predicts the probability that a car insurance policy holder will file a claim next year.

<a id="1.1"></a>
## Data Description

In the train and test data:

- Features that belong to similar groupings are tagged as such in the feature names (e.g., `ind`, `reg`, `car`, `calc`). 
- Feature names include the postfix `bin` to indicate binary features and `cat` to indicate categorical features.
- Features __without__ these designations are either __continuous or ordinal__.
- Values of `-1` indicate that the feature was missing from the observation. 
- The `target` columns signifies whether or not a claim was filed for that policy holder. 

- `ind` - individuals
- `reg` - regions
- `car` - cars
- `calc` - calculated features

- `_bin` - binary 
- `_cat` - categorical

<a id="1.2"></a>
## Libraries

In [73]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import seaborn as sns
import random

%matplotlib inline
matplotlib.style.use("ggplot")

import plotly
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

import warnings
warnings.simplefilter('ignore', FutureWarning)

from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from keras.utils import to_categorical

In [74]:
from platform import python_version
print('Python version:', python_version())

Python version: 3.8.5


In [75]:
# !pip install -q -U watermark
%reload_ext watermark
%watermark -a "Leandro Pessini" --iversions

Author: Leandro Pessini

matplotlib: 3.3.2
seaborn   : 0.11.1
plotly    : 4.14.3
pandas    : 1.1.3
numpy     : 1.19.2



<a id="1.3"></a>
## Loading Dataset

In [4]:
# Kaggle
#kaggle_path = "../input/porto-seguro-safe-driver-prediction/"

In [5]:
# Local
local_path = "./data/"

In [6]:
# Loading training dataset and converting missing data value to NaN
#train_df = pd.read_csv('../input/porto-seguro-safe-driver-prediction/train.csv', low_memory=False, na_values= '-1')

In [41]:
train_df = pd.read_csv(local_path + "train.csv").set_index('id')
test_df = pd.read_csv(local_path + "test.csv").set_index('id')

In [42]:
print('Train Dataset - Number of rows are',train_df.shape[0], 'and number of columns are ',train_df.shape[1])
print('Test Dataset - Number of rows are',test_df.shape[0], 'and number of columns are ',test_df.shape[1])

Train Dataset - Number of rows are 595212 and number of columns are  58
Test Dataset - Number of rows are 892816 and number of columns are  57


<a id="2"></a>
# <p style="background-color:#018cb7; font-size:100%; text-align:left; color:#FFFFFF; padding: 15px 5px 15px 25px; border-radius: 15px;">2- Preprocessing & Feature Engineering</p>

<a id="2.1"></a>
## Metadata

To make data management easier, a meta-info about the variables is added to the DataFrame. It will help handling those variables later on the analysis, data viz and modeling.

- __role__: input, ID, target
- __level__: categorical, numerical, binary
- __keep__: True or False
- __dtype__: int, float, str

We do not have information on which features are ordinal or not so a meta-info __numerical__ will be added in order to apply __Normalization__ later.

In [43]:
data = []
for f in train_df.columns:
    # Defining the role
    if f == 'target':
        role = 'target'
    else:
        role = 'input'
         
    # Defining the level
    if 'bin' in f or f == 'target':
        level = 'binary'
        train_df[f] = train_df[f].astype('int8')
    elif 'cat' in f:
        level = 'categorical'
        train_df[f] = train_df[f].astype('int32')
    elif train_df[f].dtype == float:
        level = 'numerical'
        train_df[f] = train_df[f].astype('float32')
    elif train_df[f].dtype == int:
        level = 'numerical'
        train_df[f] = train_df[f].astype('int32')
        
    # Initialize keep to True for all variables
    keep = True
    
    # Defining the data type 
    dtype = train_df[f].dtype
    
    # Creating a Dict that contains all the metadata for the variable
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep,
        'dtype': dtype
    }
    
    data.append(f_dict)
    
meta = pd.DataFrame(data, columns=['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace=True)

In [44]:
# Example to extract all categorical variables that are not dropped
meta[(meta.level == 'categorical') & (meta.keep)].index

Index(['ps_ind_02_cat', 'ps_ind_04_cat', 'ps_ind_05_cat', 'ps_car_01_cat',
       'ps_car_02_cat', 'ps_car_03_cat', 'ps_car_04_cat', 'ps_car_05_cat',
       'ps_car_06_cat', 'ps_car_07_cat', 'ps_car_08_cat', 'ps_car_09_cat',
       'ps_car_10_cat', 'ps_car_11_cat'],
      dtype='object', name='varname')

### Number of variables per role and level 

In [45]:
pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()

Unnamed: 0,role,level,count
0,input,binary,17
1,input,categorical,14
2,input,numerical,26
3,target,binary,1


---
### Numerical variables

In [46]:
numerical_features = meta[(meta.level == 'numerical') & (meta.keep)].index
train_df[numerical_features].describe()

Unnamed: 0,ps_ind_01,ps_ind_03,ps_ind_14,ps_ind_15,ps_reg_01,ps_reg_02,ps_reg_03,ps_car_11,ps_car_12,ps_car_13,...,ps_calc_05,ps_calc_06,ps_calc_07,ps_calc_08,ps_calc_09,ps_calc_10,ps_calc_11,ps_calc_12,ps_calc_13,ps_calc_14
count,595212.0,595212.0,595212.0,595212.0,595212.0,595212.0,595212.0,595212.0,595212.0,595212.0,...,595212.0,595212.0,595212.0,595212.0,595212.0,595212.0,595212.0,595212.0,595212.0,595212.0
mean,1.900378,4.423318,0.012451,7.299922,0.610991,0.439184,0.551102,2.346072,0.379945,0.813265,...,1.885886,7.689445,3.005823,9.225904,2.339034,8.43359,5.441382,1.441918,2.872288,7.539026
std,1.983789,2.699902,0.127545,3.546042,0.287643,0.404264,0.793506,0.832548,0.058327,0.224588,...,1.134927,1.334312,1.414564,1.459672,1.246949,2.904597,2.332871,1.202963,1.694887,2.746652
min,0.0,0.0,0.0,0.0,0.0,0.0,-1.0,-1.0,-1.0,0.250619,...,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,2.0,0.0,5.0,0.4,0.2,0.525,2.0,0.316228,0.670867,...,1.0,7.0,2.0,8.0,1.0,6.0,4.0,1.0,2.0,6.0
50%,1.0,4.0,0.0,7.0,0.7,0.3,0.720677,3.0,0.374166,0.765811,...,2.0,8.0,3.0,9.0,2.0,8.0,5.0,1.0,3.0,7.0
75%,3.0,6.0,0.0,10.0,0.9,0.6,1.0,3.0,0.4,0.90619,...,3.0,9.0,4.0,10.0,3.0,10.0,7.0,2.0,4.0,9.0
max,7.0,11.0,4.0,13.0,0.9,1.8,4.037945,3.0,1.264911,3.720626,...,6.0,10.0,9.0,12.0,7.0,25.0,19.0,10.0,13.0,23.0


<a id="2.2"></a>
## Feature scaling using MinMaxScaler

In [47]:
# from sklearn.preprocessing import MinMaxScaler

features_n = numerical_features.to_list()
train_normalized = train_df.copy()

features = train_normalized[features_n]

scaler = MinMaxScaler().fit(features.values)
features = scaler.transform(features.values)

train_normalized[features_n] = features

<a id="2.3"></a>
## Categorical encoding using One-Hot-Encoder

## One Hot Encode with Keras

You may have a sequence that is already integer encoded.

You could work with the integers directly, after some scaling. Alternately, you can one hot encode the integers directly. This is important to consider if the integers do not have a real ordinal relationship and are really just placeholders for labels.

The Keras library offers a function called to_categorical() that you can use to one hot encode integer data.

In [48]:
# from sklearn.preprocessing import OneHotEncoder



categorical_features = meta[(meta.level == 'categorical') & (meta.keep)].index # Categorical features
name_features_c = categorical_features.to_list()

train_transformed = train_normalized.copy()

features_c = train_transformed[name_features_c]

In [76]:
# from keras.utils import to_categorical

encoded = to_categorical(features_c)



In [None]:
#concat with original data
train_transformed = pd.concat([train_transformed, ohe_df], axis=1).drop(features_c, axis=1)

__Other Options__: If you are doing hyperparameter tuning with [GridSearch](https://scikit-learn.org/stable/modules/grid_search.html) it's recommended to use [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) and [FeatureUnion](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html) with [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) or directly [make_column_transformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.make_column_transformer.html)

In [None]:
train_transformed[numerical_features].describe()

In [None]:
train_transformed[categorical_features].describe()

In [None]:
jobs_encoder = OneHotEncoder()
jobs_encoder.fit(data['Profession'].unique().reshape(1, -1))
transformed = jobs_encoder.transform(data['Profession'].to_numpy().reshape(-1, 1))
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(transformed, columns=jobs_encoder.get_feature_names())
#concat with original data
data = pd.concat([data, ohe_df], axis=1).drop(['Profession'], axis=1)

In [None]:
# Apply One-Hot encoding
onehot_encoder = OneHotEncoder(sparse=False, drop='first')

enconding_cat = onehot_encoder.fit_transform(features_c.values)

In [None]:
enconding_cat

In [None]:
train_transformed[name_features_c] = enconding_cat

__OneHotEnconder__: Although LightGBM is capable of using "true" categorical variables, to reduce noise while getting the splits for most useful categories I applied OneHotEnconding technique.

## Missing values

>Values of `-1` indicate that the feature was missing from the observation.

In [None]:
vars_with_missing = []

for f in train_df.columns:
    missings = train_df[train_df[f] == -1][f].count()
    if missings > 0:
        vars_with_missing.append(f)
        missings_perc = missings/train_df.shape[0]
        
        print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))
        
print('\nIn total, there are {} variables with missing values'.format(len(vars_with_missing)))

Feature elimination. I dropped all of calc features and ['ps_ind_14','ps_car_10_cat','ps_car_14','ps_ind_10_bin','ps_ind_11_bin',
'ps_ind_12_bin','ps_ind_13_bin','ps_car_11','ps_car_12']. I was excluding them one by one in greedy fashion and checking lgb cross validation score.

---
Only `ps_car_03_cat` and `ps_car_05_cat` have a large number of missing values.
- ps_car_03_cat has 411231 records (69.09%)
- ps_car_05_cat has 266551 records (44.78%)

__I will drop those variables???__

__Add 'number of missing values per data' as a new feature?__
train_df['num_missing'] = (all_data==-1).sum(axis=1)


Add num_missin to remaining_features
remaining_features.append('num_missing')

## Checking the cardinality of the categorical variables - Future Work?

### Distinct values - hot enconding
Cardinality refers to the number of different values in a variable. As we will create dummy variables from the categorical variables later on, we need to check whether there are variables with many distinct values. We should handle these variables differently as they would result in many dummy variables

In [None]:
train_copy = train_df.copy().replace(-1, np.NaN)
train_copy = train_copy.dropna()

v = meta[(meta.level == 'interval') & (meta.keep)].index

plt.figure(figsize=(10, 8))
cont_corr = train_copy[v].corr() # Correlation between continuous features
sns.heatmap(cont_corr, annot=True, cmap='OrRd'); # Plot heatmap

In [None]:
v = meta[(meta.level == 'ordinal') & (meta.keep)].index

plt.figure(figsize=(15, 10))
cont_corr = train_copy[v].corr() # Correlation between continuous features
sns.heatmap(cont_corr, annot=True); # Plot heatmap

## Handling imbalanced classes

`target` variable 1 means that a claim was filed and 0 that it was not claimed.

In [None]:
y = train_df.target
ax = sns.countplot(y,label="Count")

total_size = len(train_df)

# Display the target value ratio at the top of the bar
for patch in ax.patches:
    height = patch.get_height()
    width = patch.get_width()
    left_coord = patch.get_x()
    percent = height/total_size*100

    ax.text(x=left_coord + width/2.0, 
            y=height + 3000,
            s='{:1.1f}%'.format(percent),
            ha='center')

ax.set_title('Target Distribution');

### SMOTE

## Feature Importance

__Reference__
>Raschka, S., & Mirjalili, V. (2019). Python Machine Learning. Zaltbommel, Netherlands: Van Haren Publishing.

In [None]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold
import time
from lightgbm import LGBMClassifier
import lightgbm as lgb

In [None]:
dimport warnings
warnings.simplefilter('ignore', UserWarning)

In [None]:
lgb_params = {
        'class_weight': 'balanced', # as we have imbalanced class
        'objective': 'binary',
        'boosting_type': 'dart',
        'learning_rate': 0.01,
        'n_estimators': 500,
        'n_jobs': 4, # number of parallel threads
    }
features_classifier = lgb.LGBMClassifier()  
features_classifier.set_params(**lgb_params)

In [None]:
y = train_df.target
X = train_df.drop(['target'], inplace=False, axis=1)

X_test = test_df

In [None]:
from sklearn.model_selection import StratifiedKFold

kf = StratifiedKFold(n_splits=3, shuffle=True)
predicts = []

for train_index, test_index in kf.split(X, y):
    print("---")
    X_train, X_val = X.iloc[train_index], X.iloc[test_index]
    y_train, y_val = y.iloc[train_index], y.iloc[test_index]
    
    features_classifier.fit(X_train, y_train, eval_set=[(X_val, y_val)])
    predicts.append(features_classifier.predict(X_test))

In [None]:
#Plotting features importances

feature_imp = pd.DataFrame(sorted(zip(features_classifier.feature_importances_,X.columns)), columns=['Value','Feature'])

fig = plt.figure(figsize=(20, 10))
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False))
plt.title('LightGBM Features Importance by avg over folds')
plt.savefig('./plots/lgbm_importances-01.png', dpi=fig.dpi)
plt.tight_layout()
plt.show()

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit_transform(train_df.drop(['target'], axis=1))

---

__SelectFromModel__ - Meta-transformer for selecting features based on importance weights.

With the get_support method we can then limit the number of variables in the train data.

>https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html

In [None]:
from sklearn.feature_selection import SelectFromModel
#sfm = SelectFromModel(model, threshold='median', prefit=True)

sfm = SelectFromModel(clf, threshold=100, prefit=True)
X_selected = sfm.transform(X_train)
print('Number of features before selection: {}'.format(X_train.shape[1]))
n_features = sfm.transform(X_train).shape[1]
print('Number of features after selection: {}'.format(n_features))
selected_vars = list(feat_labels[sfm.get_support()])

In [None]:
train = train_df[selected_vars + ['target']]

In [None]:
train.head()

<a id="3"></a>
# <p style="background-color:#018cb7; font-size:100%; text-align:left; color:#FFFFFF; padding: 15px 5px 15px 25px; border-radius: 15px;">3- Models</p>

For this project I will use 2 Ensemble ML models: `LightGBM` and `XGBoost`

In [None]:
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb

## Tuning the hyper-parameters

- Random Search for Hyper-Parameter Optimization
- Grid search
- Bayesian optimization

<a id="4"></a>
# <p style="background-color:#018cb7; font-size:100%; text-align:left; color:#FFFFFF; padding: 15px 5px 15px 25px; border-radius: 15px;">4- Evaluation</p>

## Normalized Gini coefficient

$$ 
Gini = 2 * AUC - 1
$$

As for why use this instead of the commonly used AUC, the only reason I can think of is that a random prediction will yield a Gini score of 0 as opposed to the AUC which will be 0.5.

Besides that using the gini coefficient sets the performance of a random classifier to a score of 0... the normalization "improves" the other end of the scale and makes that the score of a perfect classifier is equal to 1 rather than a maximum achievable AUC<1. The improvement being only relative depending on whether you think a more intuitive scale is good or not. Although beyond this easier interpretation you might argue that it (the normalization) also improves generalization and comparison of different data-sets.

__Reference:__

> Hand, D. J., & Till, R. J. (2001). A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems. Machine Learning, 45(2), 171–186. https://doi.org/10.1023/a:1010920819831

In [None]:
from sklearn.model_selection import learning_curve
from sklearn.metrics import make_scorer, roc_auc_score

In [None]:
def gini_normalized(y_actual, y_pred):
    """Simple normalized Gini based on Scikit-Learn's roc_auc_score"""
    gini = lambda a, p: 2 * roc_auc_score(a, p) - 1
    return gini(y_actual, y_pred) / gini(y_actual, y_actual)

It turned out that the culprit was how Scikit-Learn scored the hold-out set. By default, it predicts using the predict method on the model rather than the predict_proba method. The output from predict on a classification problem is the class labels while the output from predict_proba is the probabilities for the class labels. For computing the Gini value on the results, the output of predict_proba is more appropriate.

To ensure this happens, we modify the `gini_normalized` function to allow that.

In [None]:
def gini_normalized(y_actual, y_pred):
    """Simple normalized Gini based on Scikit-Learn's roc_auc_score"""
    
    # If the predictions y_pred are binary class probabilities
    if y_pred.ndim == 2:
        if y_pred.shape[1] == 2:
            y_pred = y_pred[:, 1]
    gini = lambda a, p: 2 * roc_auc_score(a, p) - 1
    return gini(y_actual, y_pred) / gini(y_actual, y_actual)

<a id="5"></a>
# <p style="background-color:#018cb7; font-size:100%; text-align:left; color:#FFFFFF; padding: 15px 5px 15px 25px; border-radius: 15px;">5- Kaggle Submission</p>