# Summary
1. Introduction
    1. Importing Modules
    1. Defining Score Metric
1. Feature Selection
    1. Data Analysis
    1. Data Manipulation
1. Classification
1. Submission
1. Results
1. The Team

# 1. Introduction

This is a complete notebook on how to develop a working solution for [Porto Seguro](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction) competition. It is based on self highlights, as well as highlights from other kernels. The goal is to provide an beginner friendly Kernel with exploratory and visual considerations about why the good kernels do what they do on the data.

## 1.A. Importing Modules

In [None]:
# data mining
import numpy as np
import pandas as pd

# data visualization
import seaborn as sns
import missingno as msno
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
from lightgbm import LGBMClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score

from subprocess import check_output 
print(check_output(["ls", "../input"]).decode("utf8"))

## 1.B. Defining Score Metric
This competition will be using the Normalized Gini Coeficient to calculate our predictions score. A better understanding of this metric can be obtained [here](https://www.kaggle.com/batzner/gini-coefficient-an-intuitive-explanation).

In [None]:
def gini(actual, pred):
    assert (len(actual) == len(pred))
    all = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=np.float)
    all = all[np.lexsort((all[:, 2], -1 * all[:, 1]))]
    totalLosses = all[:, 0].sum()
    giniSum = all[:, 0].cumsum().sum() / totalLosses

    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)


def gini_normalized(actual, pred):
    return gini(actual, pred) / gini(actual, actual)

def gini_xgb(preds, dtrain):
    labels = dtrain.get_label()
    gini_score = gini_normalized(labels, preds)
    return 'gini', gini_score

# 2. Feature Selection

## 2.A. Data Analysis
A summary of the data analysis discoveries shows us that the dataset is highly unbalanced, there are many missing value and some columns have no correlation with target and should be dropped.

### Importing Data

In [None]:
df = pd.read_csv('../input/train.csv', na_values='-1')
test_df = pd.read_csv('../input/test.csv', na_values='-1')
df.head()

### Class
The data is splited in two classes. Unluckily, the huge majority of the samples belongs to the same class. Actually, only 3.64% belong to the other, characterizing the dataset as highly unbalanced. The bias is expected to predict 0 all times. Techniques to deal with unbalanced training set should be used.


In [None]:
entries = df.shape[0]
plot = sns.countplot(x='target', data=df)
for p in plot.patches:
    plot.annotate('{:.2f}%'.format(100*p.get_height()/entries), (p.get_x()+ 0.3, p.get_height()+10000))

### Null values
We then search the dataset for missing entries in the samples. A quick analysis shows that many columns have multiple missing values, requiring our attention. We should proceed then either filling this values with reasonably values, or deleting the feature from our dataset. Feature analysis is then required.

In [None]:
msno.matrix(df=df.iloc[:, :], figsize=(20, 14), color=(0.8, 0.5, 0.2))   

In [None]:
print('Column \t\t Number of Null')
for column in df.columns:
    print('{}:\t {} ({:.2f}%)'.format(column,len(df[column][np.isnan(df[column])]), 100*len(df[column][np.isnan(df[column])])/entries))

### Correlation Matrix

The correlation between pairs of features shows that there is no correlation at all between *ps_calc_etc* features and the *target* or any other features. So dropping them would prevent the curse of dimensionality.

In [None]:
corr = df.corr()
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.show()

## 2.B. Data Manipulation
Based on the data analysis step, we are going to remove the *ps_calc_etc* features, and convert categorial features to dummy columns.

### Removing Features
Since *ps_calc_etc* features aren't related to *target*, removing them can prevent random junk to affect our model, and improving training and classifications times.

In [None]:
unwanted = df.columns[df.columns.str.startswith('ps_calc_')]
df = df.drop(unwanted, axis=1)
test_df = test_df.drop(unwanted, axis=1)
df.head()

### Changing Categorical Features to Dummy Values
Why do we need to convert categorical features to dummy values? Let's call your variable X and assume it takes on values "1", "2", "3", "4" or "5". If you feed X into the model as numbers, the model will estimate only a single parameter, which is the effect on the target variable of increasing X by 1 unit. So if you hold everything else constant and increase X from 1 to 2, that affects the target variable the same way as increasing it from 2 to 3 or from 4 to 5.

If instead you model X as categorical, you will estimate 4 parameters: the effect of increasing X from 1 to 2, from 2 to 3, and so on. These values could all be different. And that's what we really want to extract from categorical variables from the very beggining: the weight of each category.

In [None]:
cat_columns = [a for a in df.columns if a.endswith('cat')]

for col in cat_columns:
	dummy = pd.get_dummies(pd.Series(df[col]))
	df = pd.concat([df,dummy],axis=1)
	df = df.drop([col],axis=1)
    
for col in cat_columns:
	dummy = pd.get_dummies(pd.Series(test_df[col]))
	test_df = pd.concat([test_df,dummy],axis=1)
	test_df = test_df.drop([col],axis=1)
    
df.head()


# 3. Classification
Since we have a unbalanced training set, we are going to use Classifier Ensemble methods to predict a better output, united with a StratifieldKFold strategy to train each base model with multiple balanced training sets.

### Ensemble Class Creation

In [None]:
class Ensemble(object):
    def __init__(self, kfold, stacker, models):
        self.kfold = kfold
        self.stacker = stacker
        self.models = models

    def fit_predict(self, x, y, test):
        x = np.array(x)
        y = np.array(y)
        t = np.array(test)
        
        train = np.zeros((x.shape[0], len(self.models)))
        test = np.zeros((t.shape[0], len(self.models)))
        
        skf = list(StratifiedKFold(n_splits=self.kfold, shuffle=True, random_state=2016).split(x, y))
        
        for i, model in enumerate(self.models):

            test_i = np.zeros((t.shape[0], self.kfold))

            for j, (train_idx, test_idx) in enumerate(skf):
                x_train = x[train_idx]
                y_train = y[train_idx]
                x_valid = x[test_idx]
                y_valid = y[test_idx]

                print ("Fit %s fold %d" % (str(model).split('(')[0], j+1))
                
                model.fit(x_train, y_train)
                y_train_pred = model.predict_proba(x_train)[:,1]
                y_pred = model.predict_proba(x_valid)[:,1]   
                
                print("[Train] Gini score: %.6lf" % gini_normalized(y_train, y_train_pred))
                print("[Test] Gini score: %.6lf\n" % gini_normalized(y_valid, y_pred))

                train[test_idx, i] = y_pred
                test_i[:, j] = model.predict_proba(t)[:,1]
            test[:, i] = test_i.mean(axis=1)

        self.stacker.fit(train, y)
        valid = self.stacker.predict_proba(train)[:,1]
        res = self.stacker.predict_proba(test)[:,1]
        print("Staker Gini Score: %.6lf" % gini_normalized(valid, y))
        return res

### Preparing Data for Training/Predict

In [None]:
x = df.drop(['id', 'target'], axis=1)
y = df['target'].values
test_id = test_df['id']
test_df = test_df.drop('id', axis=1)

### Defining Base Models for Ensemble
Those are some dummy parameters for creating models. Real parameters are in grey. In order to run with the real parameters, you need to run the code locally on your machine, because Kaggle Kernels timeout after 1 hour training.

In [None]:
lgb_params = {
    'learning_rate': 0.02,
    'n_estimators': 1, # use 650 for real model
    'max_bin': 10,
    'subsample': 0.8,
    'subsample_freq': 10,
    'colsample_bytree': 0.8,
    'min_child_samples': 500,
    'random_state': 99
}

lgb_model = LGBMClassifier(**lgb_params)

lgb2_params = {
    'learning_rate': 0.02,
    'n_estimators': 1, #use 1090 for real model
    'colsample_bytree': 0.3,
    'subsample': 0.7,
    'subsample_freq': 2,
    'num_leaves': 16,
    'random_state': 99
}

lgb_model2 = LGBMClassifier(**lgb2_params)

lgb3_params = {
    'n_estimators': 1, #use 1100 for real model
    'max_depth': 4,
    'learning_rate': 0.02,
    'random_state': 99
}

lgb_model3 = LGBMClassifier(**lgb3_params)

log_model = LogisticRegression()

### Fit/Prediction
Gini score is calculated for each model of the Ensemble, as well as for the final combined classifier.

In [None]:
stack = Ensemble(kfold=3,
        stacker = log_model,
        models = (lgb_model, lgb_model2, lgb_model3))        
        
y_pred = stack.fit_predict(x, y, test_df)

# 4. Submission
This output is just a dummy. In orther to get the real output, you must use the right parameters for the models and run the code locally on your machine.

In [None]:
sub = pd.DataFrame()
sub['id'] = test_id
sub['target'] = y_pred
sub.to_csv('output.csv', index=False)

# 5. Results

For this approach, we managed to score 0.28960 at the leaderboards. The top submission scored 0.29698. That's pretty close, but, still, we are at the 1260/5169 position on the leaderboard. That's because in this competition the tiniest improvement would make you jump hundreds of positions up. Those positions were last updated at December 5th, 2017.

# The Team

Our team's name on the leaderboard is "Discípulos de Cleber". We are three students of the Federal University of Pernambuco:
* Higor Cavalcanti
* Lavínia Francesca
* João Vasconcelos