## Benchmark quality of synthetic data generated using GANs on a predictive model

### Overview

Nothing ruins the thrill of buying a brand new car more quickly than seeing your new insurance bill. The sting’s even more painful when you know you’re a good driver. It doesn’t seem fair that you have to pay so much if you’ve been cautious on the road for years.

Porto Seguro, one of Brazil’s largest auto and homeowner insurance companies, completely agrees. Inaccuracies in car insurance company’s claim predictions raise the cost of insurance for good drivers and reduce the price for bad ones.

For more details refer [here](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/overview).

### Objective

Build a model that predicts the probability that a driver will initiate an auto insurance claim in the next year. Given that the sample class of drivers initiating auto insurance is biased, augment data using GANs and other methods to improve accuracy and stabilise the predictive model. Compare the accuracy/ stability before and after data augmentation which serves as a proxy for the quality of synthetic data generated. 

In [8]:
# Load the required packages
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier

from ctgan import CTGANSynthesizer

# Use conda install scikit-learn to overcome sklearn testing import issue

In [9]:
# Load data
test = pd.read_csv('data/porto_seguro_safe_driver/test.csv')  # to test the predictions in the test set, install Kaggle API and run it on the competition kernel
train = pd.read_csv('data/porto_seguro_safe_driver/train.csv')

In [10]:
# Handle missing values

def filling_missing_values(data):
    '''A function to fill in the missing values of categorical features'''
    for i in data.columns.values:
        if data.isnull().values.any():
            if i == 'ps_car_03_cat' or i == 'ps_car_05_cat':
                continue
            elif i == 'ps_ind_05_cat' or i == 'ps_car_07_cat':
                data[i].fillna(data[i].mode()[0], inplace=True)
            else:
                data[i].fillna(data[i].mean(), inplace=True)
        else:
            continue
    return data


# Determine missing values in each column of the given dataframe
def missing_values(data):
    '''Function to find the percentage of missing values in each column of a DataFrame passed'''
    for i in data.columns.values:
        count =  data[data[i] == -1].shape[0]
        print("Missing Values in '{}' : {:.4f} %".format(i, (count/data.shape[0])*100))
        

train = train.replace(-1, np.nan)
train = train.replace(-1, np.nan)
        
# Fill missing values in train and test        
train = filling_missing_values(train)
test = filling_missing_values(test)       

# Check for missing values after filling
# missing_values(train)
# missing_values(test)

# Drop columns that are not needed
col_to_drop = list(train.columns[train.columns.str.startswith('ps_calc_')])
# Drop columns that are missing a lot as values
col_to_drop += ['ps_car_03_cat', 'ps_car_05_cat']
train = train.drop(col_to_drop, axis=1)  
test = test.drop(col_to_drop, axis=1)


In [11]:
# Generate data using GANs to handle class imbalance

# Preprocess data 

train_gen = train.loc[train.target==1, train.columns != 'target'].copy()
train_gen = train_gen.drop(['id'], axis=1)

# List of categorical features
cat_features = [a for a in train_gen.columns if a.endswith('cat')]

ctgan = CTGANSynthesizer()
ctgan.fit(train_gen, cat_features, epochs=5)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
  out = sparse.csr_matrix((data, indices, indptr),
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
  out = sparse.csr_matrix((data, indices, indptr),
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
  out = sparse.csr_matrix((data, indices, indptr),
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
  out = sparse.csr_matrix((data, indices, indptr),
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
  out = sparse.csr_matrix((data, indices, indptr),
In case you use

Epoch 1, Loss G: 1.3501, Loss D: 0.1074
Epoch 2, Loss G: 0.8529, Loss D: -0.1343
Epoch 3, Loss G: 0.3041, Loss D: 0.1112
Epoch 4, Loss G: 0.2197, Loss D: 0.0399
Epoch 5, Loss G: -0.1111, Loss D: 0.2283


In [12]:
# adding 30000 samples to train dataset
samples = ctgan.sample(30000)
samples['id'] = 'generated'
samples['target'] = 1
print(samples.shape)
samples.head()

(30000, 37)


Unnamed: 0,ps_ind_01,ps_ind_02_cat,ps_ind_03,ps_ind_04_cat,ps_ind_05_cat,ps_ind_06_bin,ps_ind_07_bin,ps_ind_08_bin,ps_ind_09_bin,ps_ind_10_bin,...,ps_car_09_cat,ps_car_10_cat,ps_car_11_cat,ps_car_11,ps_car_12,ps_car_13,ps_car_14,ps_car_15,id,target
0,4.877373,1.0,0.353685,1.0,0.0,0.989327,1.015544,0.004155,1.013882,0.001548,...,0.0,1.0,65.0,1.972799,0.401574,1.732962,0.36139,3.426562,generated,1
1,0.972134,3.0,2.888083,0.0,0.0,0.009558,-0.010918,0.005899,-0.007671,0.003505,...,0.0,1.0,104.0,2.015437,0.449782,1.015299,0.459155,3.432269,generated,1
2,0.969146,2.0,3.089348,1.0,0.0,-0.001497,0.987894,0.001142,0.993559,0.001119,...,1.0,1.0,5.0,2.980083,0.375507,0.896821,0.344558,3.462962,generated,1
3,2.079132,2.0,7.175427,1.0,0.0,0.01133,0.002391,0.002441,-0.001652,1e-06,...,2.0,1.0,87.0,2.004126,0.376327,0.902693,0.359034,3.585055,generated,1
4,2.878365,4.0,4.974337,0.0,0.0,0.011478,-0.012287,-0.001791,0.009415,-0.000182,...,2.0,1.0,1.0,2.009672,0.449358,0.946038,0.387174,2.693171,generated,1


In [13]:
# Append to train data
train = pd.concat([train, samples], axis=0, sort=True)

In [14]:
# Take a random 20% of the dataset as validation data
x_train, x_valid, y_train, y_valid = train_test_split(train, train['target'], test_size=0.2, random_state=1243)
print('Train samples: {} & Validation samples: {}'.format(len(x_train), len(x_valid)))
print('\n', y_train.value_counts())
print(y_valid.value_counts())


Train samples: 500169 & Validation samples: 125043

 0    458781
1     41388
Name: target, dtype: int64
0    114737
1     10306
Name: target, dtype: int64


In [15]:
# Preprocessing 
id_valid = x_valid['id'].values
id_test = test['id'].values
target_train = x_train['target'].values
target_valid = x_valid['target'].values

x_train = x_train.drop(['target','id'], axis = 1)
x_valid = x_valid.drop(['id', 'target'], axis = 1)
test = test.drop(['id'], axis = 1) 

def one_hot_encoding(df):
    cat_features = [a for a in df.columns if a.endswith('cat')]

    for column in cat_features:
        temp = pd.get_dummies(pd.Series(df[column]))
        df = pd.concat([df,temp],axis=1)
        df = df.drop([column],axis=1)
    return df

x_train['flag'] = 'train'
x_valid['flag'] = 'valid'
test['flag'] = 'test'

total = x_train.append([x_valid, test])
total_coded = one_hot_encoding(total.loc[:, total.columns != 'flag'])
total_coded['flag'] = total['flag']

x_train = total_coded.loc[total_coded.flag=='train', total_coded.columns != 'flag']
x_valid = total_coded.loc[total_coded.flag=='valid', total_coded.columns != 'flag']
test = total_coded.loc[total_coded.flag=='test', total_coded.columns != 'flag']

print(x_train.values.shape, x_valid.values.shape, test.values.shape)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


(500169, 206) (125043, 206) (892816, 206)


In [139]:
# Ensemble model class
class Ensemble(object):
    def __init__(self, stacker, base_models):
        self.stacker = stacker
        self.base_models = base_models

    def fit_predict(self, X, y, T):
        X = np.array(X)
        y = np.array(y)
        T = np.array(T)

        S_train = np.zeros((X.shape[0], len(self.base_models)))
        S_test = np.zeros((T.shape[0], len(self.base_models)))
        for i, clf in enumerate(self.base_models):
            clf.fit(X, y)
            S_train[:, i]= clf.predict_proba(X)[:,1]                
            S_test[:, i] = clf.predict_proba(T)[:,1]
            # S_test = S_test.mean(axis=1)

        results = cross_val_score(self.stacker, S_train, y, cv=3, scoring='roc_auc')
        print("Stacker score: %.5f" % (results.mean()))

        self.stacker.fit(S_train, y)
        res = self.stacker.predict_proba(S_test)[:,1]
        return res
    
# RandomForest params
rf_params = {}
rf_params['n_estimators'] = 200
rf_params['max_depth'] = 6
rf_params['min_samples_split'] = 70
rf_params['min_samples_leaf'] = 30



In [140]:
# Build the model

# Stacking model here is logistic regression 
log_model = LogisticRegression()

# Base models - Random Forest and logistic regression
random_forest_model = RandomForestClassifier(**rf_params)
        
stack = Ensemble(stacker = log_model,
        base_models = (log_model, random_forest_model))        
        
y_pred = stack.fit_predict(x_train, target_train, x_valid)



Stacker score: 0.82839




In [143]:
# Currently the threshold is taken to be 0.5
y_pred[y_pred>=0.5] = 1
y_pred[y_pred<0.5] = 0

accuracy_score(target_valid, y_pred)
confusion_matrix(target_valid, y_pred)

array([[114737,      0],
       [  4330,   5976]])

In [145]:
# Original matrix
confusion_matrix(target_valid, target_valid)

array([[114737,      0],
       [     0,  10306]])

In [149]:
# without sampling the confusion matrix looks as below

# array([[114729,      8],
#        [     4544,  8]])

# roc_auc improved from 0.62 to 0.82