# *Boosting Claims Predictions: an Analysis on Kaggle Data* 


# Abstract

Predictive modeling on data from the Porto Seguro’s Safe Driver Prediction competition hosted on the Kaggle platform is performed leveraging machine learning boosting algorithms (`AdaBoost` and `XGBoost`). We refer to the document "On Boosting: Theory and Applications" to complement the analysis presented in this notebook.

# Getting Started with Python and Jupyter Notebook

In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the [PEP8 standard](https://www.python.org/dev/peps/pep-0008/) ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readibility. 

In [1]:
# Notebook settings
###################

# resetting variables
get_ipython().magic('reset -sf') 

# formatting: cell width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

# plotting
%matplotlib inline

In [2]:
# loading Python packages
#########################

# scientific packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import scipy

# boosting
import xgboost
from xgboost import XGBClassifier
from xgboost import plot_importance

# scipy
from scipy.stats import chi2

# pandas: selected modules and functions
from pandas.plotting import scatter_matrix

# sklearn: selected modules and functions
from sklearn import decomposition
from sklearn.decomposition import PCA

from sklearn.preprocessing import Imputer
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import scale

from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier

from sklearn.utils import shuffle

from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import SelectFromModel

from sklearn.model_selection import train_test_split
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.pipeline import Pipeline

from sklearn.metrics import roc_curve
from sklearn.metrics import auc 
from sklearn.metrics import roc_auc_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import zero_one_loss

# utilities
from datetime import datetime

# Data Import

For the given Kaggle competition, two datasets are provided, one for training data and the other for generating predictions to be submitted. The latter lacks of the target variable (denoted by `target` in the former): it will not be imported in this notebook. Therefore, the whole analysis is carried out on the original 'training' data. A remark on data acquisition procedure: a direct download from the Kaggle [Porto Seguro Challenge website](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data) in a landing folder on a local machine is performed; original data are then copied in a separate folder and there unzipped. Import of unzipped data is performed using `pandas` as follows:

In [3]:
# data import: specify the path to the competition training data
path = 'D:\\学习资料 RUC 硕\\硕一上\\现代统计精算模型\\boosting\\train.csv'
data = pd.read_csv(path)

Imported dataset, i.e. `data`, comprises of 595212 records and 59 columns:

In [4]:
print('Structure of imported data:', data.shape)

Structure of imported data: (595212, 59)


# Structural Data Analysis

From the data description on the Kaggle [Porto Seguro Challenge website](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data) the following information on the features in `data` are provided:
- Features that belong to similar groupings are tagged as such in the feature names (e.g.,  `ind`, `reg`, `car`, `calc`).
- Feature names include the postfix `bin` to indicate binary features and `cat` to indicate categorical features. 
- Features without these designations are either continuous or ordinal. 
- Values of `-1` indicate that the feature was missing from the observation. 
- The target column `target` signifies whether or not a claim was filed for that policy holder.

The variable `target` is therefore the label used to train the classifiers. Data types for all columns in `data` are shown below:

Both integer and float data types are present; the variable `id` uniquely identifies data records; a quick check shows that no duplicate record exists:

In [5]:
# duplicates? None
data.drop_duplicates().shape

(595212, 59)

One continues by checking the first 10 entries of the `data` dataset before moving to a high level summary using `describe()`.

In [6]:
# imported data: first 10 entries
data.head(10).T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
id,7.0,9.0,13.0,16.0,17.0,19.0,20.0,22.0,26.0,28.0
target,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
ps_ind_01,2.0,1.0,5.0,0.0,0.0,5.0,2.0,5.0,5.0,1.0
ps_ind_02_cat,2.0,1.0,4.0,1.0,2.0,1.0,1.0,1.0,1.0,1.0
ps_ind_03,5.0,7.0,9.0,2.0,0.0,4.0,3.0,4.0,3.0,2.0
ps_ind_04_cat,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
ps_ind_05_cat,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ps_ind_06_bin,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0
ps_ind_07_bin,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
ps_ind_08_bin,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


### Missing Values

Missing values are encoded with `-1`, as discussed in the official data description on the  Kaggle [Porto Seguro Challenge website](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction/data); therefore the following code is implemented to have an overview of all variables with missing values

In [7]:
# missing values (encoded as -1)
feat_missing = []

for f in data.columns:
    missings = data[data[f] == -1][f].count()
    if missings > 0:
        feat_missing.append(f)
        missings_perc = missings/data.shape[0]
        
        # printing summary of missing values
        print('Variable {} has {} records ({:.2%}) with missing values'.format(f, missings, missings_perc))

# how many variables do present missing values?
print()
print('In total, there are {} variables with missing values'.format(len(feat_missing)))

Variable ps_ind_02_cat has 216 records (0.04%) with missing values
Variable ps_ind_04_cat has 83 records (0.01%) with missing values
Variable ps_ind_05_cat has 5809 records (0.98%) with missing values
Variable ps_reg_03 has 107772 records (18.11%) with missing values
Variable ps_car_01_cat has 107 records (0.02%) with missing values
Variable ps_car_02_cat has 5 records (0.00%) with missing values
Variable ps_car_03_cat has 411231 records (69.09%) with missing values
Variable ps_car_05_cat has 266551 records (44.78%) with missing values
Variable ps_car_07_cat has 11489 records (1.93%) with missing values
Variable ps_car_09_cat has 569 records (0.10%) with missing values
Variable ps_car_11 has 5 records (0.00%) with missing values
Variable ps_car_12 has 1 records (0.00%) with missing values
Variable ps_car_14 has 42620 records (7.16%) with missing values

In total, there are 13 variables with missing values


Missing value imputation will be discussed in section [Imputation of missing values](#Imputation-of-missing-values).
Due to the high number and different types of the features in `data`, univariate analysis is performed to gather insights on the provided data, as shown in the forthcoming section. 

# Univariate Analysis

## Meta-Information-Encoding

One starts the univariate analysis of `data` by recording the feature type into a meta-information data frame, as shown in [B. Carremans' kernel](https://www.kaggle.com/bertcarremans/data-preparation-exploration). This encoding will be extremely useful in the following, allowing for a feature type-targeted analyisis.

In [8]:
# B. Carremans: recording meta-information for each column in train, following the official data description on the Kaggle Porto Seguro Challenge
info = []

for f in data.columns:

    # defining the role (target and id have to be separated from the other features)
    if f == 'target':
        role = 'target'
    elif f == 'id':
        role = 'id'
    else:
        role = 'input'
         
    # defining the levels    
    
    # _bin postfix = binary feature (target is binary as well)
    if 'bin' in f or f == 'target':
        level = 'binary'
    
    # _cat postfix = categorical feature
    elif 'cat' in f or f == 'id':
        level = 'categorical'
        
    # continuous or ordinal features: those which are neither _bin nor _cat    
    elif data[f].dtype == float:
        level = 'interval'
    else:
        level = 'ordinal'    
        
    # initialize 'keep' to True for all variables except for id
    keep = True
    if f == 'id':
        keep = False
    
    # defining the data type 
    dtype = data[f].dtype
    
    # creating a dictionary that contains all the metadata for the variable
    f_dict = {
        'varname': f,
        'role': role,
        'level': level,
        'keep': keep,
        'dtype': dtype
    }
    info.append(f_dict)

# collecting all meta-information into a meta dataframe
meta = pd.DataFrame(info, columns = ['varname', 'role', 'level', 'keep', 'dtype'])
meta.set_index('varname', inplace = True)

In summary, variables in `data` are of level `categorical`, `binary`, `ordinal` (i.e. categorical variables with an ordering of levels) and `interval` (i.e. discrete or semi-continuous numerical variables). The `meta` data frame collects all these meta-information, by construction.

In [9]:
# showing meta-information data frame
print(meta.shape)
meta

(59, 4)


Unnamed: 0_level_0,role,level,keep,dtype
varname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
id,id,categorical,False,int64
target,target,binary,True,int64
ps_ind_01,input,ordinal,True,int64
ps_ind_02_cat,input,categorical,True,int64
ps_ind_03,input,ordinal,True,int64
ps_ind_04_cat,input,categorical,True,int64
ps_ind_05_cat,input,categorical,True,int64
ps_ind_06_bin,input,binary,True,int64
ps_ind_07_bin,input,binary,True,int64
ps_ind_08_bin,input,binary,True,int64


An aggregated view of `meta` is performed by grouping by `role` and `level`:

In [10]:
# showing meta-information aggregated view
pd.DataFrame({'count' : meta.groupby(['role', 'level'])['role'].size()}).reset_index()

Unnamed: 0,role,level,count
0,id,categorical,1
1,input,binary,17
2,input,categorical,14
3,input,interval,10
4,input,ordinal,16
5,target,binary,1


## `target` variable: class imbalance

The `target` variable in the `data` dataset is highly imbalanced, as shown in the following code chunk:

In [11]:
# levels for the target variable 
lev_target = ( pd.crosstab(index = data['target'], columns = 'Frequency') / data.shape[0] ) * 100
lev_target.round(2)

col_0,Frequency
target,Unnamed: 1_level_1
0,96.36
1,3.64


# Feature Engineering & Data Preparation for Modeling

In this section we perform a series of transformations on `data` to pave the way to predictive modeling. 

## On numerical `calc` variables and univariate screening
As mentioned in the document, the numerical `calc` variables will be dropped from further analysis, starting with a quick structural check on `data` and ending with a drop check.

In [12]:
print('Structure of data before calc variable drop:', data.shape)

Structure of data before calc variable drop: (595212, 59)


In [13]:
# dropping 'ps_calc_01',... 'ps_calc_14' variables and updating meta information
vars_to_drop = ['ps_calc_01', 'ps_calc_02','ps_calc_03','ps_calc_04','ps_calc_05','ps_calc_06','ps_calc_07','ps_calc_08','ps_calc_09','ps_calc_10','ps_calc_11','ps_calc_12','ps_calc_13','ps_calc_14']
data.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop), 'keep'] = False  

In [14]:
print('Structure of data after calc variable drop:', data.shape)

Structure of data after calc variable drop: (595212, 45)


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 595212 entries, 0 to 595211
Data columns (total 45 columns):
id                595212 non-null int64
target            595212 non-null int64
ps_ind_01         595212 non-null int64
ps_ind_02_cat     595212 non-null int64
ps_ind_03         595212 non-null int64
ps_ind_04_cat     595212 non-null int64
ps_ind_05_cat     595212 non-null int64
ps_ind_06_bin     595212 non-null int64
ps_ind_07_bin     595212 non-null int64
ps_ind_08_bin     595212 non-null int64
ps_ind_09_bin     595212 non-null int64
ps_ind_10_bin     595212 non-null int64
ps_ind_11_bin     595212 non-null int64
ps_ind_12_bin     595212 non-null int64
ps_ind_13_bin     595212 non-null int64
ps_ind_14         595212 non-null int64
ps_ind_15         595212 non-null int64
ps_ind_16_bin     595212 non-null int64
ps_ind_17_bin     595212 non-null int64
ps_ind_18_bin     595212 non-null int64
ps_reg_01         595212 non-null float64
ps_reg_02         595212 non-null float64
ps_re

## Imputation of missing values
As discussed in the document, a straighforward imputation scheme is chosen and applied in the forthcoming code chunk:

In [16]:
# dropping 'ps_car_03_cat', 'ps_car_05_cat' and updating meta information
vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
data.drop(vars_to_drop, inplace = True, axis = 1)
meta.loc[(vars_to_drop), 'keep'] = False  

# imputing with the mean or mode using Imputer from sklearn.preprocessing
from sklearn.preprocessing import Imputer

mean_imp = Imputer(missing_values = -1, strategy = 'mean', axis = 0)
mode_imp = Imputer(missing_values = -1, strategy = 'most_frequent', axis = 0)

data['ps_reg_03'] = mean_imp.fit_transform(data[['ps_reg_03']]).ravel()
data['ps_car_12'] = mean_imp.fit_transform(data[['ps_car_12']]).ravel()
data['ps_car_14'] = mean_imp.fit_transform(data[['ps_car_14']]).ravel()
data['ps_car_11'] = mode_imp.fit_transform(data[['ps_car_11']]).ravel()



After imputation one has 

In [17]:
print(data.shape)

(595212, 43)


## Dummies
Dummy variables for categorical features in `data` are created; the function `get_dummies` converts categorical variables into dummies dropping the original categorical variable from which the corresponding dummies are created from the resulting dataset and the first level. One-hot-encoding does not drop the first level, instead.

In [18]:
# selecting categorical variables
v = meta[(meta.level == 'categorical') & (meta.keep)].index
print('Before dummification we have {} variables in train'.format(data.shape[1]))

# creating dummy variables
data = pd.get_dummies(data, columns = v, drop_first = True)
print('After dummification we have {} variables in data'.format(data.shape[1]))

Before dummification we have 43 variables in train
After dummification we have 197 variables in data


After generation of dummy variables, the memory usage of the `data` dataset is increased:

In [19]:
print('Memory usage of `data` (in bytes):', pd.DataFrame.memory_usage(data,index=True, deep = True).sum())

Memory usage of `data` (in bytes): 246417848


A quick check on the columns of `data` after dummification ends this section:

In [20]:
print(data.columns.values)

['id' 'target' 'ps_ind_01' 'ps_ind_03' 'ps_ind_06_bin' 'ps_ind_07_bin'
 'ps_ind_08_bin' 'ps_ind_09_bin' 'ps_ind_10_bin' 'ps_ind_11_bin'
 'ps_ind_12_bin' 'ps_ind_13_bin' 'ps_ind_14' 'ps_ind_15' 'ps_ind_16_bin'
 'ps_ind_17_bin' 'ps_ind_18_bin' 'ps_reg_01' 'ps_reg_02' 'ps_reg_03'
 'ps_car_11' 'ps_car_12' 'ps_car_13' 'ps_car_14' 'ps_car_15'
 'ps_calc_15_bin' 'ps_calc_16_bin' 'ps_calc_17_bin' 'ps_calc_18_bin'
 'ps_calc_19_bin' 'ps_calc_20_bin' 'ps_ind_02_cat_1' 'ps_ind_02_cat_2'
 'ps_ind_02_cat_3' 'ps_ind_02_cat_4' 'ps_ind_04_cat_0' 'ps_ind_04_cat_1'
 'ps_ind_05_cat_0' 'ps_ind_05_cat_1' 'ps_ind_05_cat_2' 'ps_ind_05_cat_3'
 'ps_ind_05_cat_4' 'ps_ind_05_cat_5' 'ps_ind_05_cat_6' 'ps_car_01_cat_0'
 'ps_car_01_cat_1' 'ps_car_01_cat_2' 'ps_car_01_cat_3' 'ps_car_01_cat_4'
 'ps_car_01_cat_5' 'ps_car_01_cat_6' 'ps_car_01_cat_7' 'ps_car_01_cat_8'
 'ps_car_01_cat_9' 'ps_car_01_cat_10' 'ps_car_01_cat_11' 'ps_car_02_cat_0'
 'ps_car_02_cat_1' 'ps_car_04_cat_1' 'ps_car_04_cat_2' 'ps_car_04_cat_3'
 'ps_car

## On Randomness

We now fix a `random_state` for reproducibility of results; it will be used in both train vs. test dataset splitting and modeling.

In [21]:
random_state = 123

## Sampling

为了运行速度更快，这里抽取10%的记录。

In [22]:
def typicalSampling(group, typicalFracDict):
    name = group.name
    frac = typicalFracDict[name]
    return group.sample(frac=frac)
 
typicalFracDict = {  # 抽样比
    1: 0.1,
    0: 0.1
}

data = data.groupby('target', group_keys=False).apply(typicalSampling, typicalFracDict)


In [23]:
print(data.shape)
# levels for the target variable 
lev_target = ( pd.crosstab(index = data['target'], columns = 'Frequency') / data.shape[0] ) * 100
lev_target.round(2)

(59521, 197)


col_0,Frequency
target,Unnamed: 1_level_1
0,96.36
1,3.64


## `train` and `test` datasets

As mentioned in the document, `data` dataset is randomly split into `train` and `test` datasets. A not stratified approach is chosen and removal of both columns `id` and `target` from the training dataset is performed. 

In [24]:
# split 80-20% (no stratification)
X_train, X_test, y_train, y_test = train_test_split(data.drop(['id', 'target'], axis=1), 
                                                    data['target'], 
                                                    test_size=0.2,
                                                    random_state=random_state
                                                   )

After the split, the following checks on the `train` and `test` datasets are performed:

In [25]:
# structural checks
print('Training dataset - dimensions:', X_train.shape)
print('Test dataset - dimensions:', X_test.shape)
print()
print('Random split check:', X_train.shape[0] + X_test.shape[0] == data.shape[0])
print()

# imbalancing: check
lev_target = ( pd.crosstab(index = data['target'], columns = 'count') / data.shape[0] ) * 100
lev_target_train = ( pd.crosstab(index = y_train, columns = 'count') / y_train.shape[0] ) * 100
lev_target_test = ( pd.crosstab(index = y_test, columns = 'count') / y_test.shape[0] ) * 100

print('target class imbalance data:')
print(lev_target)
print()
print('target class imbalance train:')
print(lev_target_train)
print()
print('target class imbalance test:')
print(lev_target_test)

Training dataset - dimensions: (47616, 195)
Test dataset - dimensions: (11905, 195)

Random split check: True

target class imbalance data:
col_0       count
target           
0       96.355908
1        3.644092

target class imbalance train:
col_0       count
target           
0       96.385669
1        3.614331

target class imbalance test:
col_0       count
target           
0       96.236875
1        3.763125


We then remove the original dataset `data` to free up some memory.

In [26]:
del data

# Modeling

## Normalized Gini coefficient

The evaluation of fitted model on out-of-sample data is performed in the Porto Seguro Kaggle challenge by considering an ad-hoc performance measure, called normalized Gini coefficient, as discussed [here](https://www.kaggle.com/c/porto-seguro-safe-driver-prediction#evaluation). The code for the computation of the normalized Gini coefficient is provided in this [thread](https://www.kaggle.com/c/ClaimPredictionChallenge/discussion/703), for different programming languages. Below the Python implementation is presented.

In [27]:
from sklearn.metrics import make_scorer

# Gini coefficient
def gini(actual, pred):
    
    # a structural check
    assert (len(actual) == len(pred))
    
    # introducing an array called all
    all = np.asarray(np.c_[actual, pred, np.arange(len(actual))], dtype=np.float)  #slicing along second axis
    
    # sorting the array along predicted probabilities (descending order) and along the index axis all[:, 2] in case of ties
    all = all[np.lexsort((all[:, 2], -1 * all[:, 1]))]                             #

    # towards the Gini coefficient
    totalLosses = all[:, 0].sum()
    giniSum = all[:, 0].cumsum().sum() / totalLosses

    giniSum -= (len(actual) + 1) / 2.
    return giniSum / len(actual)

# normalized Gini coefficient
def gini_normalized_score(actual, pred):
    return gini(actual, pred) / gini(actual, actual)

# score using the normalized Gini
score_gini = make_scorer(gini_normalized_score, greater_is_better=True, needs_threshold = True)

## Original Adaboost, LogitBoost, SAMME, SAMME.R, Gradient Boosting, XGBoost

In [28]:
from logitboost import LogitBoost
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingClassifier

In [29]:
# weak learner: tree stump 
tree = DecisionTreeClassifier(criterion='gini',                 #only Gini and information gain criteria are supported
                              random_state= random_state,       #DecisionTreeClassifier() is not fully deterministic 
                              max_depth=1)                      #stump
tree = tree.fit(X_train, y_train)


# Original AdaBoost
adab = AdaBoostClassifier(base_estimator=tree,   # 二分类问题，当learning_rate取1时，SAMME与Original AdaBoost等价
                         algorithm='SAMME',
                         learning_rate=1,
                         n_estimators=300,
                         random_state= random_state)
adab = adab.fit(X_train, y_train)


# LogitBoost
logb = LogitBoost(DecisionTreeRegressor(max_depth=3),
                  n_estimators=300,
                  learning_rate=1,
           random_state= random_state)
logb = logb.fit(X_train, y_train)


# SAMME
sam = AdaBoostClassifier(base_estimator=tree, 
                         algorithm='SAMME',
                         learning_rate=0.1,
                         n_estimators=300,
                         random_state= random_state)
sam = sam.fit(X_train, y_train)


# SAMME.R
sam_R = AdaBoostClassifier(base_estimator=tree, 
                           algorithm='SAMME.R', 
                           learning_rate=0.1,
                           n_estimators=300,
                           random_state= random_state)
sam_R = sam_R.fit(X_train, y_train)


# Gradient Boosting
grab = GradientBoostingClassifier(random_state=random_state,
                                  n_estimators=300,
                                  learning_rate=1)
grab = grab.fit(X_train, y_train)


## XGBoost
xgb = XGBClassifier(random_state=random_state,
                    learning_rate=0.1,
                    max_depth=3,
                    n_estimators=300)
xgb = xgb.fit(X_train, y_train)

计算AUC

In [30]:
# original AdaBoost classifier performance
y_pred_proba_adab = adab.predict_proba(X_test)
fpr_adab, tpr_adab, _ = roc_curve(y_test, y_pred_proba_adab[:, 1])
roc_auc_adab = auc(fpr_adab, tpr_adab)

# LogitBoost classifier performance
y_pred_proba_logb = logb.predict_proba(X_test)
fpr_logb, tpr_logb, _ = roc_curve(y_test, y_pred_proba_logb[:, 1])
roc_auc_logb = auc(fpr_logb, tpr_logb)

# SAMME classifier performance
y_pred_proba_sam = sam.predict_proba(X_test)
fpr_sam, tpr_sam, _ = roc_curve(y_test, y_pred_proba_sam[:, 1])
roc_auc_sam = auc(fpr_sam, tpr_sam)

# SAMME.R classifier performance
y_pred_proba_sam_R = sam_R.predict_proba(X_test)
fpr_sam_R, tpr_sam_R, _ = roc_curve(y_test, y_pred_proba_sam_R[:, 1])
roc_auc_sam_R = auc(fpr_sam_R, tpr_sam_R)

# Gradient Boosting classifier performance
y_pred_proba_grab = grab.predict_proba(X_test)
fpr_grab, tpr_grab, _ = roc_curve(y_test, y_pred_proba_grab[:, 1])
roc_auc_grab = auc(fpr_grab, tpr_grab)

# XGBoosting classifier performance
y_pred_proba_xgb = xgb.predict_proba(X_test)
fpr_xgb, tpr_xgb, _ = roc_curve(y_test, y_pred_proba_xgb[:, 1])
roc_auc_xgb = auc(fpr_xgb, tpr_xgb)


# AUC on test dataset
print('AUC Original AdaBoost:', roc_auc_score(y_test, y_pred_proba_adab[:, 1]).round(3))
print('AUC LogitBoost:', roc_auc_score(y_test, y_pred_proba_logb[:, 1]).round(3))
print('AUC SAMME:', roc_auc_score(y_test, y_pred_proba_sam[:, 1]).round(3))
print('AUC SAMME.R:', roc_auc_score(y_test, y_pred_proba_sam_R[:, 1]).round(3))
print('AUC Gradient Boosting:', roc_auc_score(y_test, y_pred_proba_grab[:, 1]).round(3))
print('AUC XGBoost:', roc_auc_score(y_test, y_pred_proba_xgb[:, 1]).round(3))


AUC Original AdaBoost: 0.616
AUC LogitBoost: 0.562
AUC SAMME: 0.608
AUC SAMME.R: 0.625
AUC Gradient Boosting: 0.535
AUC XGBoost: 0.611
