# Credit Card Fraud Detection using Stacked Ensemble Technique

# <a id="1">Introduction</a>  

The datasets contains transactions made by credit cards in **September 2013** by european cardholders. This dataset presents transactions that occurred in two days, where we have **492 frauds** out of **284,807 transactions**. The dataset is **highly unbalanced**, the **positive class (frauds)** account for **0.172%** of all transactions.  

It contains only numerical input variables which are the result of a **PCA transformation**.   

Due to confidentiality issues, there are not provided the original features and more background information about the data.  

* Features **V1**, **V2**, ... **V28** are the **principal components** obtained with **PCA**;  
* The only features which have not been transformed with PCA are **Time** and **Amount**. Feature **Time** contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature **Amount** is the transaction Amount, this feature can be used for example-dependant cost-senstive learning.   
* Feature **Class** is the response variable and it takes value **1** in case of fraud and **0** otherwise.  



# <a id="2">Load packages</a>

In [1]:
import pandas as pd 
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 
import plotly.graph_objs as go
import plotly.figure_factory as ff
from plotly import tools
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)

from sklearn.model_selection import train_test_split, cross_val_score,RepeatedStratifiedKFold
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score, accuracy_score
from sklearn.ensemble import RandomForestClassifier

from sklearn.tree import DecisionTreeClassifier as DT
from sklearn.linear_model import LogisticRegression as LogReg
from sklearn.svm import LinearSVC as SVC
from sklearn.naive_bayes import GaussianNB as NB
from sklearn.neighbors import KNeighborsClassifier as Knn
from xgboost import XGBClassifier as XGB

pd.set_option('display.max_columns', 100)

# <a id="3">Read the data</a>

In [3]:
data_df = pd.read_csv("C:/Users/abhin/Desktop/NCI 2022/DMML1/datasetsand project semester/Final Submission DMML/Credit card UC Datasets/Fraud detcn/creditcard_fraud.csv")

# <a id="4">Check the data</a>

In [3]:
print("Credit Card Fraud Detection data -  rows:",data_df.shape[0]," columns:", data_df.shape[1])

Credit Card Fraud Detection data -  rows: 284807  columns: 31


## <a id="41">Glimpse the data</a>

We start by looking to the data features (first 5 rows).

In [4]:
data_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


Let's look into more details to the data.

In [5]:
data_df.describe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
count,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0,284807.0
mean,94813.859575,3.91956e-15,5.688174e-16,-8.769071e-15,2.782312e-15,-1.552563e-15,2.010663e-15,-1.694249e-15,-1.927028e-16,-3.137024e-15,1.768627e-15,9.170318e-16,-1.810658e-15,1.693438e-15,1.479045e-15,3.482336e-15,1.392007e-15,-7.528491e-16,4.328772e-16,9.049732e-16,5.085503e-16,1.537294e-16,7.959909e-16,5.36759e-16,4.458112e-15,1.453003e-15,1.699104e-15,-3.660161e-16,-1.206049e-16,88.349619,0.001727
std,47488.145955,1.958696,1.651309,1.516255,1.415869,1.380247,1.332271,1.237094,1.194353,1.098632,1.08885,1.020713,0.9992014,0.9952742,0.9585956,0.915316,0.8762529,0.8493371,0.8381762,0.8140405,0.770925,0.734524,0.7257016,0.6244603,0.6056471,0.5212781,0.482227,0.4036325,0.3300833,250.120109,0.041527
min,0.0,-56.40751,-72.71573,-48.32559,-5.683171,-113.7433,-26.16051,-43.55724,-73.21672,-13.43407,-24.58826,-4.797473,-18.68371,-5.791881,-19.21433,-4.498945,-14.12985,-25.1628,-9.498746,-7.213527,-54.49772,-34.83038,-10.93314,-44.80774,-2.836627,-10.2954,-2.604551,-22.56568,-15.43008,0.0,0.0
25%,54201.5,-0.9203734,-0.5985499,-0.8903648,-0.8486401,-0.6915971,-0.7682956,-0.5540759,-0.2086297,-0.6430976,-0.5354257,-0.7624942,-0.4055715,-0.6485393,-0.425574,-0.5828843,-0.4680368,-0.4837483,-0.4988498,-0.4562989,-0.2117214,-0.2283949,-0.5423504,-0.1618463,-0.3545861,-0.3171451,-0.3269839,-0.07083953,-0.05295979,5.6,0.0
50%,84692.0,0.0181088,0.06548556,0.1798463,-0.01984653,-0.05433583,-0.2741871,0.04010308,0.02235804,-0.05142873,-0.09291738,-0.03275735,0.1400326,-0.01356806,0.05060132,0.04807155,0.06641332,-0.06567575,-0.003636312,0.003734823,-0.06248109,-0.02945017,0.006781943,-0.01119293,0.04097606,0.0165935,-0.05213911,0.001342146,0.01124383,22.0,0.0
75%,139320.5,1.315642,0.8037239,1.027196,0.7433413,0.6119264,0.3985649,0.5704361,0.3273459,0.597139,0.4539234,0.7395934,0.618238,0.662505,0.4931498,0.6488208,0.5232963,0.399675,0.5008067,0.4589494,0.1330408,0.1863772,0.5285536,0.1476421,0.4395266,0.3507156,0.2409522,0.09104512,0.07827995,77.165,0.0
max,172792.0,2.45493,22.05773,9.382558,16.87534,34.80167,73.30163,120.5895,20.00721,15.59499,23.74514,12.01891,7.848392,7.126883,10.52677,8.877742,17.31511,9.253526,5.041069,5.591971,39.4209,27.20284,10.50309,22.52841,4.584549,7.519589,3.517346,31.6122,33.84781,25691.16,1.0


Looking to the **Time** feature, we can confirm that the data contains **284,807** transactions, during 2 consecutive days (or **172792** seconds).

## <a id="42">Check missing data</a>  

Let's check if there is any missing data.

In [6]:
total = data_df.isnull().sum().sort_values(ascending = False)
percent = (data_df.isnull().sum()/data_df.isnull().count()*100).sort_values(ascending = False)
pd.concat([total, percent], axis=1, keys=['Total', 'Percent']).transpose()

Unnamed: 0,Class,V14,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V15,Amount,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Time
Total,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Percent,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


There is no missing data in the entire dataset.

## <a id="43">Data unbalance</a>

Let's check data unbalance with respect with *target* value, i.e. **Class**.

In [7]:
temp = data_df["Class"].value_counts()
df = pd.DataFrame({'Class': temp.index,'values': temp.values})

trace = go.Bar(
    x = df['Class'],y = df['values'],
    name="Credit Card Fraud Class - data unbalance (Not fraud = 0, Fraud = 1)",
    marker=dict(color="Red"),
    text=df['values']
)
data = [trace]
layout = dict(title = 'Credit Card Fraud Class - data unbalance (Not fraud = 0, Fraud = 1)',
          xaxis = dict(title = 'Class', showticklabels=True), 
          yaxis = dict(title = 'Number of transactions'),
          hovermode = 'closest',width=600
         )
fig = dict(data=data, layout=layout)
iplot(fig, filename='class')

Only **492** (or **0.172%**) of transaction are fraudulent. That means the data is highly unbalanced with respect with target variable **Class**.

# <a id="5">EDA and transformations</a>

## Transactions in time

In [8]:
class_0 = data_df.loc[data_df['Class'] == 0]["Time"]
class_1 = data_df.loc[data_df['Class'] == 1]["Time"]

hist_data = [class_0, class_1]
group_labels = ['Not Fraud', 'Fraud']

fig = ff.create_distplot(hist_data, group_labels, show_hist=False, show_rug=False)
fig['layout'].update(title='Credit Card Transactions Time Density Plot', xaxis=dict(title='Time [s]'))
iplot(fig, filename='dist_only')

Fraudulent transactions have a distribution more even than valid transactions - are equaly distributed in time, including the low real transaction times, during night in Europe timezone.

Let's look into more details to the time distribution of both classes transaction, as well as to aggregated values of transaction count and amount, per hour. We assume (based on observation of the time distribution of transactions) that the time unit is second.

In [9]:
data_df['Hour'] = data_df['Time'].apply(lambda x: np.floor(x / 3600))

tmp = data_df.groupby(['Hour', 'Class'])['Amount'].aggregate(['min', 'max', 'count', 'sum', 'mean', 'median', 'var']).reset_index()
df = pd.DataFrame(tmp)
df.columns = ['Hour', 'Class', 'Min', 'Max', 'Transactions', 'Sum', 'Mean', 'Median', 'Var']
df.head()

Unnamed: 0,Hour,Class,Min,Max,Transactions,Sum,Mean,Median,Var
0,0.0,0,0.0,7712.43,3961,256572.87,64.774772,12.99,45615.821201
1,0.0,1,0.0,529.0,2,529.0,264.5,264.5,139920.5
2,1.0,0,0.0,1769.69,2215,145806.76,65.82698,22.82,20053.61577
3,1.0,1,59.0,239.93,2,298.93,149.465,149.465,16367.83245
4,2.0,0,0.0,4002.88,1555,106989.39,68.803466,17.9,45355.430437


## Transactions amount

In [10]:
tmp = data_df[['Amount','Class']].copy()
class_0 = tmp.loc[tmp['Class'] == 0]['Amount']
class_1 = tmp.loc[tmp['Class'] == 1]['Amount']
class_0.describe()

count    284315.000000
mean         88.291022
std         250.105092
min           0.000000
25%           5.650000
50%          22.000000
75%          77.050000
max       25691.160000
Name: Amount, dtype: float64

In [11]:
class_1.describe()

count     492.000000
mean      122.211321
std       256.683288
min         0.000000
25%         1.000000
50%         9.250000
75%       105.890000
max      2125.870000
Name: Amount, dtype: float64

The real transaction have a larger mean value, larger Q1, smaller Q3 and Q4 and larger outliers; fraudulent transactions have a smaller Q1 and mean, larger Q4 and smaller outliers.

# <a id="6">Model Building</a>  



In [12]:
target = 'Class'
predictors = ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',\
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',\
       'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28',\
       'Amount']

### Resampling dataset using SMOTE Oversampling technique

In [13]:
from collections import Counter
from imblearn.over_sampling import SMOTE

X= data_df[predictors]
y= data_df[target]

counter = Counter(y)
print(counter)

# to remove
oversample = SMOTE()
X_smote,y_smote = oversample.fit_resample(X,y)
print(Counter(y_smote))

Counter({0: 284315, 1: 492})
Counter({0: 284315, 1: 284315})


### Random sampling to reduce computational load for analysis

In [14]:
tmp_df= X_smote
tmp_df['Class'] = y_smote.values

In [15]:
tmp_df.shape

(568630, 31)

In [16]:
sample_df = tmp_df.sample(frac = 0.05, random_state=21) # without replacement
sample_df.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
404558,149605.669283,-2.225886,1.812346,-4.409791,3.300208,-0.027476,-1.709894,-2.422206,-0.19787,-0.150168,-3.246689,3.444231,-5.607895,-1.123935,-9.412487,-0.809212,-4.751887,-6.061418,-2.219664,1.366183,0.033088,0.079132,-0.479679,-0.332377,-0.045777,-0.148144,-0.788055,-0.946941,0.409877,1.467282,1
472650,149436.254618,0.087318,0.325344,-2.027639,3.216261,2.848675,-1.378014,-0.291497,0.149102,-0.812241,-1.255868,1.837064,-0.311683,-0.20756,-4.868545,-0.523102,1.842955,3.346019,2.668954,-0.810928,0.343168,-0.008306,-0.243897,0.072552,-0.397797,0.13411,-0.1963,0.025116,0.089919,8.459485,1
441536,101199.65983,-25.289514,18.447745,-24.463592,10.6766,-16.550888,3.54404,-35.947019,-26.662056,-11.043848,-22.239921,3.622014,-10.631067,-2.327078,-2.653829,-4.090996,-6.863382,-14.394283,-6.302027,-1.251208,7.294692,-15.667118,5.724197,3.288642,0.188089,-0.944938,-0.282328,-4.855651,-0.415241,1.907055,1
279614,168988.0,1.985813,-0.327462,-0.264264,0.486822,-0.653677,-0.510212,-0.523177,-0.120598,1.306577,-0.198753,-0.924415,1.045144,1.283097,-0.434115,0.359166,0.061927,-0.615985,0.10349,-0.186204,-0.143791,0.197497,0.885295,0.087248,-0.053729,-0.038774,-0.209898,0.049034,-0.035281,9.99,0
531919,152077.380612,-4.433723,3.519447,-6.631418,6.900381,-0.108235,-1.99529,-2.704084,-0.320412,-1.833275,-3.10506,4.500732,-9.11004,-1.451848,-14.014982,0.705371,-4.137342,-6.314143,-1.372595,0.200174,-0.236931,0.501529,0.505426,0.061737,-0.62146,0.341376,0.310898,-2.65855,0.494739,1.0,1


In [17]:
sample_df.shape

(28432, 31)

### Split data in train, test and validation set


In [19]:
X_sample = sample_df[predictors]
y_sample = sample_df[target]

In [20]:
X_train, X_test, y_train, y_test = train_test_split(X_sample, y_sample, test_size=0.3, \
                                                    random_state=42)

## <a id="61">RandomForestClassifier</a>





In [23]:
clf = RandomForestClassifier()

In [24]:
clf.fit(X_train, y_train)

RandomForestClassifier()

In [25]:
preds = clf.predict(X_test)

In [26]:
roc_auc_score(y_test, preds)

0.9938064254226725

In [27]:
accuracy_score(y_test, preds)

0.9937866354044549

The **ROC-AUC** score obtained with **RandomForrestClassifier** is **0.99** and accuracy is **99.37 %**.





## <a id="62">Decision Tree Classifier</a>


In [28]:
clf = DT()

In [29]:
clf.fit(X_train, y_train)

DecisionTreeClassifier()

In [30]:
preds = clf.predict(X_test)

In [31]:
roc_auc_score(y_test, preds)

0.9853459930940361

In [32]:
accuracy_score(y_test, preds)

0.9853458382180539

The ROC-AUC score obtained with DT is **0.98* and Accuracy is **98.53%**.

## <a id="63">Logistic Regression</a>



In [33]:
clf = LogReg()

In [34]:
clf.fit(X_train, y_train)

LogisticRegression()

In [35]:
preds = clf.predict(X_test)

In [36]:
roc_auc_score(y_test, preds)

0.9591366357814423

In [37]:
accuracy_score(y_test, preds)

0.9590855803048066

The ROC-AUC score obtained with LogReg is **0.959** and Accuracy is **95.90%**.

## <a id="63">XGB</a>

In [38]:
clf = XGB()
clf.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', callbacks=None,
              colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
              early_stopping_rounds=None, enable_categorical=False,
              eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
              max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
              missing=nan, monotone_constraints='()', n_estimators=100,
              n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0,
              reg_alpha=0, reg_lambda=1, ...)

In [39]:
preds= clf.predict(X_test)

In [40]:
roc_auc_score(y_test, preds)

0.9973042683535597

In [41]:
accuracy_score(y_test, preds)

0.9973036342321219

The AUC score for the prediction of (test set) is **0.997** and accuracy is **99.730%**.

## <a id="64">KNN, NB and SVC</a>

In [42]:
clf_k = Knn()
clf_n = NB()
clf_s = SVC()

clf_k.fit(X_train, y_train)
clf_n.fit(X_train, y_train)
clf_s.fit(X_train, y_train)


Liblinear failed to converge, increase the number of iterations.



LinearSVC()

In [43]:
%%time
preds_k= clf_k.predict(X_test)
print(roc_auc_score(y_test, preds_k))
print(accuracy_score(y_test, preds_k))

0.8428275857670823
0.8430246189917937
Wall time: 3.35 s


In [44]:
%%time
preds_n= clf_n.predict(X_test)
print(roc_auc_score(y_test, preds_n))
print(accuracy_score(y_test, preds_n))

0.8722820027558762
0.8715123094958969
Wall time: 12.6 ms


In [45]:
%%time
preds_s= clf_s.predict(X_test)
print(roc_auc_score(y_test, preds_s))
print(accuracy_score(y_test, preds_s))

0.661052620699331
0.6630715123094959
Wall time: 7.07 ms


### Training and validation using cross-validation


In [None]:
# %%time
# # Too computationaly intensive
# my_models = {'DT':DT(), 'LogReg':LogReg(), 'SVC':SVC(), 'NB':NB(), 'RF': RandomForestClassifier()\
#                 , 'Xgb': XGB()}

# # Too time intensive

# for key, model in my_models.items():
#   # K-fold CV score
#     kfs = KFold(n_splits= 5)
#     kf_scores = cross_val_score(model, X_sample, y_sample, cv=kfs) # cv is k-folds
#     # print(kf_scores)
#     print( "Mean KFold F1 score %s: "%key, kf_scores.mean())

## <a id="64">Stacking Ensemble Technique (DT, LR, SVM, NB)</a>

In [23]:
from numpy import mean
from numpy import std
from sklearn.ensemble import StackingClassifier

# get a stacking ensemble of models lv1 :LogReg
def get_stackingLR(my_models_lv0):
    # define the base models

    level0 = list(my_models_lv0.items())

    # define meta learner model
    level1 = LogReg()
    # define the stacking ensemble
    model = StackingClassifier(estimators=level0, final_estimator=level1, cv=3)
    return model
 
# evaluate a give model using cross-validation
def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=4, n_repeats=2, random_state=21)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    return scores

### Iteration 1 for Stacked Model: lv0: (DT, SVC, NB) and lv1: LogReg

In [24]:
# input_models = {'DT':DT(), 'LogReg':LogReg(), 'SVC':SVC(), 'NB':NB()}
input_models = {'DT':DT(), 'SVC':SVC(), 'NB':NB()} 
stacked_LR1 = get_stackingLR(input_models)
stacked_LR1

StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('SVC', LinearSVC()), ('NB', GaussianNB())],
                   final_estimator=LogisticRegression())

In [25]:
%%time
# Model Evaluation for Iter1

scores = evaluate_model(stacked_LR1, X_sample, y_sample)

print('\n%s Accuracy:  %.4f (%.3f)' % (stacked_LR1, mean(scores), std(scores)))


StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('SVC', LinearSVC()), ('NB', GaussianNB())],
                   final_estimator=LogisticRegression()) Accuracy:  0.9550 (0.079)
Wall time: 2min 19s


### Iteration 2 for Stacked Model: lv0: (DT, LogReg) and lv1: LogReg

In [26]:
# Model selection and evaluation
input_models = {'DT':DT(), 'LogReg':LogReg()}
stacked_LR2 = get_stackingLR(input_models)
stacked_LR2

StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('LogReg', LogisticRegression())],
                   final_estimator=LogisticRegression())

In [27]:
%%time
scores = evaluate_model(stacked_LR2,  X_sample, y_sample)

print('\n%s Accuracy:  %.4f (%.3f)' % (stacked_LR2, mean(scores), std(scores)))


StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('LogReg', LogisticRegression())],
                   final_estimator=LogisticRegression()) Accuracy:  0.9854 (0.001)
Wall time: 19.9 s


### Iteration 3 for Stacked Model: lv0: (DT, LogReg,NB) and lv1: LogReg

In [28]:
# Model selection and evaluation
input_models = {'DT':DT(), 'LogReg':LogReg(), 'NB':NB()}
stacked_LR3 = get_stackingLR(input_models)
stacked_LR3

StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('LogReg', LogisticRegression()),
                               ('NB', GaussianNB())],
                   final_estimator=LogisticRegression())

In [29]:
%%time
scores = evaluate_model(stacked_LR3, X_sample, y_sample)

print('\n%s Accuracy:  %.4f (%.3f)' % (stacked_LR3, mean(scores), std(scores)))


StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('LogReg', LogisticRegression()),
                               ('NB', GaussianNB())],
                   final_estimator=LogisticRegression()) Accuracy:  0.9856 (0.001)
Wall time: 19.4 s


### Iteration 4 for Stacked Model: lv0: (DT, NB) and lv1: LogReg

In [30]:
# Model selection and evaluation
input_models = {'DT':DT(), 'NB':NB()}
stacked_LR4 = get_stackingLR(input_models)
stacked_LR4

StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('NB', GaussianNB())],
                   final_estimator=LogisticRegression())

In [31]:
%%time
scores = evaluate_model(stacked_LR4,  X_sample, y_sample)

print('\n%s Accuracy:  %.4f (%.3f)' % (stacked_LR4, mean(scores), std(scores)))


StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('NB', GaussianNB())],
                   final_estimator=LogisticRegression()) Accuracy:  0.9851 (0.002)
Wall time: 18.1 s


### Iteration 5 for Stacked Model: lv0: (DT, LogReg,NB) and lv1: LogReg

In [None]:
# Model selection and evaluation
input_models = {'DT':model_DT, 'LogReg':model_LR, 'NB':NB()}
stacked_LR5 = get_stackingLR(input_models)
stacked_LR5

In [None]:
%%time
scores = evaluate_model(stacked_LR5,  X_sample, y_sample)

print('\n%s Accuracy:  %.4f (%.3f)' % (stacked_LR5, mean(scores), std(scores)))

### Iteration 6 for Stacked Model: lv0: (SVC, LogReg,NB) and lv1: DT

In [42]:
from numpy import mean
from numpy import std
from sklearn.ensemble import StackingClassifier

# get a stacking ensemble of models lv1 :DT
def get_stackingDT(my_models_lv0):
    # define the base models

    level0 = list(my_models_lv0.items())

    # define meta learner model
    level1 = DT()
    # define the stacking ensemble
    model = StackingClassifier(estimators=level0, final_estimator=level1, cv=3)
    return model
 
# evaluate a give model using cross-validation
def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=4, n_repeats=2, random_state=21)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    return scores

In [43]:
# Model selection and evaluation
input_models = {'SVC':SVC(), 'LogReg':LogReg(), 'NB':NB()}
stacked_DT1 = get_stackingDT(input_models)
stacked_DT1

StackingClassifier(cv=3,
                   estimators=[('SVC', LinearSVC()),
                               ('LogReg', LogisticRegression()),
                               ('NB', GaussianNB())],
                   final_estimator=DecisionTreeClassifier())

In [44]:
%%time
scores = evaluate_model(stacked_DT1,  X_sample, y_sample)

print('\n%s Accuracy:  %.4f (%.3f)' % (stacked_DT1, mean(scores), std(scores)))


StackingClassifier(cv=3,
                   estimators=[('SVC', LinearSVC()),
                               ('LogReg', LogisticRegression()),
                               ('NB', GaussianNB())],
                   final_estimator=DecisionTreeClassifier()) Accuracy:  0.9551 (0.011)
Wall time: 39.9 s


### Iteration 7 for Stacked Model: lv0: (LogReg,NB) and lv1: DT

In [35]:
# Model selection and evaluation
input_models = {'LogReg':LogReg(), 'NB':NB()}
stacked_DT2 = get_stackingDT(input_models)
stacked_DT2

StackingClassifier(cv=3,
                   estimators=[('LogReg', LogisticRegression()),
                               ('NB', GaussianNB())],
                   final_estimator=DecisionTreeClassifier())

In [36]:
%%time
scores = evaluate_model(stacked_DT2, X_sample, y_sample)

print('\n%s Accuracy:  %.4f (%.3f)' % (stacked_DT2, mean(scores), std(scores)))


StackingClassifier(cv=3,
                   estimators=[('LogReg', LogisticRegression()),
                               ('NB', GaussianNB())],
                   final_estimator=DecisionTreeClassifier()) Accuracy:  0.9606 (0.003)
Wall time: 6.5 s


### Iteration 8 for Stacked Model: lv0: (DT, LogReg) and lv1: SVC

In [37]:
from numpy import mean
from numpy import std
from sklearn.ensemble import StackingClassifier

# get a stacking ensemble of models lv1 :SVC
def get_stackingSVC(my_models_lv0):
    # define the base models

    level0 = list(my_models_lv0.items())

    # define meta learner model
    level1 = SVC()
    # define the stacking ensemble
    model = StackingClassifier(estimators=level0, final_estimator=level1, cv=3)
    return model
 
# evaluate a give model using cross-validation
def evaluate_model(model, X, y):
    cv = RepeatedStratifiedKFold(n_splits=4, n_repeats=2, random_state=21)
    scores = cross_val_score(model, X, y, scoring='accuracy', cv=cv, n_jobs=-1, error_score='raise')
    return scores

In [38]:
# Model selection and evaluation
input_models = {'DT':DT(), 'LogReg':LogReg()}
stacked_SVC1 = get_stackingSVC(input_models)
stacked_SVC1

StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('LogReg', LogisticRegression())],
                   final_estimator=LinearSVC())

In [39]:
%%time
scores = evaluate_model(stacked_SVC1, X_sample, y_sample)

print('\n%s Accuracy:  %.4f (%.3f)' % (stacked_SVC1, mean(scores), std(scores)))


StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('LogReg', LogisticRegression())],
                   final_estimator=LinearSVC()) Accuracy:  0.9850 (0.002)
Wall time: 22.8 s


### Iteration 9 for Stacked Model: lv0: (DT, NB) and lv1: SVC

In [40]:
# Model selection and evaluation
input_models = {'DT':DT(), 'NB':NB()}
stacked_SVC2 = get_stackingSVC(input_models)
stacked_SVC2

StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('NB', GaussianNB())],
                   final_estimator=LinearSVC())

In [41]:
%%time
scores = evaluate_model(stacked_SVC2,  X_sample, y_sample)

print('\n%s Accuracy:  %.4f (%.3f)' % (stacked_SVC2, mean(scores), std(scores)))


StackingClassifier(cv=3,
                   estimators=[('DT', DecisionTreeClassifier()),
                               ('NB', GaussianNB())],
                   final_estimator=LinearSVC()) Accuracy:  0.9856 (0.002)
Wall time: 17 s
