In this kernel, I demonstrate the performance of applying a Gradient Boosting Classification Tree (GBCT) by using Scikit-Learn. In addition, I use Synthetic Minority Over-Sampling Technique (SMOTE) to oversample the minority class, ie. fraud transactions, to combat the high skewness of this data set. 

Credit: The general pipeline is built upon [the work](https://www.kaggle.com/joparga3/d/dalpozz/creditcardfraud/in-depth-skewed-data-classif-93-recall-acc-now) of `joparga3`.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

A) DATA PREPROCESSING
===
Loading
---

In [None]:
df = pd.read_csv("../input/creditcard.csv")
df.head()

Normalization
---

In [None]:
from sklearn.preprocessing import StandardScaler
df_scaled = df #Make Duplicate

df_scaled['normAmount'] =  StandardScaler().fit_transform(df_scaled['Amount'].values.reshape(-1, 1))
df_scaled = df_scaled.drop(['Amount'],axis=1)
df_scaled.head()

In [None]:
# Number of data points in the minority class && Indices Picking
number_records_fraud = len(df_scaled[df_scaled.Class == 1]) #492 Fraud Cases
fraud_indices = np.array(df_scaled[df_scaled.Class == 1].index)
print("Number of Fraud Cases: ", number_records_fraud)


# Number of data points in the majority class && Indices Picking
number_records_normal = len(df_scaled[df_scaled.Class != 1])
normal_indices = df_scaled[df_scaled.Class == 0].index
print("Number of Normal Cases: ", number_records_normal)

# Get fraud Transactions by Filtering
df_fraud = df_scaled.iloc[fraud_indices] 
X_fraud = df_fraud.ix[:,df_fraud.columns != 'Class']
y_fraud = df_fraud.ix[:,df_fraud.columns == 'Class']

# Get Normal Transactions by Filtering
df_normal = df_scaled.iloc[normal_indices] #Get normal Transaction by Filtering
X_normal = df_normal.ix[:,df_normal.columns != 'Class']
y_normal = df_normal.ix[:,df_normal.columns == 'Class']

# Make X,y for classfication
X = df_scaled.ix[:, df_scaled.columns != 'Class']
yy = df_scaled.ix[:, df_scaled.columns == 'Class']
y = np.asarray(yy['Class'])


Resampling: oversampling using SMOTE
---
In order to balance the number of samples in our two classes, we will oversample the fraud transaction class by artificially synthesizing fraud transactions using [SMOTE](http://contrib.scikit-learn.org/imbalanced-learn/auto_examples/over-sampling/plot_smote.html#sphx-glr-auto-examples-over-sampling-plot-smote-py).

In [None]:
from imblearn.over_sampling import SMOTE
# Apply SMOTE's
kind = 'regular'
sm = SMOTE(kind='regular')
X_res, y_res = sm.fit_sample(X, y)

print("esampled Dataset has shape: ", X_res.shape)
print("Number of Fraud Cases (Real && Synthetic): ", np.sum(y_res))

Training Classifier
===
We apply the train-split operation to both the resampled data set (balanced) and the raw data set (highly skewed). The training will be conducted upon the resampled data set and we want to test the output classifier against the raw data set. 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, y)

print("Number transactions train dataset: ", len(X_train))
print("Number transactions test dataset: ", len(X_test))
print("Total number of transactions: ", len(X_train)+len(X_test))



X_train_res, X_test_res, y_train_res, y_test_res= train_test_split(X_res, y_res)

print("")
print("Number transactions train dataset: ", len(X_train_res))
print("Number transactions test dataset: ", len(X_test_res))
print("Total number of transactions: ", len(X_train_res)+len(X_test_res))

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
est = GradientBoostingClassifier(n_estimators=200, max_depth=3, learning_rate=1,
                                random_state=0, verbose = 1)
est.fit(X_train_res, y_train_res)

In [None]:
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    else:
        1#print('Confusion matrix, without normalization')

    #print(cm)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

On Original Data Space
---