Introduction
====

In this notebook I will continue the analysis from my [previous notebook](https://www.kaggle.com/dstuerzer/d/dalpozz/creditcardfraud/optimized-logistic-regression), in which I have discussed the pecularities of highly imbalanced datasets, and tried to optimize a Logistic Regression to predict frauds. Even though the results were quite satisfactory, still better models can be found. Here I will discuss the application of the very popular XGBoost method. It will indeed significantly improve the prediction accuracy. 

Due to the runtime limitations for Kaggle notebooks, this is only an abridged version, where I have left out the grid search and parameter optimization, as well as high-resollution plots. However, they are an essential part of this notebook, so you might want to have a look at the full version on [GitHub](https://github.com/dstuerzer/Kaggle/blob/master/credit_card_fraud/XGBoost.ipynb), including high-resolution plots. The runtime of the latter notebook was several days.

First I load packages and a function to visualize the confusion matrix (you can skip this).

In [None]:
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve, roc_auc_score, f1_score, precision_recall_curve
from sklearn.grid_search import GridSearchCV
import itertools

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()

def show_data(cm, print_res = 0):
    tp = cm[1,1]
    fn = cm[1,0]
    fp = cm[0,1]
    tn = cm[0,0]
    if print_res == 1:
        print('Precision =     {:.3f}'.format(tp/(tp+fp)))
        print('Recall (TPR) =  {:.3f}'.format(tp/(tp+fn)))
        print('Fallout (FPR) = {:.3e}'.format(fp/(fp+tn)))
    return tp/(tp+fp), tp/(tp+fn), fp/(fp+tn)

We read the data, and split it into the training data (X_, y_), which will be used to optimize the parameters and train our final model. This will then be validated on the test set (X_test, y_test).

In [None]:
df = pd.read_csv("../input/creditcard.csv")
y = np.array(df.Class.tolist())     #classes: 1..fraud, 0..no fraud
df = df.drop('Class', 1)
X = np.array(df.as_matrix())   # features

X_, X_test, y_, y_test = train_test_split(X, y, test_size = 0.2)

From the extensive discussion of the dataset in my [previous notebook](https://www.kaggle.com/dstuerzer/d/dalpozz/creditcardfraud/optimized-logistic-regression) we know that the classes "Fraud" and "Non-Fraud" are highly imbalanced, and that Precision, Recall and Fallout are meaningful measures for the accuracy of a model.


Strategies
===

Convenient ways to characterize the performance of a classifier here are Fallout and Recall (and the corresponding ROC-curve), and possibly the Precision (check out [Wikipedia](https://en.wikipedia.org/wiki/Evaluation_of_binary_classifiers) for definitions). In order to find good parameters we might try the following approaches:

* Maximize the Recall for fixed, sufficiently small Fallout.
* Maximize the ROC-AUC. This metric is built in, but might lead to different results than the previous strategy.
* Maximize the F1-score, which is the harmonic mean of Precision and Recall. 
Here, I will compare all three strategies. 

Before we start, it might be useful to recall our goal. Most importantly, we want to maximize the Recall, which is the probability of actually detecting a fraud. Undetected frauds are what actually creates costs for the bank and/or the customer. However, if we increase the Recall, we also increase the rate of 'false alarms' (i.e., the Fallout), and this can be very annoying, even though it does not directly generate costs. However, if a customer gets a false fraud alert every week, he or she will consider changing to another bank. Keeping false alarms small means keeping the Fallout small, or (almost) equivalently, keeping the Precision high. Hence, we will either:

* Find a tradeoff between high Recall and low Fallout, or
* simply maximize the F1-score, therefore 'maximizing' Precision and Recall at the same time.

Grid Search - Maximizing the F1-Score
===

That is our first approach. I will only initialize the grid search here, but not execute it, since it would be too time-consuming here (see my GitHub-repo for this).

In [None]:
cv_params = {'max_depth': [1,2,3,4,5,6], 'min_child_weight': [1,2,3,4]}    # parameters to be tries in the grid search
fix_params = {'learning_rate': 0.2, 'n_estimators': 100, 'objective': 'binary:logistic'}   #other parameters, fixed for the moment 
csv = GridSearchCV(xgb.XGBClassifier(**fix_params), cv_params, scoring = 'f1', cv = 5)

```
csv.fit(X_, y_) 
```

runs the search, and 

```
csv.best_params_
```

returns the optimal parameteres within the list. We "see" that the particular choice of min_child_weight does not have a significant impact on the F1-score. Let's still pick the best results of this search, i.e., max_depth = 5, and min_child_weight = 3, and continue the grid search.

In [None]:
cv_params = {'subsample': [0.8,0.9,1], 'max_delta_step': [0,1,2,4]}
fix_params = {'learning_rate': 0.2, 'n_estimators': 100, 'objective': 'binary:logistic', 'max_depth': 5, 'min_child_weight':3}

```
csv = GridSearchCV(xgb.XGBClassifier(**fix_params), cv_params, scoring = 'f1', cv = 5) 
csv.fit(X_, y_)
csv.grid_scores_
csv.best_params_
```

We obtain that max_delta_step =1 and subsample = 0.8 are (quasi)-optimal. Finally we search for an optimal leaning rate:

In [None]:
cv_params = {'learning_rate': [0.05, 0.1, 0.15, 0.2, 0.25, 0.3]}
fix_params['max_delta_step'] = 1
fix_params['subsample'] = 0.8

```
csv = GridSearchCV(xgb.XGBClassifier(**fix_params), cv_params, scoring = 'f1', cv = 5) 
csv.fit(X_, y_)
csv.grid_scores_
csv.best_params_
```

Unfortunately, running the same grid search again often yields different 'optimal' parameters. We settle for the rate 0.2, and here are our final parameters (if we wanted, we could rerun the grid searches another time, but I believe that the gain will be minimal compared to the associated computational costs. 

In [None]:
fix_params['learning_rate'] = 0.2
params_final =  fix_params
print(params_final)

Now we train our final model on the entire training set, and evaluate it on the still unused testing set:

In [None]:
xgdmat_train = xgb.DMatrix(X_, y_)
xgdmat_test = xgb.DMatrix(X_test, y_test)
xgb_final = xgb.train(params_final, xgdmat_train, num_boost_round = 100)

In [None]:
y_pred = xgb_final.predict(xgdmat_test)
thresh = 0.08
y_pred [y_pred > thresh] = 1
y_pred [y_pred <= thresh] = 0
cm = confusion_matrix(y_test, y_pred)
plot_confusion_matrix(cm, ['0', '1'], )
pr, tpr, fpr = show_data(cm, print_res = 1);

We can still play around with the probability threshold, such that Precision and Recall are to our liking. As far as we can tell from the evaluation of the final model on the testing set, it seems at least comparable to the Logistic Regression.

Grid Search - ROC-AUC
===

We can repeat exactly the same reasoning, just with scoring = 'roc_auc' in the CV. Indeed, we get similiar near-optimal parameters. I do not show this here.

Maximized Recall for given Fallout
===

Here I will apply a similar strategy to the one described in my previous notebook about the Logistic Regression. I will repeat CV multiple times in order to get averaged, precise ROC-curves and Precision-Recall-curves (PR). If we specify a Fallout or a desired Precision, we can use the curves to find the optimal parameters such that the Recall is then maximized. Since I will average over many curves, this is a very precise approach, but it is bound to be very computation-intensive.

Let us specify the XGBoost-parameters we for now assume best, e.g. the ones we have obtained above.

In [None]:
par = params_final

The following function splits the training data into a further training and a valuation set, and generates the ROC- and the PR-curve based on the valuation data:

In [None]:
def get_curves(X_, y_, pars):
    X_train, X_val, y_train, y_val = train_test_split(X_, y_, test_size = 0.2)
    clf = xgb.XGBClassifier(**pars)
    clf.fit(X_train, y_train)
    y_prob = clf.predict_proba(X_val)[:,clf.classes_[1]]
    fpr, tpr, thresholds_roc = roc_curve(y_val, y_prob)
    prec, rec, thresholds_pr = precision_recall_curve(y_val, y_prob)
    return fpr, tpr, prec, rec

This function now calls 'get_curves' *N_iter* times and computes averaged ROC- and PR-curves. I have chosen *N_iter = 300*, and obtained very smooth ROC-curves. This is important, since we want to use the curves to find the optimal parameters. Too much noise (i.e. not enough smoothing) will likely obfuscate the results.

In [None]:
def gen_curves(X_, y_, pars):
    N_iter = 300
    mean_tpr = 0.0
    mean_fpr = np.linspace(0, 1, 100000)
    
    mean_prec = 0.0
    mean_rec = np.linspace(0, 1, 100000)
    
    for n in range(N_iter):
        fpr, tpr, prec, rec = get_curves(X_, y_, pars)
        prec = list(reversed(prec)) #reverse, otherwise the interp doesn not work
        rec = list(reversed(rec))
        mean_tpr  += np.interp(mean_fpr, fpr, tpr)
        mean_prec += np.interp(mean_rec, rec, prec)

    mean_tpr /= N_iter
    mean_prec /= N_iter
    
    return mean_fpr, mean_tpr, mean_prec, mean_rec

And this function finally plots the curves for different parameters:

In [None]:
def plot_roc(X_, y_, par, name_par, list_par):
    f, (ax1, ax2) = plt.subplots(1, 2, figsize = (18,7));
    for l in list_par:
        par[name_par] = l
        print(par)
        mean_fpr, mean_tpr, mean_prec, mean_rec = gen_curves(X_, y_, par)
        ax1.plot(mean_fpr, mean_tpr, label = name_par+" = "+str(l))
        ax2.plot(mean_rec, mean_prec, label = name_par+" = "+str(l))
    ax1.set_xlim([0, 0.0005])
    ax1.set_ylim([0.5, 0.95])
    ax1.axvline(2e-4, color='b', linestyle='dashed', linewidth=2)
    ax1.legend(loc="lower right")
    ax1.set_xlabel('FPR/Fallout')
    ax1.set_ylabel('TPR/Recall')
    ax2.set_xlabel('Recall')
    ax2.set_ylabel('Precision')
    ax1.set_title('ROC')
    ax2.set_title('PR')
    ax2.legend(loc = "lower left")
    ax2.set_xlim([0.5, 1])
    plt.show()

Let us first vary 'max_depth':

```
plot_roc(X_, y_, par, 'max_depth', [1,2,3,4,5,7,10,15])
```

We see that small values of 'max_depth' are not so good (again, please check out [GitHub](https://github.com/dstuerzer/Kaggle/blob/master/credit_card_fraud/XGBoost.ipynb) for high-resolution and high-precision plots), and fix max_depth = 5. We proceed with a grid search for 'learning_rate' (again, on GitHub I use a finer grid):

In [None]:
par['max_depth'] = 5

```
plot_roc(X_, y_, par, 'learning_rate',  [0.05,0.1,0.15,0.2,0.25,0.3])
```

From the plots here we see that a learning rate around 0.2 seems best. In the grid search for the F1 score above we have observed that it was mostly those two parameters that had an influence on the performance of the model. This suggests we stop here, and set our final parameters.

In [None]:
par['learning_rate'] = 0.2

We now train the final model on the full training set (X_, y_) and apply to the still untouched testing set (X_test, y_test).

In [None]:
xgdmat_train = xgb.DMatrix(X_, y_)
xgdmat_test = xgb.DMatrix(X_test, y_test)
xgb_final = xgb.train(par, xgdmat_train, num_boost_round = 100)
y_pred = xgb_final.predict(xgdmat_test)
fpr, tpr, thresholds_roc = roc_curve(y_test, y_pred)
prec, rec, thresholds_pr = precision_recall_curve(y_test, y_pred)

mean_fpr = np.linspace(0, 1, 100000)
mean_rec = np.linspace(0, 1, 1000)

prec = list(reversed(prec)) #reverse, otherwise the interp doesn not work
rec = list(reversed(rec))
mean_tpr = np.interp(mean_fpr, fpr, tpr)
mean_prec = np.interp(mean_rec, rec, prec)

In [None]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize = (18,7));
ax1.plot(mean_fpr, mean_tpr);

ax2.plot(mean_rec, mean_prec);

ax1.set_xlim([0, 0.0005])
ax1.set_ylim([0.5, 0.95])
ax1.axvline(2e-4, color='b', linestyle='dashed', linewidth=2)
ax1.set_xlabel('FPR/Fallout')
ax1.set_ylabel('TPR/Recall')
ax2.set_xlabel('Recall')
ax2.set_ylabel('Precision')
ax1.set_title('ROC')
ax2.set_title('PR')
ax2.set_xlim([0.5, 1])
plt.show()

In [None]:
Above we see the ROC- and the PR-curve for our model on the testing set. The results on the training set indicate that our XGBoost-model performs better than the Logistic Regression (compare to my previous notebook): Especially for the smoothed curves (again, see GitHub) and a fixed Fallout of 2e-4 the (average) Recall is about 5% higher. On the testing set we expect higher fluctuations, but nevertheless, the model performs well there too.

The prediction vector *y_pred* contains now probabilities for the classes. By varying the threshold, we obtain the ROC- and PR-curves, and can adapt the sensitivity of the model to our liking (having an eye on the curves, so we know what we can expect).

In [None]:
y_final = np.copy(y_pred)
thresh = 0.08
y_final [y_final > thresh] = 1
y_final [y_final <= thresh] = 0
cm = confusion_matrix(y_test, y_final)
plot_confusion_matrix(cm, ['0', '1'], )
pr, tpr, fpr = show_data(cm, print_res = 1);

Unfortunately, we have to expect a high variation in the testing results, since the testing set only contains very few frauds. The averaged ROC- and PR-curves might still be a better indication on the actual quality of the model.

One of the main lessons of this analysis is that XGBoost will (on average) outperform the Logistic Regression, and is more suited to predict frauds. In particular, the ROC-curve is much steeper around the origin, and we can achieve similar Recall rates with only half of the Fallout (i.e. a Fallout of ~80% with a Recall of just 1e-4). Alternatively, we can expect a Precision *and* a Recall both of ~85%. See GitHub for the confirmation of these results.