# Kaggle's Credit Card Fraud Dataset - RF
In this notebook I'll apply a Random Forest classifier to the problem, but first I'll address the severe class imbalance of the set using the SMOTE ENN over/under-sampling technique.

In [None]:
%pylab inline
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.combine import SMOTEENN 
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve,auc
from sklearn.metrics import confusion_matrix

In [None]:
data = pd.read_csv('../input/creditcard.csv')
data.head()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data.drop('Class',axis=1), data['Class'], test_size=0.33)

### Under/over-sample with SMOTE ENN to overcome class imbalance
While a Random Forest classifier is generally considered imbalance-agnostic, in this case the severity of the imbalaance resuts in overfitting to the majority class.

In [None]:
sme = SMOTEENN()
X_train, y_train = sme.fit_sample(X_train, y_train)
unique(y_train, return_counts=True)

### Train & Predict

In [None]:
clf = RandomForestClassifier(random_state=42)
clf = clf.fit(X_train,y_train)

y_test_hat = clf.predict(X_test)

## Evaluate predictions

### Extremely Accurate?
While the standard accuracy metric makes our predictions look near-perfect, we should bear in mind that the class imbalance of the set skews this metric.

In [None]:
accuracy_score(y_test,y_test_hat)

SciKitLearn's classification report gives us a more complete picture.

In [None]:
print (classification_report(y_test, y_test_hat))

### ROC Curve & AUC
We'll plot precision (false positive rate) against recall (true positive rate) and compute the area under this curve for a better metric.

In [None]:
y_score = clf.predict_proba(X_test)[:,1]
fpr, tpr, _ = roc_curve(y_test, y_score)

title('Random Forest ROC curve: CC Fraud')
xlabel('FPR (Precision)')
ylabel('TPR (Recall)')

plot(fpr,tpr)
plot((0,1), ls='dashed',color='black')
plt.show()
print ('Area under curve (AUC): ', auc(fpr,tpr))

### Confusion Matrix
Another valuable way to visulize our predictions is to plot them in a confusion matrix, which shows us the frequency of correct & incorrect predictions.

In [None]:
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap) 
    plt.title(title)
    class_labels = ['Valid','Fraudulent']
    plt.colorbar()
    
    tick_marks = np.arange(len(class_labels)) 
    plt.xticks(tick_marks, class_labels, rotation=90) 
    plt.yticks(tick_marks, class_labels) 
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
cm = confusion_matrix(y_test, y_test_hat)
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis] 
plt.figure(figsize=(5,5))
plot_confusion_matrix(cm_normalized, title='Normalized confusion matrix')