# Credit Card Fraud Detection(Full Solution-XGBoost performs better)

**1) Models compared are Logistic Regression(Can consider as benchmark) & Ensemble Methods - Ada Boost, XGBoost, Random Forest** 



**2) SMOTE(Synthetic Minority Over -sampling Technique) has been used for oversampling(Given this is anomaly detection problem)**



**3) Metrics used to evaluate algorithms - Precision Score,Recall score,F1-Score,Area under precision recall curve & Accuracy score-  
  
* Priority will be given to Recall score so that we don't miss any fraud detection**

In [None]:
import warnings
with warnings.catch_warnings():
    warnings.simplefilter("ignore")

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection  import train_test_split,KFold, cross_val_score,GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix,precision_recall_curve,auc,roc_auc_score,roc_curve,\
recall_score,classification_report,accuracy_score,precision_score,f1_score,make_scorer,average_precision_score
from imblearn.over_sampling import SMOTE
from time import time
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from xgboost import XGBClassifier
import seaborn as sns
from matplotlib.colors import ListedColormap
import imp
%matplotlib inline

In [None]:
data = pd.read_csv("../input/creditcard.csv")
data.head()

In [None]:
normal_trans_perc=sum(data['Class']==0)/(sum(data['Class']==0)+sum(data['Class']==1))
fraud_trans_perc=1-normal_trans_perc
print('Total number of records : {} '.format(len(data)))
print('Total number of normal transactions : {}'.format(sum(data['Class']==0)))
print('Total number of  fraudulent transactions : {}'.format(sum(data['Class']==1)))
print('Percent of normal transactions is : {:.4f}%,  fraudulent transactions is : {:.4f}%'.format(normal_trans_perc*100,fraud_trans_perc*100))

In [None]:
plt.figure(figsize=(4, 4))
cards_trans_count= pd.Series([normal_trans_perc*100,fraud_trans_perc*100],index=['Normal','Fraud'])
cards_trans_count.plot(kind='bar', title='Card Transactions Split')
plt.ylabel('Transaction %')

In [None]:
#Normalizing Amount column as rest of columns have been already normalized and dropping time as it is just sequence
data['NormAmount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))
data = data.drop(['Time','Amount'],axis=1)
data.head()

In [None]:
fig, ax = plt.subplots(figsize=(7,7))  
sns.heatmap(data.corr(),annot_kws={"size":4})

In [None]:
X_raw = data.ix[:, data.columns != 'Class']
y_raw = data.ix[:, data.columns == 'Class']

In [None]:
#Drop columns(Like V22,V23, V24) low correlaction score
X=X_raw.drop(['V22','V23','V24','V25','V26','V27','V28'], axis = 1)
y=y_raw

###Split into train and test data

In [None]:
# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

In [None]:
# Split into train and test dataset
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.3, random_state = 0)

In [None]:
#Stats of training data 
print('---------Training data statistics-----------')
normal_trans_perc=sum( y_train['Class']==0)/(sum( y_train['Class']==0)+sum( y_train['Class']==1))
fraud_trans_perc=1-normal_trans_perc
print('Total number of records : {} '.format(len(y_train)))
print('Total number of normal transactions : {}'.format(sum(y_train['Class']==0)))
print('Total number of  fraudulent transactions : {}'.format(sum(y_train['Class']==1)))
print('Percent of normal transactions is : {:.4f}%,  fraudulent transactions is : {:.4f}%'.format(normal_trans_perc*100,fraud_trans_perc*100))

In [None]:
#Stats of testing data 
print('---------Testing data statistics-----------')
normal_trans_perc=sum( y_test['Class']==0)/(sum( y_test['Class']==0)+sum( y_test['Class']==1))
fraud_trans_perc=1-normal_trans_perc
print('Total number of records : {} '.format(len(y_test)))
print('Total number of normal transactions : {}'.format(sum(y_test['Class']==0)))
print('Total number of  fraudulent transactions : {}'.format(sum(y_test['Class']==1)))
print('Percent of normal transactions is : {:.4f}%,  fraudulent transactions is : {:.4f}%'.format(normal_trans_perc*100,fraud_trans_perc*100))

### Resampling of data using SMOTE Technique 

In [None]:
sm = SMOTE(ratio=.02,kind='borderline1',random_state=0)

In [None]:
X_resampled_train, y_resampled_train = sm.fit_sample(X_train, y_train.values.ravel())

In [None]:
# Convert to dataframe
X_resampled_train=pd.DataFrame(X_resampled_train,columns=
['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 
 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'NormAmount'])
y_resampled_train=pd.DataFrame(y_resampled_train,columns=['Class'])

In [None]:
print('---------------Resampled data statistics---------------')
normal_trans_perc=sum(y_resampled_train['Class']==0)/(sum(y_resampled_train['Class']==0)+sum(y_resampled_train['Class']==1))
fraud_trans_perc=1-normal_trans_perc
print('Total number of records : {} '.format(len(y_resampled_train)))
print('Total number of normal transactions : {}'.format(sum(y_resampled_train['Class']==0)))
print('Total number of  fraudulent transactions : {}'.format(sum(y_resampled_train['Class']==1)))
print('Percent of normal transactions is : {:.4f}%,  fraudulent transactions is : {:.4f}%'.format(normal_trans_perc*100,fraud_trans_perc*100))

In [None]:
X_resampled_train.to_csv('x_train.csv')
y_resampled_train.to_csv('y_train.csv')
X_test.to_csv('x_test.csv')
y_test.to_csv('y_test.csv')

In [None]:
def train_predict(learner, X_train, y_train, X_test, y_test): 
    '''
    inputs:
       - learner: the learning algorithm to be trained and predicted on       
       - X_train: features training set
       - y_train: income training set
       - X_test: features testing set
       - y_test: income testing set
    '''
    
    results = {}
   
    start = time() # Get start time
    learner.fit(X_train, y_train)
    end = time() # Get end time
      
    results['train_time'] = end - start
        
    start = time() # Get start time
    predictions_test = learner.predict(X_test)
    predictions_train = learner.predict(X_train)
    
    predictions_test_prob = learner.predict_proba(X_test)[:,1]
    predictions_train_prob = learner.predict_proba(X_train)[:,1]
    
    
    end = time() # Get end time
        
    results['pred_time'] =end - start
            
    
    results['acc_train'] = accuracy_score(y_train, predictions_train)      
    results['acc_test'] = accuracy_score(y_test, predictions_test)
    
    results['rec_train'] = recall_score(y_train, predictions_train)      
    results['rec_test'] = recall_score(y_test, predictions_test)
    
    results['prec_train'] = precision_score(y_train, predictions_train)      
    results['prec_test'] = precision_score(y_test, predictions_test)
    
    
    results['f1_train'] = f1_score(y_train, predictions_train)
    results['f1_test'] = f1_score(y_test, predictions_test)
    
    results['auc_train'] = average_precision_score(y_train, predictions_train_prob,average='weighted')
    results['auc_test'] = average_precision_score(y_test, predictions_test_prob,average='weighted')
    
    
       
    # Success
    print("{} trained in time {:.4f} ".format(learner.__class__.__name__,(end - start)))
        
    # Return the results
    return results

In [None]:
# Initialize and train the models
clf_lr = LogisticRegression(random_state=0)
clf_rf = RandomForestClassifier(random_state=0)
clf_ab = AdaBoostClassifier(random_state=0)
clf_xg = XGBClassifier()

# Collect results on the learners
results = {}
for clf in [clf_lr, clf_ab,clf_rf,clf_xg]:
    clf_name = clf.__class__.__name__
    results[clf_name] = {}
    results[clf_name] = train_predict(clf, X_resampled_train, y_resampled_train.values.ravel(), X_test, y_test.values.ravel())


In [None]:
lr_res=pd.DataFrame(results['LogisticRegression'],index=['LR'])
ab_res=pd.DataFrame(results['AdaBoostClassifier'],index=['AB'])
rf_res=pd.DataFrame(results['RandomForestClassifier'],index=['RF'])
xg_res=pd.DataFrame(results['XGBClassifier'],index=['XG'])
all_res= pd.concat([lr_res,rf_res,ab_res,xg_res])

In [None]:
#Untuned Classifiers scores
all_res[['train_time','pred_time','acc_train','acc_test','rec_train','rec_test',\
         'prec_train','prec_test','f1_train','f1_test','auc_train','auc_test']]

In [None]:

import warnings
warnings.filterwarnings("ignore", category = UserWarning, module = "matplotlib")
#
# Display inline matplotlib plots with IPython
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'inline')
###########################################

import matplotlib.pyplot as pl
import matplotlib.patches as mpatches
import numpy as np
import pandas as pd
from time import time
from sklearn.metrics import f1_score, accuracy_score


def evaluate(results):
  
	
	# Create figure
    fig, ax = pl.subplots(2, 5, figsize = (12,7))
    tit_label={0:'Training ',1:'Testing '}

	# Constants
    bar_width = 0.2
    colors = ['#5F9EA0','#6495ED','#90EE90','#9ACD32']

    
    # Super loop to plot four panels of data
    for k, learner in enumerate(results.keys()):
        for j, metric in enumerate(['acc_train','rec_train','prec_train','f1_train','auc_train']):                 
           # Creative plot code
           ax[0, j].bar(k*bar_width, results[learner][metric], width = bar_width, color = colors[k])
           ax[0, j].set_xlim((-0.1, .9))
           ax[0,j].set_facecolor('white')
           pl.setp(ax[0,j].get_xticklabels(),visible=False)
           
        for j, metric in enumerate(['acc_test','rec_test','prec_test','f1_test','auc_test']):                 
           # Creative plot code
           ax[1, j].bar(k*bar_width, results[learner][metric], width = bar_width, color = colors[k])
           ax[1, j].set_xlim((-0.1, .9))
           ax[1,j].set_facecolor('white')
      
    for r in range(2):
        # Add unique y-labels
        ax[r, 0].set_ylabel("Accuracy Score")
        ax[r, 1].set_ylabel("Recall Score")
        ax[r, 2].set_ylabel("Precision score")
        ax[r, 3].set_ylabel("F1 - Score")
        ax[r, 4].set_ylabel("AUC-score")
        # Add titles
        ax[r, 0].set_title(tit_label[r]+"Accuracy Score")
        ax[r, 1].set_title(tit_label[r]+"Recall Score")
        ax[r, 2].set_title(tit_label[r]+"Precision score")
        ax[r, 3].set_title(tit_label[r]+"F1 - Score")
        ax[r, 4].set_title(tit_label[r]+"AUC-score")
		
    
   

    # Create patches for the legend
    patches = []
    for i, learner in enumerate(results.keys()):
        patches.append(mpatches.Patch(color = colors[i], label = learner))
        pl.legend(handles = patches, bbox_to_anchor = (-2, 2.4), \
               loc = 'upper center', borderaxespad = 0., ncol = 4, fontsize = 'x-large')
    

	# Aesthetics
    pl.suptitle("Performance Metrics for Four Supervised Learning Models", fontsize = 16, y = 1.10)
    pl.tight_layout()
    pl.show()
    

def comp_stats(results):
  
	
	# Create figure
    fig, ax = pl.subplots(1, 1, figsize = (4,4))
    tit_label={0:'Training ',1:'Testing '}

	# Constants
    bar_width = 0.2
    colors = ['c','g']
    start_l=-0.2

    
    # Super loop to plot four panels of data
    for k, learner in enumerate(results.keys()):
        if (k==0):
          bar_l=-0.6
        else:
          bar_l=-0.4
        for j, metric in enumerate(['acc_test','rec_test','prec_test','f1_test','auc_test']):                 
           bar_l=bar_l+.6 
           ax.bar(bar_l, results[learner][metric], width = bar_width, color = colors[k])
           

    ax.set_xlim((0, 3))
    ax.set_xticks([.2, .8, 1.4,2,2.6])
    ax.set_xticklabels(["Accuracy", "Recall", "Precision","F1","AUC"])
    ax.set_ylabel("Score")
    ax.set_facecolor('white')   
	# Create patches for the legend
    patches = []
    for i, learner in enumerate(results.keys()):
        patches.append(mpatches.Patch(color = colors[i], label = learner))
        pl.legend(handles = patches, bbox_to_anchor = (0.4, 1.22), \
               loc = 'upper center', borderaxespad = 0., ncol = 2, fontsize = 10)
    
    rects = ax.patches
    labels=[]

 	 # Super loop to plot four panels of data
    for k, learner in enumerate(results.keys()):
        for j, metric in enumerate(['acc_test','rec_test','prec_test','f1_test','auc_test']):                 
           labels.append("%.4f" % results[learner][metric])

    for rect, label in zip(rects, labels):
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2, height*1.02, label, ha='center', va='bottom',rotation='vertical')



	# Aesthetics
    pl.suptitle("Metrics for Tuned Models", fontsize = 14, y = 1.20)
    pl.tight_layout()
    pl.show()
    


In [None]:
evaluate(results)

-- Based on above graph we can see Random Forest and XGB gives best score, hence taken for further tuning--

#### For XGBoost,below parameters were passed to  GridsearchCV(sklearn.model_selection) procedure 

{'max_depth': (8, 10)}
{'min_child_weight':(2,4)}
{'reg_lambda': (0.391, 0.395, 0.399)}
{'n_estimators': (150,200)}
I could see max_depth=8,min_child_weight=2 were giving best results, hence used for final model 

For RandomForest Classifier,below parameters were passed to  GridsearchCV(sklearn.model_selection) procedure -
{'max_depth': (15, 20)}
{'min_samples_split': (3,4)}
{'min_samples_leaf': (4,5)}
{'n_estimators': (12,14,16)}

I could see  max_depth=20,min_samples_leaf=5) were giving best results, hence used for final model 

**PLEASE NOTE** - *Code for sklearn.model_selection.GridsearchCV is not included here, as I was getting timeout error in Kaggle environment, hence I ran GridsearchCV in my local environment* 

### Comparison Between Pre & Post Tuned XGB Models(Test Data)

In [None]:
clf_pre_tune = XGBClassifier()
##Final Tuned XGB Model
clf_post_tune = XGBClassifier(max_depth=8,min_child_weight=2)
results_tune={}
results_tune['XG-PRE-TUNE'] = {}
results_tune['XG-PRE-TUNE'] = train_predict(clf_pre_tune, X_resampled_train, y_resampled_train.values.ravel(), X_test, y_test.values.ravel())
results_tune['XG-POST-TUNE'] = {}
results_tune['XG-POST-TUNE'] = train_predict(clf_post_tune, X_resampled_train, y_resampled_train.values.ravel(), X_test, y_test.values.ravel())

xg_pre_tune=pd.DataFrame(results_tune['XG-PRE-TUNE'],index=['XG-PRE-TUNE'])
xg_post_tune=pd.DataFrame(results_tune['XG-POST-TUNE'],index=['XG-POST-TUNE'])
all_res= pd.concat([xg_pre_tune,xg_post_tune])

all_res[['acc_train','acc_test','rec_train','rec_test',\
         'prec_train','prec_test','f1_train','f1_test','auc_train','auc_test']]



In [None]:
##Comparison between Pre-tuned and post-tuned
comp_stats(results_tune)

Comparison(Below one) Between Post Tuned - XGB & RandomForest Classifier(Test Data)
------------------------------------------------------------------------

In [None]:
## Referred from Random Forest Tuning Notebook
clf_post_tune_rf = RandomForestClassifier(random_state=0,max_depth=20,min_samples_leaf=5)
##Final Tuned XGB Model
clf_post_tune_xgb = XGBClassifier(max_depth=8,min_child_weight=2)
results_tune_1={}
results_tune_1['RF-POST-TUNE'] = {}
results_tune_1['RF-POST-TUNE'] = train_predict(clf_post_tune_rf, X_resampled_train, y_resampled_train.values.ravel(), X_test, y_test.values.ravel())
results_tune_1['XG-POST-TUNE'] = {}
results_tune_1['XG-POST-TUNE'] = train_predict(clf_post_tune_xgb, X_resampled_train, y_resampled_train.values.ravel(), X_test, y_test.values.ravel())

rf_post_tune_1=pd.DataFrame(results_tune_1['RF-POST-TUNE'],index=['RF-POST-TUNE'])
xg_post_tune_1=pd.DataFrame(results_tune_1['XG-POST-TUNE'],index=['XG-POST-TUNE'])
all_res_1= pd.concat([rf_post_tune_1,xg_post_tune_1])

all_res_1[['acc_train','acc_test','rec_train','rec_test',\
         'prec_train','prec_test','f1_train','f1_test','auc_train','auc_test']]


In [None]:
comp_stats(results_tune_1)

## Based on above stats. we can clearly see XGBOOST score is high and best model

Features Importance Chart
-------------------------

In [None]:
# feature importance
plt.figure(figsize=(6, 6))
feature_importance = pd.Series(clf_post_tune.booster().get_fscore()).sort_values(ascending=False)
feature_importance.plot(kind='bar', title='Feature Importance')
plt.ylabel('Feature Importance Score')

Final Reflection
================
This problem is highly imbalanced classification problem, where we have more than 99% transactions as normal and <1% as fraud. Data which we got was, PCA applied columns (As original columns were not shared due to privacy reasons)
Given data skewedness, classifiers will tend to prefer normal (negative class) transactions and will have challenge identifying fraud positive classes. Hence to overcome this SMOTE (Synthetic Minority Over -sampling Technique) method was used to oversample training data with positive classes, so that classifiers have more positive class to work with.

Ensemble methods (Adaboost, RandomForest, XGBoost) were trained on data as ensemble methods tend to predict better for these kinds of problems[4]. RandomForest and XGBoost gave good metrics, were further selected for tuning steps
RandomForest and XGBoost were tuned using GridSearchCV 5-fold cross validation data set. Also, I am tuning parameters one by one as tuning multiple parameters at same time was causing program to run for very long time and required lots of memory.

***Finally, between 2 classifiers, based on better score, XGBoost is chosen as preferred classifier for the problem.***