# Credit card Fraud detection

The datasets contains transactions made by credit cards in September 2013 by european cardholders.
This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation.

### Goal : Classify whether it is a fraud transaction or not.

The following steps will be taken: 

 1. EDA
 2. fit logistic regression with all features given
 3. analyze the result
 4. resampling to fix skewness in dataset
 5. compare two results and conclusion

In [None]:
import pandas as pd
import numpy as np
from __future__ import division
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.model_selection import train_test_split


# Loading the data

In [None]:
df = pd.read_csv('../input/creditcard.csv')
df.describe()

In [None]:
# It is true that in data description, all variables are the result of PCA transformation
df.head()

# Explore to Data

In [None]:
# checking null value in dataset
df.isnull().sum()

In [None]:
# number of label "1" in whole dataset.
sum(df['Class']==1)

In [None]:
# visualize
count_classes = pd.value_counts(df['Class'], sort = True)
count_classes.plot(kind='bar')
plt.title("Fraud class histogram")
plt.xlabel("Class")
plt.ylabel("Frequency")

In [None]:
492/284807 * 100

The dataset obviously skewed.

consist of 0.17% fraudulent transaction in dataset. 

In [None]:
# plot correlation heatmap 
corr = df.corr()
sns.heatmap(corr, vmin=0, vmax=1)

'Time' column is about the time when each transaction occured.
As you can see in correlation between data, I couldn't find any significant clue.

In [None]:
# make data frame easy to see, change order of columns and drop Time column.
col = df.columns.values
col = ['V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19',
       'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28',
       'Amount', 'Class']

# assign new column order to data frame
df = df.reindex(columns=col)
df.head()

In [None]:
# plot histogram for all variables
import matplotlib.gridspec as gridspec

features = df.iloc[:,:29].columns
plt.figure(figsize=(15,30*4))
gs = gridspec.GridSpec(30, 1)
for i, col in enumerate(df[features]):
    ax = plt.subplot(gs[i])
    sns.distplot(df[col][df.Class==0], bins=100)
    sns.distplot(df[col][df.Class==1], bins=100, color='r')
    ax.set_xlabel('')
    ax.set_title(col)
plt.show()

most of data distributed centered around 0.

**note**
df.hist() plots histogram for all data but column order isn't preserved unless you change name of column.

#### I wonder how standardization affect logistic regression result. Do I have to standardize features everytime? no exception?

To figure out I will test 2 cases ->
 * Test1 : fit logistic regression with NOT Standardized 'Amount' feature.
 * Test2 : fit logistic regression with Standardized 'Amount' feature.

## Test 1

In [None]:
# copy data frame for testing
df_test = df
df_test.head()

In [None]:
X = df_test.loc[:, df_test.columns != 'Class']
y = df_test.loc[:, df_test.columns == 'Class']

In [None]:
# divide data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# choose Logistic regression to classify whether it is fraudulent transaction or not.
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(X_train, y_train.values.ravel())
y_pred = lr.predict(X_test)

In [None]:
# confusion matrix
from sklearn.metrics import confusion_matrix
import itertools

def plot_confusion_matrix(cm, classes, normalize=False, title='Confusion matrix', cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=15)
    plt.yticks(tick_marks, classes, rotation=15)

    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        #print("Normalized confusion matrix")
    
        #print(cm)
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

In [None]:
# Compute confusion matrix
class_set = [0, 1]
cnf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
np.set_printoptions(precision=2)
print ("        Confusion matrix not standardized")

# Plot non-standardized confusion matrix
plot_confusion_matrix(cm=cnf_matrix, classes=class_set)
plt.show()

In [None]:
# calculate Precision, Recall and F1-score.

# Precision rate = TP / TP + FP
# Recall rate = TP / TP + FN

precision = cnf_matrix[1,1] / (cnf_matrix[1,1] + cnf_matrix[0,1])
recall = cnf_matrix[1,1] / (cnf_matrix[1,1] + cnf_matrix[1,0])

print ("Not standardized\n")
print ("------------------------------------------------------------------------")
print ("Precision : %.4f" %precision)
print ("------------------------------------------------------------------------")
print ("Recall : %.4f" %recall)
print ("------------------------------------------------------------------------")
print ("F1-Score : %.4f" % ((precision*recall*2)/(precision+recall)))

## Test 2

In [None]:
# standardize
from sklearn.preprocessing import StandardScaler

df_test['Amount_rescale'] = StandardScaler().fit_transform(df_test['Amount'].values.reshape(-1, 1))
df_test.drop('Amount', axis=1, inplace=True)
df_test.head()

In [None]:
X_std = df_test.loc[:, df_test.columns != 'Class']
y_std = df_test.loc[:, df_test.columns == 'Class']

In [None]:
# divide data
X_train_std, X_test_std, y_train_std, y_test_std = train_test_split(X_std, y_std, test_size=0.3, random_state=0)

# train and predict
lr_std = LogisticRegression()
lr_std.fit(X_train_std, y_train_std.values.ravel())
y_pred_std = lr_std.predict(X_test_std)

In [None]:
# Compute confusion matrix
cnf_matrix_std = confusion_matrix(y_true=y_test_std, y_pred=y_pred_std)
np.set_printoptions(precision=2)
print ("        Confusion matrix standardized")

# Plot standardized confusion matrix
plot_confusion_matrix(cm=cnf_matrix_std, classes=class_set)
plt.show()

In [None]:
# calculate Precision, Recall and F1-score.
# overwrite variables
precision = cnf_matrix_std[1,1] / (cnf_matrix_std[1,1] + cnf_matrix_std[0,1])
recall = cnf_matrix_std[1,1] / (cnf_matrix_std[1,1] + cnf_matrix_std[1,0])

print ("Standardized\n")
print ("------------------------------------------------------------------------")
print ("Precision : %.4f" %precision)
print ("------------------------------------------------------------------------")
print ("Recall : %.4f" %recall)
print ("------------------------------------------------------------------------")
print ("F1-Score : %.4f" % ((precision*recall*2)/(precision+recall)))

#### I don't see any difference between test1 result and test2 result.

#### but standardization is important for regularization to work properly, we need to ensure that all our features are on comparable scales.
so I will standardize feature.

In [None]:
df = df_test

# Handle skewed data

we can not collect more data, so we should solve skewed data problem by Oversampling/Undersampling.

Let's try undersampling. Main idea of undersampling is drop some of labeled samples at random to give a balanced dataset of 50% sample. we have 492 of fraudulent transactions. to make 50 / 50 ratio, choose normal transactions randomly and then concatenate 492 of fraudulent transactions and 492 of normal transactions.

In [None]:
# get fraudulent transaction indices
len_fraud = len(df[df['Class']==1])
indices_fraud = np.array(df[df['Class']==1].index)

# get normal transaction indices
indices_normal = np.array(df[df['Class']==0].index)
indices_normal = np.random.choice(indices_normal, len_fraud, replace=False)

# make a undersampled dataframe
undersample_indices = np.concatenate([indices_normal, indices_fraud])
under_df = df.iloc[undersample_indices, :]

# reindexing
under_df.index = range(0, 984)
under_df.shape

fit undersampled data to logistic regression model

In [None]:
# shuffle rows in dataframe
under_df = under_df.sample(frac=1).reset_index(drop=True)

In [None]:
# divide data
under_X = under_df.loc[:, under_df.columns.values != 'Class']
under_y = under_df.loc[:, under_df.columns.values == 'Class']

In [None]:
X_train_und, X_test_und, y_train_und, y_test_und = train_test_split(under_X, under_y, test_size=0.3, random_state=0)

lr_und = LogisticRegression()
lr_und.fit(X_train_und, y_train_und.values.ravel())
y_pred_und = lr_und.predict(X_test_und)

In [None]:
cnf_matrix_und = confusion_matrix(y_true=y_test_und, y_pred=y_pred_und)
np.set_printoptions(precision=2)
print ("Confusion matrix undersampled")

plot_confusion_matrix(cm=cnf_matrix_und, classes=class_set)
plt.show()

In [None]:
# calculate Precision, Recall and F1-score.
# overwrite variables
precision = cnf_matrix_und[1,1] / (cnf_matrix_und[1,1] + cnf_matrix_und[0,1])
recall = cnf_matrix_und[1,1] / (cnf_matrix_und[1,1] + cnf_matrix_und[1,0])


print ("Undersampled\n")
print ("------------------------------------------------------------------------")
print ("Precision : %.4f" %precision)
print ("------------------------------------------------------------------------")
print ("Recall : %.4f" %recall)
print ("------------------------------------------------------------------------")
print ("F1-Score : %.4f" % ((precision*recall*2)/(precision+recall)))

#### test whole dataset with the model that we've fitted undersampled dataset.

In [None]:
# to test whole dataset, reset X and y
X = df.loc[:, df.columns.values != 'Class']
y = df.loc[:, df.columns.values == 'Class']

In [None]:
# first fit model, compute confusion matrix and then plot ROC, AUC curve
from sklearn.metrics import roc_curve, auc

# the model we will use is 'lr_und'
# train with undersampled data
lr_und = LogisticRegression()
lr_und.fit(X_train_und, y_train_und.values.ravel())

# test whole dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.3)
y_pred = lr_und.predict(X_test)

#compute confusion matrix
cnf_matrix = confusion_matrix(y_true=y_test, y_pred=y_pred)
plot_confusion_matrix(cnf_matrix, classes=class_set, title='Final Confusion matrix')
plt.show()


In [None]:
# calculate Precision, Recall and F1-score.
# overwrite variables
precision = cnf_matrix[1,1] / (cnf_matrix[1,1] + cnf_matrix[0,1])
recall = cnf_matrix[1,1] / (cnf_matrix[1,1] + cnf_matrix[1,0])


print ("fit whole data into model \n")
print ("------------------------------------------------------------------------")
print ("Precision : %.4f" %precision)
print ("------------------------------------------------------------------------")
print ("Recall : %.4f" %recall)
print ("------------------------------------------------------------------------")
print ("F1-Score : %.4f" % ((precision*recall*2)/(precision+recall)))

very low Precision rate.

## Compare two ROC, AUC result

plot ROC, AUC of whole dataset first

In [None]:
# plot ROC, AUC
# below code is use undersampled model and plot whole dataset.
y_pred_score = lr_und.fit(X_train_und, y_train_und.values.ravel()).decision_function(X_test)

fpr, tpr, thresholds = roc_curve(y_test.values.ravel(), y_pred_score)
roc_auc = auc(fpr, tpr)

# plot
plt.plot(fpr, tpr, 'darkorange', label='AUC = %0.2f'% roc_auc)
plt.plot([0,1],[0,1],'--', color='navy')
plt.xlim([-0.05,1.0])
plt.ylim([0.0,1.05])
plt.title('Receiver Operating Characteristic')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc='lower right')
plt.show()


now plot ROC, AUC of udersampled dataset

In [None]:
# plot ROC, AUC
# below code is use undersampled model and plot whole dataset.
y_pred_score_und = lr_und.fit(X_train_und, y_train_und.values.ravel()).decision_function(X_test_und)

fpr_und, tpr_und, thresholds_und = roc_curve(y_test_und.values.ravel(), y_pred_score_und)
roc_auc_und = auc(fpr_und, tpr_und)

# plot
plt.plot(fpr_und, tpr_und, 'darkorange', label='AUC = %.2f'% roc_auc_und)
plt.plot([0,1],[0,1],'--', color='navy')
plt.xlim([-0.05,1.0])
plt.ylim([0.0,1.05])
plt.title('Receiver Operating Characteristic')
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.legend(loc='lower right')
plt.show()


## K-fold

use K-fold cross validation to find best parameter C.

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.metrics import recall_score

In [None]:
under_X = under_df.loc[:, under_df.columns.values != 'Class']
under_y = under_df.loc[:, under_df.columns.values == 'Class']

In [None]:
kf = KFold(n_splits=7)
c_params = [0.001, 0.01, 0.1, 1, 10.0, 100.0, 1000.0]
results = []

for c_param in c_params:
    print ('--------------------------------------------------')
    print ('C parameter: ', c_param)
    print ('--------------------------------------------------')
    print ('')
    
    recall_accs = []
    
    for k, (train, test) in enumerate(kf.split(under_X, under_y)):
        # use L2 penalization
        lr_kf = LogisticRegression(C = c_param, penalty='l2')
        lr_kf.fit(under_X.iloc[train], under_y.iloc[train].values.ravel())
        y_pred_under = lr_kf.predict(under_X.iloc[test].values)
        
        # compute Recall rate because our goal is find fraudulent transactions. We should minimize TN which missclassify 
        # transactions which are actually fraudulent transaction but predict as normal transaction.    
        recall_acc = recall_score(under_y.iloc[test].values.ravel(), y_pred_under)
        recall_accs.append(recall_acc)
        print ('Iteration: ',k+1 ,'recall score = ', recall_acc)
    
     # The mean value of those recall scores is the metric we want to save and get hold of.
    results.append(np.mean(recall_accs))
    
    print ('')
    print ('Mean recall score ', np.mean(recall_accs))
    print ('')
    
    best_c = max(results)
    
    # Finally, we can check which C parameter is the best amongst the chosen.
    
    print ('Best mean recall score is', best_c)
    print ('')


In [None]:
# kaggle can't understand zip function.
print (c_params, results)


best C parameter is 0.001

### Let's fit model again with best C parameter 0.001 and see what will change.

In [None]:
X_train_und, X_test_und, y_train_und, y_test_und = train_test_split(under_X, under_y, test_size=0.3, random_state=0)

lr_c = LogisticRegression(C=best_c)
lr_c.fit(X_train_und, y_train_und.values.ravel())
y_pred_c = lr_c.predict(X_test_und)

In [None]:
cnf_matrix_c = confusion_matrix(y_true=y_test_und, y_pred=y_pred_c)
np.set_printoptions(precision=2)
print ("Confusion matrix with best C parameter")

plot_confusion_matrix(cm=cnf_matrix_c, classes=class_set)
plt.show()

In [None]:
precision = cnf_matrix_c[1,1] / (cnf_matrix_c[1,1] + cnf_matrix_c[0,1])
recall = cnf_matrix_c[1,1] / (cnf_matrix_c[1,1] + cnf_matrix_c[1,0])


print ("result fit model with best C")
print ("------------------------------------------------------------------------")
print ("Precision : %.4f" %precision)
print ("------------------------------------------------------------------------")
print ("Recall : %.4f" %recall)
print ("------------------------------------------------------------------------")
print ("F1-Score : %.4f" % ((precision*recall*2)/(precision+recall)))