# About the data

Dataset description: 

The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

It contains only numerical input variables which are the result of a PCA transformation. Unfortunately, due to confidentiality issues, we cannot provide the original features and more background information about the data. Features V1, V2, ... V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are 'Time' and 'Amount'. Feature 'Time' contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature 'Amount' is the transaction Amount, this feature can be used for example-dependant cost-senstive learning. Feature 'Class' is the response variable and it takes value 1 in case of fraud and 0 otherwise.

Identify fraudulent credit card transactions.

Given the class imbalance ratio, we recommend measuring the accuracy using the Area Under the Precision-Recall Curve (AUPRC). Confusion matrix accuracy is not meaningful for unbalanced classification.

In [39]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

from sklearn.utils import resample
from sklearn.model_selection import cross_val_score, cross_val_predict, train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import auc, precision_recall_curve, average_precision_score

from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LogisticRegressionCV
from sklearn.neighbors import KNeighborsClassifier

%matplotlib inline


df = pd.read_csv('creditcard.csv')
df.head()

In [9]:
df.shape

(284807, 31)

In [3]:
df.Class.value_counts()

0    284315
1       492
Name: Class, dtype: int64

There is a class imbalance problem here. We can address the class imbalance in two ways. We can downsample the majority class, or upsample the minority class.  Downsampling is going to be a much smaller dataset.

In [5]:
# Separate majority and minority classes
df_majority = df[df.Class==0]
df_minority = df[df.Class==1]

# Downsampling

In [6]:
# Downsample majority class# Downs 
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=len(df_minority))    # to match minority class
                                
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled.Class.value_counts()

1    492
0    492
Name: Class, dtype: int64

In [8]:
df_downsampled.shape

(984, 31)

In [19]:
X = df_downsampled.loc[:, ~df_downsampled.columns.isin(['Class'])]
Y = df_downsampled['Class']

X.shape, Y.shape

((984, 30), (984,))

In [33]:
bnb = BernoulliNB()
Y_pred = cross_val_predict(bnb, X, Y, cv=5)
precision, recall, thresholds = precision_recall_curve(Y, Y_pred)
average_precision = average_precision_score(Y, Y_pred)

print(auc(precision, recall))
print('Average precision score: {0:0.2f}'.format(average_precision))

conf_mat = confusion_matrix(Y, Y_pred)
print('Precision from Confusion Matrix ', conf_mat[1,1] / (conf_mat[1,1]+conf_mat[0,1]) )
print('Recall from Confusion Matrix ', conf_mat[1,1] / (conf_mat[1,1]+conf_mat[1,0]) )

0.450047291314
Average precision score: 0.90
Precision from Confusion Matrix  0.992574257426
Recall from Confusion Matrix  0.815040650407


# What to interpret from precision_recall_curve ? what is auc(precision, recall) doing ? why scores are not matching ???

In [34]:
logreg = LogisticRegression(C = 1e9)
Y_pred = cross_val_predict(logreg, X, Y, cv=5)

precision, recall, thresholds = precision_recall_curve(Y, Y_pred)
average_precision = average_precision_score(Y, Y_pred)
    #print('fold {} precision {} recall {} thresholds {}'.format(test_no,precision, recall, thresholds))
print(auc(precision, recall))
print('Average precision score: {0:0.2f}'.format(average_precision))

conf_mat = confusion_matrix(Y, Y_pred)
print('Precision from Confusion Matrix ', conf_mat[1,1] / (conf_mat[1,1]+conf_mat[0,1]) )
print('Recall from Confusion Matrix ', conf_mat[1,1] / (conf_mat[1,1]+conf_mat[1,0]) )

0.451681448633
Average precision score: 0.91
Precision from Confusion Matrix  0.962305986696
Recall from Confusion Matrix  0.882113821138


In [43]:
x_train,x_test,y_train,y_test = train_test_split(X,Y, stratify=Y,test_size=0.35, random_state=1)

print('y_train class counts')
print(y_train.value_counts())
print('')
print('y_test class counts')
print(y_test.value_counts())

lr_model = LogisticRegression()
lr_model.fit(x_train,y_train)

pred = lr_model.predict(x_test)
conf_mat = confusion_matrix(y_test,pred)
print('Precision from Confusion Matrix ', conf_mat[1,1] / (conf_mat[1,1]+conf_mat[0,1]) )
print('Recall from Confusion Matrix ', conf_mat[1,1] / (conf_mat[1,1]+conf_mat[1,0]) )
print(conf_mat)
print(classification_report(y_test,pred))

y_train class counts
0    320
1    319
Name: Class, dtype: int64

y_test class counts
1    173
0    172
Name: Class, dtype: int64
Precision from Confusion Matrix  0.974842767296
Recall from Confusion Matrix  0.895953757225
[[168   4]
 [ 18 155]]
             precision    recall  f1-score   support

          0       0.90      0.98      0.94       172
          1       0.97      0.90      0.93       173

avg / total       0.94      0.94      0.94       345



# Why precision and recall calculated from confusion matrix is different from classification_report??

In [41]:
lr_model = LogisticRegression(class_weight='balanced')
lr_model.fit(x_train,y_train)

pred = lr_model.predict(x_test)
cfn_matrix = confusion_matrix(y_test,pred)
print(cfn_matrix)
print(classification_report(y_test,pred))

[[168   4]
 [ 18 155]]
             precision    recall  f1-score   support

          0       0.90      0.98      0.94       172
          1       0.97      0.90      0.93       173

avg / total       0.94      0.94      0.94       345



In [29]:
lasso = LogisticRegressionCV(penalty='l1', solver = 'liblinear')
Y_pred = cross_val_predict(lasso, X, Y, cv=5)

precision, recall, thresholds = precision_recall_curve(Y, Y_pred)
average_precision = average_precision_score(Y, Y_pred)
    #print('fold {} precision {} recall {} thresholds {}'.format(test_no,precision, recall, thresholds))
print(auc(precision, recall))
print('Average precision score: {0:0.2f}'.format(average_precision))

0.462464796217
Average precision score: 0.93


In [31]:
lasso = LogisticRegressionCV(penalty='l1', solver = 'liblinear')
lasso.fit(X,Y)
Y_pred = cross_val_predict(lasso, X, Y, cv=5)

precision, recall, thresholds = precision_recall_curve(Y, Y_pred)
average_precision = average_precision_score(Y, Y_pred)
    #print('fold {} precision {} recall {} thresholds {}'.format(test_no,precision, recall, thresholds))
print(auc(precision, recall))
print('Average precision score: {0:0.2f}'.format(average_precision))
print(lasso.coef_)

0.462464796217
Average precision score: 0.93
[[ -1.11892203e-05   0.00000000e+00   0.00000000e+00  -2.15070566e-01
    8.28896994e-01   2.07788599e-01  -1.44692330e-01   0.00000000e+00
   -1.71904787e-01  -1.64278565e-01  -5.01993888e-01   1.48124853e-01
   -3.93618900e-01  -1.10503873e-01  -8.40190365e-01  -7.98821663e-02
    0.00000000e+00   0.00000000e+00   1.14829589e-01  -3.05332237e-02
   -2.51288453e-01   0.00000000e+00   2.61790349e-01  -1.73931929e-01
    1.34162309e-01  -1.07224710e-01   0.00000000e+00   0.00000000e+00
    0.00000000e+00   2.39610701e-03]]


# When used Lasso it reduced 9 features. (is it ok to use Lasso here as a classifier?)