In [1]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

In [2]:
df = pd.read_csv('creditcard.csv')

From the previous data exploration, we've conclude a few points:
1. Data/class is extremely imbalanced
2. The distribution of feature "Time" and "Amount" is similar between 2 classes

For the 2nd point, we can easily exclude the 2 features, or even just leave it in if we are using SVM or similar dicision boundry methods. 
For the imbalanced problem, there's a few ways to try in this scenario:
1. Upsampling/ oversampling the class with fewer observations
2. Downsampling the class with more observations
3. Use weighted functions to penalize misclassification of class with fewer data
4. Use a evaluation metric that takes class imbalance into consideration

Approach:
1. Decide sampling method using a simple logistic regression method
2. Evaluate the sampling method by F1/Recall/Precision etc.
3. Use other more complicated algorithms to see if we can achieve better results

In [3]:
from sklearn.preprocessing import StandardScaler

df['normAmount'] = StandardScaler().fit_transform(df['Amount'].to_numpy().reshape(-1,1))
df = df.drop(['Time','Amount'],axis=1)
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,...,V21,V22,V23,V24,V25,V26,V27,V28,Class,normAmount
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0,0.244964
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,0,-0.342475
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0,1.160686
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0,0.140534
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0,-0.073403


In [4]:
X = df.iloc[:, df.columns != 'Class'].to_numpy(dtype=float) #data
Y = df.iloc[:, df.columns == 'Class'].to_numpy(dtype=float) #labels

We start without any sampling methods to get a baseline. Simple train-test split with cross validation

In [5]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0, stratify=Y)
model = SGDClassifier(loss='log', max_iter=1000, tol=1e-3)

Note that from now on all the "testing" is done in the subset of the training set, the actual test set will not be touched until we are sure of the sampling method, classification algorithm and hyper-parameter tuning. 

In [6]:
print(X_train.shape)
print(Y_train.shape)

(227845, 29)
(227845, 1)


In [7]:
from sklearn.model_selection import cross_validate
scoring = ['accuracy', 'precision', 'recall', 'f1']
scores = cross_validate(model, X_train, Y_train.ravel(), cv=5,
                        scoring=scoring,
                        return_train_score=True)

In [8]:
sorted(scores.keys())

['fit_time',
 'score_time',
 'test_accuracy',
 'test_f1',
 'test_precision',
 'test_recall',
 'train_accuracy',
 'train_f1',
 'train_precision',
 'train_recall']

In [9]:
scores['test_accuracy']

array([0.99914417, 0.99910027, 0.99901249, 0.99918804, 0.9990783 ])

In [10]:
scores['test_precision']

array([0.78571429, 0.86538462, 0.925     , 0.90384615, 0.875     ])

In [11]:
scores['test_recall']

array([0.69620253, 0.56962025, 0.46835443, 0.59493671, 0.53846154])

In [12]:
scores['test_f1']

array([0.73825503, 0.6870229 , 0.62184874, 0.71755725, 0.66666667])

Looking at the results, the model actually performed not bad for a extremely skewed data in simple settings! So the features here are already pretty solid in that it provides enough information for the model to work (more or less). 

In this case, accuracy is pretty irrelevant. The logistic regression can achieve a pretty high precision, but the recall is pretty low.

Of course, this is without parameter tuning, but the goal for this step is to set a baseline and see what we can improve. 
And since recall = TP/(TP+FN), we will perhaps try to lower the false positive rate in further down the line. 

For now we should try other sampling methods. 
Let's start from upsampling

In [13]:
from imblearn.over_sampling import SMOTE

Note that since we are over-sampling, we should create a validation set as well before doing so since the new, synthetic samples are generated from the alreay existing ones, and directly using the over-sampled set for CV comparison would not be really fair since there is a high chance of information leak in the process.

In [14]:
X_train_over, X_val, Y_train_over, Y_val = train_test_split(X_train,Y_train,test_size = 0.1, random_state = 0, stratify=Y_train)
model_over = SGDClassifier(loss='log', max_iter=1000, tol=1e-3)

In [15]:
sm = SMOTE(random_state=1)
X_res, Y_res = sm.fit_resample(X_train_over, Y_train_over)
print("size of training before oversampling: ", X_train_over.shape)
print("size of training after oversampling : ", X_res.shape)

  y = column_or_1d(y, warn=True)


size of training before oversampling:  (205060, 29)
size of training after oversampling :  (409410, 29)


In [16]:
scores = cross_validate(model_over, X_res, Y_res.ravel(), cv=5,
                        scoring=scoring,
                        return_train_score=True)

In [17]:
print("test acc:      ", scores['test_accuracy'])
print("test precision:", scores['test_precision'])
print("test recall   :", scores['test_recall'])

test acc:       [0.94974475 0.94761975 0.94886544 0.94512836 0.94845021]
test precision: [0.97448848 0.97214922 0.96927988 0.9782198  0.96927717]
test recall   : [0.92367065 0.92164334 0.92711463 0.91052979 0.92625974]


In [18]:
model_over.fit(X_res, Y_res)
Y_val_pred = model_over.predict(X_val)

In [19]:
from sklearn.metrics import precision_recall_fscore_support
precision, recall, _, _ = precision_recall_fscore_support(Y_val, Y_val_pred, average='binary')

In [20]:
precision

0.053857350800582245

In [21]:
recall

0.9487179487179487

Here we can see that precision is extremely low, while recall(sensitivity) is pretty high. This means:
1. There are a lot of false positives
2. However, the model captures most positive cases and leaves little false negatives


Imagine deploying this model in real world. This would probabily prevent most of the credit card fraud, but may prevent a lot of normal daily usage of card holders. 

We can try lowering the over-sampling rate, so instead of making the class hard-balanced, this time we increase the minority class for just enough portion. 

In [22]:
sm = SMOTE(random_state=1, sampling_strategy=0.1) #sampling_strategy=float to specify ratio of minor/major
X_res, Y_res = sm.fit_resample(X_train_over, Y_train_over)
print("size of training before oversampling: ", X_train_over.shape)
print("size of training after oversampling : ", X_res.shape)
model_over_01 = SGDClassifier(loss='log', max_iter=1000, tol=1e-3)
model_over_01.fit(X_res, Y_res)
Y_val_pred = model_over_01.predict(X_val)
precision, recall, _, _ = precision_recall_fscore_support(Y_val, Y_val_pred, average='binary')
print('precision: ',precision)
print('recall: ', recall)

  y = column_or_1d(y, warn=True)


size of training before oversampling:  (205060, 29)
size of training after oversampling :  (225175, 29)
precision:  0.26865671641791045
recall:  0.9230769230769231


The results shows that the ratio of over-sampling does effect the results. There will be a tradeoff between precision and recall if we change the ratio. 

In first one without specifying, the SMOTE algo over-sampled the minority class to same sample size as the majority class, so the size was doubled. This resulted in a extremely low precision but pretty high recall. 

In the one where ratio was set to .1, the precision was drastically increased(~.23) while recall was decreased by ~.02

For now let's move on to the other sampling method: down-sampling
The imblearn package also includes a down-sampling method, but let's write one for ourselves here for practice. 

In [23]:
def downsample(data, labels, ratio=1):
    # here we assume the data is for binary classification problem
    # the sample is choosen simply by random 
    unique, counts = np.unique(labels, return_counts=True)
    maj_class = unique[np.argmax(counts)]
    mnr_class = unique[np.argmin(counts)]
    maj_idx = np.nonzero(labels==maj_class)
    mnr_idx = np.nonzero(labels==mnr_class)
    rand_samp_maj = np.random.choice(maj_idx[0], (len(labels)-len(maj_idx[0]))*ratio)
    rand_samp_data = np.concatenate((data[rand_samp_maj, :], data[mnr_idx[0],:]), axis=0)
    rand_samp_lbl = np.concatenate((labels[rand_samp_maj, :], labels[mnr_idx[0],:]), axis=0)
    return rand_samp_data, rand_samp_lbl

In [24]:
X_train_down, X_val, Y_train_down, Y_val = train_test_split(X_train,Y_train,test_size = 0.1, random_state = 0, stratify=Y_train)
X_res, Y_res = downsample(X_train_down, Y_train_down) # X_res, Y_res = 
print("size of training before downsampling: ", X_train_down.shape)
print("size of training after downsampling : ", X_res.shape)

(205060, 1)
size of training before downsampling:  (205060, 29)
size of training after downsampling :  (710, 29)


In [25]:
model_down = SGDClassifier(loss='log', max_iter=1000, tol=1e-3)
model_down.fit(X_res, Y_res)
Y_val_pred = model_down.predict(X_val)
precision, recall, _, _ = precision_recall_fscore_support(Y_val, Y_val_pred, average='binary')
print('precision: ',precision)
print('recall: ', recall)

precision:  0.02238354506957048
recall:  0.9487179487179487


  y = column_or_1d(y, warn=True)
