# Plan

1. Undersample data
    - Undersample raw data 
    - Undersample cleaned data
1. LogReg with raw data (imbalanced)
2. LogReg with data cleaned of extreme values (imbalanced)
5. GNB with undersampled raw data (balanced)
6. GNB with undersampled cleaned data (balanced)

Preprocessing: 
1. Stratified split train-test 
2. Transform training data 
    - log(amount+0.01)
    - clean extreme values 
    - scale all data feature using robust scaler 
3. Feature selection 
    - drop time 
    - drop features whose distributions of class 0 & class 1 are similar 

# Import

In [15]:
import os

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn

from sklearn.model_selection import train_test_split

### Classifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold 
from sklearn.model_selection import StratifiedKFold
from sklearn.naive_bayes import GaussianNB 

import collections

from sklearn.metrics import confusion_matrix,auc,roc_auc_score, ConfusionMatrixDisplay
from sklearn.metrics import recall_score, precision_score, accuracy_score, f1_score

In [4]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Helper functions 

In [5]:
def get_data_by_class(data):
    '''
    separate the data set into fraud transactions (Class 1) and genuine transaction (Class 0)

    input: 
        data: entire data set 
    output: 
        a set of genuine transactions
        a set of fraud transactions 
        (in that order) 
    '''
    fraud = data[data.Class==1]
    not_fraud = data[data.Class==0]

    return fraud, not_fraud

In [6]:
def values_count_bin(data, col):
    '''
    made for testing, we dont actually have to use this
    '''
    counts, bin_edges = np.histogram(data[col], bins=10)
    for i in range(len(counts)):
        print(f"Between {round(bin_edges[i],1)} and {round(bin_edges[i+1],1)}: \t", end='')
        print(f"{counts[i]} ({round(100*counts[i]/len(not_fraud),2)}%)")

In [7]:
def eliminate_extreme_amount(data, col='Amount', threshold=5000):
    '''
    eliminate any transactions with Amount higher than a specific threshold. 
    default threshold is $5000

    input: 
        data: entire data set 
        col: default 'Amount'
        threshold: default to be $5000
    output: 
        new data set that only have entries with values <= threshold 
    '''
    new_data = data[(data[col] <= threshold)].copy()

    return new_data

In [8]:
def stratified_train_test_byclass(data, feature_names, by, test_size=0.3):
    '''
    split train & test set using stratified split (not random split) so as to avoid 
    distribution shift 
    
    input: 
        data: entire data set 
        feature_names: features that we care about 
        by: list of features that we want to stratified split by 
        test_size: percentage of test set (default to be 0.3)
    output: 
        X_train: set of training observations 
        X_test: set of test observations 
        y_train: label for training set 
        y_test: label for test set
    '''
    X = data[feature_names]
    y = data[class_name]

    X_train, X_test, y_train, y_test = \
        train_test_split(X, y, test_size=test_size, random_state=1, stratify=data[by])

    return X_train, X_test, y_train, y_test

In [9]:
def get_predictions(clf, X_train, y_train, X_test): 
    '''
    train model according to the specified classifier
    
    input: 
        clf: classifier (as imported from sklearn module)
        X_train: training observations 
        y_train: label for training observations 
        X_test: test observations 
    output: 
        y_predict: predicted label for the test observations 
        y_predict_proba: predicted class probability for the test observations
    '''
    # create classifier
    clf = clf 
    # fit classifier to training data 
    clf.fit(X_train, y_train)
    # predict on test data 
    y_predict = clf.predict(X_test)
    # compute predicted probability 
    y_predict_proba = clf.predict_proba(X_test)
    return y_predict, y_predict_proba

In [33]:
def print_scores(y_test, y_predict, y_predict_proba):
    '''
    print the scores
    '''
    print("test set confusion matrix:\n", confusion_matrix(y_test, y_predict))
    print("recall score: ", recall_score(y_test, y_predict))
    print("precision score: ", precision_score(y_test,y_predict))
    print("accuracy score: ", accuracy_score(y_test, y_predict))
    print("f1 score: ", f1_score(y_test,y_predict))
    print("ROC AUC: {}".format(roc_auc_score(y_test, y_predict_proba[:,1])))

In [49]:
def random_split_data(df, drop_list=[]):
    df = df.drop(drop_list,axis=1)
    print(df.columns)
    #test train split time
    from sklearn.model_selection import train_test_split
    y = df['Class'].values #target
    X = df.drop(['Class'],axis=1).values #features
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    random_state=42, stratify=y)

    print("train-set size: ", len(y_train),
      "\ntest-set size: ", len(y_test))
    print("fraud cases in test-set: ", sum(y_test))
    return X_train, X_test, y_train, y_test

# Splitting into Train & Test set 

In [11]:
pj_path = os.getcwd() ## get current path
data_path = os.path.join(pj_path, 'creditcard.csv')

df = pd.read_csv(data_path) # load data from the given csv file

## get feature and class names
feature_names = df.columns[:-1]
class_name = df.columns[-1]

print(df.shape)
df.head()

(284807, 31)


Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0.0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0
1,0.0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0
2,1.0,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0
3,1.0,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0
4,2.0,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0


In [12]:
# eliminate some features in which Fraud and Not Fraud have similar dist (mostly normal)
not_used = ['V15', 'V20', 'V22', 'V24', 'V25', 'V26', 'V28']
feature_names = np.setdiff1d(feature_names, not_used)

In [13]:
fraud, not_fraud = get_data_by_class(df)

Q1 = not_fraud[np.setdiff1d(feature_names, ['Time', 'Amount', 'V1', 'TimeHour'])].quantile(0.25)
Q3 = not_fraud[np.setdiff1d(feature_names, ['Time', 'Amount', 'V1', 'TimeHour'])].quantile(0.75)
IQR = Q3 - Q1
lower_whisker = Q1 - 2.5 * IQR
higher_whisker = Q3 + 2.5 * IQR

not_fraud = not_fraud[~((not_fraud < lower_whisker) |(not_fraud > higher_whisker)).any(axis=1)]
print("Genuine transactions shape: ", not_fraud.shape)
print("Fraud transactions shape: ", fraud.shape)

trimmed_data = pd.concat([fraud, not_fraud]).reset_index(drop=True)
print("Trimmed data shape: ", trimmed_data.shape)

Genuine transactions shape:  (230628, 31)
Fraud transactions shape:  (492, 31)
Trimmed data shape:  (231120, 31)


In [14]:
## convert seconds to hour, assign this info to a new column
to_hour = np.floor(trimmed_data.Time/(60*60)).astype(int)
for t in range(len(to_hour)):
    if to_hour[t] >= 24:
        to_hour[t] = to_hour[t] - 24

trimmed_data['TimeHour'] = to_hour

## eliminate data with Amount > 3000
trimmed_data = eliminate_extreme_amount(trimmed_data, threshold=3000)

## log(Amount)
trimmed_data = trimmed_data[trimmed_data.Amount != 0].reset_index(drop=True) ## it drops some fraud 
trimmed_data['lnAmount'] = np.log(trimmed_data['Amount'])

## split train-test sets using stratified split with respect to Class (Fraud or Non-fraud) and Time (24 hours a day)
X_train, X_test, y_train, y_test = \
    stratified_train_test_byclass(trimmed_data, feature_names, [class_name, 'TimeHour'])

# Resampling 

## Undersampling

In [24]:
full_fraud_index = np.array(df[df.Class==1].index)
n_fraud = len(full_fraud_index)
n_fraud

492

In [31]:
# missing 27 fraud transactions with amount=0
len(trimmed_data[trimmed_data.Class==1])

465

In [28]:
# RAW DATA
full_genuine_index = df[df.Class==0].index

full_rus_genuine_index = np.array(np.random.choice(a=full_genuine_index, size=n_fraud, replace=False))
full_rus_index = np.concatenate([full_fraud_index, full_rus_genuine_index])
full_undersample_df = df.iloc[full_rus_index, :]

full_y_undersample = full_undersample_df['Class'].values
full_X_undersample = full_undersample_df.drop(['Class'], axis=1).values

print("Number of transactions in full undersampled data: ", len(full_undersample_df))
print("Ratio between Class 0 & Class 1 in full undersampled data: ", len(full_undersample_df[full_undersample_df.Class==0])/len(full_undersample_df[full_undersample_df.Class==1]))



Number of transactions in full undersampled data:  984
Ratio between Class 0 & Class 1 in full undersampled data:  1.0


In [30]:
# # TRIMMED DATA
# trimmed_genuine_index = trimmed_data[trimmed_data.Class==0].index

# trimmed_rus_genuine_index = np.array(np.random.choice(a=trimmed_genuine_index, size=n_fraud, replace=False))
# trimmed_rus_index = np.concatenate([full_fraud_index, trimmed_rus_genuine_index])
# trimmed_undersample_df = trimmed_data.iloc[trimmed_rus_index, :]

# trimmed_undersample_df

# Logistic Regression 

## Model Assumptions

- LogReg assumes the dependent variable is binary, as is the case of this problem where we have 2 labels: 0 (Genuine) and 1 (Fraud). 
- LogReg assumes data is linear separable. As we have no idea how to verify this (it is part of the truth of the data), we ignore this ^^ 
- LogReg assumes no highly influential outlier data. Thus, we shall clean the data set of extreme values to enhance model's performance. 

## On raw data set

In [50]:
X_train_raw, X_test_raw, y_train_raw, y_test_raw = random_split_data(df)

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')
train-set size:  199364 
test-set size:  85443
fraud cases in test-set:  148


In [51]:
y_raw_LRpred, y_raw_LRpred_proba = get_predictions(LogisticRegression(C=0.01, penalty='l1', solver='liblinear'),
                                                                X_train_raw, y_train_raw, X_test_raw)

print('LOGISTIC REGRESSION ON RAW DATA RESULTS')
print_scores(y_test_raw, y_raw_LRpred, y_raw_LRpred_proba)

LOGISTIC REGRESSION ON RAW DATA RESULTS
test set confusion matrix:
 [[85280    15]
 [   68    80]]
recall score:  0.5405405405405406
precision score:  0.8421052631578947
accuracy score:  0.9990285921608558
f1 score:  0.6584362139917695
ROC AUC: 0.9380979050449711


## On cleaned data set


In [44]:
y_cleaned_LRpred, y_cleaned_LRpred_proba = get_predictions(LogisticRegression(C=0.01, penalty='l1', solver='liblinear'),
                                                                X_train, y_train, X_test)
print('LOGISTIC REGRESSION ON CLEANED DATA RESULTS')
print_scores(y_test, y_cleaned_LRpred, y_cleaned_LRpred_proba)

LOGISTIC REGRESSION ON CLEANED DATA RESULTS
test set confusion matrix:
 [[68754     0]
 [   31   106]]
recall score:  0.7737226277372263
precision score:  1.0
accuracy score:  0.9995500137898999
f1 score:  0.8724279835390947
ROC AUC: 0.9492360258694438


## On undersampled data set 

In [53]:
X_und_train, X_und_test, y_und_train, y_und_test = \
random_split_data(full_undersample_df)

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')
train-set size:  688 
test-set size:  296
fraud cases in test-set:  148


In [56]:
y_und_LRpred, y_und_LRpred_proba = get_predictions(LogisticRegression(C=0.01, penalty='l1', solver='liblinear'),
                                                   X_und_train, y_und_train, X_und_test)

print('LOGISTIC REGRESSION ON UNDERSAMPLED DATA RESULTS')
print_scores(y_und_test, y_und_LRpred, y_und_LRpred_proba)

LOGISTIC REGRESSION ON UNDERSAMPLED DATA RESULTS
test set confusion matrix:
 [[145   3]
 [ 19 129]]
recall score:  0.8716216216216216
precision score:  0.9772727272727273
accuracy score:  0.9256756756756757
f1 score:  0.9214285714285714
ROC AUC: 0.9639791818845872


# Gaussian Naive Bayesian 

## Model assumptions

- GNB assumes strong independence between features. However, in real life we don't usually have perfect independence, so we proceed anyways :) This assumption is why the model is named "Naive". 
- GNB assumes the distributions of classes in each feature follows a Gaussian distribution --> this seems like a far reach for out data, because often we don't have normal distribution. But this is the best we can do for now :) 
- From the above formula, we can tell that a heavily imbalanced data set like this will badly influence GNB's prediction as a blind guess of Class 0 will yields accuracy >= 99% (but this is useless for fraud detection purposes) --> need balancing data 
    - Undersampling is preferred because oversampling from 500 observations to 23k observations means 99% fraud observations are synthetic data --> sus 

## Model summary

\begin{align}
p(y|data) = \frac{p(data|y)*p(y)}{p(data)} \\

p(y|data) \propto p(data|y)*p(y)
\end{align}

Assuming all features $a_1, a_2,..., a_T$ of data are IID: 
\begin{align}
p(data|y) = p(a_1,..., a_T|y) = \prod_{i=1}^T p(a_i|y)
\end{align}

To improve model simplicity, we can eliminate features that have similar distributions between Class 0 and Class 1. In other words, if $p(X_i|y1)$ ~ $p(X_i|y2)$, we can eliminate $X_i$ from the model because: 

\begin{align}
p(y1|data) \propto p(data|y1)*p(y1) = [\prod_{i=1}^T p(x_i|y1)] * p(y1) \\ 

p(y2|data) \propto p(data|y2)*p(y2) = [\prod_{i=1}^T p(x_i|y2)] * p(y2) 
\end{align}
If $p(X_i|y1)$ ~ $p(X_i|y2)$, then the following ratio stays relatively the same. 

\begin{align}
\frac{p(y1|data)}{p(y2|data)} = \frac{[\prod_{i=1}^T p(x_i|y1)] * p(y1)}{[\prod_{i=1}^T p(x_i|y2)] * p(y2)}
\end{align}

Or, the final classificaion is not much influenced by the removal of class $X_i$. 

## On undersampled data set 

In [55]:
y_und_GNBpred, y_und_GNBpred_proba = get_predictions(GaussianNB(), X_und_train, y_und_train, X_und_test)

print('GAUSSIAN NAIVE BAYES ON UNDERSAMPLED DATA RESULTS')
print_scores(y_und_test, y_und_GNBpred, y_und_GNBpred_proba)

GAUSSIAN NAIVE BAYES ON UNDERSAMPLED DATA RESULTS
test set confusion matrix:
 [[145   3]
 [ 44 104]]
recall score:  0.7027027027027027
precision score:  0.9719626168224299
accuracy score:  0.8412162162162162
f1 score:  0.815686274509804
ROC AUC: 0.9693206720233747
