# Preprocessing of ULB Creditcard Dataset

### Introduction
We use the dataset of [ULB Creditcard Dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) to train our frauld detection model. In this notebook, we preprocess the dataset and generate features, which refers to some execellent work listed as below:

* **Fraud detection handbook**: https://fraud-detection-handbook.github.io/fraud-detection-handbook/Foreword.html
* **AWS creditcard fraud detector**: https://github.com/awslabs/fraud-detection-using-machine-learning/blob/master/source/notebooks/sagemaker_fraud_detection.ipynb
* **Creditcard fraud detection predictive models**: https://www.kaggle.com/code/gpreda/credit-card-fraud-detection-predictive-models

In [1]:
import numpy as np 
import pandas as pd

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter

### Train and Test split

Assuming we dowloaded creditcard dataset from [Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud). Now we will split our dataset into a train and test to evaluate the performance of our models. 

In [2]:
data = pd.read_csv('creditcard.csv', delimiter=',')
print(data.columns)

Index(['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10',
       'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20',
       'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount',
       'Class'],
      dtype='object')


In [3]:
nonfrauds, frauds = data.groupby('Class').size()
print('Number of frauds: ', frauds)
print('Number of non-frauds: ', nonfrauds)
print('Percentage of fradulent data:', frauds/(frauds + nonfrauds))

Number of frauds:  492
Number of non-frauds:  284315
Percentage of fradulent data: 0.001727485630620034


In [4]:
feature_columns = data.columns[:-1]
label_column = data.columns[-1]

features = data[feature_columns].values.astype('float32')
labels = (data[label_column].values).astype('float32')

In [5]:
X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.1, random_state=42)

In [6]:
train = pd.DataFrame(np.column_stack([X_train, y_train]), columns = data.columns)
test = pd.DataFrame(np.column_stack([X_test, y_test]), columns = data.columns)

In [7]:
print(sorted(Counter(y_train).items()))
print(sorted(Counter(y_test).items()))

[(0.0, 255880), (1.0, 446)]
[(0.0, 28435), (1.0, 46)]


### SMOTE
We will be using [Sythetic Minority Over-sampling (SMOTE)](https://arxiv.org/abs/1106.1813), which oversamples the minority class by interpolating new data points between existing ones.

In [8]:
## Hyper params
sampling_ratio=0.1
seed=42

In [9]:
smote = SMOTE(sampling_strategy=sampling_ratio, random_state=seed)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

In [10]:
print(sorted(Counter(y_smote).items()))

[(0.0, 255880), (1.0, 25588)]


In [11]:
train_smote = pd.DataFrame(np.column_stack([X_smote, y_smote]), columns = data.columns)
train_smote

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,28515.000000,1.226643,0.101988,-0.087072,0.111524,-0.281992,-1.356027,0.469050,-0.371725,-0.153672,...,-0.355100,-1.153663,0.109793,0.420318,0.197932,0.699218,-0.114861,0.007583,50.400002,0.0
1,83125.000000,1.124848,0.125602,0.249962,0.489744,-0.040386,0.167561,-0.247614,0.284736,-0.067302,...,-0.192467,-0.576819,0.190343,-0.357451,0.000870,0.139971,-0.000993,0.011505,1.980000,0.0
2,75537.000000,-0.307902,1.003715,1.404277,0.592627,0.311014,-0.382106,0.531393,-0.015292,-0.758638,...,-0.131802,-0.329268,0.046990,0.057413,-0.656960,0.193192,0.142038,0.157501,1.980000,0.0
3,156358.000000,2.174919,-1.535441,-0.726428,-1.430792,-1.517258,-0.751038,-1.155344,-0.180811,-1.111885,...,-0.112766,0.050018,0.294666,1.123322,-0.306025,-0.241343,0.006553,-0.027567,64.000000,0.0
4,162523.000000,-2.221556,1.261987,2.047642,4.659268,-0.535941,4.542044,-3.715525,-5.311701,-0.955321,...,-1.820388,0.873723,-2.648598,-0.162180,-0.492111,0.601490,0.627030,0.088289,379.290009,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
281463,41980.152344,-5.695574,3.813206,-5.777938,3.941758,-3.512781,-1.134814,-6.774862,0.462824,-2.899069,...,2.298104,0.021073,-0.337944,-0.028347,-0.549564,-0.262799,-0.081246,0.032818,11.762793,1.0
281464,42195.148438,-4.260610,3.362606,-4.961251,4.681360,-0.257067,-1.747745,-3.910412,-0.847846,-3.067478,...,1.003952,0.292121,0.016066,-0.158968,-0.020609,0.064920,0.425664,-0.004353,1.000000,1.0
281465,29533.275391,1.097774,2.759352,-3.903291,4.623804,2.770598,-1.768282,1.518948,-0.255309,-2.680552,...,-0.078706,-0.201411,-0.486597,-0.431365,1.204352,0.486749,0.022628,0.147613,1.512134,1.0
281466,85866.867188,-3.577076,2.623373,-5.572700,3.643508,-4.423023,-0.231178,-4.154671,2.416637,-3.222463,...,1.017770,0.405062,0.146872,-0.543900,0.165695,-0.149184,0.710190,-0.281908,296.619537,1.0


We should check the data after **SMOTE** sampling.

In [12]:
data[data['Time']==28515.000000]

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
17193,28515.0,-1.252973,1.537614,0.522499,0.817049,-0.456092,-0.249557,-0.156303,1.012529,-0.953321,...,0.293052,0.654299,-0.023024,0.21612,-0.32349,-0.323814,0.049625,0.066908,11.48,0
17194,28515.0,-0.1683,0.879467,-0.5727,-0.345212,3.136936,3.264143,0.712733,0.460367,-0.818549,...,0.044611,0.160078,-0.278971,1.013789,0.025695,-0.328249,-0.14769,-0.07013,1.49,0
17195,28515.0,-1.749419,-0.718019,2.503768,0.490997,-0.151449,-0.151134,0.144573,0.408038,0.000561,...,0.385483,0.497336,0.196407,0.267549,0.545669,-0.344469,-0.023884,0.093717,189.0,0
17196,28515.0,1.226643,0.101988,-0.087072,0.111524,-0.281992,-1.356027,0.46905,-0.371725,-0.153672,...,-0.3551,-1.153663,0.109793,0.420318,0.197932,0.699218,-0.114861,0.007583,50.4,0


In [13]:
train_smote[train_smote['Time']==28515.000000]

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,28515.0,1.226643,0.101988,-0.087072,0.111524,-0.281992,-1.356027,0.46905,-0.371725,-0.153672,...,-0.3551,-1.153663,0.109793,0.420318,0.197932,0.699218,-0.114861,0.007583,50.400002,0.0
59453,28515.0,-1.749419,-0.718019,2.503768,0.490997,-0.151449,-0.151134,0.144573,0.408038,0.000561,...,0.385483,0.497336,0.196407,0.267549,0.545669,-0.344469,-0.023884,0.093717,189.0,0.0
208129,28515.0,-1.252973,1.537614,0.522499,0.817049,-0.456092,-0.249557,-0.156303,1.012529,-0.953321,...,0.293052,0.654299,-0.023024,0.21612,-0.32349,-0.323814,0.049625,0.066908,11.48,0.0
250043,28515.0,-0.1683,0.879467,-0.5727,-0.345212,3.136936,3.264143,0.712733,0.460367,-0.818549,...,0.044611,0.160078,-0.278971,1.013789,0.025695,-0.328249,-0.14769,-0.07013,1.49,0.0


We can write the data to s3 cloud object storage.

In [14]:
train.to_csv('creditcard_train.csv', sep=',', index=False, encoding='utf-8')
test.to_csv('creditcard_test.csv', sep=',', index=False, encoding='utf-8')
train_smote.to_csv('creditcard_train_smote.csv', sep=',', index=False, encoding='utf-8')

In [15]:
!aws s3 cp ./creditcard_train.csv s3://dmetasoul-bucket/demo/risk/ulb/
!aws s3 cp ./creditcard_test.csv s3://dmetasoul-bucket/demo/risk/ulb/
!aws s3 cp ./creditcard_train_smote.csv s3://dmetasoul-bucket/demo/risk/ulb/

upload: ./creditcard_train.csv to s3://dmetasoul-bucket/demo/risk/ulb/creditcard_train.csv
upload: ./creditcard_test.csv to s3://dmetasoul-bucket/demo/risk/ulb/creditcard_test.csv
upload: ./creditcard_train_smote.csv to s3://dmetasoul-bucket/demo/risk/ulb/creditcard_train_smote.csv


In [16]:
!aws s3 ls s3://dmetasoul-bucket/demo/risk/ulb/

2022-07-20 06:08:16          0 
2022-07-20 06:08:35  150828752 creditcard.csv
2022-07-21 07:45:38    9368900 creditcard_test.csv
2022-07-21 07:45:35   84318327 creditcard_train.csv
2022-07-21 07:45:40   92424439 creditcard_train_smote.csv
