# Preprocessing of ULB Credit Card Dataset

### Introduction
We use the dataset of [ULB Credit Card Dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) to train our frauld detection model. In this notebook, we preprocess the dataset and generate features, which refers to some execellent work listed as below:

* **Fraud detection handbook**: https://fraud-detection-handbook.github.io/fraud-detection-handbook/Foreword.html
* **AWS creditcard fraud detector**: https://github.com/awslabs/fraud-detection-using-machine-learning/blob/master/source/notebooks/sagemaker_fraud_detection.ipynb
* **Creditcard fraud detection predictive models**: https://www.kaggle.com/code/gpreda/credit-card-fraud-detection-predictive-models

In [None]:
import numpy as np 
import pandas as pd

from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from collections import Counter

### Train and Test split

Assuming we dowloaded creditcard dataset from [Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud). Now we will split our dataset into a train and test to evaluate the performance of our models. 

In [None]:
data = pd.read_csv('creditcard.csv', delimiter=',')
print(data.columns)

In [None]:
nonfrauds, frauds = data.groupby('Class').size()
print('Number of frauds: ', frauds)
print('Number of non-frauds: ', nonfrauds)
print('Percentage of fradulent data:', frauds/(frauds + nonfrauds))

In [None]:
feature_columns = data.columns[:-1]
label_column = data.columns[-1]

features = data[feature_columns].values.astype('float32')
labels = (data[label_column].values).astype('float32')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    features, labels, test_size=0.1, random_state=42)

In [None]:
train = pd.DataFrame(np.column_stack([X_train, y_train]), columns = data.columns)
test = pd.DataFrame(np.column_stack([X_test, y_test]), columns = data.columns)

In [None]:
print(sorted(Counter(y_train).items()))
print(sorted(Counter(y_test).items()))

### SMOTE
We will be using [Sythetic Minority Over-sampling (SMOTE)](https://arxiv.org/abs/1106.1813), which oversamples the minority class by interpolating new data points between existing ones.

In [None]:
## Hyper params
sampling_ratio=0.1
seed=42

In [None]:
smote = SMOTE(sampling_strategy=sampling_ratio, random_state=seed)
X_smote, y_smote = smote.fit_resample(X_train, y_train)

In [None]:
print(sorted(Counter(y_smote).items()))

In [None]:
train_smote = pd.DataFrame(np.column_stack([X_smote, y_smote]), columns = data.columns)
train_smote

We should check the data after **SMOTE** sampling.

In [None]:
data[data['Time']==28515.000000]

In [None]:
train_smote[train_smote['Time']==28515.000000]

We can write the data to s3 cloud object storage.

In [None]:
train.to_csv('creditcard_train.csv', sep=',', index=False, encoding='utf-8')
test.to_csv('creditcard_test.csv', sep=',', index=False, encoding='utf-8')
train_smote.to_csv('creditcard_train_smote.csv', sep=',', index=False, encoding='utf-8')

You should replace `{MY_S3_BUCKET}` with actual values before executing code cells below containing these placeholders.

In [None]:
!aws s3 cp ./creditcard_train.csv ${MY_S3_BUCKET}/risk/ulb/
!aws s3 cp ./creditcard_test.csv ${MY_S3_BUCKET}/risk/ulb/
!aws s3 cp ./creditcard_train_smote.csv ${MY_S3_BUCKET}/risk/ulb/

In [None]:
!aws s3 ls ${MY_S3_BUCKET}/risk/ulb/