# Autoencoder for credit card detection (Pytorch)

The objective of this exercice is to build a model able to detect fraudulous credit card transactions among normal transactions. For this we train a special type of neural network called autoencoder. This network has as many input nodes as output nodes, and several hidden layers with, usually lower dimensions. 

The dataset we're going to use can be downloaded from [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud) (big file: 144 MB). It contains data about credit card transactions that occurred during a period of two days, with 492 frauds out of 284,807 transactions.

All 30 features in the dataset are numerical. The data has been transformed using PCA transformation(s) due to privacy reasons. The two features that haven't been changed are Time and Amount. Time contains the seconds elapsed between each transaction and the first transaction in the dataset.

The dataset also contains the class of event: 0 = normal transaction; 1 = fraudulous transaction.

## Initialize

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns

from sklearn.utils import shuffle

import pandas as pd

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import MinMaxScaler

import torch
import torch.nn as nn

import sys
print(sys.version)
print(torch.version)
print('cuda:',torch.version.cuda)

# Choose cpu/gpu
use_gpu=1
if (use_gpu):
    print('\nEnable gpu')
    dtype = torch.cuda.FloatTensor
    device = torch.device("cuda") # Uncomment this to run on GPU
    
else:
    print('\nRun on cpu')
    dtype = torch.FloatTensor
    device = torch.device("cpu")


## 1. Explore Data

a) Download the dataset and load it in a panda dataframe. Look at the first 10 examples.

b) Separate the data in two classes `normal` and `fraud`, then remove the class label from these datasets in order to conserve only the features.

c) Plot the first 5 features of both normal and fraud data (plotting all features is time consuming).

d) Split the `normal` dataset into a training and a test sample (each of same size).

After the last step you should have 3 datasets:
* normal data used for training
* normal data used for testing
* fraud data used for testing

## 2. Rescale data

Since features have different range we apply a transformation to each feature. For this we  use the MinMaxScaler that scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one:

See: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

a) Fit and transofrm the training dataset using the scaler with the `fit_transform` method.

b) Apply the transformation on the tests samples using the `transform` method.

c) Plot the first 5 features of the normal and fraud test data and see how they changed.

## 3. Partition training data

After all of this, it's important to partition the data. In order for your model to generalize well, you split the training data into two parts: a training and a validation set. You will train your model on 80% of the data and validate it on 20% of the remaining training data. 

## 4. AutoEncoder model

Now we create the AutoEncoder model. 

Complete the network structure below using linear functions `nn.Linear(dim1,dim2)` (where `dim1` is the input dim of the layer and `dim2` the dimension of the layer output) and sigmoid activation functions `nn.Sigmoid()`:

a) in the encoding part create layers of dimension 30 (input) - 30 (hidden layer 1) - 25 (hidden layer 2) - 20 (latence space), each with a sigmoid activation function

b) in the decoding part create layers of dimension 25 (hidden layer 2) - 30 (hidden layer 1) - 30 (output) , where only the 1st layer has a sigmoid activation function

c) look at the forward function, what does it return ?


### Hyperparameters of the network

In [None]:
num_epochs = 200
batch_size = 2048
hidden_layer1 = 30
hidden_layer2 = 25
encoding_dim = 20

### AutoEncoder structure

In [None]:
input_dim = x_train_train.shape[1]

class autoencoder(nn.Module):
    def __init__(self):
        super(autoencoder, self).__init__()
        self.encoder = nn.Sequential(
            # FILL HERE
        )
        self.decoder = nn.Sequential(
            # FILL HERE
        )

    def forward(self, x):
        x = self.encoder(x)
        x = self.decoder(x)
        return x

# Settings
if (use_gpu):
    model = autoencoder().cuda() # enable GPU
else:
    model = autoencoder()


## 5. Set the data loading utilities

We now call the [DataLoader](https://pytorch.org/docs/stable/data.html) constructors for the following datasets:
* normal data used for training
* normal data used for validation
* normal data used for testing
* fraud data used for testing

We shuffle the loading process of the train and validation datasets to make the learning process independent of data orderness, but the order of test datasets remains the same to examine whether we can handle unspecified bias order of inputs.

See how this is done below (you need to replace your own dataset names where appropriate).

In [None]:
# For training on normal samples
train_loader = torch.utils.data.DataLoader(dataset=#FILL HERE#,
                                          batch_size=batch_size,
                                          shuffle=True)

valid_loader = torch.utils.data.DataLoader(dataset=#FILL HERE#,
                                          batch_size=batch_size,
                                          shuffle=True)

# For testing on fraud examples (shuffle=False)
test_fraud_loader = torch.utils.data.DataLoader(dataset=#FILL HERE#,
                                          batch_size=batch_size,
                                          shuffle=False)

# For testing on unseen normal sample (shuffle=False)
test_normal_loader = torch.utils.data.DataLoader(dataset=#FILL HERE#,
                                          batch_size=batch_size,
                                          shuffle=False)


## 6. Training on normal samples

Run the training of the network on the training sample. For this complete the code below by answering the following questions:

a) Choose the mean square error loss function. See https://pytorch.org/docs/master/nn.html#loss-functions

b) Select the Adam optimizer (= minimization) method with a learning rate of 0.001. See https://pytorch.org/docs/stable/optim.html.

c) Fill the validation step knowing that it is the same structure as the training step but without the minimization part (not needed for validation).

d) Record for each epoch the loss value calculated for the training and validation steps. Make a figure of the training and validation losses as a function of the number of epochs. Do the two curve agree ?


In [None]:
criterion = # FILL HERE #

optimizer = # FILL HERE#

for epoch in range(num_epochs):
    
    ###################
    # train the model #
    ###################
    model.train() # prepare model for training
    for data in train_loader:
        data = data.type(dtype)
        output = model(data)
        loss = criterion(output, data)       
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()


    ######################    
    # validate the model #
    ######################
    model.eval() # prepare model for evaluation
    for data in valid_loader:
        
        # FILL HERE#
    



## 7. Calculate autoencoder distances

Now we calculate the euclidean distance between the autoencoder input and output.

$$ \text{distance} = \sqrt{ ||x_{\text{input}} - x_{\text{output}}||^2} = \sqrt{ \sum_i (x^i_{\text{input}} - x^i_{\text{output}})^2}$$

a) See below how this is done for the normal test data, and do the same for the fraud test data.

b) Plot the histograms of the calculated distances of the normal and fraud test data. For better viewing choose a logarithmic scale for the y axis. Comment on the result.

In [None]:
model.eval() # Sets the module in evaluation mode.
model.cpu()  # Moves all model parameters and buffers to the CPU to avoid out of memory

# Normal test dataset
#--------------------
test_normal_distance = []
for data in test_normal_loader:
    data = data.type(dtype).cpu().detach()
    output = model(data)
    test_normal_distance += torch.sqrt((torch.sum((data-output)**2,axis=1)))

# convert list to tensor
test_normal_distance = torch.FloatTensor(test_normal_distance)

# convert tensor to numpy array
test_normal_distance = test_normal_distance.numpy()

# Fraud test dataset
#-------------------

# FILL HERE #

## 8. Confusion matrix

Build a confusion matrix with a threshold on the distance such that 50% of fraud transactions are detected. What is the true positive rate in this case ? Is this threshold interesting ?

## 9. ROC Curve

Draw the ROC curve for the test sample.

## 10. Optimise the performance of the NN (optional)

Try the following:
* Change hyperparameters values
* Modify activation functions
* Add one or more layers
* Try [dropout](https://pytorch.org/docs/stable/generated/torch.nn.Dropout.html#torch.nn.Dropout)
* ...