# Unsupervised Neural Networks for Anomaly Detection

What if density estimation is not good enough to identify the outliers in our data? Could we used an actual neural network for this purpose?

Recall from previous tutorials how Neural Networks rely on a loss function for the optimization steps involved in the training process.
Could we have a loss function without a target for our output? No, not really, but we don't need actual labels to be able to give our neural network a target output. 

What if our target is the same as our input? 🤯

So far this just sounds like a glorified and computationally expensive identity function, but we have one more trick up our sleeve. Enter **the autoencoder** . . . 

## The Autoencoder 

This neural network may be thought of as having two independent parts:
- **the Encoder**: that takes our initial input and creates a representation of it in a space with a smaller number of dimensions
- **the Decoder**: that takes the *latent representation*, or the encoder output, and tries to reconstruct the original input 

By adding this *latent space*, we force our network to effectively compress our data. This *informational bottleneck* is what prevents the network from just becoming an identity function.

<img src="./resources/ae.svg" width=400 style="background-color:white;">

Thus, we don't need labels anymore, we train the network to create a smaller-dimensional representation of our inputs and reconstruct them back from there. We could use any distance metric as our loss function, such as the *Mean Squared Error*.

Now you may ask what does this compression method have to do with detecting anomalies in our data?

Well, the main assumption of this model is that our data can be compressed, which implies that it has an internal structure that may be represented with fewer, more expressive variables. If we were to run samples that are significantly different from what the model learned, it will not be as good as reconstructing them. So basically we may use the reconstruction error between the input and the output as sort of an **anomaly score**

# Hands-on: Autoencoder for credit card fraud detection

### The Data
The objective of this exercise is to build a model able to detect fraudulent credit card transactions among normal transactions. For this we train a special type of neural network called autoencoder. This network has as many input nodes as output nodes, and several hidden layers with, usually lower dimensions. 

The dataset we're going to use can be downloaded from [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud) (big file: 144 MB). It contains data about credit card transactions that occurred during a period of two days, with 492 frauds out of 284,807 transactions.

All 30 features in the dataset are numerical. The data has been transformed using PCA transformation(s) due to privacy reasons. The two features that haven't been changed are Time and Amount. Time contains the seconds elapsed between each transaction and the first transaction in the dataset.

The dataset also contains the class of event: 0 = normal transaction; 1 = fraudulent transaction.

### The tasks
For this complete exercise you have nothing already set-up for you. Use what you already learned to complete all the tasks. Some of the tasks include things that were not discussed in the previous tutorials, but have references to resources where you may learn about them. Do as much as you can and remember: *Google is your friend*
## 0. Import Libraries you may need

In [None]:
# import anything you may need
raise NotImplementedError

## 1. Explore Data

a) Download the dataset and load it in a pandas dataframe. Look at the first 10 examples.

b) Separate the data in two classes `normal` and `fraud`, then remove the class label from these datasets in order to conserve only the features.

c) Plot the first 5 features of both normal and fraud data (plotting all features is time consuming).

d) Split the `normal` dataset into a training and a test sample (each of same size).

After the last step you should have 3 datasets:
* normal data used for training
* normal data used for testing
* fraud data used for testing

In [None]:
# explore the given dataset
raise NotImplementedError

## 2. Rescale data

Since features have different range we apply a transformation to each feature. For this we  use the MinMaxScaler that scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one:

See: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

a) Fit and transofrm the training dataset using the scaler with the `fit_transform` method.

b) Apply the transformation on the tests samples using the `transform` method.

c) Plot the first 5 features of the normal and fraud test data and see how they changed.

In [None]:
# rescale the data
raise NotImplementedError

## 3. Partition training data

After all of this, it's important to partition the data. In order for your model to generalize well, you split the training data into two parts: a training and a validation set. You will train your model on 80% of the data and validate it on 20% of the remaining training data. The validation dataset is just like a test set that gets evaluated during training. You will give this validation sample to the `fit` method of your model. You may take a look [here](https://www.tensorflow.org/guide/keras/train_and_evaluate#using_a_validation_dataset) to get a better idea on how this works.

In [None]:
# partition the training dataset
raise NotImplementedError

## 4. AutoEncoder model

Now we create the AutoEncoder model. 

Complete the network structure below using `Dense` layers and sigmoid activation functions:

a) in the encoding part create 3 layers of dimension 30, 25 and 20, each with a sigmoid activation function

b) in the decoding part create 2 layers of dimension 25, 30, where only the 1st layer has a sigmoid activation function

The encoder and decoder may be part of the same model, for an easier model definition. You may just stack all the layers together in `Sequential` model.


In [None]:
# define the model here
raise NotImplementedError

## 5. Training on normal samples

Compile the model and run the training of the network on the training sample. For this complete the code below by answering the following questions:

a) Choose the mean square error loss function.

b) Select the Adam optimizer method with a learning rate of 0.001 and compile the model.

c) Fit the model using the training and validation data



In [None]:
# train the model here
raise NotImplementedError

## 6. Calculate autoencoder distances

Now we calculate the euclidean distance between the autoencoder input and output.

$$ \text{distance} = \sqrt{ ||x_{\text{input}} - x_{\text{output}}||^2} = \sqrt{ \sum_i (x^i_{\text{input}} - x^i_{\text{output}})^2}$$

a) Compute those distances on both the normal test data and the fraud test data.

b) Plot the histograms of the calculated distances of the normal and fraud test data. For better viewing choose a logarithmic scale for the y axis. Comment on the result.

In [None]:
# train the model here
raise NotImplementedError

## 7. Confusion matrix

A confusion matrix is a good way of visualizing how good our model is at classification. Check out the [wikipedia](https://en.wikipedia.org/wiki/Confusion_matrix) page to get a better idea about it.

Take a look at [`sklearn.metrics.confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) to find out how to compute one easily. 

Build a confusion matrix with a threshold on the distance such that 50% of fraud transactions are detected. What is the true positive rate in this case ? Is this threshold interesting ?

In [None]:
# plot a confusion matrix
raise NotImplementedError

## 8. ROC Curve

Another useful way of visualizing performance is through the use of ROC-curves. You can check what those are [here](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). 

Note that `scikit-learn` has functions to calculate those as well. You just need to figure out how to use them. Maybe [this page](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html) can help you.

Draw the ROC curve for the test sample.

In [None]:
# plot a confusion matrix
raise NotImplementedError

## 9. Optimize the performance of the model (optional)

Try the following:
* Change hyperparameters values
* Modify activation functions
* Add one or more layers
* Try [dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout)
* Anything else you may think of . . .

In [None]:
# optional optimization
raise NotImplementedError