## Unsupervised Anomaly Detection
```In this exercise you will use concepts you know, and maybe some concepts you are about to meet, in order to find anomalies in dataset of credit cards transactions.
We will think about this problem as one think of real anomaly detecting problems: your goal will be to choose the 1,000 most anomalous samples from the dataset - the samples you suspect to be the anomaly samples. In real life problems, those samples will be handed to a human researcher for verification. Obviously, if you give him a lot of regular samples, he will get angry.```

```~Ittai Haran```

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

```Load the dataset. You can see it's labeled: It's for allowing you to test yourself. Note that in real life problems, you won't have it. Normalize the dataset as you see fit.```

In [None]:
from google.colab import files
uploaded = files.upload()

Saving creditcard.csv to creditcard.csv


In [None]:
df = pd.read_csv('creditcard.csv') ## can be found in: https://drive.google.com/open?id=1wyz2czVFaQWdqRmLAtT5MwCOSlnZ6od9
df.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,Class
0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,0.090794,-0.5516,-0.617801,-0.99139,-0.311169,1.468177,-0.470401,0.207971,0.025791,0.403993,0.251412,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,0
1,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,-0.166974,1.612727,1.065235,0.489095,-0.143772,0.635558,0.463917,-0.114805,-0.183361,-0.145783,-0.069083,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,0
2,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,0.207643,0.624501,0.066084,0.717293,-0.165946,2.345865,-2.890083,1.109969,-0.121359,-2.261857,0.52498,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,0
3,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,-0.054952,-0.226487,0.178228,0.507757,-0.287924,-0.631418,-1.059647,-0.684093,1.965775,-1.232622,-0.208038,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,0
4,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,0.753074,-0.822843,0.538196,1.345852,-1.11967,0.175121,-0.451449,-0.237033,-0.038195,0.803487,0.408542,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,0


In [None]:
from sklearn.preprocessing import MinMaxScaler

label = df['Class']

data = df.drop(columns=["Class"])
data = pd.DataFrame(MinMaxScaler().fit_transform(data))
data = data.to_numpy()

print("data", data.shape)
print("label", label.shape)

data (284807, 28)
label (284807,)


```Your first task is to formulate a method for evaluating your anomalies. Write an evaluation method, which will help you compare between different ways to detect anomalies. Notice that this isn't a classification method, and regard your true goal: to mark the 1,000 most anomalous samples.```

In [None]:
from sklearn.metrics import roc_auc_score

def evaluate_method(y_true, grades):
    # y_true is the class: 0 for regular, 1 for anomaly
    # the grades should indicate how anomalous you think the sample is - as higher the grade, the sample is more suspiciuos
    indices_higher_grades = np.argsort(grades)[:-1000]
    return roc_auc_score(y_true[indices_higher_grades], grades[indices_higher_grades])

all_grades = np.zeros(shape=(data.shape[0], 4))

```We can now examine different methods for anomaly detecting. For each method, evaluate it, and compare it to the other methods.```

```The first one we will try is to grade the samples by their distance from the 'mean sample', in units of standard deviation. You can also think about the features as independent gaussian distributions and grade a sample by its distance from the gaussian's mean, for every feature.```

In [None]:
data_mean = data.mean(axis=0)
data_std = data.std(axis=0)
distances = np.sum(np.abs(data - data_mean) / data_std, axis=1)

all_grades[:, 0] = distances
print(evaluate_method(label, distances))

0.9228197343854639


```What hidden assumption you took during "training"? what part of the data you trained on?```

In [None]:
"""
I trained on all the data available (284807 samples). As I didnt use the labels, the is no leakage of information.
The hidden assumption is that the majority of the samples are not anomalies, and then then 'mean sample' would represent in a good way the normal credit card transactions.
"""

"\nI trained on all the data available (284807 samples). As I didnt use the labels, the is no leakage of information.\nThe hidden assumption is that the majority of the samples are not anomalies, and then then 'mean sample' would represent in a good way the normal credit card transactions.\n"

```Try using PCA: project the dataset into a lower dimensional space, and than use the "inverse" transformation (why ""?) to get approximated samples. Compare the samples you got to the samples you started with.```

In [None]:
from sklearn.decomposition import PCA

pca_model = PCA(15)
pca_model.fit(data)

reconstruct_data = pca_model.inverse_transform(pca_model.transform(data))
distances_PCA = np.sum(np.abs(data - reconstruct_data), axis=1)

all_grades[:, 1] = distances_PCA
print(evaluate_method(label, distances_PCA))

0.9159451148290565


```Read about one class SVM. Use it to evaluate your samples. Notice that this algorithm is very slow compared to those you tried earlier. Consider training it only on a fraction of the samples.
Hint: you can use the decision function directly to get the distance of the sample from the decision boundary.```

In [None]:
from sklearn.svm import OneClassSVM

svm_model = OneClassSVM(gamma='auto', verbose=True)
svm_model.fit(data[0:100000,:])

decisions = svm_model.decision_function(data)

[LibSVM]

In [None]:
decisions = -decisions
decisions[decisions<0] = 0

all_grades[:, 2] = decisions
print(evaluate_method(label, decisions))

0.8657004587518036


```Now try clustering your data, and use the distance from the clusters (you will have to define it) to grade the samples. Think about changing your normalization method when trying to cluster. Here you also might want to consider to train on a fraction of the samples.```

In [None]:
from sklearn.cluster import KMeans
from numpy.linalg import norm
from tqdm import tqdm

def clustering(nb_cluster):
  kmeans_model = KMeans(nb_cluster, n_jobs = -1)
  kmeans_model.fit(data[:100000])

  centers_clusters = kmeans_model.cluster_centers_

  grades = np.zeros(shape=data.shape[0])
  for i in tqdm(range(data.shape[0])):
    distance_closest_cluster = norm(data[i]-centers_clusters, axis=1)
    grades[i] = np.min(distance_closest_cluster)

  if nb_cluster==10:
    all_grades[:, 3] = grades
  print("for ", nb_cluster, "clusters, roc=", evaluate_method(label, grades))

clustering(nb_cluster=2)
clustering(nb_cluster=10)
clustering(nb_cluster=20)

100%|██████████| 284807/284807 [00:06<00:00, 43545.72it/s]


for  2 clusters, roc= 0.8726944576139956


100%|██████████| 284807/284807 [00:07<00:00, 37421.46it/s]


for  10 clusters, roc= 0.9182799531361588


100%|██████████| 284807/284807 [00:04<00:00, 66904.29it/s]


for  20 clusters, roc= 0.9168329281468247


```Try combining the grades you got from different methods into a single grade. Did you get a better detector? why or why not?```

In [None]:
from sklearn.mixture import GaussianMixture

clf_gaussian = GaussianMixture(n_components=10)
clf_gaussian.fit(all_grades)

centers_clusters = clf_gaussian.means_

final_grade = np.zeros(shape=all_grades.shape[0])
for i in tqdm(range(all_grades.shape[0])):
  distance_closest_cluster = norm(all_grades[i]-centers_clusters, axis=1)
  final_grade[i] = np.min(distance_closest_cluster)

print(evaluate_method(label, final_grade))

100%|██████████| 284807/284807 [00:06<00:00, 42036.95it/s]


0.8594277689633578


```Now we will experience with Deep Auto Encoders. The idea is to create a neural network that gets the samples as input, and try to predict the very same samples: The difficulty comes from the fact that the networks gets narrower, and so having an information bottleneck. The grade each sample will get is the reconstruction error - the difference between the output and the input. You can read more about Auto Encoders in the literature.
(If you want to know more about Auto Encoders, read also about about Variational Auto Encoder)```

In [None]:
import tensorflow as tf
from keras import backend as K
from keras.layers import Dense, Input
from keras.models import Model, Sequential
from keras.activations import exponential
from keras.initializers import RandomNormal

In [None]:
class VAE(tf.keras.Model):

    def __init__(self, input_size):
        super(VAE, self).__init__()

        self.encoder = Sequential()
        self.encoder.add(Input((input_size)))
        self.encoder.add(Dense((input_size*0.8), activation='tanh'))
        self.encoder.add(Dense((input_size*0.6), activation='tanh'))
        self.encoder.add(Dense((input_size*0.4), activation='tanh'))
        self.encoder.add(Dense((input_size*0.2), activation='tanh'))

        self.decoder = Sequential()
        self.decoder.add(Dense((input_size*0.2), activation='tanh'))
        self.decoder.add(Dense((input_size*0.4), activation='tanh'))
        self.decoder.add(Dense((input_size*0.6), activation='tanh'))
        self.decoder.add(Dense((input_size*0.8), activation='tanh'))
        self.decoder.add(Dense((input_size), activation='sigmoid'))

    def call(self, inputs):
      h = self.encoder(inputs)
      inputs_recon = self.decoder(h)
      return inputs_recon

In [None]:
model = VAE(input_size=28)
model.compile(optimizer='rmsprop', loss ="mse")

label = df['Class']

data = df.drop(columns=["Class"])
model.fit(data, data, batch_size=500, epochs=20, verbose=2)

model.summary()

Epoch 1/20
570/570 - 2s - loss: 1.0369
Epoch 2/20
570/570 - 2s - loss: 0.9553
Epoch 3/20
570/570 - 2s - loss: 0.9405
Epoch 4/20
570/570 - 2s - loss: 0.9340
Epoch 5/20
570/570 - 2s - loss: 0.9300
Epoch 6/20
570/570 - 2s - loss: 0.9244
Epoch 7/20
570/570 - 2s - loss: 0.9202
Epoch 8/20
570/570 - 2s - loss: 0.9171
Epoch 9/20
570/570 - 2s - loss: 0.9132
Epoch 10/20
570/570 - 2s - loss: 0.9090
Epoch 11/20
570/570 - 2s - loss: 0.9055
Epoch 12/20
570/570 - 2s - loss: 0.9030
Epoch 13/20
570/570 - 2s - loss: 0.9007
Epoch 14/20
570/570 - 2s - loss: 0.8987
Epoch 15/20
570/570 - 2s - loss: 0.8967
Epoch 16/20
570/570 - 2s - loss: 0.8945
Epoch 17/20
570/570 - 2s - loss: 0.8922
Epoch 18/20
570/570 - 2s - loss: 0.8903
Epoch 19/20
570/570 - 2s - loss: 0.8889
Epoch 20/20
570/570 - 2s - loss: 0.8876
Model: "vae_23"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
sequential_7 (Sequential)    (None, 5)                 1253  

In [None]:
VAE_grades = np.linalg.norm(model.predict(data, batch_size=500)-data, axis = 1)
evaluate_method(label, VAE_grades)

0.9241341865651618

## Bonus

```Try thinking about other methods to detect anomalies in your data, and find a way to get better results.```

In [None]:
from sklearn.cluster import SpectralClustering

def spectral_clustering(nb_cluster):
  kmeans_model = SpectralClustering(nb_cluster, n_jobs = 3)
  kmeans_model.fit(data[:10000])

  affinity_matrix = kmeans_model.affinity_matrix_

  grades = 1 / np.linalg.norm(affinity_matrix, axis=0)
  print("for ", nb_cluster, "clusters, roc=", evaluate_method(label, grades))

spectral_clustering(nb_cluster=2)
spectral_clustering(nb_cluster=10)
spectral_clustering(nb_cluster=20)

[0.01109431 0.0107819  0.01162391 ... 0.01093071 0.01085852 0.01121639]
for  2 clusters, roc= 0.8978775419491054
[0.01109431 0.0107819  0.01162391 ... 0.01093071 0.01085852 0.01121639]
for  10 clusters, roc= 0.8978775419491054
[0.01109431 0.0107819  0.01162391 ... 0.01093071 0.01085852 0.01121639]
for  20 clusters, roc= 0.8978775419491054
