<h1> Anomaly Detection </h1>

<p>
    Anomaly detection algorithms are a class of machine learning techniques which are used to identify data which is somehow distinct from the majority of the data. This anomalous data may be either true outliers or erronous data which you wish to remove during the cleaning process to improve the quality of the training data. However, it may be that this anomalous data represents something we are interested in identifying. For example, we may wish to detect strange network traffic due to a malicious user gaining unauthorized access, or to detect a tumor in MRI scans. Here I will apply this to the detection of credit card fraud. 
 </p>
 <p>
    Anomaly detection algorithms are closely related to regular classification and clustering tasks, however they are distinguished by the very low frequency of anomalous data.  Anomalies will represent a very small fraction of the data: typically less than 1%. One common approach to anomaly detection is to use the normal (non fraudulent) data to build a profile of the expected behaviour. Then any data points which would not reasonably belong to this profile we can mark as anomalous.
</p>
<p>
    More formally, we can build a multivariate probability density function then use hypothesis testing to calculate the probability of a datapoint belonging to this PDF. We then choose a probability cutoff below which we classify the datapoint as erronous.
</p>

<h1> The Dataset </h1>

<p> This dataset consists of  284,807 transactions made by European credit cards occuring over 2 days in September 2013.  492 of these transactions are fraudulent.  Due to confidentiality, the original features are not available, however we have the amount, time and 28 features named V1, V2.... V28 which were obtained using principal component analysis. 
</p>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.style
from sklearn.metrics import fbeta_score, precision_score, recall_score, confusion_matrix
import itertools

from plotly.offline import download_plotlyjs,init_notebook_mode,plot,iplot

import warnings  
warnings.filterwarnings('ignore')

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    Copyed from a kernel by joparga3 https://www.kaggle.com/joparga3/kernels
    """
    plt.figure()
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)

    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.show()

<h1>Exploratory Data Analysis</h1>

In [None]:
matplotlib.style.use('ggplot')

In [None]:
dataset = pd.read_csv('../input/creditcard.csv')#.drop('Time', axis=1)

dataset['Amount'] = np.log(dataset['Amount'] + 1)
dataset['Time'] = np.log(dataset['Time'] + 1)
normal = dataset[dataset['Class'] == 0]

anomaly = dataset[dataset['Class'] == 1]
print(normal.shape)
print(anomaly.shape)

There are 284315 valid transactions and 492 fraudulent transactions.

Next we split the data into training, validation and test datasets. 

In [None]:
from sklearn.model_selection import train_test_split

train, normal_test, _, _ = train_test_split(normal, normal, test_size=.2, random_state=42)

normal_valid, normal_test, _, _ = train_test_split(normal_test, normal_test, test_size=.5, random_state=42)
anormal_valid, anormal_test, _, _ = train_test_split(anomaly, anomaly, test_size=.5, random_state=42)

train = train.reset_index(drop=True)
valid = normal_valid.append(anormal_valid).sample(frac=1).reset_index(drop=True)
test = normal_test.append(anormal_test).sample(frac=1).reset_index(drop=True)

print('Train shape: ', train.shape)
print('Proportion os anomaly in training set: %.2f\n' % train['Class'].mean())
print('Valid shape: ', valid.shape)
print('Proportion os anomaly in validation set: %.2f\n' % valid['Class'].mean())
print('Test shape:, ', test.shape)
print('Proportion os anomaly in test set: %.2f\n' % test['Class'].mean())

It is very useful to see how the distributions of normal and fraudulent transactions compare for each feature.

In [None]:
for feature in ['Amount',
                'Time',
                'V1',
                'V2',
                'V3',
                'V4',
                'V5',
                'V6',
                'V7',
                'V8',
                'V9',
                'V10',
                'V11',
                'V12',
                'V13',
                'V14',
                'V15',
                'V16',
                'V17',
                'V18',
                'V19',
                'V20',
                'V21',
                'V22',
                'V23',
                'V24',
                'V25',
                'V26',
                'V27',
                'V28',]:
    ax = plt.subplot()
    sns.distplot(dataset[feature][dataset.Class == 1], bins=50, label='Fraudulent')
    sns.distplot(dataset[feature][dataset.Class == 0], bins=50, label='Normal')
    ax.set_xlabel('')
    ax.set_title('histogram of feature: ' + str(feature))
    plt.legend(loc='best')
    plt.show()
    


<p>The preceeding graphs show the profile of the normal data compared to the profile of the fraudulent data for all 30 features. It is clear that some features will be much more important for detecting fraudulent data than others. As an example of this, compare features V15 and V14.  Feature V15 has almost identical distributions for normal and fraudulent data wheras feature V14 has a sharp peak centered at 0 for normal data but a flatter peak centered at -6. So clearly a datapoint with a value of -15 for V14 will have quite a high probability of being fraudulent. </p>

<h1> Training and testing </h1>
<p> We will start with a multivariate normal model. This is one of the simplest anomaly detection algorithms. It fits a multivariate normal distribution (closely related to multivariate gaussian) and will mark any value which is far from the mean as being an anomaly. </p>
<p>The first step is to determine the values mu and sigma for the model and create the model.</p>

In [None]:
from scipy.stats import multivariate_normal

mu = train.drop('Class', axis=1).mean(axis=0).values
sigma = train.drop('Class', axis=1).cov().values
model = multivariate_normal(cov=sigma, mean=mu, allow_singular=True)

print(np.median(model.logpdf(valid[valid['Class'] == 0].drop('Class', axis=1).values))) 
print(np.median(model.logpdf(valid[valid['Class'] == 1].drop('Class', axis=1).values))) 

We then have to determine a threshold value, below which we classify the value as an outlier. This threshold value is a hyperparameter and can be tuned using a standard rough grid search to give a fairly high accuracy.

In [None]:
tresholds = np.linspace(-1000,-10, 150)
scores = []
for treshold in tresholds:
    y_hat = (model.logpdf(valid.drop('Class', axis=1).values) < treshold).astype(int)
    scores.append([recall_score(y_pred=y_hat, y_true=valid['Class'].values),
                 precision_score(y_pred=y_hat, y_true=valid['Class'].values),
                 fbeta_score(y_pred=y_hat, y_true=valid['Class'].values, beta=2)])

scores = np.array(scores)
print(scores[:, 2].max(), scores[:, 2].argmax())

Next we must analyse the precision and recall values to determine whether the model fits the data well.

In [None]:
plt.plot(tresholds, scores[:, 0], label='$Recall$')
plt.plot(tresholds, scores[:, 1], label='$Precision$')
plt.plot(tresholds, scores[:, 2], label='$F_2$')
plt.ylabel('Score')
# plt.xticks(np.logspace(-10, -200, 3))
plt.xlabel('Threshold')
plt.legend(loc='best')
plt.show()

In [None]:
final_tresh = tresholds[scores[:, 2].argmax()]
y_hat_test = (model.logpdf(test.drop('Class', axis=1).values) < final_tresh).astype(int)

print('Final threshold: %d' % final_tresh)
print('Test Recall Score: %.3f' % recall_score(y_pred=y_hat_test, y_true=test['Class'].values))
print('Test Precision Score: %.3f' % precision_score(y_pred=y_hat_test, y_true=test['Class'].values))
print('Test F2 Score: %.3f' % fbeta_score(y_pred=y_hat_test, y_true=test['Class'].values, beta=2))

cnf_matrix = confusion_matrix(test['Class'].values, y_hat_test)
plot_confusion_matrix(cnf_matrix, classes=['Normal','Anormal']
                      , title='Confusion matrix')

From the confusion matrix it is clear that this model gives fairly good predictions.

We can however improve on these by using a clustering technique based on K-means clustering called the gaussian mixture model. This is assumed that the data is produced by a finite number of gaussian distributions with unknown parameters. We once again use the normal data to fit the model before introducing anomalies which we wish to detect.

In [None]:
from sklearn.mixture import GaussianMixture

gmm = GaussianMixture(n_components=3, n_init=4, random_state=42)
gmm.fit(train.drop('Class', axis=1).values)
print(gmm.score(valid[valid['Class'] == 0].drop('Class', axis=1).values))
print(gmm.score(valid[valid['Class'] == 1].drop('Class', axis=1).values))

In [None]:
tresholds = np.linspace(-400, 0, 100)
y_scores = gmm.score_samples(valid.drop('Class', axis=1).values)
scores = []
for treshold in tresholds:
    y_hat = (y_scores < treshold).astype(int)
    scores.append([recall_score(y_pred=y_hat, y_true=valid['Class'].values),
                 precision_score(y_pred=y_hat, y_true=valid['Class'].values),
                 fbeta_score(y_pred=y_hat, y_true=valid['Class'].values, beta=2)])

scores = np.array(scores)
print(scores[:, 2].max(), scores[:, 2].argmax())

In [None]:
plt.plot(tresholds, scores[:, 0], label='$Recall$')
plt.plot(tresholds, scores[:, 1], label='$Precision$')
plt.plot(tresholds, scores[:, 2], label='$F_2$')
plt.ylabel('Score')
plt.xlabel('Threshold')
plt.legend(loc='best')
plt.show()

In [None]:
final_tresh = tresholds[scores[:, 2].argmax()]
y_hat_test = (gmm.score_samples(test.drop('Class', axis=1).values) < final_tresh).astype(int)

print('Final threshold: %f' % final_tresh)
print('Test Recall Score: %.3f' % recall_score(y_pred=y_hat_test, y_true=test['Class'].values))
print('Test Precision Score: %.3f' % precision_score(y_pred=y_hat_test, y_true=test['Class'].values))
print('Test F2 Score: %.3f' % fbeta_score(y_pred=y_hat_test, y_true=test['Class'].values, beta=2))

cnf_matrix = confusion_matrix(test['Class'].values, y_hat_test)
plot_confusion_matrix(cnf_matrix, classes=['Normal','Anormal'], title='Confusion matrix')

<h1> Ethical Considerations </h1>

It is clear that fraud detection is something which greatly benifits everyone, however in producing these models it encourages fraudsters to become smarter, trying to blend their transactions into the 'regular' transactions, making it ever more difficult to detect and also more difficult to prove. It might be better to try and build a 'profile' based on individual card holders transactions then looking for anomalous data from this. 

We must also be careful that a cardholders transactions are not cancelled when they are legitimate, so false positives are definitely not acceptable in these circumstances as it could leave someone unable to use their card in an emergency for example.