# About H2O
Machine Learning PLatform used in here is H2O, which is a Fast, Scalable, Open source application for machine/deep learning. 
Big names such as PayPal, Booking.com, Cisco are using H2O as the ML platform.
The speciality of h2o is that it is using in-memory compression to handles billions of data rows in memory, even in a small cluster.
It is easy to use APIs with R, Python, Scala, Java, JSON as well as a built in web interface, Flow
You can find more information here: https://www.h2o.ai


In [None]:
    import h2o
    from IPython import get_ipython
    import jupyter
    import matplotlib.pyplot as plt
    from pylab import rcParams
    import numpy as np # linear algebra
    import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
    import os
    from h2o.estimators.deeplearning import H2OAutoEncoderEstimator, H2ODeepLearningEstimator

    h2o.init(max_mem_size = 2) # initializing h2o server
    h2o.remove_all()

# Loading the Dataset
H2O also have a frame like pandas. So most of the data handling parts can be done using H2OFrame instead of DataFrame

In [None]:
    creditData = pd.read_csv("../input/creditcard.csv") # read data using pandas
    # creditData_df = h2o.import_file(r"File_Path\creditcard.csv") # H2O method
    creditData.describe()

## About the Dataset
The Dataset contains 284,807 transactions in total. From that 492 are fraud transactions. So the data itself is highly imbalanced. It contains only numeric input variable. The traget variable is 'Class'

In [None]:
    print("Few Entries: ")
    print(creditData.head())
    print("Dataset Shape: ", creditData.shape)
    print("Maximum Transaction Value: ", np.max(creditData.Amount))
    print("Minimum Transaction Value: ", np.min(creditData.Amount))

In [None]:
    # Turns python pandas frame into an H2OFrame
    creditData_h2o  = h2o.H2OFrame(creditData)
    # check if there is any null values
    # creditData.isnull().sum() # pandas method
    creditData_h2o.na_omit() # h2o method
    creditData_h2o.nacnt() # no missing values found

# Data Visualization

In [None]:
        # Let's plot the Transaction class against the Frequency
        labels = ['normal','fraud']
        classes = pd.value_counts(creditData['Class'], sort = True)
        classes.plot(kind = 'bar', rot=0)
        plt.title("Transaction class distribution")
        plt.xticks(range(2), labels)
        plt.xlabel("Class")
        plt.ylabel("Frequency")

In [None]:
    fraud = creditData[creditData.Class == 1]
    normal = creditData[creditData.Class == 0]

In [None]:
    # Amount vs Class
    f, (ax1, ax2) = plt.subplots(2,1,sharex=True)
    f.suptitle('Amount per transaction by class')

    ax1.hist(fraud.Amount, bins = 50)
    ax1.set_title('Fraud List')

    ax2.hist(normal.Amount, bins = 50)
    ax2.set_title('Normal')

    plt.xlabel('Amount')
    plt.ylabel('Number of Transactions')
    plt.xlim((0, 10000))
    plt.yscale('log')
    plt.show()

In [None]:
    # time vs Amount
    f, (ax1, ax2) = plt.subplots(2, 1, sharex=True)
    f.suptitle('Time of transaction vs Amount by class')

    ax1.scatter(fraud.Time, fraud.Amount)
    ax1.set_title('Fraud List')

    ax2.scatter(normal.Time, normal.Amount)
    ax2.set_title('Normal')

    plt.xlabel('Time (in seconds)')
    plt.ylabel('Amount')
    plt.show()

In [None]:
    #plotting the dataset considering the class
    color = {1:'red', 0:'yellow'}
    fraudlist = creditData[creditData.Class == 1]
    normal = creditData[creditData.Class == 0]
    fig,axes = plt.subplots(1,2)

    axes[0].scatter(list(range(1,fraudlist.shape[0] + 1)), fraudlist.Amount,color='red')
    axes[1].scatter(list(range(1, normal.shape[0] + 1)), normal.Amount,color='yellow')
    plt.show()

The *Time* variable is not giving an impact on the model prediction,. This can be figure out from data visualization. 
Before moving on to the trainig part, we need to figure out which variables are important and which are not. 
So we can drop the unwanted variables.

In [None]:
    features= creditData_h2o.drop(['Time'], axis=1)

# Split the Frame

In [None]:
    # 80% for the training set and 20% for the testing set
    train, test = features.split_frame([0.8])
    print(train.shape)
    print(test.shape)
    #train.describe()
    #test.describe()

In [None]:
    train_df = train.as_data_frame()
    test_df = test.as_data_frame()

    train_df = train_df[train_df['Class'] == 0]
    train_df = train_df.drop(['Class'], axis=1)

    Y_test_df = test_df['Class']

    test_df = test_df.drop(['Class'], axis=1)

    train_df.shape

In [None]:
    train_h2o = h2o.H2OFrame(train_df) # converting to h2o frame
    test_h2o = h2o.H2OFrame(test_df)
    x = train_h2o.columns

# Anomaly Detection
I used an anomaly detection technique for the dataset. 
Anomaly detection is a technique to identify unusual patterns that do not confirm to the expected behaviors. Which is called outliers. 
It has many applications in business from fraud detection in credit card transactions to fault detection in operating environments.
Machine learning approaches for Anomaly detection;
1.     K-Nearest Neighbour
2.     Autoencoders - Deep neural network
3.     K-means
4.     Support Vector Machine
5.     Naive Bayes


# Autoencoders
So as the algorithm I chose **Autoencoders**, which is a deep learning, unsupervised ML algorithm. 
"Autoencoding" is a data compression algorithm, which takes the input and going through a compressed representation and gives the reconstructed output. 


when  building the model, 
4 fully connected hidden layers were chosen with, [14,7,7,14] number of nodes for each layer.
First two for the encoder and last two for the decoder.

In [None]:
    anomaly_model = H2ODeepLearningEstimator(activation = "Tanh",
                                   hidden = [14,7,7,14],
                                   epochs = 100,
                                   standardize = True,
                                    stopping_metric = 'MSE', # MSE for autoencoders
                                    loss = 'automatic',
                                    train_samples_per_iteration = 32,
                                    shuffle_training_data = True,     
                                   autoencoder = True,
                                   l1 = 10e-5)
    anomaly_model.train(x=x, training_frame = train_h2o)

## Variable Importance
In H2O there is a special way of analysing the variables which gave more impact on the model.

In [None]:
    anomaly_model._model_json['output']['variable_importances'].as_data_frame()

In [None]:
    # plotting the variable importance
    rcParams['figure.figsize'] = 14, 8
    #plt.rcdefaults()
    fig, ax = plt.subplots()

    variables = anomaly_model._model_json['output']['variable_importances']['variable']
    var = variables[0:15]
    y_pos = np.arange(len(var))

    scaled_importance = anomaly_model._model_json['output']['variable_importances']['scaled_importance']
    sc = scaled_importance[0:15]

    ax.barh(y_pos, sc, align='center', color='green', ecolor='black')
    ax.set_yticks(y_pos)
    ax.set_yticklabels(variables)
    ax.invert_yaxis()
    ax.set_xlabel('Scaled Importance')
    ax.set_title('Variable Importance')
    plt.show()

In [None]:
    # plotting the loss
    scoring_history = anomaly_model.score_history()
    %matplotlib inline
    rcParams['figure.figsize'] = 14, 8
    plt.plot(scoring_history['training_mse'])
    #plt.plot(scoring_history['validation_mse'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')

## Evaluating the Testing set
Testing set has both normal and fraud transactions in it.
From this training method, The model will learn to identify the pattern of the input data.
 If an anomalous test point does not match the learned pattern, the autoencoder will likely have a high error rate in reconstructing this data, indicating anomalous data.
So that we can identify the anomalies of the data.
To calculate the error, it uses **Mean Squared Error**(MSE)

In [None]:
    test_rec_error = anomaly_model.anomaly(test_h2o) 
    # anomaly is a H2O function which calculates the error for the dataset
    test_rec_error_df = test_rec_error.as_data_frame() # converting to pandas dataframe

    # plotting the testing dataset against the error
    test_rec_error_df['id']=test_rec_error_df.index
    rcParams['figure.figsize'] = 14, 8
    test_rec_error_df.plot(kind="scatter", x='id', y="Reconstruction.MSE")
    plt.show()

In [None]:
    # predicting the class for the testing dataset
    predictions = anomaly_model.predict(test_h2o)

    error_df = pd.DataFrame({'reconstruction_error': test_rec_error_df['Reconstruction.MSE'],
                            'true_class': Y_test_df})
    error_df.describe()

In [None]:
    # reconstruction error for the normal transactions in the testing dataset
    fig = plt.figure()
    ax = fig.add_subplot(111)
    rcParams['figure.figsize'] = 14, 8
    normal_error_df = error_df[(error_df['true_class']== 0) & (error_df['reconstruction_error'] < 10)]
    _ = ax.hist(normal_error_df.reconstruction_error.values, bins=10)

In [None]:
    # reconstruction error for the fraud transactions in the testing dataset
    fig = plt.figure()
    ax = fig.add_subplot(111)
    rcParams['figure.figsize'] = 14, 8
    fraud_error_df = error_df[error_df['true_class'] == 1]
    _ = ax.hist(fraud_error_df.reconstruction_error.values, bins=10)

### ROC Curve

In [None]:
    from sklearn.metrics import (confusion_matrix, precision_recall_curve, auc,
                                 roc_curve, recall_score, classification_report, f1_score,
                                 precision_recall_fscore_support)
    fpr, tpr, thresholds = roc_curve(error_df.true_class, error_df.reconstruction_error)
    roc_auc = auc(fpr, tpr)

    plt.title('Receiver Operating Characteristic')
    plt.plot(fpr, tpr, label='AUC = %0.4f'% roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0,1],[0,1],'r--')
    plt.xlim([-0.001, 1])
    plt.ylim([0, 1.001])
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show();

### Precision and Recall
Since the data is highly unbalanced, it cannot be measured only by using accuracy.
Precision vs Recall was chosen as the matrix for the classification task.

**Precision**: Measuring the relevancy of obtained results. 
[ True positives / (True positives + False positives)]

**Recall**: Measuring how many relevant results are returned.
[ True positives / (True positives + False negatives)]






**True Positives** - Number of actual frauds predicted as frauds

**False Positives** - Number of non-frauds predicted as frauds

**False Negatives** - Number of frauds predicted as non-frauds.


In [None]:
    precision, recall, th = precision_recall_curve(error_df.true_class, error_df.reconstruction_error)
    plt.plot(recall, precision, 'b', label='Precision-Recall curve')
    plt.title('Recall vs Precision')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.show()

We need to find a better threshold that can seperate the anomalies from normals. This can be done by getting the intersection of the **Precision/Recall vs Threshold** graph

In [None]:
    plt.plot(th, precision[1:], label="Precision",linewidth=5)
    plt.plot(th, recall[1:], label="Recall",linewidth=5)
    plt.title('Precision and recall for different threshold values')
    plt.xlabel('Threshold')
    plt.ylabel('Precision/Recall')
    plt.legend()
    plt.show()

In [None]:
    # plot the testing set with the threshold
    threshold = 0.01
    groups = error_df.groupby('true_class')
    fig, ax = plt.subplots()

    for name, group in groups:
        ax.plot(group.index, group.reconstruction_error, marker='o', ms=3.5, linestyle='',
                label= "Fraud" if name == 1 else "Normal")
    ax.hlines(threshold, ax.get_xlim()[0], ax.get_xlim()[1], colors="r", zorder=100, label='Threshold')
    ax.legend()
    plt.title("Reconstruction error for different classes")
    plt.ylabel("Reconstruction error")
    plt.xlabel("Data point index")
    plt.show();

### Confusion Matrix

In [None]:
    import seaborn as sns
    LABELS = ['Normal', 'Fraud']
    y_pred = [1 if e > threshold else 0 for e in error_df.reconstruction_error.values]
    conf_matrix = confusion_matrix(error_df.true_class, y_pred)
    plt.figure(figsize=(12, 12))
    sns.heatmap(conf_matrix, xticklabels=LABELS, yticklabels=LABELS, annot=True, fmt="d");
    plt.title("Confusion matrix")
    plt.ylabel('True class')
    plt.xlabel('Predicted class')
    plt.show()


### Classification Report
      

In [None]:
    csr = classification_report(error_df.true_class, y_pred)
    print(csr)