# 0. Introduction

This notebook runs a fraud detection model for credit-cards transactions using the Google Cloud Platform (GCP). GCP has several advantages related to Machine Learning. It allows to store and process large datasets that do not fit in memory, in a distributed manner. It provides flexibility and allows to automatically scale the computing ressources based on your needs (for instance different scale tiers are offered with different prices and performances). In addition Google Cloud provides several ML API (e.g. Vision, Speech recognition) leveraging state of the art pre-trained models, making it easy to train and use in production performant models.

In this notebook we will execute code to process data, train a Tensorflow model with hyperparameter tuning, run predictions on new data and assess model performances.
We will leverage the following GCP capabilities:
- Google Big Query
- Cloud DataFlow (Apache-Beam + Tensorflow-Transform)
- Cloud ML Engine (Tensorflow).





**Before you start, you will need to:**

* Create a Google Cloud project and a bucket in Google Cloud Storage
* Install gcloud SDK to run gcloud commands: https://cloud.google.com/sdk/downloads

**You can also:**

* Go through the python code available on [GitHub here](https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloudml-fraud-detection).

**Recommendation to use this notebook:**

This notebook provides one example to approach the fraud detection problem and there are many other ways it can be tackled or improved. Once you have a good understanding of the current solution, try to add modifications to the code, to run it in GCP and see the impact.

Throughout the notebook, we provide suggestions for improvements, feel free to try some of these or your own ideas.



**Acknowledgements:**

The dataset has been collected and analysed during a research collaboration of Worldline and the Machine Learning Group (http://mlg.ulb.ac.be) of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available on http://mlg.ulb.ac.be/BruFence and http://mlg.ulb.ac.be/ARTML.
Dataset provided thanks to: Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.


# 1. Environment set-up

## Requirements

Make sure that the following requirements are verified:

```
tensorflow==1.13.1
apache_beam[gcp]==2.12.0
tensorflow-transform==0.4.0
matplotlib
sklearn
numpy
scipy
```

Otherwise the notebook might not run as expected.


In [0]:
%%bash
python2 -m pip install tensorflow==1.13.1 --user
python2 -m pip install apache_beam[gcp]==2.12.0 --user
python2 -m pip install tensorflow-transform==0.4.0 --user
python2 -m pip install matplotlib --user
python2 -m pip install sklearn --user
python2 -m pip install numpy --user
python2 -m pip install scipy --user

## Project information

Hard-code your project information as environment variables to be used in the next steps.
* BUCKET_ID is the name of your bucket to store outputs for this example 

In [None]:
%env BUCKET_ID=anz-uc-team-1-storage


## Set your GCP project id

In [None]:
%%bash
gcloud config set project $PROJECT_ID

## Set up the dataset using Google BigQuery

In your project, create a BigQuery dataset and create a table from the data stored in GCP public bucket by running the commands below.


Set public path to the data, GBQ dataset and table name, **please leave as is**:

In [None]:
%env DATASET_PUBLIC_PATH=gs://fraud-detection-example/creditcard_proc.csv
%env DATASET=fraud_detection
%env DATASET_LOCATION=US
%env BQ_TABLE_NAME=raw_data

In [None]:
%%bash
bq mk --dataset --data_location $DATASET_LOCATION --project_id $PROJECT_ID $DATASET
bq load --project_id $PROJECT_ID --autodetect \
--source_format=CSV $DATASET.$BQ_TABLE_NAME $DATASET_PUBLIC_PATH

## Clone the git repository containing the necessary code

It contains python code to process data, train a tensorflow model and run predictions.

**Note: you may need to request access to the GCP bucket and repository**

In [0]:
!git clone https://github.com/kasna-cloud/professional-services.git

**Current timestamp for unique naming**

We use the current timestamp as a string to ensure unique naming through the different steps of the process.

In [None]:
import os
import time

current_time = str(time.time()).replace('.', '')
current_time

# 2. Brief data exploration


The dataset is a list of credit card transactions with features describing the transactions and a flag if the transaction was fraudulent.
The features include the time of the transaction, its amount and 28 components from a PCA.

See [here](https://www.kaggle.com/mlg-ulb/creditcardfraud/data) for more information.


## Load data from GBQ into Pandas DataFrame

In [None]:
%%bash
python2 -m pip install pandas-gbq --user

In [0]:
import pandas as pd
from pandas.io import gbq

In [None]:
import os
project_id = os.environ.get('PROJECT_ID')
table_name = project_id + '.fraud_detection.raw_data'

data = gbq.read_gbq(query='select * from `{}` limit 1000'.format(table_name),
                    project_id=project_id, dialect='standard')
print data.shape
data.head()

## Count occurrences for each class


In [0]:
**Features overview**

pd.concat([data.mean(), data.std()], axis=1)

The V1-V28 PCA components are centered but not all features. We will center and scale each feature in the data processing step using the *Tensorflow-Transform* library.

In [None]:
**Suggestions for improvements**

Try to further explore the data and create additional features. For instance:
- features describing preceeding transactions
- feature percentiles
- transactions clustering to then compute the distances between a transaction and each cluster.

<h1>3. Run data preprocessing in Google Cloud Dataflow</h1>

In this part we run the data preprocessing in Google Cloud Dataflow using the Apache-Beam and Tensorflow-Transform libraries.

**Brief overview of the tools:**
- Dataflow: data processing framework in Google Cloud relying on Apache beam. Scales well to multiple workers (more information [here](https://cloud.google.com/dataflow/)).
- TFT: Wrapper on top of Tensorflow to do some preprocessing operation (e.g. standardizing data, preprocessing text). There are a couple of advantages to it: it scales with Dataflow and the graph can be used for inference (more information [here](https://github.com/tensorflow/transform)).

**This step includes the following:**
- Reads data from BigQuery.
- Adds hash key value to each row: we add a hash key identifier to each row to be able to score data and compare with the true labels later on.
- Normalizes data.
- Shuffles and splits data in train / validation / test sets.
- Oversamples train data: to make up for the strong class imbalance and allow for proper training in batches.
- Stores data as TFRecord, the preferred Tensorflow format for data storage which allows to input data in Tensorflow graphs more efficiently.
- Splits and stores test data into separate labels and features files, to ensure that inference is done on non-labelled data.

**Output path**

Hard-code the output path of data processing to be used in the following steps.

In [None]:
%%
os.environ['DATAFLOW_OUTPUT_DIR'] = 'data_flow_output_dir-{}/'.format(current_time) 
!echo $DATAFLOW_OUTPUT_DIR

**Launch data processing job**

You need to specify the name of the table you stored in BigQuery and input it with the '--bq_table' argument. In addition you need to specify several other arguments (see python --help command).

You may have to run the following before starting the preprocessing job.

%%bash
gcloud compute networks create default
gcloud services enable dataflow.googleapis.com

Start the preprocessing job.

NOTE: The job below takes about half an hour to run. We've already ran it and you could
find the output in `team-storage/data_flow_output_dir`.

In [None]:
%%bash
cd professional-services/examples/cloudml-fraud-detection/
python preprocess.py \
--cloud \
--bq_table $BQ_TABLE_NAME \
--output_dir ${DATAFLOW_OUTPUT_DIR} \
--project_id $PROJECT_ID \
--bucket_id $BUCKET_ID \
--subnet https://www.googleapis.com/compute/v1/projects/anz-uc-host/regions/australia-southeast1/subnetworks/anz-uc-team-1-subnet \
--zone australia-southeast1-a

If you didn't run the command above then setup your environment variable to point to prepopulated data:

In [None]:
%env DATAFLOW_OUTPUT_DIR=data_flow_output_dir/

In [None]:
**Monitor your DataFlow job in the console**

Go to [the GCP DataFlow console](https://cloud.google.com/dataflow/), select 'view console', click on your job. You can see the evolution of the processing through each step.


In [None]:
**List files in output directory**

!gsutil ls gs://$BUCKET_ID/$DATAFLOW_OUTPUT_DIR

**Suggestions for improvements**

The current solution stores the processed data as TFRecords files. These is ideal from a performance perspective, but may not be ideal in term of readability, if for instance you would like to debug the processing pipeline and read the output. Instead of using the `tfrecordio.WriteToTFRecord` function of the current solution, you can use `beam.io.WriteToText` that will store the data as readable text files.

# 4. Training in Google Cloud ML Engine

**Set up environment variables**

Tensorflow can store temporary model files and the trained model to a given directory. To this end, we hard-code the output path for the training sep as well as the job name, to be used as arguments in the following steps.

In [0]:
os.environ['TRAINING_JOB_NAME'] = 'fraud_detection_training_job_{}'.format(
    current_time)
os.environ['TRAINING_OUTPUT_DIR'] = 'gs://{}/training_output_dir-{}'.format(
    os.environ['BUCKET_ID'], current_time)
print os.environ['TRAINING_JOB_NAME'], os.environ['TRAINING_OUTPUT_DIR']

**Set hyperparameters search configuration**

The training step includes hyperparameter tuning. Hyperparameter tuning (described in more detail here: https://cloud.google.com/blog/big-data/2018/03/hyperparameter-tuning-on-google-cloud-platform-is-now-faster-and-smarter) is the concept of training and evaluating different parametrizations of a model to then pick the best performing one.
For this purpose we need to indicate which parameters we want to test and the range we want to try.

The ML-engine command takes in input a '.yaml' file that contains the configuration to use for hyperparameter tuning.

In [0]:
%%writefile professional-services/examples/cloudml-fraud-detection/hyperparams.yaml
trainingInput:
  scaleTier: STANDARD_1
  hyperparameters:
    maxTrials: 10
    maxParallelTrials: 2
    enableTrialEarlyStopping: True
    goal: MAXIMIZE
    hyperparameterMetricTag: auc_precision_recall
    params:
    - parameterName: first_layer_size
      type: INTEGER
      minValue: 5
      maxValue: 50
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: num_layers
      type: INTEGER
      discreteValues:
      minValue: 1
      maxValue: 2
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: dropout
      type: DOUBLE
      minValue: 0.10
      maxValue: 0.50
      scaleType: UNIT_LINEAR_SCALE
    - parameterName: learning_rate
      type: DOUBLE
      minValue: 0.0001
      maxValue: 0.1
      scaleType: UNIT_LOG_SCALE

**Submit training job**

You can specify the maximum number of training steps with the '--max_steps' argument.

NOTE: The job below takes about half an hour to run. We've already ran it and you could
find the output in `team-storage/training_output_dir`.

In [0]:
%%bash
cd professional-services/machine-learning/solutions/fraud_detection/
MAX_STEP=10000
gcloud ml-engine jobs submit training $TRAINING_JOB_NAME \
--module-name trainer.task \
--staging-bucket gs://${BUCKET_ID} \
--package-path ./trainer \
--region=us-central1 \
--runtime-version 1.5 \
--config=hyperparams.yaml \
-- \
--input_dir gs://${BUCKET_ID}/${DATAFLOW_OUTPUT_DIR} \
--output_dir ${TRAINING_OUTPUT_DIR} \
--max_steps $MAX_STEP

If you didn't run the command above then setup your environment variable to point to prepopulated data:

In [None]:
%env TRAINING_OUTPUT_DIR=gs://anz-uc-team-1-storage/training_output_dir


**Monitor overall training in Google Cloud ML-engine**

You can also monitor the overall training in GCP, access the logs and monitor the results of hyperparameter tuning in the ML-engine console:

Go to the console: https://cloud.google.com/ml-engine/ and select 'View Console'.

**Suggestions for improvements**

The current data processing applies oversampling to the positive class. There are other ways one can deal with class imbalance for instance:
- undersampling the negative class
- change the loss functions to use different weights.

Feel free to change the code to implement some of these options and see how they perform.

In [0]:
# 5. Inferences in Google Cloud ML Engine

Once our model is trained and stored in Google Cloud Storage, we can add it to Cloud ML Engine and use it for batch inference on new data, among other things.
Different versions of the same model can be stored in the ML Engine. We will specify a name for the model and a unique name for the current version, based on current timestamp.

In [0]:
**Pick the best model trial from hyperparameter tuning to use moving forward**

After hyperparameter tuning we may have to pick between different trials and select the best performing one to use for inference. The next command defines which version to use moving forward. In this example the trial picked is 'TRIAL_NUMBER=1' but **you should adapt to pick the best performing model that you obtained after training.**

%env TRIAL_NUMBER=1

**Set up environment variables**

In [0]:
os.environ['MODEL_NAME'] = 'fraud_detection_test_nb'
os.environ['MODEL_VERSION'] = 'v_{}'.format(time.time()).replace('.', '')
temp = !echo $(gsutil ls ${TRAINING_OUTPUT_DIR}/trials/${TRIAL_NUMBER}/export/exporter/ | tail -1)
os.environ['MODEL_SAVED_NAME'] = temp[0]
del temp
print os.environ['MODEL_SAVED_NAME']

**Your models are stored as .pb files in Google Cloud Storage.**

In [0]:
List of trials in Google Cloud Storage

!gsutil ls ${TRAINING_OUTPUT_DIR}/trials/

In [0]:
Files exported for the trial you selected:

!gsutil ls $MODEL_SAVED_NAME

In [0]:
**Create and save model in Google Cloud ML-engine**

We first need to add a new model to Cloud ML-engine..

%%bash
gcloud ml-engine models create $MODEL_NAME \
--regions us-central1

.. and add the current version to the GCP model created, by specifying the path to the model we would like to use from Google Cloud Storage.

In [0]:
%%bash
gcloud ml-engine versions create $MODEL_VERSION \
--model $MODEL_NAME \
--origin $MODEL_SAVED_NAME \
--runtime-version 1.5

Now that the model is saved, we can launch a inference job by indicating the name of the model we just stored and the path to the data to be scored (here the test sample data from the DataFlow step).

In [0]:
**Set up environment variables**

We need to specify the following:

* JOB_NAME: unique name for inference job
* FEATURES_INPUT_PATH: the path to the features to use for prediction (here it is in the output of the DataFlow job)
* PREDICTIONS_OUTPUT_PATH: the path to the directory to store the predictions to.

os.environ['JOB_NAME'] = '{}_{}'.format(
    os.environ['MODEL_NAME'],
    os.environ['MODEL_VERSION'])
os.environ['FEATURES_INPUT_PATH'] = 'gs://{}/{}split_data/split_data_TEST_features.txt*'.format(
    os.environ['BUCKET_ID'],
    os.environ['DATAFLOW_OUTPUT_DIR'])
os.environ['PREDICTIONS_OUTPUT_PATH'] = 'gs://{}/predictions/{}'.format(
    os.environ['BUCKET_ID'],
    os.environ['JOB_NAME'])
print os.environ['FEATURES_INPUT_PATH']
print os.environ['PREDICTIONS_OUTPUT_PATH']
print os.environ['JOB_NAME']

**Submit prediction job**



%%bash
gcloud ml-engine jobs submit prediction $JOB_NAME \
--model $MODEL_NAME \
--input-paths $FEATURES_INPUT_PATH \
--output-path $PREDICTIONS_OUTPUT_PATH \
--region us-central1 \
--data-format TEXT \
--version $MODEL_VERSION

In [0]:
**Monitor prediction job in Google Cloud ML-engine**

You can monitor the prediction job in the GCP ML Engine console:

https://cloud.google.com/ml-engine/ and select 'View Console'.

# 6. Assess model performances on out-of-sample data

We can assess how well our model is performing on the data previously scored. For this purpose we will compare predictions to true labels and derive the precision-recall curve and its AUC. Precision-recall curve AUC is a relevant metric to assess our model's performances as it reflects the fraction of the fraudulent transactions we are able to predict and the accuracy of these predictions. Compared to other metrics it is less sensitive to the strong class imbalance of this example.

In [0]:
import argparse
import json
import os
import sys

import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score

%matplotlib inline

def extract_from_json(path, key, values, proc=(lambda x: x)):
  """Extracts and parses data from json files and returns a dictionary.

  Args:
    path: string, path to input data.
    key: string, name of key column.
    values: string, name of column containing values to extract.
    proc: function, used to process values from input. Follows the signature:
      * Args:
        * x: string or tuple of string
      * Returns:
        string

  Returns:
    Dictionary of parsed data.
  """

  res = {}
  keys = []
  with open(path) as f:
    for line in f:
      line = json.loads(line)
      item_key = proc(line[key])
      res[item_key] = line[values]
      keys.append(item_key)
  unique_keys = [key for key in keys if keys.count(key) == 1]
  return {k: res[k] for k in unique_keys}

def compute_and_print_pr_auc(labels, probabilities):
  """Computes statistic on predictions, based on true labels.

  Prints precision-recall curve AUC and writes the curve as a JPG image to the
  specified directory.

  Args:
    labels: np.array, vector containing true labels.
    probabilities: np.array, 2-dimensional vector containing inferred
      probabilities.
  """

  average_precision = average_precision_score(labels, probabilities[:, 1])

  precision, recall, _ = precision_recall_curve(labels, probabilities[:, 1])
  plt.step(recall, precision, color='b', alpha=0.2, where='post')
  plt.fill_between(recall, precision, step='post', alpha=0.2, color='b')
  plt.xlabel('Recall')
  plt.ylabel('Precision')
  plt.ylim([0.0, 1.05])
  plt.xlim([0.0, 1.0])
  plt.title(
    'Precision-Recall curve: AUC={0:0.2f}'.format(average_precision))
  plt.plot()
  print 'Precision-Recall AUC: {0:0.2f}'.format(average_precision)

def run(labels_path, predictions_path):
  """Reads input data and runs analysis on predictions.

  Args:
    labels_path: string, path to true labels.
    predictions_path: string, path to inferred probabilities.
  """

  labels = extract_from_json(labels_path, 'key', 'Class')
  proba = extract_from_json(
    predictions_path, 'key', 'probabilities', proc=(lambda x: x[0]))

  keys = set(labels.keys()) & set(proba.keys())
  labels = np.array([labels[key] for key in keys])
  proba = np.array([proba[key] for key in keys])

  compute_and_print_pr_auc(
    labels=labels, probabilities=proba)

run(
  labels_path='labels.txt',
  predictions_path='predictions.txt')


**Suggestions for improvements**

Once the classification model is trained and tested, we can think about implementation and how to take action using the model. One important question is to decide which action should be taken based on the model output for a given transaction.

In this classification example the output is the probability that a transaction is fraudulent. We need to decide when to actually flag a transaction as fraudulent. There are several ways to approach the problem (non exhaustive and non mutually exclusive):
- Set a fixed threshold for the probabilities; above the threshold the transaction is flagged, below it is not.
- Balance the cost and benefits of true / false positives / negatives and optimize the overall profit (linear optimization problem).
- Flag a transaction as fraudulent in a probabilistic manner (depending on the output of the model), the higher the output probability, the more likely we will flag it.

The approach to take depends on the business problem to solve:
- The cost / benefits of classification / miss-classification.
- The cost of checking that a transaction is actually fraudulent, once we flagged it.
- If the business wants to gather data to further improve the model.
- The business impact of acting in a probabilistic manner.

Feel free to explore different options. You can also find more online: https://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall07/dtheory.pdf
