## Import Libraries and specify Amazon S3 Bucket

* `bucket` - An S3 bucket accessible by this account.
* `prefix` - The location in the bucket where this notebook's input and output data will be stored. (The default value is sufficient.)

In [None]:
# Define IAM role
import boto3
import re
import sagemaker
import seaborn as sns

role = sagemaker.get_execution_role()

#Manage interactions with the Amazon SageMaker APIs and any other AWS services needed.
#manipulating entities and resources that Amazon SageMaker uses, such as training jobs, endpoints, and input datasets in S3.
sess = sagemaker.Session()
bucket = sess.default_bucket()
prefix = 'DEMO-fraud-detection'

## Obtain and Inspect Sample Data


Our data comes from the Numenta Anomaly Benchmark (NAB) [[1](https://github.com/numenta/NAB/blob/master/data/realKnownCause/machine_temperature_system_failure.csv)]. The data records the temperature sensor data from an internal component of a large industrial machine. The data collected over the course of 3 months aggregated into 5-minute buckets.

> https://github.com/numenta/NAB/blob/master/data/realKnownCause/machine_temperature_system_failure.csv

In [None]:
%%time

import pandas as pd
import urllib.request

data_filename = '2018.csv'#'machine_temperature_system_failure.csv'
data_source = 'https://raw.githubusercontent.com/numenta/NAB/master/data/realKnownCause/machine_temperature_system_failure.csv'

#urllib.request.urlretrieve(data_source, data_filename)
temp_data = pd.read_csv(data_filename, delimiter=',')
prediction_data = pd.read_csv("2018.csv", delimiter=',')


## Data Inspection

Before training any models it is important to inspect our data, first. Perhaps there are some underlying patterns or structures that we could provide as "hints" to the model or maybe there is some noise that we could pre-process away. The raw data looks like this:

In [None]:
temp_data.head()

### Visualize the dataset

If you're running this notebook on Amazon Sagemaker, please install the `bokeh` library manually in the terminal of this notebook machine. or you can run `!sudo pip install bokeh` in one of the blocks.

In [None]:
import bokeh
import bokeh.io
from bokeh.models import HoverTool
bokeh.io.output_notebook()
from bokeh.plotting import figure, show, output_file, ColumnDataSource
from bokeh.models.formatters import DatetimeTickFormatter
import bokeh.palettes
temp_data['timestamp'] = pd.to_datetime(temp_data['timestamp'])
output_file("datetime.html")
date = temp_data['timestamp']
temp = temp_data['value']

x = date.values
y = temp.values
score = []

string_x = list(map(str, x))
source = ColumnDataSource(temp_data)
source.add(temp_data['timestamp'].apply(lambda d: d.strftime('%Y-%m-%d %H:%M:%S')), 'event_date_formatted')
#Hover Tool

hover = HoverTool(
    names = ["temp"],
    tooltips=[
        ( 'date',   '@event_date_formatted'),
        ( 'temperature',  '$y' ),        
    ],


    # display a tooltip whenever the cursor is vertically in line with a glyph
    mode='vline'
)

p = figure(x_axis_type='datetime', plot_width=900, plot_height=400, tools=[hover, 'pan','wheel_zoom','box_zoom','reset']) 
p.line( name="temp", x='timestamp',y= 'value',source=source, line_width=2,color='navy', alpha=0.5)
show(p)

## Store Data on S3

The Random Cut Forest Algorithm accepts data in [RecordIO](https://mxnet.apache.org/api/python/io/io.html#module-mxnet.recordio) [Protobuf](https://developers.google.com/protocol-buffers/) format. The SageMaker Python API provides helper functions for easily converting your data into this format. Below we convert the temperature sensor data and upload it to the `bucket + prefix` Amazon S3 destination specified at the beginning of this notebook in the [Setup AWS Credentials](#Setup-AWS-Credentials) section.

In [None]:
def convert_and_upload_training_data(ndarray, bucket, prefix, filename='data.pbr'):
    import boto3
    import os
    from sagemaker.amazon.common import numpy_to_record_serializer
    
    # convert numpy array to Protobuf RecordIO format
    serializer = numpy_to_record_serializer()
    buffer = serializer(ndarray)
    
    # Upload to S3
    s3_object = os.path.join(prefix, 'train', filename)
    boto3.Session().resource('s3').Bucket(bucket).Object(s3_object).upload_fileobj(buffer)
    
    s3_path = 's3://{}/{}'.format(bucket, s3_object)
    return s3_path

# RCV only works on an array of values.
s3_train_data = convert_and_upload_training_data(
    temp_data.value.as_matrix().reshape(-1,1),
    bucket,
    prefix)
print('Uploaded data to {}'.format(s3_train_data))

# Training

***

We have created a training data set and uploaded it to S3. Next, we configure a SageMaker training job to use the Random Cut Forest (RCF) algorithm on said training data.

The first step is to specify the location of the Docker image containing the SageMaker Random Cut Forest algorithm. In order to minimize communication latency, we provide containers for each AWS region in which SageMaker is available. The code below automatically chooses an algorithm container based on the current region; that is, the region in which this notebook is run.

## Hyperparameters and Job Definition

Particular to a SageMaker RCF training job are the following hyperparameters:

* **`num_samples_per_tree`** - the number randomly sampled data points sent to each tree. As a general rule, `1/num_samples_per_tree` should approximate the the estimated ratio of anomalies to normal points in the dataset.
* **`num_trees`** - the number of trees to create in the forest. Each tree learns a separate model from different samples of data. The full forest model uses the mean predicted anomaly score from each constituent tree.
* **`feature_dim`** - the dimension of each data point.

In addition to these RCF model hyperparameters, we provide additional parameters defining things like the EC2 instance type on which training will run, the S3 bucket containing the data, and the AWS access role. Note that,

* Recommended instance type: `ml.m4`, `ml.c4`, or `ml.c5`
* Current limitations:
  * The RCF algorithm does not take advantage of GPU hardware.

In [None]:
%%time
session = sagemaker.Session()

# Specify the location of the training container
container = '382416733822.dkr.ecr.us-east-1.amazonaws.com/randomcutforest:latest'

# specify general training job information
rcf = sagemaker.estimator.Estimator(
    container,
    execution_role,
    input_mode='File',
    output_path='s3://{}/{}/output'.format(bucket, prefix),
    train_instance_count=1,
    train_instance_type='ml.m4.xlarge',
    sagemaker_session=session,
)

# set algorithm-specific hyperparameters
rcf.set_hyperparameters(
    num_samples_per_tree = 500,
    num_trees = 100,
    feature_dim = 1,
)

# RCF training requires sharded data. See documentation for
# more information.

s3_train_input = sagemaker.session.s3_input(
    s3_train_data,
    distribution='ShardedByS3Key',
    content_type='application/x-recordio-protobuf',
)


# run the training job on input data stored in S3
rcf.fit({'train': s3_train_input})

If you see the message

> `===== Job Complete =====`

at the bottom of the output logs then that means training successfully completed and the output RCF model was stored in the specified output path. You can also view information about and the status of a training job using the AWS SageMaker console. Just click on the "Jobs" tab and select training job matching the training job name, below:

In [None]:
print('Training job name: {}'.format(rcf.latest_training_job.job_name))

# Inference

***

A trained Random Cut Forest model does nothing on its own. We now want to use the model we computed to perform inference on data. In this case, it means computing anomaly scores from input time series data points.

We create an inference endpoint using the SageMaker Python SDK `deploy()` function from the job we defined above. We specify the instance type where inference is computed as well as an initial number of instances to spin up. We recommend using the `ml.c5` instance type as it provides the fastest inference time at the lowest cost.

In [None]:
rcf_inference = rcf.deploy(
    initial_instance_count=1,
    instance_type='ml.m4.xlarge',
)

Congratulations! You now have a functioning SageMaker RCF inference endpoint. You can confirm the endpoint configuration and status by navigating to the "Endpoints" tab in the AWS SageMaker console and selecting the endpoint matching the endpoint name, below: 

In [None]:
print('Endpoint name: {}'.format(rcf_inference.endpoint))

## Data Serialization/Deserialization

We can pass data in a variety of formats to our inference endpoint. In this example we will demonstrate passing CSV-formatted data. Other available formats are JSON-formatted and RecordIO Protobuf. We make use of the SageMaker Python SDK utilities `csv_serializer` and `json_deserializer` when configuring the inference endpoint.

In [None]:
from sagemaker.predictor import csv_serializer, json_deserializer

rcf_inference.content_type = 'text/csv'
rcf_inference.serializer = csv_serializer
rcf_inference.deserializer = json_deserializer

Let's pass the training dataset, in CSV format, to the inference endpoint so we can automatically detect the anomalies we saw with our eyes in the plots, above. Note that the serializer and deserializer will automatically take care of the datatype conversion from Numpy NDArrays.

For starters, let's only pass in the first six datapoints so we can see what the output looks like.

In [None]:
import time
prediction_data = pd.read_csv("2018.csv", delimiter=',')
prediction_data_numpy = prediction_data.value.as_matrix().reshape(-1,1)
results = rcf_inference.predict(prediction_data_numpy)

## Computing Anomaly Scores

Now, let's compute and plot the anomaly scores from the entire temperature dataset.

In [None]:
from bokeh.models import Range1d, LinearAxis
def prediction(data):
    prediction_data = data
    prediction_data['timestamp'] = pd.to_datetime(prediction_data['timestamp'])
    output_file("datetime.html")
    temp = prediction_data['value']
    scores = prediction_data['score']

    hover = HoverTool(
        names = ["temp","score"],
        tooltips=[
            ( 'date',   '@event_date_formatted'),
            ( 'temp',  '@temp_y' ), 
           ( 'Score',  '@score_y' ), 
        ],
    )
    source = ColumnDataSource(prediction_data)
    source.add(prediction_data['timestamp'].apply(lambda d: d.strftime('%Y-%m-%d %H:%M:%S')), 'event_date_formatted')
    source.add(temp, 'temp_y')
    source.add(scores, 'score_y')

    p = figure(x_axis_type='datetime', plot_width=1000, plot_height=350, tools=[hover, 'pan','wheel_zoom','box_zoom','reset']) 
    p.line( name = "temp", x='timestamp',y= 'value',source=source, line_width=2,color='navy', alpha=0.5, legend=["Temperature"])
    p.extra_y_ranges = {"Anomaly": Range1d(start=0, end=10)}
    p.add_layout(LinearAxis(y_range_name="Anomaly"), 'right')
    p.line( name = "score", x='timestamp',y='score',source=source, line_width=2,color='red', alpha=0.5, y_range_name="Anomaly",legend=["Score"])
    p.legend.location = "top_left"
    p.legend.click_policy="hide"
    p.title.text = "Anomaly Detection for Device Temperature"



    #select the highest anomaly scores
    score_mean = prediction_data['score'].mean()
    score_std = prediction_data['score'].std()
    score_cutoff = score_mean + 3*score_std
    print(score_cutoff)
    anomalies = prediction_data[prediction_data['score'] > score_cutoff]
    sorted_anomalies = anomalies.sort_values(by=['score'], ascending=False)
    print(sorted_anomalies)
    source = ColumnDataSource(prediction_data)
    #p.circle( sorted_anomalies['timestamp'],sorted_anomalies['score'], line_width=8,color='black', alpha=0.5, y_range_name="Anomaly")
    p.circle( sorted_anomalies['timestamp'],sorted_anomalies['value'], line_width=5,color='black')


    show(p)
    
def process_data(file):
    prediction_data = pd.read_csv(file, delimiter=',')
    prediction_data_numpy = prediction_data.value.as_matrix().reshape(-1,1)
    results = rcf_inference.predict(prediction_data_numpy)
    scores = [datum['score'] for datum in results['scores']]
    prediction_data['score'] = pd.Series(scores, index=prediction_data.index)
    return prediction_data
#prediction(prediction_data)

## Make prediction on the whole dataset

In [None]:
prediction(process_data("2018.csv"))

## Make prediction on a new data

In [None]:
prediction(process_data("temp_predictions.csv"))

Note that the anomaly score spikes where our eyeball-norm method suggests there is an anomalous data point as well as in some places where our eyeballs are not as accurate.

Below we print and plot any data points with scores greater than 3 standard deviations (approx 99.9th percentile) from the mean score.

The first anomaly was a planned shutdown. The third anomaly was a catastrophic system failure. The other few anomalies in the middle, a subtle but observable change in the behavior, indicated the actual onset of the problem that led to the eventual system failure.

* `2013-12-16` - A planned shutdown
* `2014-02-01` - Subtle anomalies that led to the failure.
* `2014-02-08` - Catastrophic System Failure

Note that our algorithm managed to capture these events along with quite a few others. Below we add these anomalies to the score plot.

With the current hyperparameter choices we see that the three-standard-deviation threshold, while able to capture the known anomalies as well as the ones apparent in the ridership plot, is rather sensitive to fine-grained peruturbations and anomalous behavior. Adding trees to the SageMaker RCF model could smooth out the results as well as using a larger data set.

## Stop and Delete the Endpoint

Finally, we should delete the endpoint before we close the notebook.

To do so execute the cell below. Alternately, you can navigate to the "Endpoints" tab in the SageMaker console, select the endpoint with the name stored in the variable `endpoint_name`, and select "Delete" from the "Actions" dropdown menu. 

In [None]:
sagemaker.Session().delete_endpoint(rcf_inference.endpoint)