#  Data Drift Detector for Time Series

Data evolves over time, causing a change in the distributions and interpretation. This is known as drift and causes a degradation in ML model performance. The Drift Detector detects changes in the incoming data, and provides useful insights to the user with respect to the data and model behavior. The solution can also trigger an alert for model retraining based on data drift detection results.

## Contents

1. [Prequisites](#Prerequisite)
1. [Data Dictionary](#Data-Dictionary)
1. [Set Up The Environment](#Set-up-the-environment)
1. [Create The Model](#Create-Model)
1. [Batch Transform Job](#Batch-Transform-Job)
1. [Invoke Endpoint](#Invoking-through-Endpoint)

### Prerequisites

To run this algorithm you need to have access to the following AWS Services:
- Access to AWS SageMaker and the model package.
- An S3 bucket to specify input/output.
- Role for AWS SageMaker to access input/output from S3.


### Data Dictionary

- The input has to be a '.csv' file with 'utf-8' encoding. PLEASE NOTE: If your input .csv file is not 'utf-8' encoded, model   will not perform as expected
- The algorithm works with a time-series dataset with a row limit of not less than 100 instances.
- Content type for input: The input must be in “.csv” format.
- Input must contain columns ‘class’ and ‘predicted’. 


### Sample input data

In [1]:
import pandas as pd
df = pd.read_csv("prediction.csv")
pd.set_option('display.max_colwidth', -1)
df.head(10)

Unnamed: 0.1,Unnamed: 0,date,day,period,nswprice,nswdemand,vicprice,vicdemand,transfer,class,predicted
0,0,0.0,2,0.0,0.056443,0.439155,0.003467,0.422915,0.414912,UP,UP
1,1,0.0,2,0.021277,0.051699,0.415055,0.003467,0.422915,0.414912,UP,UP
2,2,0.0,2,0.042553,0.051489,0.385004,0.003467,0.422915,0.414912,UP,UP
3,3,0.0,2,0.06383,0.045485,0.314639,0.003467,0.422915,0.414912,UP,UP
4,4,0.0,2,0.085106,0.042482,0.251116,0.003467,0.422915,0.414912,DOWN,DOWN
5,5,0.0,2,0.106383,0.041161,0.207528,0.003467,0.422915,0.414912,DOWN,DOWN
6,6,0.0,2,0.12766,0.041161,0.171824,0.003467,0.422915,0.414912,DOWN,DOWN
7,7,0.0,2,0.148936,0.041161,0.152782,0.003467,0.422915,0.414912,DOWN,DOWN
8,8,0.0,2,0.170213,0.041161,0.13493,0.003467,0.422915,0.414912,DOWN,DOWN
9,9,0.0,2,0.191489,0.041161,0.140583,0.003467,0.422915,0.414912,DOWN,DOWN


### Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [2]:
import sagemaker as sage
from time import gmtime, strftime
from sagemaker import get_execution_role

sess = sage.Session()
role = get_execution_role()

## Create Model

Now we use the Model Package to create a model

In [3]:
# Please use the appropriate ARN obtained after subscribing to the model to define 'model_package_arn'

model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/numerical-drift-v1'
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)


## Input File

Now we pull a sample input file for testing the model.

In [4]:
sample_txt="s3://aws-marketplace-mphasis-assets/marketplace-drift-numerical/prediction.csv"

## Batch Transform Job

Now let's use the model built to run a batch inference job and verify it works.

In [5]:
import json 
import uuid


transformer = model.transformer(1, 'ml.m5.xlarge')
transformer.transform(sample_txt, content_type='text/csv')
transformer.wait()
#transformer.output_path
print("Batch Transform complete")


.................[34m * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)[0m
[34m169.254.255.130 - - [30/Jun/2020 04:23:26] "#033[37mGET /ping HTTP/1.1#033[0m" 200 -[0m
[34m169.254.255.130 - - [30/Jun/2020 04:23:26] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -[0m
[34m       Unnamed: 0    date  day    period  ...  vicdemand  transfer  class  predicted[0m
[34m0               0  0.0000    2  0.000000  ...   0.422915  0.414912     UP         UP[0m
[34m1               1  0.0000    2  0.021277  ...   0.422915  0.414912     UP         UP[0m
[34m2               2  0.0000    2  0.042553  ...   0.422915  0.414912     UP         UP[0m
[34m3               3  0.0000    2  0.063830  ...   0.422915  0.414912     UP         UP[0m
[34m4               4  0.0000    2  0.085106  ...   0.422915  0.414912   DOWN       DOWN[0m
[34m...         

## Output from Batch Transform

Note: Ensure that the following package is installed on the local system : boto3

In [6]:
import boto3
print(transformer.output_path)
bucketFolder = transformer.output_path.rsplit('/')[3]
bucket_name=transformer.output_path.rsplit('/')[2]

#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")
# bucket_name="sagemaker-us-east-2-786796469737"
with open('result.csv', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,bucketFolder+'/prediction.csv.out', f)
    print("Output file loaded from bucket")

s3://sagemaker-us-east-2-786796469737/numerical-drift-v1-2020-06-30-04-20-23--2020-06-30-04-20-24-073
Output file loaded from bucket


## Result - Data points where drift has occurred

In [8]:
df = pd.read_csv("result.csv")
df=df.drop(columns='Unnamed: 0')
df

Unnamed: 0,date,day,period,nswprice,nswdemand,vicprice,vicdemand,transfer,class,predicted
0,0.000177,6,0.659574,0.072745,0.287266,0.003467,0.422915,0.414912,UP,UP
1,0.000221,7,0.319149,0.047256,0.180452,0.003467,0.422915,0.414912,DOWN,UP
2,0.000221,7,1.0,0.047106,0.293663,0.003467,0.422915,0.414912,DOWN,UP
3,0.000354,3,0.659574,0.045214,0.450164,0.003467,0.422915,0.414912,DOWN,UP
4,0.004248,1,1.0,0.076108,0.475156,0.003467,0.422915,0.414912,UP,UP
5,0.004292,2,0.659574,0.046806,0.505653,0.003467,0.422915,0.414912,DOWN,UP
6,0.004336,3,1.0,0.045394,0.506397,0.003467,0.422915,0.414912,DOWN,UP
7,0.005044,5,1.0,0.047886,0.514281,0.003467,0.422915,0.414912,DOWN,UP
8,0.005133,7,0.319149,0.068362,0.231627,0.003467,0.422915,0.414912,UP,UP
9,0.009646,4,0.319149,0.045605,0.631508,0.003467,0.422915,0.414912,DOWN,UP


## Invoking through Endpoint
This is another way of deploying the model that provides results as real time inference. Here is a sample endpoint for reference

In [9]:
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from sagemaker import ModelPackage
import boto3
from IPython.display import Image
from PIL import Image as ImageEdit

role = get_execution_role()

sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()

In [10]:
content_type='text/csv'
model_name='drift-detector'
real_time_inference_instance_type='ml.c4.2xlarge'

In [11]:
# Please use the appropriate ARN obtained after subscribing to the model to define 'model_package_arn'
model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/numerical-drift-v1'

In [12]:
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()

In [13]:
#Define predictor wrapper class
def predict_wrapper(endpoint, session):
    return sage.RealTimePredictor(endpoint, session,content_type=content_type)
#create a deployable model from the model package.
model = ModelPackage(role=role,
                    model_package_arn=model_package_arn,
                    sagemaker_session=sagemaker_session,
                    predictor_cls=predict_wrapper)

In [None]:
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=model_name)

###  1. Invoking endpoint result through CLI command

In [19]:
file_name="prediction.csv"

In [20]:
!aws sagemaker-runtime invoke-endpoint --endpoint-name $model_name --body fileb://$file_name --content-type 'text/csv' --region us-east-2 result1.csv

{
    "ContentType": "text/csv; charset=utf-8",
    "InvokedProductionVariant": "AllTraffic"
}


In [21]:
df = pd.read_csv("result1.csv")
df=df.drop(columns='Unnamed: 0')
df.head(20)

Unnamed: 0,date,day,period,nswprice,nswdemand,vicprice,vicdemand,transfer,class,predicted
0,0.000177,6,0.659574,0.072745,0.287266,0.003467,0.422915,0.414912,UP,UP
1,0.000221,7,0.319149,0.047256,0.180452,0.003467,0.422915,0.414912,DOWN,UP
2,0.000221,7,1.0,0.047106,0.293663,0.003467,0.422915,0.414912,DOWN,UP
3,0.000354,3,0.659574,0.045214,0.450164,0.003467,0.422915,0.414912,DOWN,UP
4,0.004248,1,1.0,0.076108,0.475156,0.003467,0.422915,0.414912,UP,UP
5,0.004292,2,0.659574,0.046806,0.505653,0.003467,0.422915,0.414912,DOWN,UP
6,0.004336,3,1.0,0.045394,0.506397,0.003467,0.422915,0.414912,DOWN,UP
7,0.005044,5,1.0,0.047886,0.514281,0.003467,0.422915,0.414912,DOWN,UP
8,0.005133,7,0.319149,0.068362,0.231627,0.003467,0.422915,0.414912,UP,UP
9,0.009646,4,0.319149,0.045605,0.631508,0.003467,0.422915,0.414912,DOWN,UP


### 2. Invoking endpoint result through python code

In [22]:
f = open('./prediction.csv', mode='r')
data=f.read()
prediction = predictor.predict(data)

In [23]:
from io import StringIO

s=str(prediction,'utf-8')
data = StringIO(s) 
df=pd.read_csv(data)
df=df.drop(columns='Unnamed: 0')
df

Unnamed: 0,date,day,period,nswprice,nswdemand,vicprice,vicdemand,transfer,class,predicted
0,0.000177,6,0.659574,0.072745,0.287266,0.003467,0.422915,0.414912,UP,UP
1,0.000221,7,0.319149,0.047256,0.180452,0.003467,0.422915,0.414912,DOWN,UP
2,0.000221,7,1.0,0.047106,0.293663,0.003467,0.422915,0.414912,DOWN,UP
3,0.000354,3,0.659574,0.045214,0.450164,0.003467,0.422915,0.414912,DOWN,UP
4,0.004248,1,1.0,0.076108,0.475156,0.003467,0.422915,0.414912,UP,UP
5,0.004292,2,0.659574,0.046806,0.505653,0.003467,0.422915,0.414912,DOWN,UP
6,0.004336,3,1.0,0.045394,0.506397,0.003467,0.422915,0.414912,DOWN,UP
7,0.005044,5,1.0,0.047886,0.514281,0.003467,0.422915,0.414912,DOWN,UP
8,0.005133,7,0.319149,0.068362,0.231627,0.003467,0.422915,0.414912,UP,UP
9,0.009646,4,0.319149,0.045605,0.631508,0.003467,0.422915,0.414912,DOWN,UP


In [24]:
predictor.delete_endpoint()