# Mphasis Persistent Disk Storage Forecasting

Persistent disk storage forecasting helps businesses assess their local disk storage utilization based on the historic usage pattern. This will provide businesses an understanding of their disk storage attached to their virtual machines which will help them manage their infrastructure better. It uses ensemble ML algorithms with automatic model selection algorithms. This solution provides consistent and better results due to its ensemble learning approach. This solution performs automated model selection to apply the right model based on the input data. 

## Contents

1. [Prequisites](#Prerequisite)
1. [Data Dictionary](#Data-Dictionary)
1. [Create The Model](#Create-Model)
1. [Batch Transform Job](#Batch-Transform-Job)
1. [Invoke Endpoint](#Invoking-through-Endpoint)

### Prerequisites

To run this algorithm you need to have access to the following AWS Services:
- Access to AWS SageMaker and the model package.
- An S3 bucket to specify input/output.
- Role for AWS SageMaker to access input/output from S3.


### Data Dictionary

- The input has to be a '.csv' file with 'utf-8' encoding. PLEASE NOTE: If your input .csv file is not 'utf-8' encoded, model   will not perform as expected
1. Have an unique identifier column called 'maskedsku'. eg. 'maskedsku' can be destinationId or sourceId
2. The date format of the columns should be: ''YYYY-MM-DD HH:MM''

### Sample input data

In [10]:
import pandas as pd
import boto3
import re
df = pd.read_csv("sample_disk_space.csv")
df.head(10)

Unnamed: 0,maskedsku,2018-08-01 12:00,2018-08-01 13:00,2018-08-01 14:00,2018-08-01 15:00,2018-08-01 16:00,2018-08-01 17:00,2018-08-01 18:00,2018-08-01 19:00,2018-08-01 20:00,...,2018-08-02 13:00,2018-08-02 14:00,2018-08-02 15:00,2018-08-02 16:00,2018-08-02 17:00,2018-08-02 18:00,2018-08-02 19:00,2018-08-02 20:00,2018-08-02 21:00,2018-08-02 22:00
0,product_1,13380.82192,15244.93151,14925.20548,13585.9726,11365.47945,20060.54795,12861.36986,14945.2274,14490.37808,...,15046.35616,19864.93151,14184.9863,12370.84932,19949.58904,14228.38356,19529.55616,16279.7589,14330.9589,15056.87671


### Create the session

The session remembers our connection parameters to SageMaker. We'll use it to perform all of our SageMaker operations.

In [37]:
# Please use the appropriate ARN obtained after subscribing to the model to define 'model_package_arn'
model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/forecasting-disk-space-usage'

In [12]:
import sagemaker as sage
from time import gmtime, strftime
from sagemaker import get_execution_role


role = get_execution_role()
sess = sage.Session()

## Create Model

Now we use the Model Package to create a model

In [13]:


from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()
model = ModelPackage(model_package_arn=model_package_arn,
                    role = role,
                    sagemaker_session = sagemaker_session)


## Input File

Now we pull a sample input file for testing the model.

In [14]:
sample_txt="s3://mphasis-marketplace/disk-space-usage/sample_disk_space.csv"

## Batch Transform Job

Now let's use the model built to run a batch inference job and verify it works.

In [15]:
import json 
import uuid


transformer = model.transformer(1, 'ml.m5.xlarge')
transformer.transform(sample_txt, content_type='text/csv')
transformer.wait()
#transformer.output_path
print("Batch Transform complete")


.................[34mImporting plotly failed. Interactive plots will not work.
 * Serving Flask app "serve" (lazy loading)
 * Environment: production
   Use a production WSGI server instead.
 * Debug mode: off
 * Running on http://0.0.0.0:8080/ (Press CTRL+C to quit)[0m
[34m169.254.255.130 - - [26/May/2020 16:24:32] "#033[37mGET /ping HTTP/1.1#033[0m" 200 -[0m
[34m169.254.255.130 - - [26/May/2020 16:24:32] "#033[33mGET /execution-parameters HTTP/1.1#033[0m" 404 -
   maskedsku  2018-08-01 12:00  ...  2018-08-02 21:00  2018-08-02 22:00[0m
[34m0  product_1       13380.82192  ...        14330.9589       15056.87671
[0m
[34m[1 rows x 36 columns][0m
[34m35[0m
[32m2020-05-26T16:24:32.627:[sagemaker logs]: MaxConcurrentTransforms=1, MaxPayloadInMB=6, BatchStrategy=MULTI_RECORD[0m
[34m0    [18800.868403010776, 17346.997223684073, 16615...[0m
[34mdtype: object[0m
[35m0    [18800.868403010776, 17346.997223684073, 16615...[0m
[35mdtype: object[0m


[34m0  product_1          13380.8  ...               16340.7               18376.9
[0m
[34m[1 rows x 60 columns][0m
[34m169.254.255.130 - - [26/May/2020 16:25:08] "#033[37mPOST /invocations HTTP/1.1#033[0m" 200 -[0m
[34mINFO:werkzeug:169.254.255.130 - - [26/May/2020 16:25:08] "#033[37mPOST /invocations HTTP/1.1#033[0m" 200 -[0m
[35m0  product_1          13380.8  ...               16340.7               18376.9
[0m
[35m[1 rows x 60 columns][0m
[35m169.254.255.130 - - [26/May/2020 16:25:08] "#033[37mPOST /invocations HTTP/1.1#033[0m" 200 -[0m
[35mINFO:werkzeug:169.254.255.130 - - [26/May/2020 16:25:08] "#033[37mPOST /invocations HTTP/1.1#033[0m" 200 -[0m

Batch Transform complete


## Output from Batch Transform

Note: Ensure that the following package is installed on the local system : boto3

In [16]:
import boto3
print(transformer.output_path)
bucketFolder = transformer.output_path.rsplit('/')[3]
bucket_name=transformer.output_path.rsplit('/')[2]

#print(s3bucket,s3prefix)
s3_conn = boto3.client("s3")
with open('sample_disk_result.csv', 'wb') as f:
    s3_conn.download_fileobj(bucket_name,bucketFolder+'/sample_disk_space.csv.out', f)
    print("Output file loaded from bucket")

s3://sagemaker-us-east-2-786796469737/forecasting-disk-space-usage-2020-05-26-2020-05-26-16-21-29-675
Output file loaded from bucket


In [34]:
df = pd.read_csv("sample_disk_result.csv")
#df  = df.drop('Unnamed: 0',1)
df.head(10)

Unnamed: 0,maskedsku,2018-08-01 12:00,2018-08-01 13:00,2018-08-01 14:00,2018-08-01 15:00,2018-08-01 16:00,2018-08-01 17:00,2018-08-01 18:00,2018-08-01 19:00,2018-08-01 20:00,...,201808031300_forecast,201808031400_forecast,201808031500_forecast,201808031600_forecast,201808031700_forecast,201808031800_forecast,201808031900_forecast,201808032000_forecast,201808032100_forecast,201808032200_forecast
0,product_1,13380.82192,15244.93151,14925.20548,13585.9726,11365.47945,20060.54795,12861.36986,14945.2274,14490.37808,...,18002.753231,21782.566594,17195.127321,16008.596415,24668.373953,18218.794262,20839.149042,20710.743651,16340.686193,18376.925103


## Invoking through Endpoint
This is another way of deploying the model that provides results as real time inference. Here is a sample endpoint for reference

In [42]:
import json 
import uuid
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role
from sagemaker import ModelPackage
import boto3
from IPython.display import Image
from PIL import Image as ImageEdit

role = get_execution_role()

sagemaker_session = sage.Session()
bucket=sagemaker_session.default_bucket()

In [43]:
content_type='text/csv'
model_name='disk-space-usage'
real_time_inference_instance_type='ml.c4.2xlarge'

In [44]:
# Please use the appropriate ARN obtained after subscribing to the model to define 'model_package_arn'
model_package_arn = 'arn:aws:sagemaker:us-east-2:786796469737:model-package/forecasting-disk-space-usage'

In [45]:
from sagemaker import ModelPackage
import sagemaker as sage
from sagemaker import get_execution_role

role = get_execution_role()
sagemaker_session = sage.Session()

In [46]:
#Define predictor wrapper class
def predict_wrapper(endpoint, session):
    return sage.RealTimePredictor(endpoint, session,content_type=content_type)
#create a deployable model from the model package.
model = ModelPackage(role=role,
                    model_package_arn=model_package_arn,
                    sagemaker_session=sagemaker_session,
                    predictor_cls=predict_wrapper)

In [47]:
predictor = model.deploy(1, real_time_inference_instance_type, endpoint_name=model_name)

-------------!

###  1. Invoking endpoint result through CLI command

In [48]:
file_name="sample_disk_space.csv"

In [49]:
!aws sagemaker-runtime invoke-endpoint --endpoint-name $model_name --body fileb://$file_name --content-type 'text/csv' --region us-east-2 result_energy.csv

{
    "ContentType": "text/csv; charset=utf-8",
    "InvokedProductionVariant": "AllTraffic"
}


In [50]:
df = pd.read_csv("result_energy.csv")
#df  = df.drop('Unnamed: 0',1)
df.head(10)

Unnamed: 0,maskedsku,2018-08-01 12:00,2018-08-01 13:00,2018-08-01 14:00,2018-08-01 15:00,2018-08-01 16:00,2018-08-01 17:00,2018-08-01 18:00,2018-08-01 19:00,2018-08-01 20:00,...,201808031300_forecast,201808031400_forecast,201808031500_forecast,201808031600_forecast,201808031700_forecast,201808031800_forecast,201808031900_forecast,201808032000_forecast,201808032100_forecast,201808032200_forecast
0,product_1,13380.82192,15244.93151,14925.20548,13585.9726,11365.47945,20060.54795,12861.36986,14945.2274,14490.37808,...,18002.753231,21782.566594,17195.127321,16008.596415,24668.373953,18218.794262,20839.149042,20710.743651,16340.686193,18376.925103


### 2. Invoking endpoint result through python code

In [51]:
f = open('./sample_disk_space.csv', mode='r')
data=f.read()
prediction = predictor.predict(data)

In [53]:
from io import StringIO

s=str(prediction,'utf-8')
data = StringIO(s) 
df=pd.read_csv(data)
#df  = df.drop('Unnamed: 0',1)
df

Unnamed: 0,maskedsku,2018-08-01 12:00,2018-08-01 13:00,2018-08-01 14:00,2018-08-01 15:00,2018-08-01 16:00,2018-08-01 17:00,2018-08-01 18:00,2018-08-01 19:00,2018-08-01 20:00,...,201808031300_forecast,201808031400_forecast,201808031500_forecast,201808031600_forecast,201808031700_forecast,201808031800_forecast,201808031900_forecast,201808032000_forecast,201808032100_forecast,201808032200_forecast
0,product_1,13380.82192,15244.93151,14925.20548,13585.9726,11365.47945,20060.54795,12861.36986,14945.2274,14490.37808,...,18002.753231,21782.566594,17195.127321,16008.596415,24668.373953,18218.794262,20839.149042,20710.743651,16340.686193,18376.925103


In [54]:
predictor.delete_endpoint()