### from the AICamp Workshop: Unsupervised Learning and Deep Learning Based Forecasting

https://www.youtube.com/watch?v=amTzgvJg-ZE

Using the Rossman Stores Sales Dataset to make predictions about sales a week into the future

In [25]:
import pandas as pd # A tool we'll use to download and preview CSV files
import pprint # A tool to pretty print dictionary outputs
pp = pprint.PrettyPrinter(indent=2)


In [6]:
import creds
api_key = creds.apiKey

In [7]:
from realityengines.client import ReClient
client = ReClient(api_key)

In [8]:
client.list_use_cases()

[UseCase(use_case='CUSTOMER_CHURN', pretty_name='Customer Churn Prediction', description='Identify customers who are most likely to churn out of your system and send them marketing promotions/emails to retain them. Deploy a deep learning, real-time model that identifies customers who are most likely to leave and increase retention.'),
 UseCase(use_case='ENERGY', pretty_name='Real-Time Forecasting', description='Accurately forecast energy or computation usage in real-time. Make downstream planning decision based on your predictions. We use generative modeling and deep learning to augment your dataset with synthetic data. This unique approach allows us to make accurate predictions in real-time, even when you have little historical data.'),
 UseCase(use_case='FINANCIAL_METRICS', pretty_name='Financial Metrics Forecasting', description='Accurately plan your cash flow, revenue and sales with state-of-the-art deep learning-based forecasting. We use generative modeling and deep learning to au

In [9]:
use_case = 'RETAIL'  #@param {type: "string"}

In [10]:
for requirement in client.describe_use_case_requirements(use_case):
  pp.pprint(requirement.to_dict())

{ 'allowed_column_mappings': { 'DATE': { 'allowed_data_types': ['TIMESTAMP'],
                                         'description': 'Date (day, year or '
                                                        'month) that '
                                                        'corresponds to the '
                                                        'demand value}',
                                         'required': True},
                               'DEMAND': { 'allowed_data_types': ['NUMERICAL'],
                                           'description': 'The demand value '
                                                          'you are '
                                                          'forecasting. (e.g '
                                                          'sales)',
                                           'required': True},
                               'FUTURE': { 'description': 'Known values ahead '
                                              

In [11]:
forecasting_project = client.create_project(name='Store Sales Forecasting', use_case=use_case)
forecasting_project.to_dict()

{'project_id': '1198584b44',
 'name': 'Store Sales Forecasting',
 'use_case': 'RETAIL',
 'created_at': '2020-04-23T17:49:20+00:00'}

In [12]:
sales_history = pd.read_csv('https://s3.amazonaws.com/realityengines.exampledatasets/sales_forecasting/store_sales_timeseries.csv')
sales_history.to_csv('sales_history.csv', index=False)
sales_history

Unnamed: 0,Store,Date,Sales,Customers,Promo,StateHoliday,SchoolHoliday
0,1,2015-07-31,5263,555,1,0,1
1,2,2015-07-31,6064,625,1,0,1
2,3,2015-07-31,8314,821,1,0,1
3,4,2015-07-31,13995,1498,1,0,1
4,5,2015-07-31,4822,559,1,0,1
5,6,2015-07-31,5651,589,1,0,1
6,7,2015-07-31,15344,1414,1,0,1
7,8,2015-07-31,8492,833,1,0,1
8,9,2015-07-31,8565,687,1,0,1
9,10,2015-07-31,7185,681,1,0,1


In [13]:
store_attr = pd.read_csv('https://s3.amazonaws.com/realityengines.exampledatasets/sales_forecasting/store_metadata.csv')
store_attr.to_csv('store_attr.csv', index=False)
store_attr

Unnamed: 0,Store,StoreType,Assortment,CompetitionDistance,CompetitionOpenSinceMonth,CompetitionOpenSinceYear,Promo2,Promo2SinceWeek,Promo2SinceYear,PromoInterval
0,1,c,a,1270.0,9.0,2008.0,0,,,
1,2,a,a,570.0,11.0,2007.0,1,13.0,2010.0,"Jan,Apr,Jul,Oct"
2,3,a,a,14130.0,12.0,2006.0,1,14.0,2011.0,"Jan,Apr,Jul,Oct"
3,4,c,c,620.0,9.0,2009.0,0,,,
4,5,a,a,29910.0,4.0,2015.0,0,,,
5,6,a,a,310.0,12.0,2013.0,0,,,
6,7,a,c,24000.0,4.0,2013.0,0,,,
7,8,a,a,7520.0,10.0,2014.0,0,,,
8,9,a,c,2030.0,8.0,2000.0,0,,,
9,10,a,a,3160.0,9.0,2009.0,0,,,


In [None]:
# Upload the datasets to RealityEngines.AI
sales_upload = client.create_dataset_from_local_file('Store Sales History', 
                                                      project_id=forecasting_project.project_id, 
                                                      dataset_type='TIMESERIES')
with open('sales_history.csv') as file:
   sales_dataset = sales_upload.upload_file(file)
   print("Sales History Dataset Uploaded")


store_upload = client.create_dataset_from_local_file('Store Attributes', 
                                                      project_id=forecasting_project.project_id, 
                                                      dataset_type='ITEM_ATTRIBUTES')
with open('store_attr.csv') as file:
   store_dataset = store_upload.upload_file(file)
   print("Store Attributes Dataset Uploaded")

datasets = [sales_dataset, store_dataset]

**Check Dataset Processing Status**

Once the file is uploaded, RealityEngines.AI starts inspecting and featurizing the datasets to automatically detect the schema. The following command will wait until RealityEngines.AI is done inspecting the files, then prints the schema that RealityEngines.AI detected.


In [15]:
for dataset in datasets:
    dataset.wait_for_inspection()
    print(f'{dataset.name} Schema:')
    pp.pprint(client.get_schema(forecasting_project.project_id, dataset.dataset_id))

Store Sales History Schema:
[ Schema(name='Store', column_mapping='ITEM_ID', column_data_type='IDENTIFIER'),
  Schema(name='Date', column_mapping='DATE', column_data_type='TIMESTAMP'),
  Schema(name='Sales', column_mapping='DEMAND', column_data_type='NUMERICAL'),
  Schema(name='Customers', column_mapping=None, column_data_type='NUMERICAL'),
  Schema(name='Promo', column_mapping=None, column_data_type='CATEGORICAL'),
  Schema(name='StateHoliday', column_mapping=None, column_data_type='CATEGORICAL'),
  Schema(name='SchoolHoliday', column_mapping=None, column_data_type='CATEGORICAL')]
Store Attributes Schema:
[ Schema(name='Store', column_mapping='ITEM_ID', column_data_type='IDENTIFIER'),
  Schema(name='StoreType', column_mapping=None, column_data_type='CATEGORICAL'),
  Schema(name='Assortment', column_mapping=None, column_data_type='CATEGORICAL'),
  Schema(name='CompetitionDistance', column_mapping=None, column_data_type='NUMERICAL'),
  Schema(name='CompetitionOpenSinceMonth', column_map

Each column in a Dataset gets a Column Data Type and all required Column Mappings will be set using heuristics. 

There are some additional features that we'd like to set on the Store Sales Dataset.

For each Dataset Type in a Use Case, there are special Column Mappings that can be applied to a column. We can find the list of available Column Mappings by calling the Describe Use Case Requirements API.


In [16]:
client.describe_use_case_requirements(use_case)[0].allowed_column_mappings

{'ITEM_ID': {'description': 'The unique identifier of the item whose demand you are forecasting (e.g product Id, sku id)',
  'allowed_data_types': ['CATEGORICAL'],
  'required': True},
 'DEMAND': {'description': 'The demand value you are forecasting. (e.g sales)',
  'allowed_data_types': ['NUMERICAL'],
  'required': True},
 'DATE': {'description': 'Date (day, year or month) that corresponds to the demand value}',
  'allowed_data_types': ['TIMESTAMP'],
  'required': True},
 'FUTURE': {'description': 'Known values ahead of time (eg., Holidays) that will be present during prediction',
  'multiple': True},
 'IGNORE': {'description': 'Ignore this column in training', 'multiple': True}}

We want to mark a few columns as FUTURE in our timeseries dataset, as we'll know in advance when a Holiday will be.

The result of this call will be the updated schema after the Column Mappings are applied.


In [17]:
client.set_column_mapping(project_id=forecasting_project.project_id, dataset_id=sales_dataset.dataset_id,
                          column='StateHoliday', column_mapping='FUTURE')
client.set_column_mapping(project_id=forecasting_project.project_id, dataset_id=sales_dataset.dataset_id,
                          column='SchoolHoliday', column_mapping='FUTURE')

[Schema(name='Store', column_mapping='ITEM_ID', column_data_type='IDENTIFIER'),
 Schema(name='Date', column_mapping='DATE', column_data_type='TIMESTAMP'),
 Schema(name='Sales', column_mapping='DEMAND', column_data_type='NUMERICAL'),
 Schema(name='Customers', column_mapping=None, column_data_type='NUMERICAL'),
 Schema(name='Promo', column_mapping=None, column_data_type='CATEGORICAL'),
 Schema(name='StateHoliday', column_mapping='FUTURE', column_data_type='CATEGORICAL'),
 Schema(name='SchoolHoliday', column_mapping='FUTURE', column_data_type='CATEGORICAL')]

### Train a model


In [19]:
forecasting_project.validate()

ProjectValidation(valid=True, dataset_errors=[])

In [26]:
# the 7 in this describes the number of days into the future we want to check
forecasting_project.get_training_config_options()

[TrainingConfigOptions(name='TEST_SPLIT', data_type='INTEGER', value=None, default=10, options={'range': [5, 20]}, description='Percent of dataset to use for test data. We support using a range between 5% to 20% of your dataset to use as test data.', required=None, last_model_value=None),
 TrainingConfigOptions(name='DROPOUT_RATE', data_type='INTEGER', value=None, default=None, options={'range': [1, 10]}, description='Dropout percentage rate.', required=None, last_model_value=None),
 TrainingConfigOptions(name='BATCH_SIZE', data_type='ENUM', value=None, default=None, options={'values': [16, 32, 64, 128]}, description='Batch size.', required=None, last_model_value=None),
 TrainingConfigOptions(name='PREDICTION_LENGTH', data_type='INTEGER', value=None, default=7, options={'range': [1, 100]}, description='How many timesteps in the future to predict', required=None, last_model_value=None),
 TrainingConfigOptions(name='PROBABILITY_QUANTILES', data_type='MULTI_ENUM', value=None, default=[0.1

In [21]:
forecasting_model = forecasting_project.train_model(training_config={})
forecasting_model.to_dict()

{'name': 'Store Sales Forecasting Model',
 'model_id': '7490a3bef',
 'model_config': {},
 'created_at': '2020-04-23T18:01:09+00:00',
 'project_id': '1198584b44',
 'latest_model_instance': {'model_instance_id': 'd22979a4',
  'status': 'PENDING',
  'model_id': '7490a3bef',
  'training_started_at': '2020-04-23T18:01:09+00:00',
  'training_completed_at': None}}

In [22]:
ReClient().get_forecast(deployment_token="7f5305fba2004199a96bfbca88669c25", 
                              deployment_id="13d9392c63", 
                              query_data={"Store":"1"})

{'p10': [{'Date': '2015-08-01T00:00:00', 'Sales': 2359.1279296875},
  {'Date': '2015-08-02T00:00:00', 'Sales': 233.60757446289062},
  {'Date': '2015-08-03T00:00:00', 'Sales': 3945.16455078125},
  {'Date': '2015-08-04T00:00:00', 'Sales': 4377.07763671875},
  {'Date': '2015-08-05T00:00:00', 'Sales': 2910.220458984375},
  {'Date': '2015-08-06T00:00:00', 'Sales': 901.6558227539062},
  {'Date': '2015-08-07T00:00:00', 'Sales': 3536.629150390625}],
 'p50': [{'Date': '2015-08-01T00:00:00', 'Sales': 2895.922607421875},
  {'Date': '2015-08-02T00:00:00', 'Sales': 533.9945068359375},
  {'Date': '2015-08-03T00:00:00', 'Sales': 4904.3271484375},
  {'Date': '2015-08-04T00:00:00', 'Sales': 5133.029296875},
  {'Date': '2015-08-05T00:00:00', 'Sales': 3505.1728515625},
  {'Date': '2015-08-06T00:00:00', 'Sales': 1551.3121337890625},
  {'Date': '2015-08-07T00:00:00', 'Sales': 4124.46630859375}],
 'p90': [{'Date': '2015-08-01T00:00:00', 'Sales': 3632.3525390625},
  {'Date': '2015-08-02T00:00:00', 'Sales': 1

#### This step can take a long time - possibly hours

In [23]:
forecasting_model.wait_for_evaluation()

Model(name='Store Sales Forecasting Model', model_id='7490a3bef', model_config={}, created_at='2020-04-23T18:01:09+00:00', project_id='1198584b44', latest_model_instance=ModelInstance(model_instance_id='d22979a4', status='COMPLETE', model_id='7490a3bef', training_started_at='2020-04-23T18:01:09+00:00', training_completed_at='2020-04-23T18:29:14+00:00'))

## Checkpoint - this block encapsuates everything up until now
import pandas as pd
import pprint
pp = pprint.PrettyPrinter(indent=2)
api_key = ''  #@param {type: "string"}
from realityengines.client import ReClient
client = ReClient(api_key)
forecasting_project = next(project for project in client.list_projects() if project.use_case == 'RETAIL')
forecasting_model = forecasting_project.list_models()[-1]
forecasting_model.wait_for_evaluation()



In [27]:
# this runs after the evaluation is over
pp.pprint(forecasting_model.get_metrics().to_dict())

{ 'metric_names': [ {'nrmse': 'Normalized Root Mean Square Error'},
                    {'smape': 'Symmetric Mean Absolute Percent Error'}],
  'metrics': {'nrmse': 0.23895544855275425, 'smape': 32.15120281673451},
  'model_id': '7490a3bef',
  'model_instance_id': 'd22979a4'}


### To get a better understanding on what these metrics mean:

    Normalized Root Mean Square Error (nrmse):

    NRMSE stands for Normalized Root Mean Square Error. This metric is used in forecasting to measure forecast error and calculate the difference between the predicted values and the observed values. The lower the value of the metric the better. We report the metric as a percentage value. Normally, NRMSE values below 30% are considered good.

    Symmetric Mean Absolute Percent Error (smape):

    Symmetric Mean Absolute Percent Error, this metric is used for forecasting to calculate average of the difference between the prediction and observed results for all data points. The metric is calculated on a range from 0% to 100%. A metric score of 0% means that the model forecasted with perfect results. A metric score of 100% means that the model was completely inaccurate.


### 4. Deploy Model

After the model has been trained, we need to deploy the model to be able to start making predictions. Deploying a model will reserve cloud resources to host the model for realtime and batch predictions.

In [28]:
forecasting_deployment = forecasting_model.create_deployment('Store Sales Deployment')
forecasting_deployment.wait_for_deployment()

Deployment(deployment_id='15427b1e01', name='Store Sales Deployment', status='ACTIVE', description='', deployment_config=None, deployed_at='2020-04-23T18:53:00+00:00', created_at='2020-04-23T18:52:59+00:00', project_id='1198584b44', model_id='7490a3bef')

After the model is deployed, we need to create a deployment token for authenticating prediction requests. This token is only authorized to predict on deployments in this project, so it's safe to embed this token inside of a user-facing application or website. 

In [29]:
deployment_token = forecasting_project.create_deployment_token().deployment_token
deployment_token

'1c5444d7c1694cc1b731a09bdd0ffeba'

### 5. Predict
Now that you have an Active deployment and a Deployment Token to authenticate requests, you can call the prediction command below.

This command will return a JSON Map containing the Prediction Quantiles P10, P50 and P90, each containing an Array of objects representing the prediction of Sales for each Date for the store with ID "1".


In [30]:
client.get_forecast(deployment_token=deployment_token, 
                              deployment_id=forecasting_deployment.deployment_id, 
                              query_data={"Store":"1"})

{'p10': [{'Date': '2015-08-01T00:00:00', 'Sales': 3661.40966796875},
  {'Date': '2015-08-02T00:00:00', 'Sales': 78.0924301147461},
  {'Date': '2015-08-03T00:00:00', 'Sales': 4273.931640625},
  {'Date': '2015-08-04T00:00:00', 'Sales': 2486.921875},
  {'Date': '2015-08-05T00:00:00', 'Sales': 1734.584716796875},
  {'Date': '2015-08-06T00:00:00', 'Sales': 156.35986328125},
  {'Date': '2015-08-07T00:00:00', 'Sales': 1636.0501708984375}],
 'p50': [{'Date': '2015-08-01T00:00:00', 'Sales': 4393.29248046875},
  {'Date': '2015-08-02T00:00:00', 'Sales': 359.0133361816406},
  {'Date': '2015-08-03T00:00:00', 'Sales': 5258.83056640625},
  {'Date': '2015-08-04T00:00:00', 'Sales': 3108.142333984375},
  {'Date': '2015-08-05T00:00:00', 'Sales': 2379.417236328125},
  {'Date': '2015-08-06T00:00:00', 'Sales': 718.6151123046875},
  {'Date': '2015-08-07T00:00:00', 'Sales': 2432.581298828125}],
 'p90': [{'Date': '2015-08-01T00:00:00', 'Sales': 5015.91357421875},
  {'Date': '2015-08-02T00:00:00', 'Sales': 1056

This method will get you forecasts for a single Store, but what about all of your stores? If you need the results immediately, you can use get_forecast and call it for each of your Store IDs. However, if you can wait to get the results asyncronously, you can call the batch prediction API to generate the latest forecasts for all of your Stores. The API will then write the results of each prediction to a JSON Lines file, which you can then retrieve and use.

In [None]:
batch_predict_job = client.batch_predict(deployment_id=forecasting_deployment.deployment_id)
batch_predict_job.wait_for_predictions()

#### Read results using pandas

In [None]:
pd.read_json(batch_predict_job.output_location, lines=True)