Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# 03b Forecasting Pipeline
---

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/manymodels/03_Forecasting/03_Forecasting_Pipeline.png)

In this notebook we create a pipeline for Forcasting 11,973 AutoML models. The training and scoring of these models was completed in the Training notebook in this repository. We will set up the Pipeline for forecasting given the desired forecasting horizon. We utitlize the [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) to parallelize the process. For more information about the Data and Models refer to the Data Preparation and Training Notebooks. 

The pipeline set up is similar to the Training Pipeline in this repository. For more details on the steps and functions refer to the Training folder. 

### Prerequisites 
At this point, you should have already:

1. Created your AML Workspace using the [00_Setup_AML_Workspace notebook](../00_Setup_AML_Workspace.ipynb)
2. Run [01b_Data_Preparation.ipynb](../01b_Data_Preparation/01b_Data_Preparation.ipynb) to create the dataset
3. Run [02b_Train_AutomatedML.ipynb](../02b_Train_AutoML/02b_Train_AutoML.ipynb) to train the models

## 1.0 Call the Workspace, Datastore, and Compute

As we did in the Training Pipeline notebook, we need to call the Workspace. We also want to create variables for the datastore and compute cluster. 

### Connect to the workspace


In [None]:
import azureml.core
from azureml.core import Workspace, Datastore
import pandas as pd

# set up workspace
ws= Workspace.from_config() 

# Take a look at Workspace
ws.get_details()

# set up datastores
dstore = ws.get_default_datastore()

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Default datastore name'] = dstore.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

### Attach existing compute resource


In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget

# Choose a name for your cluster.
amlcompute_cluster_name = "train-many-model"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets
if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
    found = True
    print('Found existing compute target.')
    compute = cts[amlcompute_cluster_name]
    
if not found:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D13_V2',
                                                           min_nodes=3,
                                                           max_nodes=20)
    # Create the cluster.
    compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
compute.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)
    
# For a more detailed view of current AmlCompute status, use get_status().

### Set up an Experiment

In [None]:
from azureml.core import Experiment

experiment = Experiment(ws, 'manymodels-forecasting-pipeline')

### Call the Datastore

In [None]:
from azureml.core import Datastore

dstore = ws.get_default_datastore()

## 2.0 Call Registered FileDataset
In the Data Preparation notebook, we registered the orange juice inference data to the Workspace. You can choose to run the pipeline on the subet of 10 series or the full dataset of 11,973 series. We recommend starting with 10 series then expanding. 

In [None]:
from azureml.core.dataset import Dataset

filedst_10_models = Dataset.get_by_name(ws, name='oj_inference_small')
filedst_10_models_input = filedst_10_models.as_named_input('forecast_10_models')
 
filedst_all_models = Dataset.get_by_name(ws, name='oj_inference')
filedst_all_models_input = filedst_all_models.as_named_input('forecast_all_models')

## 3.0 Build forecasting pipeline
Now that the data, models, and compute resources are set up, we can put together a pipeline for forecasting. 
### Set up the environment to run the script
Specify the conda dependencies for your script. This will allow us to install packages and configure the environment.

In [None]:
from scripts.helper import get_automl_environment
forecast_env = get_automl_environment()

### Create the configuration to wrap the entry script 
[ParallelRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_config.parallelrunconfig) is configuration for parallel run step. You will need to determine the number of workers and nodes appropriate for your use case. The process_count_per_node is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.


* <b>node_count</b>: The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long.

* <b>process_count_per_node</b>: The number of processes per node.

* <b>run_invocation_timeout</b>: The run() method invocation timeout in seconds. The timeout should be set to maximum training time of one AutoML run(with some buffer), by default it's 60 seconds.

<span style="color:red"><b>NOTE: There are limits on how many runs we can do in parallel per workspace, and we currently recommend to set the parallelism to maximum of 20 runs per experiment per workspace. If users want to have more parallelism and increase this limit they might encounter Too Many Requests errors (HTTP 429). </b></span>

In [None]:
#!pip install azureml.contrib.pipeline.steps

In [None]:
from scripts.helper import build_parallel_run_config_for_forecasting

# PLEASE MODIFY the following three settings based on your compute and experiment timeout.
node_count=3
process_count_per_node=6
run_invocation_timeout=300 # this timeout(in seconds), for larger models need to change this to a higher timeout


parallel_run_config = build_parallel_run_config_for_forecasting(forecast_env, compute, node_count, process_count_per_node, run_invocation_timeout)

### Create the ParallelRunStep
 The [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) is the main step in our pipeline. We specified the following parameters: **input**, **output**, and **arguments**. We also set the output directory.   

For the orange juice sales forecasting, we pass two **arguments** to the entry_script. 
- **group_column_names** list of column names that identifies groups
- **target_column_name** [Optional] column name only if the inference dataset has the target 
- **time_column_name** [Optional] column name only if it is timeseries

*arguments* and *inputs* are the two parameters that can pass information to the entry_script.

You can change between running the pipeline on a subset of models or the full data set by changing the inputs parameter. 

In [None]:
from azureml.pipeline.core import PipelineData
from azureml.contrib.pipeline.steps import ParallelRunStep

forecasting_output_name = 'forecasting_output'

output_dir = PipelineData(name = forecasting_output_name, 
                          datastore = dstore)

parallelrun_step = ParallelRunStep(
    name="many-models-forecasting",
    parallel_run_config=parallel_run_config,
    inputs=[filedst_10_models_input], 
    #inputs=[filedst_all_models_input],
    output=output_dir,
    models= [], 
    arguments=['--group_column_names', 'Store', 'Brand',
              '--target_column_name', 'Quantity', # this is optional, and needs to be passed only if inference data contains target column
              '--time_column_name', 'WeekStarting'  # this is needed for timeseries
              ])

## 4.0 Run the forecast pipeline
We can use the Experiment we created to track the runs of the pipeline and review the output.

In [None]:
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace = ws, steps=parallelrun_step)
run = experiment.submit(pipeline)

You can run the folowing command if you'd like to monitor the forecasting process in jupyter notebook. It will stream logs live while forecasting. 

**Note**: this command may not work for Notebook VM, however it should work on your local laptop.

In [None]:
run.wait_for_completion(show_output=True)

Succesfully forecasted on AutoML Models. 

## 5.0 Pipeline Outputs
The forecasting pipeline forecasts the orange juice quantity for a Store by Brand. The pipeline returns one file with the predictions for each store and outputs the result to the forecasting_output Blob container. The details of the blob container is listed in 'forecasting_output.txt' under Outputs+logs. 

The following code snippet:
1. Downloads the contents of the output folder that is passed in the parallel run step 
2. Reads the parallel_run_step.txt file that has the predictions as pandas dataframe and 
3. Displays the top 10 rows of the predictions

In [None]:
import pandas as pd
import shutil
import os
import sys 
from scripts.helper import get_forecasting_output

forecasting_results_name = "forecasting_results"

forecast_file = get_forecasting_output(run, forecasting_results_name, forecasting_output_name)
df = pd.read_csv(forecast_file, delimiter=" ", header=None)
df.columns = ["Week Starting", "Store", "Brand", "Quantity",  "Advert", "Price" , "Revenue", "Predicted" ]
print("Prediction has ", df.shape[0], " rows. Here the first 10 rows are being displayed.")
df.head(10)