Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# 04 Forecasting Pipeline

In this notebook we create a pipeline for Forcasting 11,973 ARIMA models. The training and scoring of these models was completed in the Training and Scoring notebooks in this repository. We will set up the Pipeline for forecasting given the desired forecasting horizon. We utitlize the [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) to parallelize the process. For more information about the Data and Models refer to the Data Preparation and Training Notebooks. 

## Prerequisites 

This example runs on an Azure Machine Learning Notebook VM. We are calling models that have **already been trained and registered** to the Workspace. If you have already run the Environment Setup and Training Pipeline notebooks or you have an AML Notebook set up with Models registered to the Workspace you are all set. 

## 1.0 Call the Workspace, Datastore, and Compute

As we did in the Training Pipeline notebook, we need to call the Workspace. We also want to create variables for the datastore and compute cluster. 

### Connect to the workspace
Create a workspace object. *Workspace.from_config()* reads the config.json file and loads the details into an object named ws. 

In [None]:
from azureml.core import Workspace 

ws= Workspace.from_config(path='../aml_config/config.json')

# ws = Workspace.from_config()

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

### Attach existing compute resource
From the Environment Setup Notebook, we created a compute cluster. 

In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget

compute = AmlCompute(ws, 'train-many-model')

### Call the Datastore containing the Orange Juice sales data
From the Data Preparation Notebook, we uploaded the csv's for each Store and Brand comination. Use *.get_default_datastore()* to save the datastore we uploaded the files into. 

In [None]:
dstore = ws.get_default_datastore()

## 2.0 Call the Registered FileDataset
In the Data Preparation notebook, we uploaded our data to Blob storage then registered the folder of data as a FileDataset to the Workspace. We are call that Dataset in order to pass it as an input into our ParallelRunStep. 

In [None]:
from azureml.core.dataset import Dataset

FileDst10Models = Dataset.get_by_name(ws, name='oj_data_small')
FileDst10ModelsInput = FileDst10Models.as_named_input('Forecast10Models')

In [None]:
FileDstAllModels = Dataset.get_by_name(ws, name='oj_data')
FileDstAllModelsInputs = FileDstAllModels.as_named_input('ForecastAllmodels')

## 3.0 Build forecasting pipeline
Now that the data, models, and compute resources are set up, we can put together a pipeline for forecasting. 
### Set up the environment to run the script
Specify the conda dependencies for your script. This will allow us to install packages and configure the environment. 

In [None]:
from azureml.core import Environment
from azureml.core.runconfig import DEFAULT_CPU_IMAGE
from azureml.core.conda_dependencies import CondaDependencies

# set up the batch environment settings
batch_conda_deps = CondaDependencies.create(pip_packages=['sklearn','pmdarima'])

batch_env = Environment(name="manymodels_environment")
batch_env.python.conda_dependencies = batch_conda_deps
batch_env.docker.enabled = True
batch_env.docker.base_image = DEFAULT_CPU_IMAGE

### Create the configuration to wrap the inference script 
In the [ParallelRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunconfig?view=azure-ml-py), we will call the entry script, environment configuration, and parameters. You will want to determine the number of workers and nodes appropriate for your use case.
- The **workercount** is based off the number of cores on the VM. The compute cluster we set up has 8 cores therefore we set our worker count to 8.  
- The**nodecount** will determine the number of nodes to use. Increasing the node count should help to speed up the process. We started with 3 then increased the number in order for the job to complete in under an hour. 
- You should set the **timeout** to be the slightly longer than amount of time it would take for one iteration of your script to complete. In this example, that would be the amount of time to pull down a model and make predictions. The default time is 60 seconds. 

**include info here about runs we tried**

We added tags for additional information about our settings for the step. 

In [None]:
from azureml.contrib.pipeline.steps import ParallelRunStep, ParallelRunConfig 

workercount = 8
nodecount = 5
timeout = 500

tags1 = {}
tags1['nodes'] = nodecount
tags1['workers-per-node'] = workercount
tags1['timeout'] = timeout 

parallel_run_config = ParallelRunConfig(
    source_directory = './scripts',
    entry_script = 'forecast.py',
    mini_batch_size = '1',
    run_invocation_timeout = timeout, 
    error_threshold = 10,
    output_action = 'append_row', 
    environment = batch_env, 
    process_count_per_node = workercount, 
    compute_target = compute, 
    node_count = nodecount
)

### Create the ParallelRunStep
 This [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) is the main step in our pipeline.  

We can specified the following parameters: 
- **input** : We will provide the data that will be used in the entry_script. 
- **output** : This is the directory where the output of the step will be written. 
- **models** : This provides additional metadata about the models used in the step. 
- **arguments** : You can specify arguments that you want to pass to the entry_script with this argument.

*arguments* and *inputs* are the two parameters that can pass information to the entry_script.

We also need to specify an output directory. This is where output from the setp will be stored. 

#### For the orange juice sales forecasting, we have two arguments passed to the entry_script. 
- **forecast_horizon** is how far into the future the forecast should extend.
- **starting_date** is the date to begin forecating. 

In [None]:
from azureml.pipeline.core import PipelineData

datasetname = 'store'
output_dir = PipelineData(name = 'forecasting_output', 
                         datastore = dstore)

parallelrun_step = ParallelRunStep(
    name="many-models-forecasting",
    parallel_run_config=parallel_run_config,
    inputs=[FileDstAllModelsInputs],  
    output=output_dir,
    models= [],
    arguments=['--forecast_horizon', 8,
              '--starting_date', '1992-10-01'],
    allow_reuse = False
    )

We created a PythonScriptStep to retrieve the output of the ParallelRunStep and upload this final prediction file to a dedicated blob container. 

We first set up the RunConfiguration to make sure we have the right environment and necessary conda pacakges installed.

### Set up RunConfiguration for PythonScriptStep

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.runconfig import CondaDependencies

conda_run_config = RunConfiguration(framework="python")
conda_run_config.target = compute
conda_run_config.environment.docker.enabled = True
conda_run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE
cd = CondaDependencies.create(pip_packages=['azureml-pipeline-core'], conda_packages=['pandas'])
conda_run_config.environment.python.conda_dependencies = cd

We then set up the PythonScriptStep. This step is very similar to the Training and Scoring Pipeline notebooks.

### Create PythonScriptStep

In [None]:
from azureml.pipeline.steps import PythonScriptStep

upload_predictions_python_script_step = PythonScriptStep(name="upload_predictions",
                        script_name="upload_predictions.py",
                        compute_target=compute,
                        source_directory='./scripts',
                        runconfig=conda_run_config,
                        arguments=['--ParallelRunStep_name','many-models-forecasting', '--pipeline_output_name', 'forecasting_output', '--datastore', 'forecasting_output_datastore', '--experiment', 'automl-ojforecasting', '--overwrite_predictions', True],
                        allow_reuse=False)

### Set up the step sequence

We set up a step sequence to make sure to execute ParallelRunStep and PythonScriptStep execute in the correct order. In this example, we run PythonScriptStep after ParallelRunStep since our log is retrieving ParallelRunStep's output data.

In [None]:
from azureml.pipeline.core import StepSequence

all_steps = StepSequence(steps=[parallelrun_step, upload_predictions_python_script_step])

## 4.0 Run forecasting pipeline

### Submit and Run the Pipeline
Create an Experiment to track the runs of the pipeline. Then, you can run the pipeline and review the output. With the current settings and the Standard_D13_V2 VM the pipeline takes approximately 1h 5m to run forecasts for all 11,973 models.  

In [None]:
# set up the experiment
from azureml.core import Experiment
from azureml.pipeline.core import Pipeline

experiment = Experiment(ws, 'automl-ojforecasting')

pipeline = Pipeline(workspace = ws, steps=all_steps)

run = experiment.submit(pipeline, tags=tags1)

In [None]:
run.wait_for_completion(show_output=True)

Succesfully forecasted and logged 11,973 models.

## 5.0 Publish the pipeline

After a succesful run, we publish the pipeline to the Workspace.

In [None]:
published_pipeline = pipeline.publish(name = 'forecast-all-models',
                                     description = 'forecast 11,973 models and log the run',
                                     version = '1',
                                     continue_on_step_failure = True
                                     )

## 6.0 Schedule the pipeline to run weekly

A published pipeline represents a Pipeline to be submitted without the Python code which constructed it.

In addition, a [PublishedPipeline](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.publishedpipeline?view=azure-ml-py) can be used to resubmit a Pipeline with different PipelineParameter values and inputs. The following code block will retrieve all the published pipelines in the Workspace.

We retrieve the published pipeline id and then can schedule this piepline to a desired cadence. In this case, we schedule the training pipeline to run on the first day of every month starting Jan 1, 2020 at 1PM UTC.

In [None]:
from azureml.pipeline.core import Schedule, ScheduleRecurrence
    
forecast_pipeline_id = published_pipeline.id

recurrence = ScheduleRecurrence(frequency="Week", interval=1, start_time="2020-01-01T13:00:00")
recurring_schedule = Schedule.create(ws, name="Forecasting-Pipeline-Recurring-Schedule", 
                            description="Schedule forecasting Pipeline to run on the first day of every month starting Jan 1, 2020 at 1PM",
                            pipeline_id=forecast_pipeline_id, 
                            experiment_name=experiment.name, 
                            recurrence=recurrence)

## 7.0 Review outputs of the forecasting pipeline

The forecasting pipeline will predict the sales quantity for the next 8 weeks for all 11,973 models. 

The pipeline will generate forecasting results of each model and upload a concatenated prediction file to the forecasting output blob container under a folder named as 'oj_forecasts'. 