# Forecasting Pipeline

In this notebook we create a pipeline for Forcasting 11,973 ARIMA models. The training and scoring of these models was completed in the Training and Scoring notebooks in this repository. We will set up the Pipeline for forecasting given the desired forecasting horizon. We utitlize the [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) to parallelize the process. For more information about the Data and Models refer to the Data Preparation and Training Notebooks. 

# Prerequisites 

This example runs on an Azure Machine Learning Notebook VM. We are calling models that have **already been trained and registered** to the Workspace. If you have already run the Environment Setup and Training Pipeline notebooks or you have an AML Notebook set up with Models registered to the Workspace you are all set. 

## Call the Workspace, Datastore, and Compute

As we did in the Training Pipeline notebook, we need to call the Workspace. We also want to create variables for the datastore and compute cluster. 

### Connect to the workspace
Create a workspace object. *Workspace.from_config()* reads the config.json file and loads the details into an object named ws. 

In [None]:
from azureml.core import Workspace 

# ws = Workspace(subscription_id="bbd86e7d-3602-4e6d-baa4-40ae2ad9303c", resource_group="ManyModelsSA", workspace_name="ManyModelsSAv1")

ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

### Attach existing compute resource
From the Environment Setup Notebook, we created a compute cluster. 

In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget

compute = AmlCompute(ws, 'train-many-model')

### Call the Datastore containing the Orange Juice sales data
From the Data Preparation Notebook, we uploaded the csv's for each Store and Brand comination. Use *.get_default_datastore()* to save the datastore we uploaded the files into. 

In [None]:
dstore = ws.get_default_datastore()

### Call Registered FileDataset
In the Data Preparation notebook, we uploaded our data to Blob storage then registered the folder of data as a FileDataset to the Workspace. We are call that Dataset in order to pass it as an input into our ParallelRunStep. 

In [None]:
from azureml.core.dataset import Dataset

FileDst10Models = Dataset.get_by_name(ws, name='10modelsfiledataset')
FileDst10ModelsInput = FileDst10Models.as_named_input('Train10Models')

In [None]:
from azureml.core.dataset import Dataset

FileDstAllModels = Dataset.get_by_name(ws, name='AllDataProd')
FileDstAllModelsInputs = FileDstAllModels.as_named_input('ForecastAllmodels')

In [None]:
# from azureml.core.dataset import Dataset

# ds_name = 'AllDataProd'
# oj_ds = Dataset.get_by_name(ws, name = ds_name)

### Subset the number of models to run
If you would like to test the forecasting pipeline with a subset of models use the following code to take a sample of the full dataset. To make sure the same subset is selected across the Training, Scoring and Forecasting Notebooks use the same seed value. The *take_sample* function is taking a random sample. Otherwise, run the commented out line to run all 11,000+ models. 

In [None]:
# # Subset to 14 models
# oj_ds_subset = oj_ds.take_sample(0.001, seed = 1248) 

# # Subset to 50 models
# #oj_ds_subset = oj_ds.take_sample(0.005, seed = 1248)

# # subset to 1,000 models 
# #oj_ds_subset = oj_ds.take_sample(0.084, seed = 1248)

# oj_input_data = oj_ds_subset.as_named_input('oj_series')

# # This line will create a dataset for the full 11,973 models. 
# #oj_input_data = oj_ds.as_named_input('oj_series')

## Create entry script for Forecasting
To use the models to make forecasts, you need an **entry script** and a list of **dependencies**:

#### An entry script
This script accepts requests, scores the requests by using the model, and returns the results.
- __init()__ - Typically this function loads the model into a global object. This function is run only once at the start of batch processing per worker node/process. init method can make use of following environment variables (ParallelRunStep input):
    	AZUREML_BI_OUTPUT_PATH – output folder path
    
    
    
- __run(mini_batch)__ - The method to be parallelized. Each invocation will have one minibatch.<BR>

    - __mini_batch__: Batch inference will invoke run method and pass either a list or Pandas DataFrame as an argument to the method. Each entry in mini_batch will be - a filepath if input is a FileDataset, a Pandas DataFrame if input is a TabularDataset.<BR>

    - __run__ method response: run() method should return a Pandas DataFrame or an array. For append_row output_action, these returned elements are appended into the common output file. For summary_only, the contents of the elements are ignored. For all output actions, each returned output element indicates one successful inference of input element in the input mini-batch.
    The User should make sure that enough data is included in inference result to map input to inference. Inference output will be written in output file and not guaranteed to be in order, the user should use some key in the output to map it to input.
    

#### Dependencies
Helper scripts or Python/Conda packages are required to run the entry script or model.

The deployment configuration for the compute target that hosts the deployed model. This configuration describes things like memory and CPU requirements needed to run the model.

These items are encapsulated into an inference configuration and a deployment configuration. The inference configuration references the entry script and other dependencies. You define these configurations programmatically when you use the SDK to perform the deployment. You define them in JSON files when you use the CLI.

## Build and Run the batch inferece pipeline
Now that the data, models, and compute resources are set up, we can put together a pipeline for forecasting. 
### Set up the environment to run the script
Specify the conda dependencies for your script. This will allow us to install packages and configure the environment. 

In [None]:
from azureml.core import Environment
from azureml.core.runconfig import DEFAULT_CPU_IMAGE
from azureml.core.conda_dependencies import CondaDependencies

# set up the batch environment settings
batch_conda_deps = CondaDependencies.create(pip_packages=['sklearn','pmdarima'])

batch_env = Environment(name="manymodels_environment")
batch_env.python.conda_dependencies = batch_conda_deps
batch_env.docker.enabled = True
batch_env.docker.base_image = DEFAULT_CPU_IMAGE

### Create the configuration to wrap the inference script 
In the [ParallelRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunconfig?view=azure-ml-py), we will call the entry script, environment configuration, and parameters. You will want to determine the number of workers and nodes appropriate for your use case.
- The **workercount** is based off the number of cores on the VM. The compute cluster we set up has 8 cores therefore we set our worker count to 8.  
- The**nodecount** will determine the number of nodes to use. Increasing the node count should help to speed up the process. We started with 3 then increased the number in order for the job to complete in under an hour. 
- You should set the **timeout** to be the slightly longer than amount of time it would take for one iteration of your script to complete. In this example, that would be the amount of time to pull down a model and make predictions. The default time is 60 seconds. 

**include info here about runs we tried**

We added tags for additional information about our settings for the step. 

In [None]:
from azureml.contrib.pipeline.steps import ParallelRunStep, ParallelRunConfig 

workercount = 8
nodecount = 5
timeout = 500

tags1 = {}
tags1['nodes'] = nodecount
tags1['workers-per-node'] = workercount
tags1['timeout'] = timeout 

parallel_run_config = ParallelRunConfig(
    source_directory = './scripts',
    entry_script = 'forecast.py',
    mini_batch_size = '1',
    run_invocation_timeout = timeout, 
    error_threshold = 10,
    output_action = 'append_row', 
    environment = batch_env, 
    process_count_per_node = workercount, 
    compute_target = compute, 
    node_count = nodecount
)

### Create the ParallelRunStep
 This [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) is the main step in our pipeline.  

We can specified the following parameters: 
- **input** : We will provide the data that will be used in the entry_script. 
- **output** : This is the directory where the output of the step will be written. 
- **models** : This provides additional metadata about the models used in the step. 
- **arguments** : You can specify arguments that you want to pass to the entry_script with this argument.

*arguments* and *inputs* are the two parameters that can pass information to the entry_script.

We also need to specify an output directory. This is where output from the setp will be stored. 

#### For the orange juice sales forecasting, we have two arguments passed to the entry_script. 
- **forecast_horizon** is how far into the future the forecast should extend.
- **starting_date** is the date to begin forecating. 

In [None]:
from azureml.pipeline.core import PipelineData

datasetname = 'store'
output_dir = PipelineData(name = 'forecasting_output', 
                         datastore = dstore)

parallelrun_step = ParallelRunStep(
    name="many-models-forecasting",
    parallel_run_config=parallel_run_config,
    inputs=[FileDstAllModelsInputs],  
    output=output_dir,
    models= [], # this is just for logging
    arguments=['--forecast_horizon', 8,
              '--starting_date', '1992-10-01',
              '--output_datastore', 'forecasting_output_datastore',
              '--overwrite_forecasting', True],
    allow_reuse = False
    )

### Set up RunConfiguration for PythonScriptStep

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.runconfig import CondaDependencies

conda_run_config = RunConfiguration(framework="python")
conda_run_config.target = compute
conda_run_config.environment.docker.enabled = True
conda_run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE
cd = CondaDependencies.create(pip_packages=['azureml-pipeline-core'], conda_packages=['pandas'])
conda_run_config.environment.python.conda_dependencies = cd

### Set up PythonScriptStep

In [None]:
from azureml.pipeline.steps import PythonScriptStep

log_python_script_step = PythonScriptStep(name="logging",
                        script_name="log.py",
                        compute_target=compute,
                        source_directory='./scripts',
                        runconfig=conda_run_config,
                        arguments=['--ParallelRunStep_name','many-models-forecasting', '--pipeline_output_name', 'forecasting_output', '--datastore', 'forecasting_output_datastore', '--experiment', 'automl-ojforecasting', '--overwrite_logs', True],
                        allow_reuse=False)

### Set up the step sequence

In [None]:
from azureml.pipeline.core import StepSequence

all_steps = StepSequence(steps=[parallelrun_step, log_python_script_step])

### Submit and Run the Pipeline
Create an Experiment to track the runs of the pipeline. Then, you can run the pipeline and review the output. With the current settings and the Standard_D13_V2 VM the pipeline takes approximately 1h 5m to run forecasts for all 11,973 models.  

In [None]:
# set up the experiment
from azureml.core import Experiment
from azureml.pipeline.core import Pipeline

experiment = Experiment(ws, 'automl-ojforecasting')

pipeline = Pipeline(workspace = ws, steps=[all_steps])

run = experiment.submit(pipeline, tags=tags1)

In [None]:
run.wait_for_completion(show_output=True)

Succesfully forecasted and logged 11,973 models.

## Publish the pipeline

After a succesful run, we publish the pipeline to the Workspace.

In [None]:
published_pipeline = pipeline.publish(name = 'forecast-all-models',
                                     description = 'forecast 11,973 models and log the run',
                                     version = '1',
                                     continue_on_step_failure = True
                                     )

## Schedule the pipeline to run weekly

A published pipeline represents a Pipeline to be submitted without the Python code which constructed it.

In addition, a [PublishedPipeline](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.publishedpipeline?view=azure-ml-py) can be used to resubmit a Pipeline with different PipelineParameter values and inputs. The following code block will retrieve all the published pipelines in the Workspace.

In [None]:
from azureml.pipeline.core import PublishedPipeline

published_pipelines = PublishedPipeline.list(ws)
for published_pipeline in  published_pipelines:
    print(f"{published_pipeline.name},'{published_pipeline.id}'")

After we identify the pipeline_id of the just published pipline. We can schedule this piepline to a desired cadence. In this case, we schedule the Pipeline to run on the first day of every month starting Jan 1, 2020 at 1PM UTC.

In [None]:
from azureml.pipeline.core import Schedule, ScheduleRecurrence
    
forecast_pipeline_id = 'c0fa9dce-441d-450b-93a6-3d2afccc3036'

recurrence = ScheduleRecurrence(frequency="Week", interval=1, start_time="2020-01-01T13:00:00")
recurring_schedule = Schedule.create(ws, name="Forecasting-Pipeline-Recurring-Schedule", 
                            description="Schedule forecasting Pipeline to run on the first day of every month starting Jan 1, 2020 at 1PM",
                            pipeline_id=forecast_pipeline_id, 
                            experiment_name=experiment.name, 
                            recurrence=recurrence)