Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# 03 Scoring Pipeline

In this notebook we create a pipeline for Scoring 11,973 ARIMA models that we built in the Training Pipeline of this repository. For the first run, we recommend using the subset of 10 models. This is the default set up for this notebook. We will set up the pipeline for scoring. We utitlize the [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) to parallelize the process of scoring 11,973 models. For more information about the Data and Models refer to the Data Preparation and Training Notebooks. 

The pipeline set up is similar to the Training Pipeline in this repository. For more details on the steps and functions refer to the Training folder. 

# Prerequisites 

This example runs on an Azure Machine Learning Notebook VM. We are calling models that have **already been trained and registered** to the Workspace. We are also calling a FileDataset that has been registered to the Workspace. If you have already run the Environment Setup, Data Preparation, and Training Pipeline notebooks or you have an AML Notebook set up with Models and Data registered to the Workspace you are all set. 

## 1.0 Call the Workspace, Datastore, and Compute

As we did in the Training Pipeline notebook, we need to call the Workspace. We also want to create variables for the datastore and compute cluster. 

### Connect to the workspace

In [None]:
from azureml.core import Workspace 

ws= Workspace.from_config(path='../aml_config/.azureml/ws_config.json')

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

### Attach existing compute resource

In [None]:
from azureml.core.compute import AmlCompute

compute = AmlCompute(ws, 'train-many-model')

### Set up an Experiment

In [None]:
from azureml.core import Experiment

experiment = Experiment(ws, 'scoring-pipeline')

### Call the datastore

In [None]:
from azureml.core import Datastore

dstore = ws.get_default_datastore()
score_output_dstore = Datastore(ws,'scoring_output_datastore' )

## 2.0 Call Registered FileDataset
In the Data Preparation notebook, we registered the orange juice data to the Workspace. You can choose to run the pipeline on the subet of 10 series or the full dataset of 11,973 series. We recommend starting with 10 series then expanding. 

In [None]:
from azureml.core.dataset import Dataset

filedst_10_models = Dataset.get_by_name(ws, name='oj_data_small')
filedst_10_models_input = filedst_10_models.as_named_input('forecast_10_models')

filedst_all_models = Dataset.get_by_name(ws, name='oj_data')
filedst_all_models_input = filedst_all_models.as_named_input('forecast_all_models')

## 3.0 Build the scoring pipeline
Now that the data, models, and compute resources are set up, we can put together a pipeline for scoring. 
### Set up the environment to run the script
Specify the conda dependencies for your script. This will allow us to install packages and configure the environment. 

In [None]:
from azureml.core import Environment
from azureml.core.runconfig import DEFAULT_CPU_IMAGE
from azureml.core.conda_dependencies import CondaDependencies

score_env = Environment(name="many_models_environment")
score_conda_deps = CondaDependencies.create(pip_packages=['sklearn','pmdarima'])
score_env.python.conda_dependencies = score_conda_deps
score_env.docker.enabled = True
score_env.docker.base_image = DEFAULT_CPU_IMAGE

### Create the configuration to wrap the entry script 
In the [ParallelRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunconfig?view=azure-ml-py), we will call the entry script, environment configuration, and parameters. You will want to determine the number of workers and nodes appropriate for your use case. We use the same settings determined in the Training Notebook. Refer to the Additional_Docs folder to create a custom entry_script. 

In [None]:
from azureml.contrib.pipeline.steps import ParallelRunStep, ParallelRunConfig 

workercount = 8
nodecount = 5
timeout = 500

tags1 = {}
tags1['nodes'] = nodecount
tags1['workers-per-node'] = workercount
tags1['timeout'] = timeout 

parallel_run_config = ParallelRunConfig(
    source_directory = './scripts',
    entry_script = 'score.py',
    mini_batch_size = '1',
    run_invocation_timeout = timeout, 
    error_threshold = 10,
    output_action = 'append_row', 
    environment = score_env, 
    process_count_per_node = workercount, 
    compute_target = compute, 
    node_count = nodecount
)

### Create the ParallelRunStep
 The [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) is the main step in our pipeline. We specify the following parameters: **input**, **output**, and **arguments**. We also set the output directory.   

For the orange juice sales example, we pass four **arguments** to the entry_script. 
- **n_test_periods** is the length of the test set which is also the number of predictions you would like to make.
- **time_column_name** is the date column from the data. 
- **output_datastore** is the datastore on blob you would like the output to be written to. 
- **overwrite_scoring** sets if the forecasts will be overwritten.

*arguments* and *inputs* are the two parameters that can pass information to the entry_script.

You can change between running the pipeline on a subset of models or the full data set by changing the inputs parameter. 

In [None]:
from azureml.pipeline.core import PipelineData

datasetname = 'store'
output_dir = PipelineData(name = 'scoringOutput', 
                         datastore = dstore)

parallelrun_step = ParallelRunStep(
    name="many-models-scoring",
    parallel_run_config=parallel_run_config,
    inputs=[filedst_10_models_input], 
    #inputs=[filedst_all_models_input],
    output=output_dir,
    models= [], 
    arguments=['--n_test_periods', 6,
              '--timestamp_column', 'WeekStarting',
              '--output_datastore', score_output_dstore.name,
              '--overwrite_scoring', True])

### Create Output Step
We create a [PythonScriptStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep) as a second step in the pipeline to retrieve the output from the ParallelRunStep and upload the results to a specificed path in Blob storage. 

### Set up RunConfiguration

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.runconfig import CondaDependencies

log_run_config = RunConfiguration(framework="python")
log_run_config.target = compute
log_run_config.environment.docker.enabled = True
log_run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE
log_cd = CondaDependencies.create(pip_packages=['azureml-pipeline-core'], conda_packages=['pandas'])
log_run_config.environment.python.conda_dependencies = log_cd

### Create PythonScriptStep 

In [None]:
from azureml.pipeline.steps import PythonScriptStep

log_python_step = PythonScriptStep(name="logging",
                        script_name="log.py",
                        compute_target=compute,
                        source_directory='./scripts',
                        runconfig=log_run_config,
                        arguments=['--parallelrunstep_name',parallelrun_step.name, 
                                   '--pipeline_output_name', output_dir.name, 
                                   '--datastore', score_output_dstore.name, 
                                   '--experiment', experiment.name, 
                                   '--overwrite_predictions', True])

### Set up step sequence  
We set up a step sequence to ensure the ParallelRunStep executes before the PythonScriptStep. We want the ParallelRunStep to fully complete for the PythonScriptStep to collect the logs from each run and output them to the specificed directory.

In [None]:
from azureml.pipeline.core import StepSequence

scoring_steps = StepSequence(steps=[parallelrun_step, log_python_step])

## 4.0 Run the scoring pipeline
We can use the Experiment we created to track the runs of the pipeline and review the output. With the current settings and the Standard_D13_V2 VM the pipeline takes approximately 1h 8m to run all 11,973 models.  

In [None]:
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace = ws, steps=scoring_steps)
run = experiment.submit(pipeline, tags=tags1)

In [None]:
run.wait_for_completion(show_output=True)

## 5.0 Publish the pipeline
After successfully setting up the pipeline, we publish the pipeline to the Workspace. [Published pipelines](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.publishedpipeline?view=azure-ml-py) create an endpoint to call the pipeline to run without having to open the code used to create it. It can be used to resubmit the pipline with different parameter inputs. 

In [None]:
published_pipeline = pipeline.publish(name = 'score_many_models',
                                     description = 'score many models and log the run',
                                     version = '1',
                                     continue_on_step_failure = False)

## 6.0 Schedule the pipeline monthly
We can use the pipeline id to scheudle the pipeline to run at a specified interval. We schedule the scoring pipeline to run monthly.

In [None]:
from azureml.pipeline.core import Schedule, ScheduleRecurrence
    
score_pipeline_id = published_pipeline.id

recurrence = ScheduleRecurrence(frequency="Month", interval=1, start_time="2020-01-01T11:00:00")
recurring_schedule = Schedule.create(ws, name="Scoring-Pipeline-Recurring-Schedule", 
                            description="Schedule scoring Pipeline to run on the first day of every month starting Jan 1, 2020 at 11AM",
                            pipeline_id=score_pipeline_id, 
                            experiment_name=experiment.name, 
                            recurrence=recurrence)

## 7.0 Pipeline Outputs
The scoring pipeline returns one file with  predictions for each store along with accuracy metrics.The results are output to the scoring_output Blob container. 