Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# 02 Training Pipeline

This notebook demonstrates how to create a pipeline that trains and registers 11,973 ARIMA models. We are utilize the ParallelRunStep (add link to ParallelRun page) to parallelize the process of training 11,973 models. For this solution accelerator we are using an orange juice sales dataset to predict the orange juice quantity for each brand and each store. For more information about the data refer to the Data Preparation Notebook.

## Prerequisites

This example runs on an Azure Machine Learning Notebook VM. If you have already run the Environment Setup and Data Preparation notebooks you are all set.

## 1.0 Set up workspace, datastore, experiment

In [None]:
from azureml.core import Workspace, Datastore

# set up workspace
ws= Workspace.from_config(path='../aml_config/ws_config.json')
# ws= Workspace.from_config()

'''If running on Azure Machine Learning studio notebook VM, you should use ws= Workspace.from_config() instead'''

# Take a look at Workspace
ws.get_details()

# set up datastores
dstore = ws.get_default_datastore()
train_output_dstore = Datastore(ws, 'training_output_datastore')

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, 
      'Training datastore name: '+ train_output_dstore.name,
      'Default datastore name: '+ dstore.name,
      sep = '\n')

### Choose an experiment

In [None]:
from azureml.core import Experiment

experiment = Experiment(ws, 'automl-ojforecasting')

print('Experiment name: ' + experiment.name)

## 2.0 Call the registered filedataset

We use 11,973 datasets and ParallelRunStep to build 11,973 time-series ARIMA models to predict the quantity of each store brand.

Each dataset represents a brand's 2 years orange juice sales data that contains 7 columns and 122 rows. 

You will need to register the datasets in the Workspace first. The Data Preparation notebook demonstrates how to register two datasets to the workspace. 

The registered 'oj_data_small' file dataset contains the first 10 csv files and 'oj_data' contains all 11,973 csv files. You can choose to pass either filedatasets_10_models_input or filedatasets_all_models_inputs in the ParallelRunStep.

We recommend to **start with filedatasets_10_models** and make sure everything runs successfully, then scale up to filedatasets_all_models.

In [None]:
from azureml.core.dataset import Dataset

filedst_10_models = Dataset.get_by_name(ws, name='oj_data_small')
filedst_10_models_input = filedst_10_models.as_named_input('train_10_models')

In [None]:
filedst_all_models = Dataset.get_by_name(ws, name='oj_data')
filedst_all_models_inputs = filedst_all_models.as_named_input('train_all_models')

## 3.0 Build the training pipeline
Now that the dataset, WorkSpace, and datastore are set up, we can put together a pipeline for training. 

### Set up environment  for ParallelRunStep

[Environment](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.environment.environment?view=azure-ml-py) defines a collection of resources that we will need to run our pipelines. We configure a reproducible Python environment for our training script. We are using two Python libraries: sklearn and pmdarima. 

In [None]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

batch_conda_deps = CondaDependencies.create(pip_packages=['sklearn','pmdarima', 'azureml-pipeline-core'])
batch_env = Environment(name="manymodels_environment")
batch_env.python.conda_dependencies = batch_conda_deps
batch_env.docker.enabled = True
batch_env.docker.base_image = DEFAULT_CPU_IMAGE

### Choose a compute target

Currently ParallelRunConfig only supports AMLCompute. You can change to a different compute cluster if one fails.

This is the compute target we will pass into our ParallelRunConfig.

In [None]:
from azureml.core.compute import AmlCompute

compute = AmlCompute(ws, "train-many-model")

### Set up ParallelRunConfig

[ParallelRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_config.parallelrunconfig) is configuration for parallel run step. You will need to determine the number of workers and nodes appropriate for your use case. The workercount is based off the number of cores of the compute VM. The nodecount will determine the number of master nodes to use. In time-series ARIMA model scenario, increasing the node count will speed up the training process.


* <b>node_count</b>: The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long.

* <b>process_count_per_node</b>: Workercount - the number of processes per node. We are using a 8 cores computer cluster therefore we set it to 8.

* <b>compute_target</b>: Only AmlCompute is supported. You can change to a different compute cluster if one fails.

* <b>run_invocation_timeout</b>: The run() method invocation timeout in seconds. The timeout should be set to be higher than the maximum training time of one model (in seconds), by default it's 60. Since the model that takes the longest to train is about 120 seconds, we set it to be 500 which is greater than 120.

We also added tags to preserve the information about our training cluster's node count, process count per node and dataset name. You can find the 'Tags' column in Azure Machine Learning Studio.

In [None]:
from azureml.contrib.pipeline.steps import ParallelRunConfig

workercount=8
nodecount=5
timeout=500

datasetname='all_stores_filedatasets'

tags1={}
tags1['DatasetName']=datasetname
tags1['Nodes']=nodecount
tags1['WorkersPerNode']=workercount
tags1['Timeout']=timeout

parallel_run_config = ParallelRunConfig(
    source_directory='./scripts',
    entry_script='train.py',
    mini_batch_size="1",
    run_invocation_timeout=timeout,
    error_threshold=10,
    output_action="append_row",
    environment=batch_env,
    process_count_per_node=workercount,
    compute_target=compute,
    node_count=nodecount)

### Set up ParallelRunStep

This [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunstep?view=azure-ml-py) is the main step in our pipeline. First, we set up the output directory and define the Pipeline's output name. The datastore that stores the pipeline's output data is Workspace's default datastore.

In [None]:
from azureml.pipeline.core import PipelineData

output_dir = PipelineData(name="all_ARIMA_models", 
                          datastore=dstore)

We created 4 input arguments that the user can adjust for the forecasting use case.

* <b>target_column</b>: The target column is the column name you'd like to predict on.
* <b>n_test_periods</b>: The n test periods is the number of periods you'd like to hold off for testing/scoring.
* <b>timestamp_column</b>: We set the timestamp column to be the index column for the ARIMA models to train on. 
* <b>stepwise_training</b>: Stepwise training can be set to 'True' or 'False'. 'False' will conduct a full grid search on each model when training hence will take longer to compelete. 'True' will perform stepwise training and the grid search will be stopped as soon as one of the thresholds are hit, and the best fit model at that time is returned. 'True' will speed up the training process dramatically.



We specify the following parameters:

* <b>name</b>: We set a name for our ParallelRunStep.

* <b>parallel_run_config</b>: We then pass the previously defined ParallelRunConfig.

* <b>inputs</b>: We are going to use the registered FileDataset that we called earlier in the Notebook. _inputs_ points to a registered file dataset in AML studio that points to a path in the blob container. The number of files in that path determines the number of models will be trained in the ParallelRunStep. 

* <b>output</b>: The output directory we just defined. A PipelineData object that corresponds to the output directory.

* <b>models</b>: Zero or more model names already registered in the Azure Machine Learning model registry.


In [None]:
from azureml.contrib.pipeline.steps import ParallelRunStep

parallel_run_step = ParallelRunStep(
    name="many_models_training",
    parallel_run_config=parallel_run_config,
    inputs=[filedst_10_models_input],
#    inputs=[filedst_all_models_inputs], #switch to this inputs if train all models
    output=output_dir,
    models=[],
    arguments=['--target_column','Quantity', 
               '--n_test_periods',6, 
               '--timestamp_column','WeekStarting', 
               '--stepwise_training',True]
    )

### Create the PythonScriptStep

We then set up a PythonScriptStep to retrieve the output of the training pipeline (in this case, a file with trained models' logging information) and upload it to a dedicated blob path.

### Set up RunConfiguration for PythonScriptStep

Run configuration represents configuration for experiment runs targeting different compute targets in Azure Machine Learning. The RunConfiguration object encapsulates the information necessary to submit a training run in an experiment. Here we define azureml-pipeline-core and Pandas packages.

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.runconfig import CondaDependencies

conda_run_config = RunConfiguration(framework="python")
conda_run_config.target = compute
conda_run_config.environment.docker.enabled = True
conda_run_config.environment.docker.base_image = DEFAULT_CPU_IMAGE
cd = CondaDependencies.create(pip_packages=['azureml-pipeline-core'], conda_packages=['pandas'])
conda_run_config.environment.python.conda_dependencies = cd

### Set up PythonScriptStep

We then set up a [PythonScriptStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.python_script_step.pythonscriptstep?view=azure-ml-py) to process our log file.

We specify the following parameters:

* <b>name</b>: We set a name for our PythonScriptStep.

* <b>script_name</b>: The name of the log script.

* <b>compute_target</b>: Set AML compute target. 

* <b>runconfig</b>: The RunConfiguration defined during the previous step.

* <b>arguments</b>: The arguments you can specify based on your setup.


We created 5 input arguments that the user can adjust for the forecasting use case.

* <b>ParallelRunStep_name</b>: The ParallelRunStep name that you defined in the ParallelRunStep.
* <b>pipeline_output_name</b>: The ParallelRunStep output directory name. 
* <b>datastore</b>: The registered datastore name where you put your log file.
* <b>experiment</b>: The name of the experiment that ParallelRunStep is running on. 
* <b>overwrite_logs</b>: 'True' will overwrite the existing log files. 'False' will not overwrite.


In [None]:
from azureml.pipeline.steps import PythonScriptStep

log_python_script_step = PythonScriptStep(name="logging",
                        script_name="log.py",
                        compute_target=compute,
                        source_directory='./scripts',
                        runconfig=conda_run_config,
                        arguments=['--ParallelRunStep_name', parallel_run_step.name,
                                   '--datastore', train_output_dstore.name, 
                                   '--experiment', experiment.name, 
                                   '--overwrite_logs', True, 
                                   '--pipeline_output_name', output_dir.name]
                                         )

### Set up the step sequence

We set up a step sequence to make sure to execute ParallelRunStep and PythonScriptStep execute in the correct order. In this example, we run PythonScriptStep after ParallelRunStep since our log is retrieving ParallelRunStep's output data.

In [None]:
from azureml.pipeline.core import StepSequence

all_steps = StepSequence(steps=[parallel_run_step, log_python_script_step])

## 4.0 Run the training pipeline

### Submit the pipeline to run

Next we submit our pipeline to run. The whole training pipeline takes about 1h 11m using a Standard_D13_V2 VM with our current ParallelRunConfig setting.

In [None]:
from azureml.pipeline.core import Pipeline
from azureml.widgets import RunDetails

pipeline = Pipeline(workspace=ws, steps=all_steps)
run = experiment.submit(pipeline,tags=tags1)
RunDetails(run).show()

You can run the folowing command if you'd like to monitor the training process in jupyter notebook. It will stream logs live while training.

In [None]:
run.wait_for_completion(show_output=True)

Succesfully trained, registered and logged 11,973 ARIMA models. 

## 5.0 Publish the pipeline

After a succesful run, we [publish the pipeline](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.publishedpipeline?view=azure-ml-py) to the Workspace.

* <b>continue_on_step_failure</b>: Indicates whether to continue execution of other steps in the PipelineRun if a step fails; the default is false. If True, only steps that have no dependency on the output of the failed step will continue execution. The PythonScriptStep depends on the previous parallelRunStep hence we set it False.

In [None]:
published_pipeline = pipeline.publish(name = 'train_many_models',
                                     description = 'train many models and log the run',
                                     version = '1',
                                     continue_on_step_failure = False
                                     )

## 6.0 Schedule the pipeline to run monthly

A published pipeline represents a Pipeline to be submitted without the Python code which constructed it.

In addition, a [PublishedPipeline](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-core/azureml.pipeline.core.publishedpipeline?view=azure-ml-py) can be used to resubmit a Pipeline with different PipelineParameter values and inputs. The following code block will retrieve all the published pipelines in the Workspace.

We retrieve the published pipeline id and then can schedule this piepline to a desired cadence. In this case, we schedule the training pipeline to run on the first day of every month starting Jan 1, 2020 at 9AM UTC.

In [None]:
from azureml.pipeline.core import Schedule, ScheduleRecurrence
    
training_pipeline_id = published_pipeline.id

recurrence = ScheduleRecurrence(frequency="Month", interval=1, start_time="2020-01-01T09:00:00")
recurring_schedule = Schedule.create(ws, name="training_pipeline_recurring_schedule", 
                            description="Schedule Training Pipeline to run on the first day of every month starting Jan 1, 2020 at 9AM",
                            pipeline_id=training_pipeline_id, 
                            experiment_name=experiment.name, 
                            recurrence=recurrence)

## 7.0 Review outputs of the training pipeline

The training pipeline will train and register 11,973 models to the Workspace. You can review trained models in the Azure Machine Learning Studio under 'Models'.

The pipeline will also generate a log file that contains each model's trianing information. The log file will be automatically uploaded to the training output blob container under a folder named as 'training_log' for monitoring. 