Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/MachineLearningNotebooks/how-to-use-azureml/automated-machine-learning/manymodels/02_Training/02_Training_Pipeline.png)

# 02b Train Automated ML
# Automated Machine Learning
_**Training many models using Automated Machine Learning**_

This notebook demonstrates how to train and register 11,973 models using Automated Machine Learning. We will utilize the [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_step.parallelrunstep?view=azure-ml-py) to parallelize the process of training 11,973 models. For this notebook we are using an orange juice sales dataset to predict the orange juice quantity for each brand and each store. For more information about the data refer to the Data Preparation Notebook.

<span style="color:red"><b>NOTE: There are limits on how many runs we can do in parallel per workspace, and we currently recommend to set the parallelism to maximum of 20 runs per experiment per workspace. If users want to have more parallelism and increase this limit they might encounter Too Many Requests errors (HTTP 429). </b></span>

## Prerequisites

At this point, you should have already:

1. Created your AML Workspace using the [00_Setup_AML_Workspace notebook](../../00_Setup_AML_Workspace.ipynb)
2. Run [01b_Data_Preparation.ipynb](../01b_Data_Preparation/01b_Data_Preparation.ipynb) to create the dataset

## 1.0 Set up workspace, datastore, experiment

In [None]:
import azureml.core
from azureml.core import Workspace, Datastore
import pandas as pd

# set up workspace
ws= Workspace.from_config() 

# Take a look at Workspace
ws.get_details()

# set up datastores
dstore = ws.get_default_datastore()

output = {}
output['SDK version'] = azureml.core.VERSION
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Default datastore name'] = dstore.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

### Choose an experiment

In [None]:
from azureml.core import Experiment

experiment = Experiment(ws, 'manymodels-training-pipeline')

print('Experiment name: ' + experiment.name)

## 2.0 Call the registered filedataset

We use 11,973 datasets and ParallelRunStep to build 11,973 time-series to predict the quantity of each store brand.

Each dataset represents a brand's 2 years orange juice sales data that contains 7 columns and 122 rows. 

You will need to register the datasets in the Workspace first. The Data Preparation notebook demonstrates how to register two datasets to the workspace. 

The registered 'oj_data_small' file dataset contains the first 10 csv files and 'oj_data' contains all 11,973 csv files. You can choose to pass either filedatasets_10_models_input or filedatasets_all_models_inputs in the ParallelRunStep.

We recommend to **start with filedatasets_10_models** and make sure everything runs successfully, then scale up to filedatasets_all_models.

In [None]:
from azureml.core.dataset import Dataset

filedst_10_models = Dataset.get_by_name(ws, name='oj_data_small')
filedst_10_models_input = filedst_10_models.as_named_input('train_10_models')

filedst_all_models = Dataset.get_by_name(ws, name='oj_data')
filedst_all_models_inputs = filedst_all_models.as_named_input('train_all_models')

## 3.0 Build the training pipeline
Now that the dataset, WorkSpace, and datastore are set up, we can put together a pipeline for training. 

### Set up environment  for ParallelRunStep

[Environment](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.environment.environment?view=azure-ml-py) defines a collection of resources that we will need to run our pipelines. We configure a reproducible Python environment for our training script. 

In [None]:
from scripts.helper import get_automl_environment
train_env = get_automl_environment()

### Choose a compute target

Currently ParallelRunConfig only supports AMLCompute. You can change to a different compute cluster if one fails.

This is the compute target we will pass into our ParallelRunConfig.

In [None]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget

# Choose a name for your cluster.
amlcompute_cluster_name = "train-many-model"

found = False
# Check if this compute target already exists in the workspace.
cts = ws.compute_targets
if amlcompute_cluster_name in cts and cts[amlcompute_cluster_name].type == 'AmlCompute':
    found = True
    print('Found existing compute target.')
    compute = cts[amlcompute_cluster_name]
    
if not found:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D13_V2',
                                                           min_nodes=2,
                                                           max_nodes=20)
    # Create the cluster.
    compute = ComputeTarget.create(ws, amlcompute_cluster_name, provisioning_config)
    
print('Checking cluster status...')
# Can poll for a minimum number of nodes and for a specific timeout.
# If no min_node_count is provided, it will use the scale settings for the cluster.
compute.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)
    
# For a more detailed view of current AmlCompute status, use get_status().

## Train

This dictionary defines the AutoML settings, for this forecasting task we add the name of the time column and the maximum forecast horizon.

|Property|Description|
|-|-|
|**task**|forecasting|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**blacklist_models**|Models in blacklist won't be used by AutoML. All supported models can be found at [here](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.constants.supportedmodels.forecasting?view=azure-ml-py).|
|**iterations**|Number of models to train. This is optional but provides customer with greater control.|
|**iteration_timeout_minutes**|Maximum amount of time in minutes that the model can train. This is optional and depends on the dataset. We ask customer to explore a bit to get approximate times for training the dataset. For OJ dataset we set it 20 minutes|
|**experiment_timeout_hours**|Maximum amount of time in hours that the experiment can take before it terminates.|
|**label_column_name**|The name of the label column.|
|**n_cross_validations**|Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way.|
|**enable_early_stopping**|Flag to enable early termination if the score is not improving in the short term.|
|**time_column_name**|The name of your time column.|
|**max_horizon**|The number of periods out you would like to predict past your training data. Periods are inferred from your data.|
|**grain_column_names**|The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp.|
|**group_column_names**|The names of columns used to group your models. For timeseries, the groups must not split up individual time-series. That is, each group must contain one or more whole time-series.|
|**track_child_runs**|Flag to disable tracking of child runs. Only best run (metrics and model) is tracked if the flag is set to False.|

In [None]:
import logging
from scripts.helper import write_automl_settings_to_file

automl_settings = {
    "task" : 'forecasting',
    "primary_metric" : 'r2_score',
    "iteration_timeout_minutes" : 10, # This needs to be changed based on the dataset. We ask customer to explore how long training is taking before settings this value
    "iterations" : 15,
    "experiment_timeout_hours" : 1,
    "label_column_name" : 'Quantity',
    "n_cross_validations" : 3,
    "verbosity" : logging.INFO, 
    "debug_log": 'automl_oj_sales_debug.txt',
    "time_column_name": 'WeekStarting',
    "max_horizon" : 6,
    "group_column_names": ['Store', 'Brand'],
    "grain_column_names": ['Store', 'Brand']
}

write_automl_settings_to_file(automl_settings)

### Set up ParallelRunConfig

[ParallelRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_config.parallelrunconfig) is configuration for parallel run step. You will need to determine the number of workers and nodes appropriate for your use case. The process_count_per_node is based off the number of cores of the compute VM. The node_count will determine the number of master nodes to use, increasing the node count will speed up the training process.


* <b>node_count</b>: The number of compute nodes to be used for running the user script. We recommend to start with 3 and increase the node_count if the training time is taking too long.

* <b>process_count_per_node</b>: The number of processes per node.

* <b>run_invocation_timeout</b>: The run() method invocation timeout in seconds. The timeout should be set to maximum training time of one AutoML run(with some buffer), by default it's 60 seconds.

<span style="color:red"><b>NOTE: There are limits on how many runs we can do in parallel per workspace, and we currently recommend to set the parallelism to maximum of 20 runs per experiment per workspace. If users want to have more parallelism and increase this limit they might encounter Too Many Requests errors (HTTP 429). </b></span>


In [None]:
#!pip install azureml.contrib.pipeline.steps

In [None]:
from scripts.helper import build_parallel_run_config

# PLEASE MODIFY the following three settings based on your compute and experiment timeout.
node_count=2
process_count_per_node=6
run_invocation_timeout=3700 # this timeout(in seconds) is inline with AutoML experiment timeout or (no of iterations * iteration timeout)

parallel_run_config = build_parallel_run_config(train_env, compute, node_count, process_count_per_node, run_invocation_timeout)

### Set up ParallelRunStep

This [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunstep?view=azure-ml-py) is the main step in our pipeline. First, we set up the output directory and define the Pipeline's output name. The datastore that stores the pipeline's output data is Workspace's default datastore.

In [None]:
from azureml.pipeline.core import PipelineData

output_dir = PipelineData(name="training_output", 
                          datastore=dstore)

We specify the following parameters:

* <b>name</b>: We set a name for our ParallelRunStep.

* <b>parallel_run_config</b>: We then pass the previously defined ParallelRunConfig.

* <b>allow_reuse</b>: Indicates whether the step should reuse previous results when re-run with the same settings. 

* <b>inputs</b>: We are going to use the registered FileDataset that we called earlier in the Notebook. _inputs_ points to a registered file dataset in AML studio that points to a path in the blob container. The number of files in that path determines the number of models will be trained in the ParallelRunStep. 

* <b>output</b>: The output directory we just defined. A PipelineData object that corresponds to the output directory.

* <b>models</b>: Zero or more model names already registered in the Azure Machine Learning model registry.


In [None]:
from azureml.contrib.pipeline.steps import ParallelRunStep

parallel_run_step = ParallelRunStep(
    name="many-models-training",
    parallel_run_config=parallel_run_config,
    allow_reuse = False,
    inputs=[filedst_10_models_input], # train 10 models
    #inputs=[filedst_all_models_inputs], # switch to this inputs if train all 11,973 models
    output=output_dir,
    models=[],
    arguments=[]              
)

## 4.0 Run the training pipeline

### Submit the pipeline to run

Next we submit our pipeline to run. The whole training pipeline takes about 1h 11m using a Standard_D13_V2 VM with our current ParallelRunConfig setting.

In [None]:
from azureml.pipeline.core import Pipeline
#from azureml.widgets import RunDetails

pipeline = Pipeline(workspace=ws, steps=parallel_run_step)
run = experiment.submit(pipeline)
#RunDetails(run).show()

You can run the folowing command if you'd like to monitor the training process in jupyter notebook. It will stream logs live while training. 

**Note**: This command may not work for Notebook VM, however it should work on your local laptop.

In [None]:
run.wait_for_completion(show_output=True)

Succesfully trained, registered Automated ML models. 

## 5.0 Review outputs of the training pipeline

The training pipeline will train and register models to the Workspace. You can review trained models in the Azure Machine Learning Studio under 'Models'.
If there are any issues with training, you can go to 'many-models-training' run under the pipeline run and explore logs under 'Logs'.
You can look at the stdout and stderr output under logs/user/worker/<ip> for more details


## 6.0 Get list of AutoML runs along with registered model names and tags

The following code snippet will iterate through all the automl runs for the experiment and list the details.

**Run Id** - AutoML run id, **Status** - AutoML run status, **BestScore** - best score for AutoML run, **Model Name** - Registered model name, **Model Tags** - Tags for model.

Please **note** that it might take a long time to finish executing this cell if there are lot of runs and models that belong to the experiment.


In [None]:
from azureml.train.automl.run import AutoMLRun
from azureml.core.model import Model

all_runs = experiment.get_runs()
run_summary = {}
for idx, run in enumerate(all_runs):
    if run.type == 'automl':
        print('Run_id: '+run.id, ', Status: '+run.status, ', Best Score: '+ run.tags.get('best_score', 'Nan'))
        run_summary[idx] = [run.id, run.status, run.tags.get('best_score', 'Nan')]
        if run.status == 'Completed':
            children = list(run.get_children())
            for child_run in children:
                models = Model.list(ws, run_id=child_run.id)
                for model in models:
                    print('Model name: '+ model.name, ', Model tags: '+ str(model.tags))      
                    run_summary[idx].extend([model.name, model.tags])

pd.DataFrame.from_dict(run_summary, orient='index',
                       columns=['Run Id', 'Status', 'BestScore', 'Model Name', 'Model Tags'])