Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Training Pipeline
---
This notebook demonstrates how to create a pipeline that trains, scores, and registers many models. We utilize the ParallelRunStep to parallelize the process of training the models. For this solution accelerator we are using the orange juice  dataset to predict the quantity of sales of orange juice for each brand and each store.


### Prerequisites
At this point, you should have already:

1. Created your AML Workspace
2. Run 00_Environment_Setup.ipynb to configure your workspace and create the dataset

## 1.0 Connect to workspace

In [None]:
from azureml.core import Workspace

# set up workspace
ws = Workspace.from_config()

# Take a look at Workspace
ws.get_details()

# set up datastores
dstore = ws.get_default_datastore()

print('Workspace Name: ' + ws.name, 
      'Azure Region: ' + ws.location, 
      'Subscription Id: ' + ws.subscription_id, 
      'Resource Group: ' + ws.resource_group, 
      sep = '\n')

## 2.0 Create an experiment

In [None]:
from azureml.core import Experiment

experiment = Experiment(ws, 'training_pipeline')

print('Experiment name: ' + experiment.name)

## 3.0 Get the dataset

In [None]:
from azureml.core.dataset import Dataset

dataset_name = 'oj_data_small'
small_dataset = Dataset.get_by_name(ws, name=dataset_name)
small_dataset_input = small_dataset.as_named_input(dataset_name)

## 4.0 Create the training pipeline
Now that the workspace, experiment, and dataset are set up, we can put together a pipeline for training.

### 4.1 Configure environment for ParallelRunStep
An [environment](https://docs.microsoft.com/en-us/azure/machine-learning/concept-environments) defines a collection of resources that we will need to run our pipelines. We configure a reproducible Python environment for our training script including the [scikit-learn](https://scikit-learn.org/stable/index.html) python library.

In [None]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

train_env = Environment(name="many_models_environment")
train_conda_deps = CondaDependencies.create(pip_packages=['sklearn'])
train_env.python.conda_dependencies = train_conda_deps

### 4.2 Choose a compute target 

In [None]:
from azureml.contrib.pipeline.steps import ParallelRunConfig
from azureml.core.compute import AmlCompute

compute = AmlCompute(ws, "cpucluster")

### 4.3 Set up ParallelRunConfig

[ParallelRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallel_run_config.parallelrunconfig?view=azure-ml-py) provides the configuration for the ParallelRunStep we'll be creating next. Here we specify the environment and compute target we created above along with the entry script that will be for each batch.

There's a number of important parameters to configure:
- **mini_batch_size**: The number of files per batch. If you have 500 files and mini_batch_size is 10, 50 batches would be created containing 10 files each. Batches are split across the various nodes. 

- **node_count**: The number of compute nodes to be used for running the user script. We recommend to start with 5 and increasing the node_count from there. If you increase the node count here, you'll need to increase the max_nodes for the compute cluster as well.

- **process_count_per_node**: The number of processes per node. The compute cluster we are using has 8 cores so we set this parameter to 8.

- **run_invocation_timeout**: The run() method invocation timeout in seconds. The timeout should be set to be higher than the maximum training time of one model (in seconds), by default it's 60. Since the model that takes the longest to train is about 120 seconds, we set it to be 500 to ensure the method has adequate time to run.


We also added tags to preserve the information about our training cluster's node count, process count per node, and dataset name. You can find the 'Tags' column in Azure Machine Learning Studio.

In [None]:
processes_per_node = 8
node_count = 5
timeout = 500

tags = {}
tags['dataset_name'] = dataset_name
tags['node_count'] = node_count
tags['process_count_per_node'] = process_count_per_node
tags['timeout'] = timeout

parallel_run_config = ParallelRunConfig(
    source_directory='./scripts',
    entry_script='train.py',
    mini_batch_size="1",
    run_invocation_timeout=timeout,
    error_threshold=10,
    output_action="append_row",
    environment=train_env,
    process_count_per_node=processes_per_node,
    compute_target=compute,
    node_count=node_count)

### 4.4 Set up ParallelRunStep

This [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-contrib-pipeline-steps/azureml.contrib.pipeline.steps.parallelrunstep?view=azure-ml-py) is the main step in our training pipeline. 

First, we set up the output directory and define the pipeline's output name. The datastore that stores the pipeline's output data is Workspace's default datastore.

In [None]:
from azureml.pipeline.core import PipelineData

output_dir = PipelineData(name="training_output", 
                          datastore=dstore)

We provide our ParallelRunStep with a name, the ParallelRunConfig created above and several other parameters:

- **inputs**: A list of input datasets. Here we'll use the dataset created in the previous notebook. The number of files in that path determines the number of models will be trained in the ParallelRunStep.

- **output**: The output directory we just defined. A PipelineData object that corresponds to the output directory.

- **arguments**: A list of arguments required for the train.py entry script.

In [None]:
from azureml.contrib.pipeline.steps import ParallelRunStep

parallel_run_step = ParallelRunStep(
    name="many-models-training",
    parallel_run_config=parallel_run_config,
    inputs=[small_dataset_input],
    output=output_dir,
    arguments=['--target_column', 'Quantity', 
               '--n_test_periods', 6, 
               '--forecast_granularity', 7, 
               '--timestamp_column', 'WeekStarting', 
               '--model_type', 'linear_regression'])

## 5.0 Run the pipeline
Next, we submit our pipeline to run. With 10 files, this should only take a few minutes but with the full dataset this can take over an hour.

In [None]:
from azureml.pipeline.core import Pipeline
from azureml.widgets import RunDetails

pipeline = Pipeline(workspace=ws, steps=[parallel_run_step])
run = experiment.submit(pipeline,tags=tags)
RunDetails(run).show()

In [None]:
# run.wait_for_completion(show_output=True)

## 6.0 View results of training pipeline

## 7.0 Publish and schedule the pipeline

### 7.1 Publish the pipeline

In [None]:
# published_pipeline = pipeline.publish(name = 'train_many_models',
#                                      description = 'train many models and log the run',
#                                      version = '1',
#                                      continue_on_step_failure = False)

### 7.2 Schedule the pipeline

In [None]:
# from azureml.pipeline.core import Schedule, ScheduleRecurrence
    
# training_pipeline_id = published_pipeline.id

# recurrence = ScheduleRecurrence(frequency="Month", interval=1, start_time="2020-01-01T09:00:00")
# recurring_schedule = Schedule.create(ws, name="training_pipeline_recurring_schedule", 
#                             description="Schedule Training Pipeline to run on the first day of every month",
#                             pipeline_id=training_pipeline_id, 
#                             experiment_name=experiment.name, 
#                             recurrence=recurrence)