Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Training Pipeline
---

This notebook demonstrates how to train and register many models. We will utilize the [ParallelRunStep](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-parallel-run-step) in a pipeline to parallelize the process of training the models to make the process more efficient. 

For this solution accelerator we are using the [OJ Sales Dataset](https://azure.microsoft.com/en-us/services/open-datasets/catalog/sample-oj-sales-simulated/) to train 11,973 individual models that predict sales for each store and brand of orange juice. For more information about the data refer to the [data preparation notebook](01_Data_Preparation.ipynb).

There are two ways to train many models:

- Using a custom training script
- Using Automated ML


### Prerequisites

At this point, you should have already:

1. Created your AML Workspace using the [00_Setup_AML_Workspace notebook](00_Setup_AML_Workspace.ipynb)
2. Run [01_Data_Preparation.ipynb](01_Data_Preparation.ipynb) to create the dataset

Please ensure you have the latest version of the Azure ML SDK and also install Pipeline Steps Package:

In [None]:
!pip install --upgrade azureml-sdk azureml-pipeline-steps

If you are planning to train the models using Automated ML you should also have the latest version of the `automl` extension:

In [None]:
!pip install --upgrade azureml-sdk[automl]

## 1.0 Connect to workspace and datastore

In [None]:
import azureml.core
from azureml.core import Workspace

# Connect to workspace
ws = Workspace.from_config()

# Get datastore
dstore = ws.get_default_datastore()

print('SDK version: ' + azureml.core.VERSION, 
      'Workspace Name: ' + ws.name,
      'Azure Region: ' + ws.location,
      'Subscription ID: ' + ws.subscription_id,
      'Resource Group: ' + ws.resource_group,
      sep='\n')

## 2.0 Get the training Dataset

Next, we get the training Dataset using the [Dataset.get_by_name()](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.dataset.dataset?view=azure-ml-py#get-by-name-workspace--name--version--latest--) method.

This is the training dataset we created and registered in the [data preparation notebook](01_Data_Preparation.ipynb). If you chose to use only a subset of the files, the training dataset name will be `oj_data_small_train`. Otherwise, the name you'll have to use is `oj_data_train`.

We recommend to **start with the small dataset** and make sure everything runs successfully, then scale up to the full dataset.

In [None]:
dataset_name = 'oj_data_small_train'

In [None]:
from azureml.core.dataset import Dataset

dataset = Dataset.get_by_name(ws, name=dataset_name)
dataset_input = dataset.as_named_input(dataset_name)

## 3.0 Choose a compute target

Currently ParallelRunConfig only supports AMLCompute as compute target. This is the compute cluster you created in the [setup notebook](00_Setup_AML_Workspace.ipynb#3.0-Create-compute-cluster).

In [None]:
cpu_cluster_name = 'cpucluster'

In [None]:
from azureml.core.compute import AmlCompute

compute = AmlCompute(ws, cpu_cluster_name)

## 4.0 Build the training pipeline

Now that the workspace, dataset and compute are set up, we can put together a pipeline for training. 

In the following subsections you'll need to pick an option to run depending on your choice for training:
- Custom script
- Automated ML

The model we use as an example in the **custom script** version is a simple, regression-based forecaster built on scikit-learn and pandas utilities. See the [custom training script](scripts/train.py) to see how the forecaster is constructed. This forecaster is intended for demonstration purposes, so it does not handle the large variety of special cases that one encounters in time-series modeling. For instance, the model here assumes that all time-series are comprised of regularly sampled observations on a contiguous interval with no missing values. The model does not include any handling of categorical variables. You can of course modify this script to include feature engineering and the machine learning model of your choice.

For a more general-use forecaster that handles missing data, advanced featurization, and automatic model selection, you can train using **Automated ML** with the [AutoML Forecasting task](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-auto-train-forecast). See the [AutoML training script](scripts/train_automl.py) to see how the forecaster is constructed.

### 4.1 Configure environment for ParallelRunStep

An [environment](https://docs.microsoft.com/en-us/azure/machine-learning/concept-environments) defines a collection of resources that we will need to run our pipelines. We configure a reproducible Python environment for our training script. That environment will be replicated in the compute cluster for training.

#### [Option A] Environment for custom script

This environment will include the [scikit-learn](https://scikit-learn.org/stable/index.html) python library for training with a custom model.

In [None]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

train_env = Environment(name='many_models_environment_customscript')
train_conda_deps = CondaDependencies.create(
    pip_packages=['sklearn', 'pandas', 'joblib', 'azureml-core', 'azureml-dataprep[fuse]'])
train_env.python.conda_dependencies = train_conda_deps

#### [Option B] Environment for AutoML

Training with Automated ML requires a more complex environment. We will create it using the `get_automl_environment` helper function located under the scripts folder of this repository.

In [None]:
from scripts.notebooks.modeling import get_automl_environment
train_env = get_automl_environment(name='many_models_environment_automl')

### 4.2 Set up ParallelRunConfig

[ParallelRunConfig](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.parallel_run_config.parallelrunconfig?view=azure-ml-py) provides the configuration for the ParallelRunStep we'll be creating next. Here we specify the environment and compute target we created above along with the entry script that will be for each batch.

There's a number of important parameters to configure including:
- **mini_batch_size**: The number of files per batch. If you have 500 files and mini_batch_size is 10, 50 batches would be created containing 10 files each. Batches are split across the various nodes. We'll set this to 1.

- **node_count**: The number of compute nodes to be used for running the user script. For the small sample of OJ datasets, we only need a single node, but you will likely need to increase this number for larger datasets composed of more files. If you increase the node count beyond 20 here, you may need to increase the max_nodes for the compute cluster as well.

- **process_count_per_node**: The number of processes per node. The compute cluster we are using has 8 cores, so that should be the limit.

- **run_invocation_timeout**: The run() method invocation timeout in seconds. The timeout should be set to be higher than the maximum training time of one model (in seconds), by default it's 60.

We have determined the settings that work best for the orange juice use case depending on the type of training we are using, but you can play with the parameters and see how this affect training time.

[Option A] Configuration for custom script:

In [None]:
node_count = 1
process_count_per_node = 8
run_invocation_timeout = 180
training_script = 'train.py'

[Option B] Configuration for AutoML:

In [None]:
node_count = 2
process_count_per_node = 6
run_invocation_timeout = 3700
training_script = 'train_automl.py'

Now let's create the `ParallelRunConfig` object:

In [None]:
from azureml.pipeline.steps import ParallelRunConfig

parallel_run_config = ParallelRunConfig(
    source_directory='./scripts',
    entry_script=training_script,
    mini_batch_size="1",
    run_invocation_timeout=run_invocation_timeout,
    error_threshold=-1,
    output_action="append_row",
    environment=train_env,
    process_count_per_node=process_count_per_node,
    compute_target=compute,
    node_count=node_count
)

And validate it:

In [None]:
from scripts.notebooks.modeling import validate_parallel_run_config

validate_parallel_run_config(parallel_run_config)

### 4.3  *[Optional]* Define AutoML settings

You'll need to run this section if you are planning to train the models using Automated ML.

This dictionary defines the AutoML settings, for this forecasting task we add the name of the time column and the maximum forecast horizon.

|Property|Description|
|-|-|
|**task**|forecasting|
|**primary_metric**|This is the metric that you want to optimize.<br> Forecasting supports the following primary metrics <br><i>spearman_correlation</i><br><i>normalized_root_mean_squared_error</i><br><i>r2_score</i><br><i>normalized_mean_absolute_error</i>|
|**blacklist_models**|Models in blacklist won't be used by AutoML. All supported models can be found at [here](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.constants.supportedmodels.forecasting?view=azure-ml-py).|
|**iterations**|Number of models to train. This is optional but provides customer with greater control.|
|**iteration_timeout_minutes**|Maximum amount of time in minutes that the model can train. This is optional and depends on the dataset. We ask customer to explore a bit to get approximate times for training the dataset. For OJ dataset we set it 20 minutes|
|**experiment_timeout_hours**|Maximum amount of time in hours that the experiment can take before it terminates.|
|**label_column_name**|The name of the label column.|
|**n_cross_validations**|Number of cross validation splits. Rolling Origin Validation is used to split time-series in a temporally consistent way.|
|**enable_early_stopping**|Flag to enable early termination if the score is not improving in the short term.|
|**time_column_name**|The name of your time column.|
|**max_horizon**|The number of periods out you would like to predict past your training data. Periods are inferred from your data.|
|**grain_column_names**|The column names used to uniquely identify timeseries in data that has multiple rows with the same timestamp.|
|**group_column_names**|The names of columns used to group your models. For timeseries, the groups must not split up individual time-series. That is, each group must contain one or more whole time-series.|
|**track_child_runs**|Flag to disable tracking of child runs. Only best run (metrics and model) is tracked if the flag is set to False.|

In [None]:
import logging
from scripts.notebooks.modeling import write_automl_settings_to_file

automl_settings = {
    'task': 'forecasting',
    'primary_metric': 'normalized_root_mean_squared_error',
    'iteration_timeout_minutes': 10, # This needs to be changed based on the dataset. We ask customer to explore how long training is taking before settings this value
    'iterations': 15,
    'experiment_timeout_hours': 1,
    'label_column_name': 'Quantity',
    'n_cross_validations' : 3,
    'verbosity': logging.INFO, 
    'debug_log': 'automl_oj_sales_debug.txt',
    'time_column_name': 'WeekStarting',
    'max_horizon': 6,
    'group_column_names': ['Store', 'Brand'],
    'grain_column_names': ['Store', 'Brand']
}

write_automl_settings_to_file(automl_settings)

### 4.4 Set up ParallelRunStep

This [ParallelRunStep](https://docs.microsoft.com/en-us/python/api/azureml-pipeline-steps/azureml.pipeline.steps.parallelrunstep?view=azure-ml-py) is the main step in our training pipeline. 

First, we set up the output directory and define the pipeline's output name. The datastore that stores the pipeline's output data is Workspace's default datastore.

In [None]:
from azureml.pipeline.core import PipelineData

output_dir = PipelineData(name='training_output', datastore=dstore)

We provide our ParallelRunStep with a name, the ParallelRunConfig created above and several other parameters:

- **inputs**: A list of input datasets. Here we'll use the dataset created in the data preparation notebook and retrieved in step 2. The number of files in the FileDataset determines the number of models that will be trained in the ParallelRunStep.

- **output**: A PipelineData object that corresponds to the output directory. We'll use the output directory we just defined. 

- **allow_reuse**: Indicates whether the step should reuse previous results when re-run with the same settings. 

- **arguments**: A list of arguments required for the train.py entry script. Here, we provide the schema for the timeseries data - i.e. the names of target, timestamp, and id columns - as well as columns that should be dropped prior to modeling and a string identifying the model type.


[Option A] Parameters for custom script:

In [None]:
step_arguments = [
    '--target_column', 'Quantity', 
    '--timestamp_column', 'WeekStarting', 
    '--timeseries_id_columns', 'Store', 'Brand',
    '--drop_columns', 'Revenue', 'Store', 'Brand',
    '--model_type', 'lr'
]

[Option B] Parameters for Automated ML:

In [None]:
step_arguments = [
    '--drop_columns', 'Revenue',
  # '--retrain_failed_models', 'True'  # Uncomment this if you want to retrain only failed models
]

Now create the ParallelRunStep object:

In [None]:
from azureml.pipeline.steps import ParallelRunStep

parallel_run_step = ParallelRunStep(
    name='many-models-parallel-training',
    parallel_run_config=parallel_run_config,
    inputs=[dataset_input],
    output=output_dir,
    allow_reuse=False,
    arguments=step_arguments
)

### 4.5 Create pipeline

Finally, we will create the training pipeline, solely composed of the ParallelRunStep we have just defined.

In [None]:
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[parallel_run_step])

## 5.0 Run the training pipeline

Next, we submit our pipeline to run. The run will train models for each dataset and compute **in-sample** accuracy metrics for the fits. 

Training time will depend on the method chosen and the number of files in the dataset. With our current ParallelRunConfig settings and using a Standard_D13_V2 VM:
- Custom script: should take around 5-10 minutes with 10 files. With the full dataset it can take over an hour.
- Automated ML: the whole training pipeline takes 20-30 minutes with 10 files, as it tries with many different algorithms and data transformation options and picks the one that yields better results. The full dataset takes hours to train.

In [None]:
from azureml.core import Experiment

experiment = Experiment(ws, 'many-models-training')
print('Experiment name: ' + experiment.name)

run = experiment.submit(pipeline)
print('Run ID:', run.id)

You can run the folowing command if you'd like to monitor the training process in jupyter notebook. It will stream logs live while training. This command may not work for Notebook VM, however it should work on your local laptop.

In [None]:
run.wait_for_completion(show_output=True, raise_on_error=True)

## 6.0 Monitor the training pipeline

The run submitted to the Azure Machine Learning Training Compute Cluster may take a while to complete. You can monitor the status of the run in the [Azure Machine Learning Portal](https://ml.azure.com). When finished, the training pipeline will train and register models to the Workspace. Results can also be reviewed in the AML Portal.

If there are any issues with training, you can inspect the run and explore logs under 'Outputs+logs'. You can look at the stdout and stderr output under logs/user/worker/*< worker ip >* for more details.

In [None]:
step_run = run.find_step_run('many-models-parallel-training')[0]
print('URL to run:', step_run.get_portal_url())

You can review registered models under the 'Models' section in the AML Portal:

In [None]:
workspace_id = ws.get_details()['id']
models_url = f'https://ml.azure.com/model/list?wsid={workspace_id}'
print('URL to models list:', models_url)

If you need to cancel the runs of the training experiment you can use the following:

In [None]:
# for run in experiment.get_runs():
#     if run.status == 'Running':
#         print('Canceling run:', run.id)
#         try:
#             run.cancel() 
#         except Exception as e:
#             print('Canceling run failed due to', e)

## 7.0 Analyze results of training pipeline

The dataframe we return in the run method of the training script ([train.py](scripts/train.py) or [train_automl.py](scripts/train_automl.py)) is outputted to *parallel_run_step.txt*. To analyze the results of our training pipeline, we'll download that file, read in the data to a DataFrame, and then visualize the results, including the in-sample metrics. The output is not generated until the run is complete.

### 7.1 Download parallel_run_step.txt locally

If this notebook has been restarted since you ran training, you might need to run the following to get the `run` object:

In [None]:
# from azureml.core import Experiment
# from azureml.pipeline.core import PipelineRun

# experiment = Experiment(ws, 'many-models-training')
# run = PipelineRun(experiment, '<your-run-id>')

And then download the results:

In [None]:
import os

def download_results(run, target_dir=None, step_name='many-models-parallel-training', output_name='training_output'):
    stitch_run = run.find_step_run(step_name)[0]
    port_data = stitch_run.get_output_data(output_name)
    port_data.download(target_dir, show_progress=True)
    return os.path.join(target_dir, 'azureml', stitch_run.id, output_name)

file_path = download_results(run, 'output')
file_path

### 7.2 Convert the file to a dataframe

In [None]:
import pandas as pd

df = pd.read_csv(file_path + '/parallel_run_step.txt', 
                 sep=' ',
                 names=['Store', 'Brand', 'ModelType', 'FileName', 'ModelName', 'StartTime', 'EndTime',
                        'RMSE', 'MAE', 'MAPE', 'Index', 'Number of Models', 'Status', 'ErrorType', 'ErrorMessage'], 
                 parse_dates=['StartTime', 'EndTime'])

df['MSE'] = df['RMSE'] ** 2
df['Duration'] = df['EndTime'] - df['StartTime']

df.head()

### 7.3 Review Results

In [None]:
total = df['EndTime'].max() - df['StartTime'].min()

print('Number of Models: ' + str(len(df)))
print('Total Duration: ' + str(total)[6:])

In [None]:
print('Average MAPE: ' + str(round(df['MAPE'].mean(), 5)))
print('Average MSE: ' + str(round(df['MSE'].mean(), 5)))
print('Average RMSE: ' + str(round(df['RMSE'].mean(), 5)))
print('Average MAE: '+ str(round(df['MAE'].mean(), 5)))

In [None]:
print('Maximum Duration: '+ str(df['Duration'].max())[7:])
print('Minimum Duration: ' + str(df['Duration'].min())[7:])
print('Average Duration: ' + str(df['Duration'].mean())[7:])

### 7.4 Visualize Performance across models

Here, we produce some charts from the errors metrics calculated during the run. It is important to note that these metrics are computed over the training data - that is, they're in-sample metrics - and, therefore, may not reflect true forecasting accuracy. Please see the [forecasting notebook](03_Forecasting_Pipeline.ipynb) for an example of out-of-sample evaluation that is more appropriate for assessing forecast accuracy.

First, we examine the distribution of mean absolute percentage error (MAPE) over all the models:

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt

fig = sns.boxplot(y='MAPE', data=df)
fig.set_title('MAPE across all models')

Next, we can break that down by Brand or Store to see variations in error across our models

In [None]:
fig = sns.boxplot(x='Brand', y='MAPE', data=df)
fig.set_title('MAPE by Brand')

We can also look at how long models for different brands took to train

In [None]:
brand = df.groupby('Brand')
brand = brand['Duration'].sum()
brand = pd.DataFrame(brand)
brand['time_in_seconds'] = [time.total_seconds()  for time in brand['Duration']]

brand.drop(columns=['Duration']).plot(kind='bar')
plt.xlabel('Brand')
plt.ylabel('Seconds')
plt.title('Total Training Time by Brand')
plt.show()

## 8.0 *[Optional]* Publish and schedule the pipeline 


### 8.1 Publish the pipeline
Once you have a pipeline you're happy with, you can publish a pipeline so you can call it programatically later on. See this [tutorial](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-your-first-pipeline#publish-a-pipeline) for additional information on publishing and calling pipelines.

In [None]:
# published_pipeline = pipeline.publish(
#     name='many-models-training',
#     description='Many Models training pipeline',
#     version='1',
#     continue_on_step_failure=False
# )

### 8.2 Schedule the pipeline
You can also [schedule the pipeline](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-schedule-pipelines) to run on a time-based or change-based schedule. This could be used to automatically retrain models every month or based on another trigger such as data drift.

In [None]:
# from azureml.pipeline.core import Schedule, ScheduleRecurrence
    
# training_pipeline_id = published_pipeline.id

# recurrence = ScheduleRecurrence(frequency="Month", interval=1, start_time="2020-01-01T09:00:00")
# recurring_schedule = Schedule.create(
#     ws, 
#     name="training_pipeline_recurring_schedule", 
#     description="Schedule Training Pipeline to run on the first day of every month",
#     pipeline_id=training_pipeline_id, 
#     experiment_name=experiment.name, 
#     recurrence=recurrence
# )

## Next Steps

Now that you've trained and scored the models, move on to [03_Forecasting_Pipeline.ipynb](03_Forecasting_Pipeline.ipynb) to make forecasts with your models.