# Training several models in parallel #

In this notebook we are going to show how to use Azure Machine Learning service in order to automate Form Recognizer service training. You will be able to see how to setup AML workspace, create a compute, execute a basic python script as a pipeline step, and store all metadata from Form Recognizer in AML model store. In this notebook we are going to use ParalleRunStep in order to execute several training processes in parallel

In [None]:
from azureml.core import Workspace
from azureml.core.datastore import Datastore
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import PipelineData
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment
from msrest.exceptions import HttpOperationError
from azureml.data.data_reference import DataReference
from azureml.core.runconfig import Environment
from azureml.contrib.pipeline.steps import ParallelRunConfig, ParallelRunStep
from azureml.core import Keyvault

In order to execute this notebook, you need to provide some parameters. There are several formal categories to provide.

### Azure Machine Learning Workspace parameters ###

- subscription_id: subscription id where you host or going to create Azure Machine LEarning Workspace
- wrksp_name: a name of the Azure Machine Learning Workspace
- resource_group: a resource group name where you are going to have your AML workspace;

### Form Recognizer Parameters ###
- fr_endpoint: Form Recognizer endpoint
- fr_key: Form Recognizer key to invoke REST API

### Input data parameters ###
- sas_uri: You need to create a container where you need to place all your data in separate folders. Each folder is a data source for a model in Form Recognizer. This parameter is a Shared Access Signature for the **container** that you can generate in Storage Explorer or from command line
- storage_name: a storage name that contains input data
- storage_key: a storage key to get access to the storage with input data
- container_name: the name of the container that contains folder with input data

You can leave all other parameters as is or modify some of them.

In [None]:
subscription_id = "<to provide>"
wrksp_name = "<to provide>"
resource_group = "<to provide>"
region = "westus2"
compute_name = "mycluster"
min_nodes = 0
max_nodes = 4
vm_priority = "dedicated"
vm_size = "Standard_DS2_v2"
project_folder = "parallel_training_steps"
fr_endpoint = "<to provide>"
fr_key = "<to provide>"
sas_uri = "<to provide>"
storage_name = "<to provide>"
storage_key = "<to provide>"
container_name = "<to provide>"
datastore_name = "training_ds"

In the beginning we need to get a reference to Azure Machine Learning workspace. We will use this reference to create all needed entities. If the workspace doesn't exist we will create a new workspace based on provided parameters.

In [None]:
try:
    aml_workspace = Workspace.get(
        name=wrksp_name,
        subscription_id=subscription_id,
        resource_group=resource_group)
    print("Found the existing Workspace")
except Exception as e:
    print(f"Creating AML Workspace: {wrksp_name}")
    aml_workspace = Workspace.create(
        name=wrksp_name,
        subscription_id=subscription_id,
        resource_group=resource_group,
        create_resource_group=True,
        location=region)

In this notebook we show how to pass secure values to any pipeline step using integration with KeyVault. KeyVault has been deployed together with AML and it's available to any AML pipeline. The next two lines create or update key vault secrets with sasa uri and form recognizer key

In [None]:
keyvault = aml_workspace.get_default_keyvault()
keyvault.set_secret(name="frkey", value = fr_key)
keyvault.set_secret(name="sasuri", value = sas_uri)

We will have several steps in our machine learning pipeline. All temporary data we will store in the default blob storage that is associated woth AML workspace. In the cell below we are getting a reference to the blob.

In [None]:
blob_datastore = aml_workspace.get_default_datastore()

In the next cell we need to create a compute that we are going to use to run pipeline. The compute is auto-scalable and it uses min_nodes as minimum number of nodes. If this value is 0, it means that compute will deploy a node (or several) just when it needs to run a step. In our case we are not going to use more than one node at the time, because we have two steps only and both of them are just basic Python scripts.

In [None]:
if compute_name in aml_workspace.compute_targets:
    compute_target = aml_workspace.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print(f"Found existing compute target {compute_name} so using it")
else:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size=vm_size,
        vm_priority=vm_priority,
        min_nodes=min_nodes,
        max_nodes=max_nodes,
    )

    compute_target = ComputeTarget.create(aml_workspace, compute_name,
                                                  compute_config)
    compute_target.wait_for_completion(show_output=True)

This part is different compare to our basic pipeline. Because we have several folders in the container, we will need to list all of them. It means that we need to mount our input storage container to the compute cluster. In order to do that we need to register the blob container as a data store in Azure ML, and we can create a data reference to a specific folder after that

In [None]:
try:
    training_datastore = Datastore.get(aml_workspace, datastore_name)
    print("Found Blob Datastore with name: %s" % datastore_name)
except HttpOperationError:
    training_datastore = Datastore.register_azure_blob_container(
        workspace=aml_workspace,
        datastore_name=datastore_name,
        account_name=storage_name,
        container_name=container_name,
        account_key=storage_key)
print("Registered blob datastore with name: %s" % datastore_name)

In [None]:
training_src = DataReference(
    datastore=training_datastore,
    data_reference_name="training_src",
    path_on_datastore="/")

Because we are running training step in parallel, we need to list all available folders in advance to allow ParallelRunStep to split execution in batches. In order to do that we will need to add one more step to our pipeline and use it to create a pipeline data set with all folders there. 
In the cell below we are creating a pipeline data to store our dataset between steps

In [None]:
list_folders = PipelineData("list_folders", datastore=blob_datastore)
list_folders_file_dataset = list_folders.as_dataset()

Now, we can list all folders and store result using provided pipeline data folder in csv format. The step is below.

In [None]:
list_step = PythonScriptStep(
    name = "list_folders",
    script_name="list.py",
    inputs=[training_src],
    outputs=[list_folders_file_dataset],
    arguments=[
        "--list_output", list_folders_file_dataset,
        "--training_folder", training_src],
    compute_target=compute_target,
    source_directory=project_folder
)

Finally, we can convert our csv file to a tabular dataset and use it as an input source to parallel run step

In [None]:
list_folders_tabular = list_folders_file_dataset.parse_delimited_files()

We will need an entity to pass data from one step to another. We will use pipeline data. Every time when we run the pipeline, it will create an unique folder in our default blob and store our pipeline data there.

In [None]:
training_output = PipelineData(
    name="training_output",
    datastore=blob_datastore)

ParallelRunStep uses an environment that allows you to configure your compute nodes and install some additional components (or use your own container). We have just very basic environment because our code is pretty simple.

In [None]:
batch_env = Environment(name="batch_environment")

Now, we can create and configure parallel step for executing traings in parallel. We will use 4 (max_nodes) nodes to train 4 models at the same time. Each node will get batches with size of 1. Finally, all results will be in training_output pipeline data folder. We will use this folder as is to register all results in AML model store as a single entity.

In [None]:
training_config = ParallelRunConfig(
    source_directory=project_folder,
    entry_script="train.py",
    mini_batch_size="1",
    output_action="append_row",
    compute_target=compute_target,
    environment=batch_env,
    node_count=max_nodes,
    process_count_per_node=1,
    error_threshold=10)

training_step = ParallelRunStep(
    name=f"trainfr",
    models=[],
    parallel_run_config=training_config,
    inputs=[list_folders_tabular],
    output=training_output,
    arguments=[
        "--fr_endpoint", fr_endpoint
    ],
    allow_reuse=False)

Next step is taking output from the training step and register it in AML store. In fact, we could implement these two steps as a single step, but we wanted to show some aspects of AML (passing data between steps and multistep pipeline)

In [None]:
register_step = PythonScriptStep(
    name = "registering",
    script_name="register.py",
    inputs=[training_output],
    outputs=[],
    arguments=["--input", training_output],
    compute_target=compute_target,
    source_directory=project_folder
)

Finally, we can create a pipeline based on our three steps above. We just need to combine all the steps in an array and create Pipeline object using it.

In [None]:
steps = [list_step, training_step, register_step]

In [None]:
pipeline = Pipeline(workspace=aml_workspace, steps=steps)

It's time to execute our pipeline. We use Experiment class to create a real execution and passing our pipeline as a parameter

In [None]:
pipeline_run = Experiment(aml_workspace, 'train_parallel_exp').submit(pipeline)
pipeline_run.wait_for_completion()

In the case of success, we would like to preserve pipeline to execute it later using the Azure portal, Python SDK or Rest API

In [None]:
pipeline.publish(
    name="parallel_training",
    description="Training form recognizer based on several data folders in parallel")