# Single model training #

In this notebook we are going to show how to use Azure Machine Learning service in order to automate Form Recognizer service training. You will be able to see how to setup AML workspace, create a compute, execute a basic python script as a pipeline step, and store all metadata from Form Recognizer in AML model store.

In [None]:
from azureml.core import Workspace
from azureml.core.datastore import Datastore
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import PipelineData
from azureml.pipeline.core import Pipeline
from azureml.core import Experiment

In order to execute this notebook, you need to provide some parameters. There are several formal categories to provide.

### Azure Machine Learning Workspace parameters ###

- subscription_id: subscription id where you host or going to create Azure Machine LEarning Workspace
- wrksp_name: a name of the Azure Machine Learning Workspace
- resource_group: a resource group name where you are going to have your AML workspace;

### Form Recognizer Parameters ###
- fr_endpoint: Form Recognizer endpoint
- fr_key: Form Recognizer key to invoke REST API

### Input data parameters ###
- sas_uri: The sample data set that we use to train the model and test the model is available as a .zip file from [GitHub](https://go.microsoft.com/fwlink/?linkid=2090451). We assume that you extract this datasets to a blob container. This parameter is a Shared Access Signature for the **container** that you can generate in Storage Explorer or from command line

You can leave all other parameters as is or modify some of them.

In [None]:
subscription_id = "<provide it here>"
wrksp_name = "<provide it here>"
resource_group = "<provide it here>"
region = "westus2"
compute_name = "mycluster"
min_nodes = 0
max_nodes = 4
vm_priority = "lowpriority"
vm_size = "Standard_F2s_v2"
project_folder = "basic_training_steps"
fr_endpoint = "<provide it here>"
fr_key = "<provide it here>"
sas_uri = "<provide it here>"

In the beginning we need to get a reference to Azure Machine Learning workspace. We will use this reference to create all needed entities. If the workspace doesn't exist we will create a new workspace based on provided parameters.

In [None]:
try:
    aml_workspace = Workspace.get(
        name=wrksp_name,
        subscription_id=subscription_id,
        resource_group=resource_group)
    print("Found the existing Workspace")
except Exception as e:
    print(f"Creating AML Workspace: {wrksp_name}")
    aml_workspace = Workspace.create(
        name=wrksp_name,
        subscription_id=subscription_id,
        resource_group=resource_group,
        create_resource_group=True,
        location=region)

We will have several steps in our machine learning pipeline. All temporary data we will store in the default blob storage that is associated woth AML workspace. In the cell below we are getting a reference to the blob.

In [None]:
blob_datastore = aml_workspace.get_default_datastore()

In the next step we need to create a compute that we are going to use to run pipeline. The compute is auto-scalable and it uses min_nodes as minimum number of nodes. If this value is 0, it means that compute will deploy a node (or several) just when it needs to run a step. In our case we are not going to use more than one node at the time, because we have two steps only and both of them are just basic Python scripts.

In [None]:
if compute_name in aml_workspace.compute_targets:
    compute_target = aml_workspace.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print(f"Found existing compute target {compute_name} so using it")
else:
    compute_config = AmlCompute.provisioning_configuration(
        vm_size=vm_size,
        vm_priority=vm_priority,
        min_nodes=min_nodes,
        max_nodes=max_nodes,
    )

    compute_target = ComputeTarget.create(aml_workspace, compute_name,
                                                  compute_config)
    compute_target.wait_for_completion(show_output=True)

Now, we have our workspace and compute there. It's time to start creating a pipeline. It will be two steps in our pipeline: train a form recognizer model and preserve metadata information about the model in AML model store to make it available to scoring pipeline.

Because we have two steps, we will need an entity to pass data from one step to another. We will use pipeline data. Every time when we run the pipeline, it will create an unique folder in our default blob and store our pipeline data there.

In [None]:
training_output = PipelineData(
    "training_output",
    datastore=blob_datastore)

Our first step is to execute training process that we implemented in train.py script. This script has several parameters like sas and form recognizer details, and it will save output inside out pipeline data folder

In [None]:
training_step = PythonScriptStep(
    name = "training",
    script_name="train.py",
    inputs=[],
    outputs=[training_output],
    arguments=[
        "--sas_uri", sas_uri, 
        "--output", training_output,
        "--fr_endpoint", fr_endpoint,
        "--fr_key", fr_key],
    compute_target=compute_target,
    source_directory=project_folder
)

The second step is taking output from the training step and register it in AML store. In fact, we could implement these two steps as a single step, but we wanted to show some aspects of AML (passing data between steps and multistep pipeline)

In [None]:
register_step = PythonScriptStep(
    name = "registering",
    script_name="register.py",
    inputs=[training_output],
    outputs=[],
    arguments=["--input", training_output],
    compute_target=compute_target,
    source_directory=project_folder
)

Finally, we can create a pipeline based on our two steps above. We just need to combine all the steps in an array and create Pipeline object using it.

In [None]:
steps = [training_step, register_step]

In [None]:
pipeline = Pipeline(workspace=aml_workspace, steps=steps)

It's time to execute our pipeline. We use Experiment class to create a real execution and passing our pipeline as a parameter

In [None]:
pipeline_run = Experiment(aml_workspace, 'train_basic_exp').submit(pipeline)
pipeline_run.wait_for_completion()

In the case of success, we would like to preserve pipeline to execute it later using the Azure portal, Python SDK or Rest API

In [None]:
pipeline.publish(
    name="basic_training",
    description="Training form recognizer based on one data folder only")