# Azure Machine Learning - AutoML for Images Training Pipeline
This notebook demonstrates creation of an Azure ML pipeline designed to load image data from an AML-linked blob storage account, convert that data into a labeled dataset (.jsonl format) using blob tags retrieved via the Azure blob storage SDK, and submit an AutoML for Images run to train a new object detection model.

### Import required packages

In [None]:
from azureml.core import Workspace, Experiment, Datastore, Environment, Dataset, Model
from azureml.core.compute import ComputeTarget, AmlCompute, DataFactoryCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE, DEFAULT_GPU_IMAGE
from azureml.pipeline.core import Pipeline, PipelineParameter, PipelineData
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import PipelineParameter, PipelineData
from azureml.data.output_dataset_config import OutputTabularDatasetConfig, OutputDatasetConfig, OutputFileDatasetConfig
from azureml.data.datapath import DataPath
from azureml.data.data_reference import DataReference
from azureml.data.sql_data_reference import SqlDataReference
from azureml.pipeline.steps import DataTransferStep
from datetime import datetime

import warnings
warnings.filterwarnings('ignore')

### Connect to AML workspace

In [None]:
ws = Workspace.from_config()

### Create and connect to ML training cluster
Connect to workspace using config associated config file. Get a reference to you pre-existing AML compute cluster or provision a new cluster to facilitate processing. Finally, get references to your default blob datastore. Note: For image classification model training purposes, we leverage GPU compute for our training cluster. The default cluster definition below is configured to spin down individual nodes after 3 minutes of inactivity. This automated spin-down will help decrease training costs. Additionally, we are creating a test cluster for deploying our custom model to a real-time endpoint.

In [None]:
cluster_name = "gpu-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print("Found existing compute target.")
except Exception:
    print("Creating a new compute target...")
    compute_config = AmlCompute.provisioning_configuration(
        vm_size=os.getenv("Standard_NC6"),
        idle_seconds_before_scaledown=180,
        min_nodes=0,
        max_nodes=5,
    )
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)
    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min_node_count is provided, it will use the scale settings for the cluster.
    compute_target.wait_for_completion(
        show_output=True, min_node_count=None, timeout_in_minutes=20
    )

### Create Run Configuration
The `RunConfiguration` defines the environment used across all python steps. You can optionally add additional conda or pip packages to be added to your environment. [More details here](https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.conda_dependencies.condadependencies?view=azure-ml-py).

Here, we also register the environment to the AML workspace so that it can be used for future retraining and inferencing operations.

In [None]:
default_ds = ws.get_default_datastore()

run_config = RunConfiguration()
run_config.environment.docker.base_image = DEFAULT_GPU_IMAGE
run_config.environment.python.conda_dependencies = CondaDependencies.create()
run_config.environment.python.conda_dependencies.add_conda_package("numpy==1.18.5")
run_config.environment.python.conda_dependencies.add_conda_package("libffi=3.3")
run_config.environment.python.conda_dependencies.set_pip_requirements([
    "azureml-core==1.37.0",
    "azureml-mlflow==1.37.0",
    "azureml-dataset-runtime==1.37.0",
    "azureml-telemetry==1.37.0",
    "azureml-responsibleai==1.37.0",
    "azureml-automl-core==1.37.0",
    "azureml-automl-runtime==1.37.0",
    "azureml-train-automl-client==1.37.0",
    "azureml-defaults==1.37.0",
    "azureml-interpret==1.37.0",
    "azureml-train-automl-runtime==1.37.0",
    "azureml-automl-dnn-vision==1.37.0",
    "azureml-dataprep>=2.24.4"
])
run_config.environment.python.conda_dependencies.set_python_version('3.7')
run_config.environment.name = "AutoMLForImagesEnv"
run_config.environment.register(ws)

### Define Pipeline Parameters
`PipelineParameter` objects serve as variable inputs to an Azure ML pipeline and can be specified at runtime. Below we define the following parameters for our Azure ML Pipeline:

| Parameter Name | Parameter Description |
|----------------|-----------------------|
| `model_name` | Name of the custom object detection model to be trained (used for model registration). |
| `dataset_name` | The name of the dataset to be created (using images from the attached datastore) upon execution of the pipeline. |
| `compute_name` | Name of the compute cluster to be used for model training. | 

In [None]:
model_name = PipelineParameter(name='model_name', default_value='Model_Name')
dataset_name = PipelineParameter(name='dataset_name', default_value='Dataset_Name')
compute_name = PipelineParameter(name='compute_name', default_value=cluster_name)

### Define Pipeline Steps
The pipeline below consists of a single step which executes an associated python script located in the `./pipeline_step_scripts` dir. In this step we call the script located at `./pipeline_step_scripts/automl_job.py` which retrieves an Image Dataset from the AML workspace (referenced by the `dataset_name` parameter and triggers execution of an AutoML for Images training job. Upon completion of this job, the trained model is automatically registered in the AML workspace according to the value provided in the `model_name` parameter.

<i>Note:</i> The AutoML configuration settings can be modified inline by editing the `automl_job.py` file. Additionally, certain fields can be added as `PipelineParameters` and passed into the executed python script step. Finally, advanced logic to perform A/B testing against newly trained models and historical best-performers can be integrated into this step (or a secondary step) to ensure the registered model is always the best performer.

In [None]:
submit_job_step = PythonScriptStep(
    name='Submit AutoML for Images Training Job',
    script_name='automl_job.py',
    arguments=[
        '--model_name', model_name,
        '--dataset_name', dataset_name,
        '--compute_name', compute_name,
    ],
    compute_target=compute_target,
    source_directory='./pipeline_step_scripts',
    allow_reuse=False,
    runconfig=run_config
)

### Create Pipeline
Create an Azure ML Pipeline by specifying the steps to be executed. Note: based on the dataset dependencies between steps, exection occurs logically such that no step will execute unless all of the necessary input datasets have been generated.

In [None]:
pipeline = Pipeline(workspace=ws, steps=[submit_job_step] )

### Trigger a Pipeline Execution from the Notebook
You can create an Experiment (logical collection for runs) and submit a pipeline run directly from this notebook by running the commands below

In [None]:
experiment = Experiment(ws, 'MY_EXPERIMENT')
run = experiment.submit(pipeline, pipeline_parameters = {"model_name": "MY_MODEL", "dataset_name": "MY_DATASET"})