# Azure AML - Streamlining Training

### Prerequisites

Understand the architecture and terms introduced by Azure Machine Learning (AML)

Install the Python SDK: make sure to install notebook, and contrib

```
conda create -n azureml -y Python=3.6
source activate azureml
pip install --upgrade azureml-sdk[notebooks,contrib] 
conda install ipywidgets
jupyter nbextension install --py --user azureml.widgets
jupyter nbextension enable azureml.widgets --user --py
```

You will need to restart jupyter after this Detailed instructions are [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-create-workspace-with-python/?WT.mc_id=bert-notebook-abornst)

If you need a free trial account to get started you can get one [here](https://azure.microsoft.com/en-us/offers/ms-azr-0044p/?WT.mc_id=bert-notebook-abornst)

## Initialize workspace

To create or access an Azure ML Workspace, you will need to import the AML library and the following information:
* A name for your workspace
* Your subscription id
* The resource group name

Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace/?WT.mc_id=bert-notebook-abornst) object from the existing workspace you created in the Prerequisites step or create a new one. 

In [None]:
from azureml.core import Workspace

# subscription_id = ''
# resource_group  = ''
# workspace_name  = ''
#     ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
#     ws.write_config()

try:
    ws = Workspace.from_config()
    print(ws.name, ws.location, ws.resource_group, ws.location, sep='\t')
    print('Library configuration succeeded')
except:
    print('Workspace not found')

## Compute

### Create a GPU remote compute target
We need to create a GPU [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) to perform the fine-tuning. In this example, we create an AmlCompute cluster as our training compute resource. Please find the information of Azure VM size in below table.

|       VM Size      | CPU |    GPU   | Storage (SSD) | GPU memory | InfiniBand |
|:------------------:|:---:|:--------:|:-------------:|:----------:|:----------:|
|    Standard_NC6    |  6  |  1 x K80 |    340 GiB    |    8 GiB   |     No     |
|    Standard_NC12   |  12 |  2 x K80 |    680 GiB    |   16 GiB   |     No     |
|    Standard_NC24   |  24 |  4 x K80 |    1440 GiB   |   32 GiB   |     No     |
|   Standard_NC24r   |  24 |  4 x K80 |    1440 GiB   |   32 GiB   |     Yes    |
|  Standard_NC6s_v3  |  6  | 1 x V100 |    736 GiB    |   16 GiB   |     No     |
|  Standard_NC12s_v3 |  12 | 2 x V100 |    1474 GiB   |   32 GiB   |     No     |
|  Standard_NC24s_v3 |  24 | 4 x V100 |    2948 GiB   |   64 GiB   |     No     |
| Standard_NC24rs_v3 |  24 | 4 x V100 |    2948 GiB   |   64 GiB   |     Yes    |

***Note that*** you need to request NCv3-serie quota if you would like to use NVIDIA Tesla V100
This code creates a cluster for you if it does not already exist in your workspace.

__One Time Creation__


Choose a name for your GPU cluster cluster_name = "gpucluster" # Verify that cluster does not exist already try: gpu_cluster = ComputeTarget(workspace = ws, name = cluster_name) except ComputeTargetException: compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', min_nodes=0, max_nodes=1) gpu_cluster = ComputeTarget.create(ws, cluster_name, compute_config) gpu_cluster.wait_for_completion(show_output=True)

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cluster_name = "cluster"

# Verify that cluster does not exist already
try:
    cluster = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC6',
                                                           min_nodes=1,
                                                           max_nodes=4)
    cluster = ComputeTarget.create(ws, cluster_name, compute_config)

cluster.wait_for_completion(show_output=True)

## Upload Data



In [None]:
from azureml.core import Datastore
blob_datastore = Datastore.register_azure_file_share(workspace=ws, 
                                                     datastore_name='<<SECRET>>', 
                                                     file_share_name="<<SECRET>>", 
                                                     account_name="<<SECRET>>",
                                                     account_key='<<SECRET>>')



#get named datastore from current workspace
datastore = Datastore.get(ws, datastore_name='project_zero')
print(blob_datastore)

## ALM Training Environment Setup

In [None]:

from azureml.core.runconfig import RunConfiguration, DataReferenceConfiguration
proj_root = "project_zero/ds1"


dr = DataReferenceConfiguration(datastore_name=datastore.name, 
                                path_on_datastore=proj_root,
#                                path_on_compute='/datastore', path_on_compute doesn't work with mount
                                overwrite=True)

In [None]:
container_registry = ContainerRegistry()
container_registry.address = '>>> SECRET <<<'
container_registry.username = '>>> SECRET <<<'
container_registry.password = '>>> SECRET <<<'

In [None]:
from azureml.core.conda_dependencies import CondaDependencies

run_config = RunConfiguration()
run_config.environment.docker.enabled = True
#run_config.environment.docker.base_image='computervisi6b3936b2.azurecr.io/dtron2:3'
run_config.environment.docker.base_image='yolact:1'
run_config.environment.docker.base_image_registry=container_registry


run_config.data_references = {datastore.name: dr}

# GPU support: Azure automatically detects and uses the NVIDIA Docker extension when it is available.
run_config.environment.python.user_managed_dependencies=True  # use your own installed packages instead of an AML created Conda env

run_config.target = compute_target # specify the compute target; obscure error message: `docker image` cannot run

## Create Azure Machine Learning experiment

In [None]:
experiment_name = 'zerowaste'
experiment = Experiment(workspace=ws, name=experiment_name)

In [None]:
base_mount = datastore.path(proj_root).as_mount()
img_path = os.path.join(str(base_mount), 'parsed')
masks_path = os.path.join(str(base_mount), 'experiments/dataset_config')                                                                                                   
output_path = os.path.join(str(base_mount),LOGS_AND_MODEL_PATH )

In [None]:
base_mount = datastore.path(proj_root).as_mount()
img_path = os.path.join(str(base_mount), 'parsed')
masks_path = os.path.join(str(base_mount), 'experiments/dataset_config')

                                                                                                    ### update me
output_path = os.path.join(str(base_mount),LOGS_AND_MODEL_PATH )

In [None]:
from azureml.train.estimator import Estimator

script_params = {
    '--data-folder': base_mount,
    '--img-folder': img_path,
    '--masks-folder': masks_path,
    '--output-folder': output_path,
    #'--config-file': 'configs/R_50_1x.yaml'
    '--config-file': 'yolact_plus_resnet50_zw1_3c_mnt_x1_config', #'yolact_resnet50_zwds1_3label_config',

}

est = Estimator(source_directory="../../../yolact_alm/",
                    script_params=script_params,
                    compute_target=compute_target,
                    entry_script="yolact_alm_runner.py",
                    use_docker=True,
                    #use_gpu=True,
                    image_registry_details=container_registry,
                    user_managed=True,
                    custom_docker_image='yolact:1', #notice this is short name, different from ScriptRun
                    inputs=[base_mount]) #tell the system to mount, or if the script params contain ds.mount(), it will mount without this


In [None]:
run = exp.submit(est)
print(f"Run Details: {run.get_details()['runId']}")

print(f'Run Status: {run.get_status()}')

In [None]:
run.get_properties()

In [None]:
run.get_status()

### Monitor Run

Below please find the elapsed time per epoch using deferent Azure GPU VMs with above hyperparameters

|     GPU counts |    1 GPU    |       2 GPU |      4 GPU |
|---------------:|:-----------:|------------:|-----------:|
|      NC-series | 191 s/epoch | 105 s/epoch | 60 s/epoch |
|    NCv3-series |  36 s/epoch |  22 s/epoch | 13 s/epoch |
| NCv3 with fp16 |  32 s/epoch |  18 s/epoch | 12 s/epoch |


## Fine-Tuning with Hyperparameter Tuning

TODO

### Additional Resources
- https://nbviewer.jupyter.org/github/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/Pretrained-BERT-GLUE.ipynb
- https://nbviewer.jupyter.org/github/microsoft/AzureML-BERT/blob/master/finetune/PyTorch/notebooks/Pretrained-BERT-NER.ipynb