# Train YOLOv3  with Azure Machine Learning

## Introduction
In this tutorial, we will go through the process of training YOLOv3 on AzureML. For the purposes of this guide, we will be training YOLOv3 on VOC dataset to detect cars among others.

## Requirements:
* If you are using an Azure Machine Learning Notebook VM, you can go the next section. Otherwise, you need to:
    * install [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) in order to use az commands
    * install the AML SDK
    * create a workspace and its configuration file (`config.json`)
* You will also need to have `tensorflow` and `keras` installed in the current Jupyter kernel.

#### Install the necessary packages
If you haven't installed AML SDK, you just need to run the following command in your Python env:
```sh
pip install --upgrade azureml-sdk
```

In [None]:
import azureml
from azureml.core import Workspace

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

#### Create a AzureML Workspace
You can skip this section if you have already set up a AzureML Workspace, in case you haven't, sign in with the command below or to the [Azure portal](https://portal.azure.com/) and find a value for the `<azure-subscription-id>` parameter in the subscriptions list.

In [None]:
# Log into your azure portal, this will give you some info about your Azure account
!az login

In [None]:
# Enter the resource group in Azure where you want to provision the resources 
resource_group_name = ""

# Enter Azure region where your services will be provisioned, for example "northeurope"
azure_region = ""

# Provide your Azure subscription ID to provision your services (above)
subscription_id = ""

# Provide your Azure ML service workspace name 
# If you don't have a workspace, pick a name to create a new one
aml_workspace_name = ""

In [None]:
ws = Workspace.create(subscription_id = subscription_id,
                resource_group = resource_group_name,
                name = aml_workspace_name,
                location = azure_region)

# Create the config.json file
ws.write_config()

Let's import some python packages

In [None]:
%matplotlib inline
import numpy as np
import os
import matplotlib.pyplot as plt

## Initialize workspace
Initialize a [Workspace](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) object from the existing workspace you created in the previous steps. `Workspace.from_config()` creates a workspace object from `config.json`.

In [None]:
ws = Workspace.from_config()

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

## Create an Azure ML experiment
Let's create an experiment named ``yolov3`` and a folder to hold the training scripts.

In [None]:
from azureml.core import Experiment

exp = Experiment(workspace=ws, name='yolov3')

## Download VOC dataset
Let's download the dataset. The cell below will create a folder VOCdevkit/ containing the images and the annotations (in .xml). You might need to install wget on windows since !commands use the internal system (here UNIX).

In [None]:
# on Unix OS
!wget https://pjreddie.com/media/files/VOCtrainval_06-Nov-2007.tar
!wget https://pjreddie.com/media/files/VOCtest_06-Nov-2007.tar
!tar xf VOCtrainval_06-Nov-2007.tar
!tar xf VOCtest_06-Nov-2007.tar

# on Windows, you need to download the two .tar and extract them in the same folder as this notebook

## Download the pre-trained weights
As we will be using fine-tuning (transfer learning) to train our model at the beginning before unfreezing all layers. We need to download the existing pre-trained weights on COCO dataset available on darknet site.

In [None]:
!wget https://pjreddie.com/media/files/yolov3.weights # Unix command

## Convert to adapted Keras weights file
It takes quite a bit of time to convert to the appropriate weights.

In [None]:
!python convert.py -w yolov3.cfg yolov3.weights model_data/yolo_weights.h5

## Upload the dataset to default datastore 
A [datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data) is a place where data can be stored that is then made accessible to a Run either by means of mounting or copying the data to the compute target. A datastore can either be backed by an Azure Blob Storage or and Azure File Share (ADLS will be supported in the future). For simple data handling, each Azure ML workspace provides a default datastore that can be used, in case the data is not already in Blob Storage or File Share.

In [None]:
ds = ws.get_default_datastore()

In this next step, we will upload the training set into the workspace's default datastore, which we will then later be mount on an `AmlCompute` cluster for training. We also mount the model folder containing te pret-trained YOLOv3 model on COCO dataset. This step requires a bit of time too.

In [None]:
ds.upload(src_dir='./VOCdevkit', target_path='VOCdevkit', overwrite=True, show_progress=True)
ds.upload(src_dir='./model_data', target_path='model_data', overwrite=True, show_progress=True)

## Create or Attach existing AmlCompute
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, you create `AmlCompute` as your training compute resource.

If we could not find the cluster with the given name, then we will create a new cluster here. We will create an `AmlCompute` cluster of `STANDARD_NC6` GPU VMs. This process is broken down into 3 steps:
1. create the configuration (this step is local and only takes a second)
2. create the cluster (this step will take about **20 seconds**)
3. provision the VMs to bring the cluster to the initial size (of 1 in this case). This step will take about **3-5 minutes** and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "yolo-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6', 
                                                           max_nodes=1)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

## Create TensorFlow estimator & add Keras
Next, we construct an `azureml.train.dnn.TensorFlow` estimator object, use the `yolo-cluster` as compute target, and pass the mount-point of the datastore to the training code as a parameter.
The TensorFlow estimator is providing a simple way of launching a TensorFlow training job on a compute target. It will automatically provide a docker image that has TensorFlow installed. In this case, we add `keras` package (for the Keras framework obviously), and `matplotlib` package for plotting a "Loss vs. Accuracy" chart and record it in run history.

`train.py` calls `voc_annotation.py` to create the .txt file containing the path to the images with its associated ground-truth bounding boxes.

In [None]:
from azureml.train.dnn import TensorFlow

est = TensorFlow(source_directory='./script',
                 script_params={'--data_folder': ds.path('VOCdevkit').as_mount(), '--model': ds.path('model_data').as_mount()},
                 compute_target=compute_target,
                 pip_packages=['keras', 'pillow', 'matplotlib'],
                 entry_script='train.py', 
                 use_gpu=True)

## Submit job to run
Submit the estimator to the Azure ML experiment to kick off the execution.

In [None]:
run = exp.submit(est)

### Monitor the Run
As the Run is executed, it will go through the following stages:
1. Preparing: A docker image is created matching the Python environment specified by the TensorFlow estimator and it will be uploaded to the workspace's Azure Container Registry. This step will only happen once for each Python environment -- the container will then be cached for subsequent runs. Creating and uploading the image takes about **5 minutes**. While the job is preparing, logs are streamed to the run history and can be viewed to monitor the progress of the image creation.

2. Scaling: If the compute needs to be scaled up (i.e. the AmlCompute cluster requires more nodes to execute the run than currently available), the cluster will attempt to scale up in order to make the required amount of nodes available. Scaling typically takes about **5 minutes**.

3. Running: All scripts in the script folder are uploaded to the compute target, data stores are mounted/copied and the `entry_script` is executed. While the job is running, stdout and the `./logs` folder are streamed to the run history and can be viewed to monitor the progress of the run.

4. Post-Processing: The `./outputs` folder of the run is copied over to the run history

There are multiple ways to check the progress of a running job. We can use a Jupyter notebook widget. 

**Note: The widget will automatically update ever 10-15 seconds, always showing you the most up-to-date information about the run**

In [None]:
# If you want to visualize the training without submitting, uncomment the next lines
#listrun = exp.get_runs()
#run = next(listrun)

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show() # You need to install azureml-sdk[notebooks]

We can also periodically check the status of the run object, and navigate to Azure portal to monitor the run.

In [None]:
print(run.get_metrics())

In [None]:
run

In [None]:
run.wait_for_completion(show_output=True)

In [None]:
print(run.get_metrics())