# Train HydraNet for WikiSQL benchmark

## Prerequisites:
- Understand the [architecture and terms](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture) introduced by Azure Machine Learning (AML)
- Install the Python SDK:  make sure to install notebook, and contrib
    ```
    conda create -n azureml -y Python=3.6
    source activate azureml
    pip install --upgrade azureml-sdk[notebooks,contrib]
    conda install ipywidgets
    jupyter nbextension install --py --user azureml.widgets
    jupyter nbextension enable azureml.widgets --user --py
    ```
 
You will need to restart jupyter after this
Detailed instructions are here: https://docs.microsoft.com/en-us/azure/machine-learning/service/quickstart-create-workspace-with-python

## Import AzureML SDK and initialize workspace

To create or access an Azure ML Workspace, you will need to import the AML library and the following information:
- A name for your workspace
- Your subscription id
- The resource group name

In [1]:
import json

from azureml.core import Workspace, Experiment, Run, RunConfiguration, ComputeTarget
# Import AzureML Libraries
import azureml.core
from azureml.core import Datastore, Dataset,Experiment, Workspace, RunConfiguration, ContainerRegistry, Environment
from azureml.core.compute_target import ComputeTargetException
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import MpiConfiguration

from azureml.train.estimator import Estimator
from azureml.train.dnn import PyTorch
from azureml.data.datapath import DataPath, DataPathComputeBinding
from azureml.data.data_reference import DataReference

from azureml.widgets import RunDetails

In [2]:
azureml.core.VERSION

'1.11.0'

In [3]:
with open('credentials.json', 'r') as config_file:
    credentials = json.load(config_file)

In [4]:
# Retrieve your workspace
ws = Workspace.get(name="xiaoyzhu-turingrg",
                 subscription_id='a6c2a7cc-d67e-4a1a-b765-983f08c0423a',
                  resource_group='xiaoyzhu-turingrg')

### Configure Data Stores
AML Workspace comes with a default data store. The training, validation and testing data is hosted <add description>

In [5]:
# Create the compute cluster
gpu_cluster_name = "nd40-ssh-2" 

# Verify that the cluster doesn't exist already
try:
    gpu_compute_target = ComputeTarget(workspace=ws, name=gpu_cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_ND40rs_v2', min_nodes=0, max_nodes=1)
    
    # create the cluster
    gpu_compute_target = ComputeTarget.create(ws, gpu_cluster_name, compute_config)
    gpu_compute_target.wait_for_completion(show_output=True)

Found existing compute target.


In [6]:
default_ds = ws.get_default_datastore()

# Print the workspace attributes
print('Datastore name: ' + default_ds.name, 
      'Container name: ' + default_ds.container_name, 
      'Datastore type: ' + default_ds.datastore_type, 
      'Workspace name: ' + default_ds.workspace.name, sep = '\n')

Datastore name: workspaceblobstore
Container name: azureml-blobstore-1df38ad3-d561-413a-b8c5-603f11808774
Datastore type: AzureBlob
Workspace name: xiaoyzhu-turingrg


## Create custom environment based on Docker image 

We use a docker image built from the [Dockerfile](../dockerfile). Azure ML allows user to use a python environment in a custom docker image to run python scripts. 


In [7]:
myenv = Environment(name="myenv")
myenv.docker.enabled = True
myenv.docker.base_image = "hydranet:v1"
myenv.docker.base_image_registry.address = "marinchwus2a448b3c27.azurecr.io"
myenv.docker.base_image_registry.username = "marinchwus2a448b3c27"
myenv.docker.base_image_registry.password = credentials["registry_password"]
myenv.python.user_managed_dependencies = True


## Run fine-tuning PyTorch script
Let us create a new PyTorch estimator to run the fine-tuning script `run_classifier.py`. 
 
AzureML provides mpi backend to launch python scripts in a process group. The helper module `azureml_adapter.py` sets up environment variables for pytorch so user can use `torch.distributed.init_process_group(backend="nccl")` to initialize pytorch distributed process group with NCCL backend. 
 
 
PyTorch estimator parameters:
- source_directory: the dictory of all source code to be executed in the remote compute target
- compute_target: the compute target (created in the preparation step)
- entry_script: the stript to be launched
- node_count: how many nodes for the cluster to be created within the compute target to run this job
- distributed_training: we use mpi to launch multiproces jobs in each node, allows pytorch to initialize process group using NCCL backend. Note that mpi configuration also sets number of processes for each node. This should be the same as the number of GPUs on each node. 
- environment_definition: use the custom environment. If omitted, Azure ML provides standard docker images for different pytorch versions.
 
We explain a few script arguments, the rest are self-explainable
- model_name_or_path: The location of the pretrained TuringV3 model
- task_name: The GLUE task name. We use sst-2 here.
- data_dir: The input of the training data from GLUE 
- output_dir: The output path of the fine-tuned model
- per_gpu_train_batch_size: The batch size for each gpu in forward and backward steps
- num_training_epochs: number of epochs to train


In [18]:
# workdir = "/turing/workdir"

# mpi = MpiConfiguration()
# mpi.process_count_per_node = 8 

source_directory = '..'
# model_checkpoint_path = "NLRv3-Base-Uncased/tnlrv3-base.pt"
# dataset_path = 'glue/SST-2/'
dataset_path = 'marinch/hydranet/'
# output_path = 'outputs-marinch/hydranet-outputs/'

train_est = PyTorch(source_directory=source_directory,
                    compute_target=gpu_compute_target,
                    script_params = {
#                         '--model_name_or_path': ds.path(model_checkpoint_path).as_mount(),
#                         '--task_name': 'sst-2',
#                         '--tokenizer_name': './tnlrv3/tokenizer/tnlrv3-base-uncased-vocab.txt',
#                         '--config_name': './tnlrv3/config/tnlrv3-base-uncased-config.json',
#                         '--do_train': '',
#                         '--do_lower_case':'', 
#                         '--evaluate_during_training':'',
#                         '--data_dir':default_ds.path(dataset_path).as_mount(),
#                         '--output_dir': default_ds.path(output_path).as_mount(),
                        '--gpu': '0,1,2,3,4,5,6,7',
#                         '--max_seq_length': 128, 
#                         '--per_gpu_train_batch_size': 32, 
#                         '--learning_rate': 7e-6,
#                         '--num_train_epochs':2.0, # 25.0,
#                         '--weight_decay':0.01, 
#                         '--warmup_ratio':0.2,
#                         '--fp16':'',
#                         '--fp16_opt_level': 'O2',
#                         '--overwrite_output_dir':'',
#                         '--do_eval':''
                    },
                    entry_script='notebooks/prep_and_train.py',
#                     node_count=1,
#                     distributed_training = mpi,
                    environment_definition = myenv,
                    )





Run the training. On a Standard_ND40rs_v2 VM with 8 V100 GPUs, it takes about 2 hours.

In [19]:
experiment = Experiment(ws, name="HydraNet")
run = experiment.submit(train_est)
RunDetails(run).show()


_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…