# Challenge 1 - Basics of Azure ML

As part of this challenge you will get familar with the basic concepts of [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning/). Relevant links will provided in the Notebook and help you to solve the tasks.

Generally a very good source of information is the [Python SDK reference](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/intro?view=azure-ml-py) for Azure Machine learning.

## 1. Import Azure ML Python Python SDK

In [None]:
import azureml.core
print("SDK version:", azureml.core.VERSION)

## 2. Authentication and initializing Azure Machine Learning Workspace

As a first step you have to authenticate against the Azure [Machine Learning Workspace](https://ml.azure.com/). This can be achieved in different ways:

1. **Interactive Login Authentication:** The interactive authentication is suitable for local experimentation on your own computer.
2. **Azure CLI Authentication:** Azure CLI authentication is suitable if you are already using Azure CLI for managing Azure resources, and want to sign in only once.
3. **Managed Service Identity (MSI) Authentication:** The MSI authentication is suitable for automated workflows, for example as part of Azure Devops build.
4. **Service Principal Authentication:** The Service Principal authentication is suitable for automated workflows, for example as part of Azure Devops build.

For now, we will use the interactive authentication, which is the default mode when using Azure ML SDK. When you connect to your workspace using `Workspace.from_config`, you will get an interactive login dialog.

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()

Note the user you're authenticated as must have access to the subscription and resource group. If you receive an error
```
AuthenticationException: You don't have access to xxxxxx-xxxx-xxx-xxx-xxxxxxxxxx subscription. All the subscriptions that you have access to = ...
```
check that the you used correct login and entered the correct subscription ID.

Alternatively, you can also specify the details of your workspace.

In [None]:
'''
# Alternative login method

from azureml.core.authentication import InteractiveLoginAuthentication

interactive_auth = InteractiveLoginAuthentication()

ws = Workspace(subscription_id='<your-subscription-id>',
               resource_group='<your-resource-group-name>',
               workspace_name='<your-workspace-name>',
               auth=interactive_auth)
'''

After we logged in, we can print the Worspace details.

**TASK**: Print the workspace details below. See here for the workspace object reference: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py

In [None]:
print("Workspace name: " + ws.name, 
      "Azure region: " + ws.location, 
      "Subscription id: " + ws.subscription_id, 
      "Resource group: " + ws.resource_group, sep = '\n')

## 3. Create and register datastore

**TODO**

To register an Azure blob container as a datastore, use [register_azure_blob-container()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore(class)?view=azure-ml-py#register-azure-blob-container-workspace--datastore-name--container-name--account-name--sas-token-none--account-key-none--protocol-none--endpoint-none--overwrite-false--create-if-not-exists-false--skip-validation-false--blob-cache-timeout-none--grant-workspace-access-false--subscription-id-none--resource-group-none-).

The following code creates and registers the `blob_datastore_name` datastore to the `ws` workspace. This datastore accesses the `my-container-name` blob container on the `my-account-name` storage account, by using the provided account key.

In [None]:
blob_datastore_name='<my-name-in-ws>' # Name of the datastore to workspace
container_name = '<my-container-name>' # Name of Azure blob container
account_name = '<my-account-name>' # Storage account name
account_key = '<my-account-key>' # Storage account key

blob_datastore = Datastore.register_azure_blob_container(workspace=ws, 
                                                         datastore_name=blob_datastore_name, 
                                                         container_name=container_name, 
                                                         account_name=account_name,
                                                         account_key=account_key)

## 4. Upload and register data

Every workspace comes with a default [datastore](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data) (and you can register more) which is backed by the Azure blob storage account associated with the workspace. We can use it to transfer data from local to the cloud, and create Dataset from it. We will now upload the Iris data to the default datastore (blob) within your workspace.

By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred.

In [None]:
# List all datastores registered in the current workspace
datastores = ws.datastores
for name, datastore in datastores.items():
    print(name, datastore.datastore_type)

For this challenge we will use the [default datastore](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-access-data#get-datastores-from-your-workspace) that comes with the Azure Machine Learning Workspace.

**TASK**: Retrieve the default datastore for this workspace.

Hint: Same link as in the previous hint.

In [None]:
# get the default datastore
datastore = ws.get_default_datastore()
print(datastore.name, datastore.datastore_type, datastore.account_name, datastore.container_name, sep="\n")

**TASK**: Upload the file `./train-dataset/iris.csv` to the target path `train-dataset/tabular/` on the default datastore.

Hint: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.data.azure_storage_datastore.azureblobdatastore

In [None]:
import os

datastore.upload(src_dir = os.path.join('.', 'train_dataset'),
                      target_path = 'train_dataset/image_dataset',
                      overwrite = True,
                      show_progress = True)

Now we will register a dataset in the Azure Machine Learning Workspace as a file dataset. A file dataset can be mounted to the compute engine. When you mount a file system, you attach that file system to a directory (mount point) and make it available to the system. Because mounting load files at the time of processing, it is usually faster than download.
Note: mounting is only available for Linux-based compute (DSVM/VM, AMLCompute, HDInsights).

In [None]:
from azureml.core import Dataset

file_dataset = Dataset.File.from_files(path = [(datastore, 'train_dataset/image_dataset/hymenoptera_data')])
file_dataset = file_dataset.register(workspace=ws,
                                     name='hymenoptera_data',
                                     description='hymenoptera training dataset',
                                     create_new_version = True)

file_dataset.to_path()

## 3. Create Compute Engine

In this sample, we want to train a simple scikit-learn model on a remote compute engine on Azure. To do so, we first must create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target).

In this challenge, we want to use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training compute resource. Once this is created, you are ready to train on your remote compute.

#### **TASK:** Create a machine learning compute target.

Create an Azure Machine Learning Compute cluster and folow the steps one to four.
1. Check whether the cluster with the given name already exists.
2. Create the configuration (this step is local and only takes a second). Use the SKU `STANDARD_D2_V2` and a maximum of 4 nodes.
3. Create the cluster (this step will take about 20 seconds)
4. Provision the VMs to bring the cluster to the initial size. This step will take about 3-5 minutes and is providing only sparse output in the process. Please make sure to wait until the call returns before moving to the next cell.

Hint: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.computetarget?view=azure-ml-py

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "gpucluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    # can poll for a minimum number of nodes and for a specific timeout.
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

## 4. Create a project directory

Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.

In [None]:
TRAIN_FOLDER_NAME = 'train_new'
TRAIN_FILE_NAME = 'train.py'

## 5. Create a training script 

Now you will need to create your training scripts in your project folder. This will be done in the next step. In practice, you should be able to take any custom training script as is and run it with Azure ML without having to modify your code.

If you would like to use Azure ML's [tracking and metrics](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#metrics) capabilities, you will have to add a small amount of Azure ML code inside your training script.

In `train_iris.py`, we will log some metrics to our Azure ML run. To do so, we will access the Azure ML Run object within the script:

```python
from azureml.core.run import Run
run = Run.get_context()
```

Further within `train_iris.py`, we log the kernel and penalty parameters, and the highest accuracy the model achieves:

```python
run.log('Kernel type', np.string(args.kernel))
run.log('Penalty', np.float(args.penalty))

run.log('Accuracy', np.float(accuracy))
```

These run metrics will become particularly important when we begin hyperparameter tuning our model in the "Tune model hyperparameters" section.

**TASK**: The training script below misses to log a few of the metrics. Find the `???` and complete the script

Link: https://pytorch.org/docs/stable/torchvision/models.html

In [None]:
%%writefile $TRAIN_FOLDER_NAME/$TRAIN_FILE_NAME

from __future__ import print_function
import argparse
import datetime
import time
import os
import onnx

import torch
import torch.nn as nn
import torch.nn.parallel
import torch.optim as optim
import torch.utils.data
import torch.utils.data.distributed
import torchvision
from torchvision import transforms

import utils

from azureml.core import Dataset, Run
run = Run.get_context() # get the Azure ML run object


def train_one_epoch(model, criterion, optimizer, data_loader, device, epoch):
    start_time = time.time()

    # Create objects for tracking parameters
    img_processing = utils.AverageMeter()
    batch_time = utils.AverageMeter()
    losses = utils.AverageMeter()
    accuracies1 = utils.AverageMeter()
    accuracies5 = utils.AverageMeter()
    lr = utils.AverageMeter()
    
    # Put model in train mode
    print('Model in training state')
    model.train()

    for i, (image, target) in enumerate(data_loader):
        print('Iteration {0}'.format(i))
        image, target = image.to(device), target.to(device)
        output = model(image)
        loss = criterion(output, target)

        # Compute gradient and do SGD step
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Measure accuracy and loss
        acc1, acc5 = utils.accuracy(output, target, topk=(1, 5))
        batch_size = image.shape[0]
        losses.update(val=loss.item(), n=batch_size)
        accuracies1.update(val=acc1.item(), n=batch_size)
        accuracies5.update(val=acc5.item(), n=batch_size)
        batch_time.update(val=time.time() - start_time)
        img_processing.update(val=batch_size / (time.time() - start_time))
        lr.update(val=optimizer.param_groups[0]["lr"])

        # Log metrics to Azure ML
        run.log(name='train_loss', value=loss.item())
        run.log(name='train_acc1', value=acc1.item())
        run.log(name='train_acc5', value=acc5.item())
        run.log(name='train_imgs/s', value=batch_size / (time.time() - start_time))
        run.log(name='train_lr', value=optimizer.param_groups[0]["lr"])
    
    # Synchronize AverageMeters between processes --> leads to non working state in single node run
    #img_processing.synchronize_between_processes(device=device)
    #losses.synchronize_between_processes(device=device)
    #accuracies1.synchronize_between_processes(device=device)
    #accuracies5.synchronize_between_processes(device=device)

    # Log metrics to Azure ML
    run.log(name='train_loss_avg', value=losses.avg)
    run.log(name='train_acc1_avg', value=accuracies1.avg)
    run.log(name='train_acc5_avg', value=accuracies5.avg)
    run.log(name='train_imgs/s_avg', value=img_processing.avg)

    print('[Training] Epoch {epoch} Acc@1 {accuracies1.avg:.3f} Acc@5 {accuracies5.avg:.3f} Loss {losses.avg:.3f} Took {time}'
          .format(epoch=epoch, accuracies1=accuracies1, accuracies5=accuracies5, losses=losses, time=(time.time()-start_time)))


def evaluate(model, criterion, data_loader, device, epoch):
    # Create objects for tracking parameters
    losses = utils.AverageMeter()
    accuracies1 = utils.AverageMeter()
    accuracies5 = utils.AverageMeter()
    
    # Put model in eval mode 
    model.eval()

    with torch.no_grad():
        for i, (image, target) in enumerate(data_loader):
            image = image.to(device, non_blocking=True)
            target = target.to(device, non_blocking=True)
            output = model(image)
            loss = criterion(output, target)

            # Measure accuracy and loss
            acc1, acc5 = utils.accuracy(output, target, topk=(1, 5))
            batch_size = image.shape[0]
            losses.update(val=loss.item(), n=batch_size)
            accuracies1.update(val=acc1.item(), n=batch_size)
            accuracies5.update(val=acc5.item(), n=batch_size)

            # Log metrics to Azure ML
            run.log(name='val_loss', value=loss.item())
            run.log(name='val_acc1', value=acc1.item())
            run.log(name='val_acc5', value=acc5.item())
    
    # Synchronize AverageMeters between processes --> leads to non working state in single node run
    #losses.synchronize_between_processes(device=device)
    #accuracies1.synchronize_between_processes(device=device)
    #accuracies5.synchronize_between_processes(device=device)

    # Log metrics to Azure ML
    run.log(name='val_loss_avg', value=losses.avg)
    run.log(name='val_acc1_avg', value=accuracies1.avg)
    run.log(name='val_acc5_avg', value=accuracies5.avg)

    print('[Validation] Epoch {epoch} Acc@1 {accuracies1.avg:.3f} Acc@5 {accuracies5.avg:.3f} Loss {losses.avg:.3f}'
          .format(epoch=epoch, accuracies1=accuracies1, accuracies5=accuracies5, losses=losses))


def load_data(traindir, valdir, distributed, input_size):
    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
                                     std=[0.229, 0.224, 0.225])
    
    # Load training data
    print('Loading training data')
    st = time.time()
    dataset = torchvision.datasets.ImageFolder(
        traindir,
        transforms.Compose([
            transforms.RandomResizedCrop(input_size),
            transforms.RandomHorizontalFlip(),
            transforms.ToTensor(),
            normalize,
        ]))
    print('Took {0}'.format(time.time() - st))

    # Load validation data
    print('Loading validation data')
    dataset_test = torchvision.datasets.ImageFolder(
        valdir,
        transforms.Compose([
            transforms.Resize(input_size + 32),
            transforms.CenterCrop(input_size),
            transforms.ToTensor(),
            normalize,
        ]))
    
    # Create data sampler
    print('Creating data sampler')
    if distributed:
        train_sampler = torch.utils.data.distributed.DistributedSampler(dataset)
        test_sampler = torch.utils.data.distributed.DistributedSampler(dataset_test)
    else:
        train_sampler = None #torch.utils.data.RandomSampler(dataset) --> leads to non working state in single node run
        test_sampler = None #torch.utils.data.SequentialSampler(dataset_test) --> leads to non working state in single node run

    return dataset, dataset_test, train_sampler, test_sampler


def main(args):
    print(args)

    # Create output directory
    print('Create output dir and paths')
    if args.output_dir:
        utils.mkdir(path=os.path.join('.', args.output_dir))
    train_dir = os.path.join(args.data_path, 'train')
    val_dir = os.path.join(args.data_path, 'val')
    num_classes = len(train_dir)
    
    # Initialize distributed mode
    print('Initialize distributed mode')
    utils.init_distributed_mode(args=args)

    # Create model
    print('Create model')
    model, input_size, params_to_update = utils.initialize_model(num_classes=num_classes, args=args)
    
    # Load data
    print('Load data')
    dataset, dataset_test, train_sampler, test_sampler = load_data(traindir=train_dir,
                                                                   valdir=val_dir,
                                                                   distributed=args.distributed,
                                                                   input_size=input_size)
    data_loader = torch.utils.data.DataLoader(dataset,
                                              batch_size=args.batch_size,
                                              sampler=train_sampler,
                                              num_workers=args.workers,
                                              shuffle=(train_sampler is None),
                                              pin_memory=True)
    data_loader_test = torch.utils.data.DataLoader(dataset_test,
                                                   batch_size=args.batch_size,
                                                   sampler=test_sampler, 
                                                   num_workers=args.workers,
                                                   shuffle=(test_sampler is None),
                                                   pin_memory=True)
    
    # Create criterion
    print('Creating criterion')
    criterion = nn.CrossEntropyLoss().to(args.device)

    # Create optimizer
    print('Creating optimizer')
    optimizer = torch.optim.SGD(
        params_to_update,
        lr=args.lr,
        momentum=args.momentum,
        weight_decay=args.weight_decay)
    
    # Create lr scheduler
    print('Creating lr scheduler')
    lr_scheduler = torch.optim.lr_scheduler.StepLR(
        optimizer,
        step_size=args.lr_step_size,
        gamma=args.lr_gamma)
    
    # Resume from checkpoint
    if args.resume:
        print('Resuming from checkpoint {0}')
        checkpoint = torch.load(args.resume, map_location='cpu')
        model.module.load_state_dict(checkpoint['model'])
        optimizer.load_state_dict(checkpoint['optimizer'])
        lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
        args.start_epoch = checkpoint['epoch'] + 1
    
    if args.test_only:
        # Test script only
        print('Testing only')
        evaluate(model, criterion, data_loader_test, device=args.device, epoch=0)

    else:
        # Train model
        print('Start training')
        start_time = time.time()

        for epoch in range(args.start_epoch, args.epochs):
            #if args.distributed: --> leads to non working state in single node run
            #    train_sampler.set_epoch(epoch)
            print('Epoch {0}'.format(epoch))

            # Train one epoch
            train_one_epoch(model, criterion, optimizer, data_loader, args.device, epoch)
            lr_scheduler.step()

            # Evauluate on val data
            evaluate(model, criterion, data_loader_test, args.device, epoch)

            # Save checkpoints after each epoch
            if args.output_dir:
                checkpoint = {
                    'model': model.module.state_dict(),
                    'optimizer': optimizer.state_dict(),
                    'lr_scheduler': lr_scheduler.state_dict(),
                    'epoch': epoch,
                    'args': args}
                utils.save_on_master(
                    checkpoint,
                    os.path.join(args.output_dir, 'model_{}.pth'.format(epoch)))
                utils.save_on_master(
                    checkpoint,
                    os.path.join(args.output_dir, 'checkpoint.pth'))

        total_time = time.time() - start_time
        total_time_str = str(datetime.timedelta(seconds=int(total_time)))
        print('Training time {}'.format(total_time_str))
    
    # Save model as pt and ONNX
    if isinstance(model, torch.nn.DataParallel) or isinstance(model, torch.nn.parallel.DistributedDataParallel):
        model = model.module
    dummy_input = torch.randn(args.batch_size, 3, input_size, input_size, requires_grad=True, device=args.device)
    torch.save(model, os.path.join(args.output_dir, 'model.pt'))
    torch.onnx.export(model,
                      dummy_input,
                      os.path.join(args.output_dir, 'model.onnx'),
                      export_params=True,
                      opset_version=10,
                      do_constant_folding=True,
                      verbose=True,
                      input_names = ['input'],
                      output_names = ['output'],
                      dynamic_axes={'input' : {0 : 'batch_size'},
                                    'output' : {0 : 'batch_size'}})
    
    # Check ONNX model
    print('Checking ONNX model')
    onnx_model = onnx.load(os.path.join(args.output_dir, 'model.onnx'))
    onnx.checker.check_model(onnx_model)


def parse_args():
    parser = argparse.ArgumentParser(description='PyTorch Classification Training')
    
    # Training parameters
    parser.add_argument('--data-path', dest='data_path', default='/tmp/dataset/',
                        help='dataset path')
    parser.add_argument('--dataset-name', dest='dataset_name', default=None,
                        help='dataset name')
    parser.add_argument('--model', dest='model', default='resnet18',
                        help='model name')
    parser.add_argument('--device', dest='device', default='cuda',
                        help='device')
    parser.add_argument('-b', '--batch-size', dest='batch_size', default=32, type=int,
                        help='input batch size for training (default: 32)')
    parser.add_argument('--epochs', dest='epochs', default=10, type=int, metavar='N',
                        help='number of epochs to train (default: 10)')
    parser.add_argument('-j', '--workers', dest='workers', default=16, type=int, metavar='N',
                        help='number of data loading workers (default: 16)')
    parser.add_argument('--lr', dest='lr', default=0.01, type=float,
                        help='initial learning rate (default 0.01)')
    parser.add_argument('--momentum', dest='momentum', default=0.9, type=float, metavar='M',
                        help='SGD momentum (default 0.9)')
    parser.add_argument('--wd', '--weight-decay', dest='weight_decay', default=1e-4, type=float, metavar='W',
                        help='weight decay (default: 1e-4)')
    parser.add_argument('--lr-step-size', dest='lr_step_size', default=30, type=int,
                        help='decrease lr every step-size epochs')
    parser.add_argument('--lr-gamma', dest='lr_gamma', default=0.1, type=float,
                        help='decrease lr by a factor of lr-gamma')
    parser.add_argument('--output-dir', dest='output_dir', default='outputs',
                        help='path where to save')
    parser.add_argument('--resume', dest='resume', default='',
                        help='resume from checkpoint')
    parser.add_argument('--start-epoch', dest='start_epoch', default=0, type=int, metavar='N',
                        help='start epoch')
    parser.add_argument('--test-only', dest='test_only', action='store_true',
                        help='Only test the model')
    parser.add_argument('--pretrained', dest='pretrained', action='store_true',
                        help='Use pre-trained models from torchvision')
    parser.add_argument('--finetuning', dest='finetuning', action='store_true',
                        help='Finetune only last layer of CNN')
    parser.add_argument('--seed', type=int, default=1, metavar='S',
                        help='random seed (default: 1)')
    
    # Distributed training parameters
    parser.add_argument('--world-size', dest='world_size', default=1, type=int,
                        help='number of distributed processes')
    parser.add_argument('--dist-backend', dest='dist_backend', default='nccl', type=str,
                        help='distributed backend')
    parser.add_argument('--dist-url', dest='dist_url', type=str,
                        help='url used to set up distributed training')
    parser.add_argument('--rank', dest='rank', default=-1, type=int,
                        help='rank of the worker')
    
    args = parser.parse_args()

    # Load data and checkpoint path from run
    try:
        args.data_path = run.input_datasets[args.dataset_name]
        print('Loaded registered dataset')
    except:
        print('Could not find registered dataset. Loading default data path.')
    try:
        args.resume = run.input_datasets[args.resume]
        print('Loaded checkpoint path')
    except:
        args.resume = None
        print('Could not find checkpoint path')
    
    # set distributed mode
    args.distributed = args.world_size >= 2
    return args


if __name__ == "__main__":
    args = parse_args()
    main(args=args)


## 6. Create an experiment

An *Experiment* is a logical container in an Azure ML Workspace that represents a collection of trials (individual model runs). It hosts run records which can include run metrics and output artifacts from your experiments.

**TASK**: Fill in the missing values below to create a new experiment in your workspace

In [None]:
from azureml.core import Experiment
exp = Experiment(workspace=ws, name='ch3-pytorch_sample')

## 7. Create Estimator

An estimator object is used to submit the run. Azure Machine Learning has pre-configured estimators for common machine learning frameworks, as well as generic Estimator. Create a generic estimator for by specifying

- The name of the estimator object, est
- The directory that contains your scripts. All the files in this directory are uploaded into the cluster nodes for execution.
- The training script name, train_titanic.py
- The input Dataset for training
- The compute target. In this case you will use the AmlCompute you created
- The environment definition for the experiment

**TASK**: Complete the estimator creation below.

Hint: https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.estimator.estimator?view=azure-ml-py

In [None]:
from azureml.train.dnn import PyTorch, Nccl

# distributed run with Nccl backend
script_params = {
    '--dataset-name': 'hymenoptera_data',
    '--dist-backend': 'nccl',
    '--dist-url': '$AZ_BATCHAI_PYTORCH_INIT_METHOD',
    '--rank': '$AZ_BATCHAI_TASK_INDEX',
    '--world-size': 2,
    '--epochs': 1,
    '--pretrained': None,
    '--finetuning': None
}

est = PyTorch(source_directory=TRAIN_FOLDER_NAME,
              entry_script=TRAIN_FILE_NAME,
              script_params=script_params,
              compute_target=compute_target,
              node_count=2,
              inputs=[file_dataset.as_named_input('hymenoptera_data').as_mount('tmp/dataset')],
              distributed_training=Nccl(),
              use_gpu=True,
              framework_version='1.3',
              pip_packages=['azureml-dataprep[pandas,fuse]', 'onnx', 'Pillow==6.1'])

In [None]:
'''
# distributed run with Gloo backend
from azureml.train.dnn import PyTorch, Gloo

script_params = {
    '--dataset-name': 'hymenoptera_data',
    '--dist-backend': 'nccl',
    '--dist-url': '$AZ_BATCHAI_PYTORCH_INIT_METHOD',
    '--rank': '$AZ_BATCHAI_TASK_INDEX',
    '--world-size': 2,
    '--epochs': 1,
    '--pretrained': None,
    '--finetuning': None
}

est = PyTorch(source_directory=TRAIN_FOLDER_NAME,
              entry_script=TRAIN_FILE_NAME,
              script_params=script_params,
              compute_target=compute_target,
              node_count=2,
              inputs=[file_dataset.as_named_input('hymenoptera_data').as_mount('tmp/dataset')],
              distributed_training=Gloo(),
              use_gpu=True,
              framework_version='1.3',
              pip_packages=['azureml-dataprep[pandas,fuse]', 'onnx', 'Pillow==6.1'])
'''

In [None]:
'''              
# non distributed run
from azureml.train.dnn import PyTorch

script_params = {
    '--dataset-name': 'hymenoptera_data',
    '--world-size': 1,
    '--epochs': 1,
    '--pretrained': None,
    '--finetuning': None
}

est = PyTorch(source_directory=TRAIN_FOLDER_NAME,
              entry_script=TRAIN_FILE_NAME,
              script_params=script_params,
              compute_target=compute_target,
              node_count=1,
              inputs=[file_dataset.as_named_input('hymenoptera_data').as_mount('tmp/dataset')],
              use_gpu=True,
              framework_version='1.3',
              pip_packages=['azureml-dataprep[pandas,fuse]', 'onnx', 'Pillow==6.1'])
'''

## 8. Submit the job

Submit the estimator to the Azure ML experiment to kick off the execution.

**TASK**: Submit the experiment as a new run.

Hint: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.experiment%28class%29?view=azure-ml-py#methods

In [None]:
script_params = {
    '--dataset-name': 'hymenoptera_data',
    '--world-size': 1,
    '--epochs': 3,
    '--pretrained': None,
    '--finetuning': None
}

est = PyTorch(source_directory=TRAIN_FOLDER_NAME,
              entry_script=TRAIN_FILE_NAME,
              script_params=script_params,
              compute_target=compute_target,
              node_count=1,
              inputs=[file_dataset.as_named_input('hymenoptera_data').as_mount('tmp/dataset')],
              use_gpu=True,
              framework_version='1.3',
              pip_packages=['azureml-dataprep[pandas,fuse]', 'onnx', 'Pillow==6.1'])

In [None]:
run = exp.submit(est)
run.wait_for_completion(show_output=True, wait_post_processing=True)

In [None]:
run.cancel()

You now have a model trained on a remote cluster. Retrieve all the metrics logged during the run, including the accuracy of the model:

**TASK**: Retrieve the metrics of the run.

Hint: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run%28class%29?view=azure-ml-py#methods

In [None]:
run.get_metrics()

In [None]:
# register model from run
from azureml.core import Model

model = run.register_model(model_name='ch3-pytorch-model',
                           model_path='outputs/model.onnx',
                           model_framework=Model.Framework.ONNX,
                           model_framework_version='1.3',
                           datasets=[('Training dataset', file_dataset)],
                           description='PyTorch hymenoptera classification.',
                           tags={'area': 'hymenoptera_data', 'type': 'pytorch'})

**BROKEN FROM HERE** and no-code Deployment should work again. 

## 9. Tune model hyperparameters

Now that we've seen how to do a simple Scikit-learn training run using the SDK, let's see if we can further improve the accuracy of our model. We can optimize our model's hyperparameters using Azure Machine Learning's hyperparameter tuning capabilities.

First, we will define the hyperparameter space to sweep over. Let's tune the `kernel` and `penalty` parameters. In this example we will use random sampling to try different configuration sets of hyperparameters to maximize our primary metric, `Accuracy`.

In [None]:
from azureml.train.hyperdrive.runconfig import HyperDriveConfig
from azureml.train.hyperdrive.sampling import RandomParameterSampling
from azureml.train.hyperdrive.run import PrimaryMetricGoal
from azureml.train.hyperdrive.parameter_expressions import choice, uniform, loguniform
from azureml.train.hyperdrive.policy import MedianStoppingPolicy

# distributed run with Nccl backend
script_params = {
    '--dataset-name': 'hymenoptera_data',
    '--dist-backend': 'nccl',
    '--dist-url': '$AZ_BATCHAI_PYTORCH_INIT_METHOD',
    '--rank': '$AZ_BATCHAI_TASK_INDEX',
    '--world-size': 2,
    '--epochs': 1,
    '--pretrained': None,
    '--finetuning': None
}

est = PyTorch(source_directory=TRAIN_FOLDER_NAME,
              entry_script=TRAIN_FILE_NAME,
              script_params=script_params,
              compute_target=compute_target,
              node_count=2,
              inputs=[file_dataset.as_named_input('hymenoptera_data').as_mount('tmp/dataset')],
              distributed_training=Nccl(),
              use_gpu=True,
              framework_version='1.3',
              pip_packages=['azureml-dataprep[pandas,fuse]', 'onnx', 'Pillow==6.1'])

param_sampling = RandomParameterSampling({
    '--lr': loguniform(0.0005, 0.01),
    '--momentum': uniform(0.45, 0.55)
    })

# '--model': choice('resnet18', 'vgg11_bn', 'mobilenet_v2', 'shufflenet_v2_x0_5')

hyperdrive_run_config = HyperDriveConfig(estimator=est,
                                         hyperparameter_sampling=param_sampling, 
                                         primary_metric_name='val_acc1_avg',
                                         primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
                                         max_total_runs=2,
                                         max_concurrent_runs=2,
                                         policy=MedianStoppingPolicy())

Finally, lauch the hyperparameter tuning job.

**TASK**: Submit the hyperdrive run

Hint: Is is very similar to the experiment submission before

In [None]:
hyperdrive_run = exp.submit(hyperdrive_run_config)
hyperdrive_run.wait_for_completion(show_output=True, wait_post_processing=True)

Often times, finding the best hyperparameter values for your model can be an iterative process, needing multiple tuning runs that learn from previous hyperparameter tuning runs. Reusing knowledge from these previous runs will accelerate the hyperparameter tuning process, thereby reducing the cost of tuning the model and will potentially improve the primary metric of the resulting model. When warm starting a hyperparameter tuning experiment with Bayesian sampling, trials from the previous run will be used as prior knowledge to intelligently pick new samples, so as to improve the primary metric. Additionally, when using Random or Grid sampling, any early termination decisions will leverage metrics from the previous runs to determine poorly performing training runs. 

Azure Machine Learning allows you to warm start your hyperparameter tuning run by leveraging knowledge from up to 5 previously completed hyperparameter tuning parent runs. 

Additionally, there might be occasions when individual training runs of a hyperparameter tuning experiment are cancelled due to budget constraints or fail due to other reasons. It is now possible to resume such individual training runs from the last checkpoint (assuming your training script handles checkpoints). Resuming an individual training run will use the same hyperparameter configuration and mount the storage used for that run. The training script should accept the "--resume-from" argument, which contains the checkpoint or model files from which to resume the training run. You can also resume individual runs as part of an experiment that spends additional budget on hyperparameter tuning. Any additional budget, after resuming the specified training runs is used for exploring additional configurations.

For more information on warm starting and resuming hyperparameter tuning runs, please refer to the [Hyperparameter Tuning for Azure Machine Learning documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-tune-hyperparameters) 

When all jobs finish, we can find out the one that has the highest accuracy.

**TASK**: Get the best run from the hyperdrive experiment

Hint: https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.hyperdrive.hyperdriverun

In [None]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])

## 10. Register model

The last step in the training script wrote the file `model.pkl` in a directory named `outputs` in the VM of the cluster where the job is executed. `outputs` is a special directory in that all content in this  directory is automatically uploaded to your workspace.  This content appears in the run record in the experiment under your workspace. Hence, the model file is now also available in your workspace.

You can see files associated with that run.

**TASK**: Get all the file names associated with the best run.

Hint: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.run%28class%29

In [None]:
best_run.get_file_names()

Register the model in the workspace so that you (or other collaborators) can later query, examine, and deploy this model.

**TASK**: Fill in the missing values below to register the model

In [None]:
from azureml.core import Model
from azureml.core.resource_configuration import ResourceConfiguration

model = best_run.register_model(model_name='ch3-pytorch-model',
                                model_path='outputs/model.onnx',
                                model_framework=Model.Framework.ONNX,
                                model_framework_version='1.3',
                                datasets=[('Training dataset', file_dataset)],
                                description='PyTorch hymenoptera classification.',
                                tags={'area': 'hymenoptera_data', 'type': 'pytorch'})

print(model.name, model.id, model.version, sep='\n')

Now, your model is ready for deployment.

## 11. Deployment

No-code model deployment is currently in preview and supports various frameworks and model types including Tensorflow SavedModel format, ONNX models and Scikit-learn models. No code model deployment is supported for all built-in scikit-learn model types.

The deployment will take a few minutes and will take place on an Azure Container Instance.

**TASK**: Fill in the missing values to deploy the model as a no-code webservice

In [None]:
service_no_code = Model.deploy(workspace=ws,
                               name='ch3-pytorch-service',
                               models=[model])
service_no_code.wait_for_deployment(show_output=True)

In [None]:
# If deployment fails, then retry with:
#service_no_code.update(models=[model])

**TASK**: Get the logs for the web service

Hint: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.webservice%28class%29

In [None]:
#service_no_code.get_logs()

Convert this Webservice object into a JSON serialized dictionary, which lists all the details of the webservice.

In [None]:
service_no_code.serialize()

## 12. Test Service

The following code is an example of a Python client that can be used with the container.

**TASK**: Fill in the missing part to execute the request against the web service

Hint: Look in the same class as for the previous hint

In [None]:
import requests, os

headers = {'Content-Type': 'application/json', 'Accept': 'application/json'}

if service_no_code.auth_enabled:
    headers['Authorization'] = 'Bearer '+ service_no_code.get_keys()[0]
elif service_no_code.token_auth_enabled:
    headers['Authorization'] = 'Bearer '+ service_no_code.get_token()[0]

scoring_uri = service_no_code.scoring_uri
print(scoring_uri)
with open(os.path.join('test_deployment', 'onnx-mnist-predict-input.json'), 'rb') as data_file:
    response = requests.post(
        scoring_uri, data=data_file, headers=headers)
print(response.status_code)
print(response.elapsed)
print(response.json())

Delete the service to save cost.

In [None]:
service_no_code.delete()