In [1]:
import os
from os import path
import json

# Train Deep Learning Model on Azure Batch Shipyard
This notebook is a simple example on how to train a Deep Learning Model on GPUs using Docker containers and Azure Batch Shipyard. The whole thing can be mostly orchestrated through a Jupyter notebook which makes it very easy to intermix commands, create JSON teamplates and see the output of the executed commands. The reason for doing so is mainly pedagogical and not meant to be the recommended way of doing this. We will be using the Microsoft DSVM which comes with many tools already installed such as Anaconda, Docker, Azure CLI and Blobxfer.

Most of the instructions apply to other Linux distros, just be mindful of the distro specific requirements.
Tools you will need to install if you are not using a Microsoft DSVM
* Azure Cli: Command line tool to manage azure resources. See [here](https://github.com/Azure/azure-cli) for installation instructions
* blobxfer: A file transfer tool. See [here](https://github.com/Azure/blobxfer) for installation instructions
* Docker: [See here](https://docs.docker.com/engine/installation/)
* Anaconda: [See here](https://docs.continuum.io/anaconda/install)

<span style="color:red">You need an Azure subscription with access to GPU enabled VMs to run this example</span>


** Many of the variables defined in the notebook can be called whatever you want. The only one that you need to be mindful of is the Azure subscription you wish to use. **

## Setting up Docker
<span style="color:blue">[If you have Docker already setup and able to execute it without invoking sudo you can ignore these instructions]</span>

The Docker engine comes ready installed on the Microsoft DSVM. If you need to install Docker look at [these instructions](https://docs.docker.com/engine/installation/). 

As it is set up we need to still use sudo to invoke Docker which we can not do from within the Jupyter notebook. Therefore we do the following. (Instructions taken from [here](https://docs.docker.com/engine/installation/linux/centos/))

##### Create a Docker group

The Docker daemon binds to a Unix socket instead of a TCP port. By default that Unix socket is owned by the user root and other users can access it with sudo. For this reason, Docker daemon always runs as the root user.

To avoid having to use sudo when you use the docker command, create a Unix group called docker and add users to it. When the docker daemon starts, it makes the ownership of the Unix socket read/writable by the docker group.

To create the docker group and add your user:

Log into your machine as a user with sudo or root privileges.

Create the docker group.
    
```bash
sudo groupadd docker
```

Add your user to docker group.

```bash
sudo usermod -aG docker your_username
```

Log out and log back in.
This ensures your user is running with the correct permissions.


Verify that your user is in the docker group by running docker without sudo.

In [166]:
!docker run --rm hello-world


Hello from Docker!
This message shows that your installation appears to be working correctly.

To generate this message, Docker took the following steps:
 1. The Docker client contacted the Docker daemon.
 2. The Docker daemon pulled the "hello-world" image from the Docker Hub.
 3. The Docker daemon created a new container from that image which runs the
    executable that produces the output you are currently reading.
 4. The Docker daemon streamed that output to the Docker client, which sent it
    to your terminal.

To try something more ambitious, you can run an Ubuntu container with:
 $ docker run -it ubuntu bash

Share images, automate workflows, and more with a free Docker ID:
 https://cloud.docker.com/

For more examples and ideas, visit:
 https://docs.docker.com/engine/userguide/



## Deep Learning Model
The model we will be using is a simple convolution network which we will train and evaluate on the CIFAR 10 dataset.

The two notebooks are [Process CIFAR data.ipynb](process_cifar_data.ipynb) and [CNTK CIFAR10](cntk_cifar10.ipynb)

We first create the appropriate directory structure

In [105]:
!mkdir script
!mkdir script/code

mkdir: cannot create directory ‘script’: File exists
mkdir: cannot create directory ‘script/code’: File exists


Then we copy the notebooks to the appropriate locations

In [4]:
!cp process_cifar_data.ipynb script/code

In [5]:
!cp cntk_cifar10.ipynb script/code

Once copied we convert them to Python files. This isn't necessary since we could run them as notebooks but requires some extra configuration steps which we can avoid by converting them.

In [6]:
!jupyter nbconvert --to python script/code/process_cifar_data.ipynb --ExecutePreprocessor.kernel_name=cntk-py34

[NbConvertApp] Converting notebook script/code/process_cifar_data.ipynb to python
[NbConvertApp] Writing 6344 bytes to script/code/process_cifar_data.py


In [7]:
!jupyter nbconvert --to python script/code/cntk_cifar10.ipynb --ExecutePreprocessor.kernel_name=cntk-py34

[NbConvertApp] Converting notebook script/code/cntk_cifar10.ipynb to python
[NbConvertApp] Writing 7925 bytes to script/code/cntk_cifar10.py


### Login and configure Azure CLI

In [None]:
!az login -o table

<span style="color:red">OUTPUT CLEARED FOR CONFIDENTIALITY REASONS</span>

You only need to do the following if you have more than one Azure subscription

In [111]:
selected_subscription = "'My Subscription'" # ADD THE NAME OR ID OF THE SUBSCRIPTION YOU WANT TO USE

In [112]:
!az account set --subscription $selected_subscription

List Azure subscriptions and check we have selected the right one

In [None]:
!az account list -o table

<span style="color:red">OUTPUT CLEARED FOR CONFIDENTIALITY REASONS</span>

### Package up the model

Create the private docker registry and collect the necessary information. For this example we will call the registry and group as follows:

In [114]:
docker_registry = "mscontainer"
docker_registry_group = "mscontainergorup"

In [115]:
!az group create -n $docker_registry_group -l southcentralus -o table

Location        Name
--------------  ----------------
southcentralus  mscontainergorup


In [116]:
!az acr create -n $docker_registry -g $docker_registry_group -l southcentralus -o table

[33m
Create a new service principal and assign access:[0m
[33m  az ad sp create-for-rbac --scopes /subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourcegroups/mscontainergorup/providers/Microsoft.ContainerRegistry/registries/mscontainer --role Owner --password <password>[0m
[33m
Use an existing service principal and assign access:[0m
[33m  az role assignment create --scope /subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourcegroups/mscontainergorup/providers/Microsoft.ContainerRegistry/registries/mscontainer --role Owner --assignee <app-id>[0m
NAME         RESOURCE GROUP    LOCATION        LOGIN SERVER                      CREATION DATE                     ADMIN ENABLED
-----------  ----------------  --------------  --------------------------------  --------------------------------  ---------------
mscontainer  mscontainergorup  southcentralus  mscontainer-microsoft.azurecr.io  2017-02-23T13:01:48.164277+00:00  False


In [117]:
!az acr update -n $docker_registry --admin-enabled true -o table

NAME         RESOURCE GROUP    LOCATION        LOGIN SERVER                      CREATION DATE                     ADMIN ENABLED
-----------  ----------------  --------------  --------------------------------  --------------------------------  ---------------
mscontainer  mscontainergorup  southcentralus  mscontainer-microsoft.azurecr.io  2017-02-23T13:01:48.164277+00:00  True


In [118]:
json_data = !az acr credential show -n $docker_registry
docker_username = json.loads(''.join(json_data))['username']
docker_password = json.loads(''.join(json_data))['password']

In [119]:
json_data = !az acr show -n $docker_registry
docker_registry_server = json.loads(''.join(json_data))['loginServer']

In [120]:
mkdir script/docker

mkdir: cannot create directory ‘script/docker’: File exists


##### Write the dockerfile
The dockerfile defines what we want in our Docker image. CNTK has some prebuilt images which we can use as a base to build upon. The image has everything installed so we don't have to worry about installing GPU drivers etc. We simply add the code directory we created earlier which contains our model.

In [None]:
%%writefile script/docker/dockerfile
# Dockerfile for CNTK-GPU-OpenMPI for use with Batch Shipyard on Azure Batch

FROM microsoft/cntk:2.0.beta8.0-runtime-gpu-python3.4-cuda8.0-cudnn5.1
MAINTAINER Mathew Salvaris
ADD code /code
ENV PATH /root/anaconda3/bin:$PATH
CMD [ "/bin/bash" ]

In [122]:
container_name = docker_registry_server+"/masalvar/cntkbatch"
application_path = 'script'
docker_file_location = path.join(application_path, 'docker/dockerfile')

In [123]:
!docker login $docker_registry_server -u $docker_username -p $docker_password

Login Succeeded


##### Build Docker Image
It can take some time to pull the CNTK image

In [124]:
!docker build -t $container_name -f $docker_file_location $application_path --no-cache

Sending build context to Docker daemon 45.57 kB
Step 1 : FROM microsoft/cntk:2.0.beta8.0-runtime-gpu-python3.4-cuda8.0-cudnn5.1
 ---> c5b08f2fba7b
Step 2 : MAINTAINER Mathew Salvaris <mathew.salvaris@microsoft.com>
 ---> Running in d4442f9f2dc3
 ---> ccecae2d8314
Removing intermediate container d4442f9f2dc3
Step 3 : ADD code /code
 ---> 3e73e652b70a
Removing intermediate container ce2395e793df
Step 4 : ENV PATH /root/anaconda3/bin:$PATH
 ---> Running in 0e1c677a2ff8
 ---> f19dd8675760
Removing intermediate container 0e1c677a2ff8
Step 5 : CMD /bin/bash
 ---> Running in e8c1f3277392
 ---> 17b08a178ef7
Removing intermediate container e8c1f3277392
Successfully built 17b08a178ef7


In [None]:
!docker push $container_name 

<span style="color:red">OUTPUT CLEARED FOR CONFIDENTIALITY REASONS</span>

## Install Batch Shipyard
<span style="color:blue">[If you have Batch Shipyard installed somewhere you can ignore these instructions]</span>

[Based on these instructions](https://github.com/Azure/batch-shipyard/blob/master/docs/01-batch-shipyard-installation.md)

In [68]:
!git clone https://github.com/Azure/batch-shipyard.git

Cloning into 'batch-shipyard'...
remote: Counting objects: 2957, done.[K
remote: Total 2957 (delta 0), reused 0 (delta 0), pack-reused 2956[K
Receiving objects: 100% (2957/2957), 978.75 KiB | 0 bytes/s, done.
Resolving deltas: 100% (1998/1998), done.


Can not be done from inside the notebook since it requires root privilidges so execute the following in the terminal
```bash
cd batch-shipyard
pip install -r requirements.txt
./install.sh -3
```

The -3 switch is to install batch-shipyard for python 3 which is what is recommeded and what we will do.

Create a reference to the scripts. Instead of doing this you could add it to your path or create a symlink.

In [126]:
batchshipyard = 'batch-shipyard/shipyard'

## Using Batch Shipyard
In order to use Batch Shipyard we need to prepare a number of configuration json files. [Look here for more details](https://github.com/Azure/batch-shipyard/blob/master/docs/10-batch-shipyard-configuration.md)

### Create Batch Account and Batch Storage
Batch shipyard requires a storage account for storing metadata in order to execute across a distributed environment.

First we create the group.

In [127]:
group_name = 'msbatchexample'
location = 'southcentralus'

In [128]:
%%time
!az group create -n $group_name -l $location -o table

Location        Name
--------------  --------------
southcentralus  msbatchexample
CPU times: user 59.9 ms, sys: 27.2 ms, total: 87.1 ms
Wall time: 3 s


In [None]:
%%time
!az group list -o table

<span style="color:red">OUTPUT CLEARED FOR CONFIDENTIALITY REASONS</span>

Now we create the batch account and the storage account which we associate with the group we created earlier.

In [130]:
batch_account_name = "msbatchex"
storage_account_name = "msbatchstoreex"

ARM template to create batch account and batch storage

In [131]:
template_dict = {
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
        "batchAccounts_name": {
            "defaultValue": batch_account_name,
            "type": "String"
        },
        "storageAccounts_name": {
            "defaultValue": storage_account_name,
            "type": "String"
        }
    },
    "variables": {},
    "resources": [
        {
            "type": "Microsoft.Batch/batchAccounts",
            "name": "[parameters('batchAccounts_name')]",
            "apiVersion": "2015-12-01",
            "location": location,
            "properties": {
                "autoStorage": {
                    "storageAccountId": "[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccounts_name'))]"
                }
            },
            "resources": [],
            "dependsOn": [
                "[resourceId('Microsoft.Storage/storageAccounts', parameters('storageAccounts_name'))]"
            ]
        },
        {
            "type": "Microsoft.Storage/storageAccounts",
            "sku": {
                "name": "Standard_LRS",
                "tier": "Standard"
            },
            "kind": "Storage",
            "name": "[parameters('storageAccounts_name')]",
            "apiVersion": "2016-01-01",
            "location": location,
            "tags": {},
            "properties": {},
            "resources": [],
            "dependsOn": []
        }
    ]
}

In [132]:
template_filename = 'template.json'

In [133]:
with open(template_filename, 'w') as outfile:
    json.dump(template_dict, outfile)

##### Validate the template

In [134]:
!az group deployment validate --template-file $template_filename -g $group_name

{
  "error": null,
  "properties": {
    "correlationId": "60de92eb-a595-4016-b14d-d5a008cae09c",
    "debugSetting": null,
    "dependencies": [
      {
        "dependsOn": [
          {
            "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/msbatchexample/providers/Microsoft.Storage/storageAccounts/msbatchstoreex",
            "resourceGroup": "msbatchexample",
            "resourceName": "msbatchstoreex",
            "resourceType": "Microsoft.Storage/storageAccounts"
          }
        ],
        "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/msbatchexample/providers/Microsoft.Batch/batchAccounts/msbatchex",
        "resourceGroup": "msbatchexample",
        "resourceName": "msbatchex",
        "resourceType": "Microsoft.Batch/batchAccounts"
      }
    ],
    "mode": "Incremental",
    "outputs": null,
    "parameters": {
      "batchAccounts_name": {
        "type": "String",
        "value": "m

##### Deploy

In [135]:
%%time
!az group deployment create --template-file $template_filename -g $group_name --verbose

[32mStarting long running operation 'Starting group deployment create'[0m
[32mLong running operation 'Starting group deployment create' completed with result {'id': '/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/msbatchexample/providers/Microsoft.Resources/deployments/template', 'name': 'template', 'properties': <azure.mgmt.resource.resources.models.deployment_properties_extended.DeploymentPropertiesExtended object at 0x7fb411618668>}[0m
{
  "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/msbatchexample/providers/Microsoft.Resources/deployments/template",
  "name": "template",
  "properties": {
    "correlationId": "46dbfa1f-1578-4534-acf0-f67a6376b323",
    "debugSetting": null,
    "dependencies": [
      {
        "dependsOn": [
          {
            "id": "/subscriptions/edf507a2-6235-46c5-b560-fd463ba2e771/resourceGroups/msbatchexample/providers/Microsoft.Storage/storageAccounts/msbatchstoreex",
            "resourceGroup": "msb

##### Gather batch account info

In [136]:
json_data = !az batch account keys list -n $batch_account_name -g $group_name
batch_account_key = json.loads(''.join(json_data))['primary']

In [137]:
json_data = !az batch account list -g $group_name
batch_service_url = 'https://'+json.loads(''.join(json_data))[0]['accountEndpoint']

##### Get storage account key

In [138]:
json_data = !az storage account keys list -n $storage_account_name -g $group_name
storage_account_key = json.loads(''.join(json_data))[0]['value']

### Batch configuration files

In [140]:
storage_alias = "mystorageaccount"
storage_endpoint = "core.windows.net"

##### Credentials

In [141]:
credentials = {
    "credentials": {
        "batch": {
            "account": batch_account_name,
            "account_key": batch_account_key,
            "account_service_url": batch_service_url
        },
        "storage": {
            storage_alias : {
                    "account": storage_account_name,
                    "account_key": storage_account_key,
                    "endpoint": storage_endpoint
            }
        },
        "docker_registry": {
            docker_registry_server : {
                    "username": docker_username,
                    "password": docker_password
            }
        }   
    }
}

##### Config

In [142]:
config = {
    "batch_shipyard": {
        "storage_account_settings": storage_alias
    },
    "docker_registry": {
        "private": {
            "allow_public_docker_hub_pull_on_missing": True,
            "server": docker_registry_server
        }
    },
    "global_resources": {
        "docker_images": [
            container_name
        ]
    }
}


##### Pool configuration file

In [143]:
pool={
    "pool_specification": {
        "id": "scikit",
        "vm_size": "STANDARD_NC6",
        "vm_count": 1,
        "publisher": "Canonical",
        "offer": "UbuntuServer",
        "sku": "16.04.0-LTS",
        "ssh": {
            "username": "docker"
        },
        "reboot_on_start_task_failed": False,
        "block_until_all_global_resources_loaded": True,
    }
}


##### Jobs configuration file

In [144]:
jobs = {
    "job_specifications": [
        {
            "id": "cntkjob",
            "tasks": [
                {
                    "id": "run_cifar",# This should be changed per task
                    "image": container_name,
                    "remove_container_after_exit": True,
                    "command": 'bash -c "source /cntk/activate-cntk;python /code/process_cifar_data.py;ipython /code/cntk_cifar10.py"',
                    "gpu": True,
                }
            ]
        }
    ]
}

In [82]:
mkdir config

mkdir: cannot create directory ‘config’: File exists


In [145]:
def write_json_to_file(json_dict, filename):
    with open(filename, 'w') as outfile:
        json.dump(json_dict, outfile)

In [146]:
write_json_to_file(credentials, path.join('config', 'credentials.json'))

In [147]:
write_json_to_file(config, path.join('config', 'config.json'))

In [148]:
write_json_to_file(pool, path.join('config', 'pool.json'))

In [149]:
write_json_to_file(jobs, path.join('config', 'jobs.json'))

### Execture Configuration

Create the pool based on the configuration files we created earlier

In [151]:
%%bash --bg --proc pool_proc
batch-shipyard/shipyard pool add --yes --configdir config

Starting job # 2 in a separate thread.


Wait a bit before adding the jobs to give a the VM a chance to spin up. 

Check status of the pool

In [167]:
!$batchshipyard pool list --configdir config

2017-02-23 19:18:44,345Z INFO convoy.batch:list_pools:556 pool_id=scikit [state=PoolState.active allocation_state=AllocationState.steady vm_size=standard_nc6, vm_count=1 target_vm_count=1]


##### Add the job to the queue 

In [170]:
%%time
!$batchshipyard jobs add --configdir config --tail stdout.txt

2017-02-23 19:37:00,192Z INFO convoy.batch:add_jobs:1722 Adding job: cntkjob
2017-02-23 19:37:00,545Z DEBUG convoy.storage:upload_resource_files:333 remote file is the same for shipyardtaskrf-cntkjob/run_cifar.shipyard.envlist, skipping
2017-02-23 19:37:00,546Z INFO convoy.batch:add_jobs:1921 Adding task: run_cifar
2017-02-23 19:37:00,621Z DEBUG convoy.batch:stream_file_and_wait_for_task:1192 attempting to stream file stdout.txt from job=cntkjob task=run_cifar

************************************************************
CNTK is activated.

Please checkout tutorials and examples here:
  /cntk/Tutorials
  /cntk/Examples

To deactivate the environment run

  source /root/anaconda3/bin/deactivate

************************************************************
Downloading http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Done.
Extracting files...
Done.
Preparing train set...
Done.
Preparing test set...
Done.
Writing train text file...
Done.
Writing test text file...
Done.
Converting tra

##### Stream stdout and stderr 

In [159]:
!$batchshipyard data stream --configdir config --filespec cntkjob,run_cifar,stdout.txt

2017-02-23 14:08:39,292Z DEBUG convoy.batch:stream_file_and_wait_for_task:1192 attempting to stream file stdout.txt from job=cntkjob task=run_cifar

************************************************************
CNTK is activated.

Please checkout tutorials and examples here:
  /cntk/Tutorials
  /cntk/Examples

To deactivate the environment run

  source /root/anaconda3/bin/deactivate

************************************************************
Downloading http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
Done.
Extracting files...
Done.
Preparing train set...
Done.
Preparing test set...
Done.
Writing train text file...
Done.
Writing test text file...
Done.
Converting train data to png images...
Done.
Converting test data to png images...
Done.
Training 145578 parameters in 10 parameter tensors.
Finished Epoch[1 of 300]: [Training] loss = 1.789279 * 50000, metric = 65.5% * 50000 6.931s (7213.8 samples per second)
Finished Epoch[2 of 300]: [Training] loss = 1.439543 * 50000, metric =

In [160]:
!$batchshipyard data stream --configdir config --filespec cntkjob,run_cifar,stderr.txt

2017-02-23 14:08:42,330Z DEBUG convoy.batch:stream_file_and_wait_for_task:1192 attempting to stream file stderr.txt from job=cntkjob task=run_cifar



##### Delete job information 

In [171]:
!$batchshipyard jobs del -y --configdir config --wait

2017-02-23 19:45:46,279Z INFO convoy.batch:del_jobs:652 Deleting job: cntkjob
2017-02-23 19:45:46,352Z DEBUG convoy.batch:del_jobs:660 waiting for job cntkjob to delete


##### Delete pool (deallocate VMs)

In [172]:
!$batchshipyard pool del -y --configdir config

2017-02-23 19:46:19,762Z INFO convoy.batch:del_pool:603 Deleting pool: scikit
2017-02-23 19:46:19,903Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=msbatchex$scikit): shipyarddht
2017-02-23 19:46:20,002Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=msbatchex$scikit): shipyardregistry
2017-02-23 19:46:20,052Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=msbatchex$scikit): shipyardtorrentinfo
2017-02-23 19:46:20,060Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=msbatchex$scikit): shipyardimages
2017-02-23 19:46:20,081Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=msbatchex$scikit): shipyardgr
2017-02-23 19:46:20,101Z DEBUG convoy.storage:_clear_table:413 clearing table (pk=msbatchex$scikit): shipyardperf
2017-02-23 19:46:20,110Z DEBUG convoy.storage:delete_storage_containers:368 deleting container: shipyardrf-msbatchex-scikit
2017-02-23 19:46:20,298Z DEBUG convoy.storage:delete_storage_containers:368 deleting container: shipyard

## Clean up

In [102]:
!az group delete -n $group_name --yes --verbose

[32mStarting long running operation 'Starting group delete'[0m
[32mLong running operation 'Starting group delete' completed with result None[0m


In [103]:
!az group delete -n $docker_registry_group --yes --verbose

[32mStarting long running operation 'Starting group delete'[0m
[32mLong running operation 'Starting group delete' completed with result None[0m
