# Keras DSVM


## Introduction

This recipe shows how to run Keras using Batch AI on DSVM. DSVM supports tensorflow, cntk and theano backends for running Keras. Currently only tensorflow and cntk backends supports running on GPU.

## Details

- DSVM has Keras framework preinstalled;
- Standard keras sample script [mnist_cnn.py](https://raw.githubusercontent.com/fchollet/keras/master/examples/mnist_cnn.py) is used;
- The script downloads the standard MNIST Database on its own;
- Standard output of the job will be stored on Azure File Share.

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](/recipes) to install all dependencies and create configuration file.

### Read Configuration and Create Batch AI client

In [71]:
from __future__ import print_function

from datetime import datetime
import sys

from azure.storage.file import FileService
import azure.mgmt.batchai.models as models

# utilities.py contains helper functions used by different notebooks
import utilities

cfg = utilities.Configuration('./configuration.json')
client = utilities.create_batchai_client(cfg)

### Create File Share

For this example we will create a new File Share with name `batchaidsvmsample` under your storage account.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [72]:
azure_file_share_name = 'batchaisample'
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_share(azure_file_share_name, fail_on_exist=False)
print('Done')

Done


### Configure Compute Cluster

- For this example we will use a gpu cluster of `STANDARD_NC6` nodes. Number of nodes in the cluster is configured with `nodes_count` variable;
- We will mount file share at folder with name `external`. Full path of this folder on a computer node will be `$AZ_BATCHAI_MOUNT_ROOT/external`;
- We will call the cluster `nc6`;


So, the cluster will have the following parameters:

In [73]:
azure_file_share = 'external'
nodes_count = 2
cluster_name = 'nc6'

volumes = models.MountVolumes(
    azure_file_shares=[
        models.AzureFileShareReference(
            account_name=cfg.storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=cfg.storage_account_key),
            azure_file_url = 'https://{0}.file.core.windows.net/{1}'.format(
                cfg.storage_account_name, azure_file_share_name),
            relative_mount_path=azure_file_share)
    ]
)

parameters = models.ClusterCreateParameters(
    location=cfg.location,
    vm_size="STANDARD_NC6",
    virtual_machine_configuration=models.VirtualMachineConfiguration(
        image_reference=models.ImageReference(
            publisher="microsoft-ads",
            offer="linux-data-science-vm-ubuntu",
            sku="linuxdsvmubuntu",
            version="latest")),
    scale_settings=models.ScaleSettings(
        auto_scale=models.AutoScaleSettings(initial_node_count=1, minimum_node_count=1, maximum_node_count=2)
    ),
    node_setup=models.NodeSetup(
        mount_volumes=volumes
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password,
        admin_user_ssh_public_key=cfg.admin_ssh_key
    )
)

### Create Compute Cluster

In [74]:
cluster = client.clusters.create(cfg.resource_group, cluster_name, parameters).result()

### Monitor Cluster Creation

utilities.py contains a helper function allowing to wait for the cluster to become available - all nodes are allocated and finished preparation.

In [75]:
cluster = client.clusters.get(cfg.resource_group, cluster_name)
utilities.print_cluster_status(cluster)

Cluster state: AllocationState.steady Allocated: 1; Idle: 1; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0


### Deploy Sample Script and Configure the Input Directories

- For each job we will create a folder containing a copy of [train_mnist.py](https://github.com/chainer/chainer/blob/master/examples/mnist/train_mnist.py). This allows each job to have it's own copy of the sample script (in case you would like to change it).

In [116]:
keras_sample_dir = "learner-attrition"
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_directory(
    azure_file_share_name, keras_sample_dir, fail_on_exist=False)
service.create_file_from_path(
    azure_file_share_name, keras_sample_dir, 'download_data.py', 'download_data.py')
service.create_file_from_path(
    azure_file_share_name, keras_sample_dir, 'model.py', 'model.py')
service.create_file_from_path(
    azure_file_share_name, keras_sample_dir, 'params.json', 'params.json')
print('Done')

Done


- The job needs to know where to find train_mnist.py script (the chainer will download MNIST dataset on its own). So, we will configure an input directory for the script:

In [117]:
input_directories = [
    models.InputDirectory(
        id='SCRIPT',
        path='$AZ_BATCHAI_MOUNT_ROOT/{0}/{1}'.format(azure_file_share, keras_sample_dir))
]

The job will be able to reference those directories using ```$AZ_BATCHAI_INPUT_SCRIPT``` environment variable.

### Configure Output Directories
We will store standard and error output of the job in File Share:

In [118]:
std_output_path_prefix = "$AZ_BATCHAI_MOUNT_ROOT/{0}".format(azure_file_share)

### Configure Job

- Will use configured previously input and output directories;
- Will run standard `mnist_cnn.py` from SCRIPT input directory using custom framework;
- Keral will use theano backend; DSVM supports cntk, tensorflow and theano backends for keral, just change KERAS_BACKEND to "tensorflow" or "theano" to use corresponding backend. Note, theano backend will run on CPU. 
- Will output standard output and error streams to file share.


In [119]:
job_name = datetime.utcnow().strftime("la_%m_%d_%Y_%H%M%S")
parameters = models.job_create_parameters.JobCreateParameters(
    location=cfg.location,
    cluster=models.ResourceId(cluster.id),
    node_count=1,
    input_directories=input_directories,
    std_out_err_path_prefix=std_output_path_prefix,
    job_preparation=models.JobPreparation(
        command_line="LC_ALL=C.UTF-8 LANG=C.UTF-8 python $AZ_BATCHAI_INPUT_SCRIPT/download_data.py --az-tenant-id {} --az-sp-client-id {} --az-sp-client-secret {} --datalake-store-name {}".format(
            cfg.aad_tenant_id, cfg.aad_client_id, cfg.aad_secret_key, cfg.datalake_store_name
        )
    ),
    custom_toolkit_settings = models.CustomToolkitSettings(
        command_line='KERAS_BACKEND=cntk python $AZ_BATCHAI_INPUT_SCRIPT/model.py --course-id Microsoft+DAT206x+4T2017 --train True --num-epochs 1 --batch-size 256 --positive-upweight 3 --lr 0.01 --layers-config-file $AZ_BATCHAI_INPUT_SCRIPT/params.json'
    )
)


### Create a training Job and wait for Job completion


In [120]:
job = client.jobs.create(cfg.resource_group, job_name, parameters).result()
print('Created Job: {}'.format(job_name))

Created Job: la_01_28_2018_062241


### Wait for Job to Finish
The job will start running when the cluster will have enought idle nodes. The following code waits for job to start running printing the cluster state. During job run, the code prints current content of stdout.txt.

**Note** Execution may take several minutes to complete.

In [121]:
utilities.wait_for_job_completion(client, cfg.resource_group, job_name, cluster_name, 'stdouterr', 'stdout.txt')

Cluster state: AllocationState.steady Allocated: 1; Idle: 0; Unusable: 0; Running: 1; Preparing: 0; Leaving: 0
Job state: running ExitCode: None
Waiting for job output to become available...
STARTING
ARGS ALL GOOD
GETTING DATA: 
model_data.csv does not exist for course:  Microsoft+DAT206x+JPN+1T2017
model_data.csv does not exist for course:  top_course_ids.txt
model_data.csv does not exist for course:  Microsoft+DAT207x+1T2018
model_data.csv does not exist for course:  Microsoft+DAT205x+3T2016
model_data.csv does not exist for course:  Microsoft+DAT206x+1T2018
model_data.csv does not exist for course:  Microsoft+DAT215.4x+1T2017
model_data.csv does not exist for course:  Microsoft+DAT206x+6T2016
model_data.csv does not exist for course:  Microsoft+DAT215.3x+3T2017
model_data.csv does not exist for course:  _SUCCESS
model_data.csv does not exist for course:  Microsoft+DAT206x+JPN+2T2017
Training data done.
Done.
Job state: failed ExitCode: 1
FailureDetails: 
ErrorCode:JobFailed
ErrorMes

### Download stdout.txt and stderr.txt files for the Job

In [122]:
files = client.jobs.list_output_files(cfg.resource_group, job_name, models.JobsListOutputFilesOptions("stdOuterr")) 
for f in list(files):
    utilities.download_file(f.download_url, f.name)
print("All files downloaded")

Downloading https://learnerattrition.file.core.windows.net/batchaisample/fd74930d-c060-4ff4-a7f1-9470f7ad7f8f/learner-attrition-supp/jobs/la_01_28_2018_062241/bce88626-cd3f-441a-8ef1-7a497febfea6/stderr-job_prep.txt?sv=2016-05-31&sr=f&sig=KGu9I%2F9L7Qe6rQp%2BWdLZERskvi05tzD1x%2FBVcOxTGsA%3D&se=2018-01-28T07%3A25%3A01Z&sp=rl ...Done
Downloading https://learnerattrition.file.core.windows.net/batchaisample/fd74930d-c060-4ff4-a7f1-9470f7ad7f8f/learner-attrition-supp/jobs/la_01_28_2018_062241/bce88626-cd3f-441a-8ef1-7a497febfea6/stderr.txt?sv=2016-05-31&sr=f&sig=cUmrsu8liWMYY7MPs7e73F4lKKSkCiqRMiKlsBWPpr0%3D&se=2018-01-28T07%3A25%3A01Z&sp=rl ...Done
Downloading https://learnerattrition.file.core.windows.net/batchaisample/fd74930d-c060-4ff4-a7f1-9470f7ad7f8f/learner-attrition-supp/jobs/la_01_28_2018_062241/bce88626-cd3f-441a-8ef1-7a497febfea6/stdout-job_prep.txt?sv=2016-05-31&sr=f&sig=NwBmyNKOsYnaRh5U9fkKplfLwdZcZbpOGxRYHYRScUQ%3D&se=2018-01-28T07%3A25%3A01Z&sp=rl ...Done
Downloading https:/

In [123]:
print('stdout.txt content:')
with open('stderr.txt') as f:
    print(f.read())

stdout.txt content:
bash: /mnt/batch/tasks/workitems/la_01_28_2018_062241_bce88626-cd3f-441a-8ef1-7a497febfea6/job-1/la_01_28_2018_062241_bce88626-cd3f-441a-8ef1-7a497febfea6/wd/.bashrc: No such file or directory
Selected GPU[0] Tesla K80 as the process wide default device.
Using CNTK backend
Traceback (most recent call last):
  File "/mnt/batch/tasks/shared/LS_root/mounts/external/learner-attrition/model.py", line 260, in <module>
    run_model(args['course_id'], args['train'], args['num_epochs'], args['batch_size'], args['positive_upweight'], args['lr'], args['layers_config_file'])
  File "/mnt/batch/tasks/shared/LS_root/mounts/external/learner-attrition/model.py", line 177, in run_model
    adam = optimizers.Adam(lr=learning_rate)
  File "/anaconda/envs/py35/lib/python3.5/site-packages/keras/optimizers.py", line 409, in __init__
    self.decay = K.variable(decay, name='decay')
  File "/anaconda/envs/py35/lib/python3.5/contextlib.py", line 77, in __exit__
    self.gen.throw(type, val

### Delete the Job

In [None]:
_ = client.jobs.delete(cfg.resource_group, job_name)

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
_ = client.clusters.delete(cfg.resource_group, cluster_name)

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [None]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)