# Keras DSVM Deep Learning PaaS


## Introduction

This recipe shows how to run Keras using Batch AI on DSVM. DSVM supports tensorflow, cntk and theano backends for running Keras. Currently only tensorflow and cntk backends supports running on GPU.

## Details

- DSVM has Keras framework preinstalled;
- 셈플코드는 Glove + bidirectional LSTM 으로 작성된 간단한 Text Classification 모델을 이용합니다.
- The script downloads the standard MNIST Database on its own;
- Standard output of the job will be stored on Azure File Share.

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](/recipes) to install all dependencies and create configuration file.

### Read Configuration and Create Batch AI client

In [1]:
from __future__ import print_function

import time
from datetime import datetime
import os
import sys
import zipfile

from azure.storage.file import FileService, FilePermissions
import azure.mgmt.batchai.models as models

# utilities.py contains helper functions used by different notebooks
sys.path.append('./')
import utilities

cfg = utilities.Configuration('configuration.json')
client = utilities.create_batchai_client(cfg)

### Create File Share

For this example we will create a new File Share with name `batchaidsvmsample` under your storage account.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [3]:
azure_file_share_name = 'kimhoondong352'
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_share(azure_file_share_name, fail_on_exist=False)

True

### Configure Compute Cluster

- For this example we will use a gpu cluster of `STANDARD_NC6` nodes. Number of nodes in the cluster is configured with `nodes_count` variable;
- We will mount file share at folder with name `external`. Full path of this folder on a computer node will be `$AZ_BATCHAI_MOUNT_ROOT/external`;
- We will call the cluster `nc6`;


So, the cluster will have the following parameters:

In [4]:
azure_file_share = 'external'
nodes_count = 1
cluster_name = 'nc6-352'

volumes = models.MountVolumes(
    azure_file_shares=[
        models.AzureFileShareReference(
            account_name=cfg.storage_account_name,
            credentials=models.AzureStorageCredentialsInfo(
                account_key=cfg.storage_account_key),
            azure_file_url = 'https://{0}.file.core.windows.net/{1}'.format(
                cfg.storage_account_name, azure_file_share_name),
            relative_mount_path=azure_file_share)
    ]
)

parameters = models.ClusterCreateParameters(
    location=cfg.location,
    vm_size="STANDARD_NC6",
    virtual_machine_configuration=models.VirtualMachineConfiguration(
        image_reference=models.ImageReference(
            publisher="microsoft-ads",
            offer="linux-data-science-vm-ubuntu",
            sku="linuxdsvmubuntu",
            version="latest")),
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    node_setup=models.NodeSetup(
        mount_volumes=volumes
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password,
        admin_user_ssh_public_key=cfg.admin_ssh_key
    )
)

### Create Compute Cluster

In [5]:
_ = client.clusters.create(cfg.resource_group, cluster_name, parameters)

### Monitor Cluster Creation

utilities.py contains a helper function allowing to wait for the cluster to become available - all nodes are allocated and finished preparation.

In [6]:
cluster = client.clusters.get(cfg.resource_group, cluster_name)
utilities.print_cluster_status(cluster)

Cluster state: AllocationState.steady Target: 1; Allocated: 1; Idle: 1; Unusable: 0; Running: 0; Preparing: 0; Leaving: 0


### Deploy Sample Script and Configure the Input Directories


The job will be able to reference those directories using ```$AZ_BATCHAI_INPUT_SCRIPT``` environment variable.

In [8]:
script_directory = 'trainall_script'
#script_file = 'GloveBidirectionalLSTM_v3_1.py'
script_file = 'BidirectionalLSTM_v2_1_1host.py'

#dataset_file1 = 'glove.6B.100d.txt'
#dataset_file2 = 'labeledTrainData.tsv'

service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.create_directory(
    azure_file_share_name, script_directory, fail_on_exist=False)

#Script file
service.create_file_from_path(
    azure_file_share_name, script_directory, script_file, script_file)

#Dataset file
#service.create_file_from_path(
#    azure_file_share_name, script_directory, dataset_file1, dataset_file1)
#service.create_file_from_path(
#    azure_file_share_name, script_directory, dataset_file2, dataset_file2)

In [9]:
input_directories = [
    models.InputDirectory(
        id='SCRIPTS',
        path='$AZ_BATCHAI_MOUNT_ROOT/{0}/{1}'.format(azure_file_share, script_directory))
]

### Configure Output Directories
We will store standard and error output of the job in File Share:

In [10]:
std_output_path_prefix = "$AZ_BATCHAI_MOUNT_ROOT/{0}".format(azure_file_share)

### Configure Job

- Will use configured previously input and output directories;
- Will run standard `mnist_cnn.py` from SCRIPT input directory using custom framework;
- Keral will use theano backend; DSVM supports cntk, tensorflow and theano backends for keral, just change KERAS_BACKEND to "tensorflow" or "theano" to use corresponding backend. Note, theano backend will run on CPU. 
- Will output standard output and error streams to file share.


In [11]:
job_name = datetime.utcnow().strftime("azure_batch_ai_keras_351_%m_%d_%Y_%H%M%S")
parameters = models.job_create_parameters.JobCreateParameters(
     location=cfg.location,
     cluster=models.ResourceId(cluster.id),
     node_count=nodes_count,
     input_directories=input_directories,
     std_out_err_path_prefix=std_output_path_prefix,
     custom_toolkit_settings = models.CustomToolkitSettings(
         command_line='KERAS_BACKEND=tensorflow python $AZ_BATCHAI_INPUT_SCRIPTS/'+script_file))

In [38]:
## /mnt/batch/tasks/shared/LS_root/mounts/external/trainall_script/

### Create a training Job and wait for Job completion


In [12]:
_ = client.jobs.create(cfg.resource_group, job_name, parameters)
print('Created Job: {}'.format(job_name))

Created Job: azure_batch_ai_keras_351_01_27_2018_082235


### Wait for Job to Finish
The job will start running when the cluster will have enought idle nodes. The following code waits for job to start running printing the cluster state. During job run, the code prints current content of stdout.txt.

**Note** Execution may take several minutes to complete.

In [13]:
utilities.wait_for_job_completion(client, cfg.resource_group, job_name, cluster_name, 'stdouterr', 'stdout.txt')

Cluster state: AllocationState.steady Target: 1; Allocated: 1; Idle: 0; Unusable: 0; Running: 1; Preparing: 0; Leaving: 0
Job state: running ExitCode: None
Waiting for job output to become available...
Loading data...
Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz

    8192/17464789 [..............................] - ETA: 4s
 1589248/17464789 [=>............................] - ETA: 0s
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 100)
x_test shape: (25000, 100)
Build model...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/4

   32/25000 [..............................] - ETA: 2:35:54 - loss: 0.6927 - acc: 0.5000
   64/25000 [..............................] - ETA: 1:19:55 - loss: 0.6932 - acc: 0.5000
   96/25000 [..............................] - ETA: 54:48 - loss: 0.6930 - acc: 0.5000  
  128/25000 [..............................] - ETA: 41:56 - loss: 0.6925 - acc: 0.4844
  160/25000 [........

Epoch 2/4

   32/25000 [..............................] - ETA: 3:26 - loss: 0.1393 - acc: 1.0000
   64/25000 [..............................] - ETA: 3:25 - loss: 0.1482 - acc: 0.9844
   96/25000 [..............................] - ETA: 3:23 - loss: 0.1585 - acc: 0.9479
  128/25000 [..............................] - ETA: 3:24 - loss: 0.1669 - acc: 0.9453
  160/25000 [..............................] - ETA: 3:23 - loss: 0.1914 - acc: 0.9313
  192/25000 [..............................] - ETA: 3:28 - loss: 0.1953 - acc: 0.9271
  224/25000 [..............................] - ETA: 3:27 - loss: 0.2182 - acc: 0.9196
  256/25000 [..............................] - ETA: 3:28 - loss: 0.2240 - acc: 0.9102
  288/25000 [..............................] - ETA: 3:27 - loss: 0.2219 - acc: 0.9132
  320/25000 [..............................] - ETA: 3:26 - loss: 0.2232 - acc: 0.9156
  352/25000 [..............................] - ETA: 3:25 - loss: 0.2291 - acc: 0.9119
  384/25000 [..............................

 3264/25000 [==>...........................] - ETA: 3:07 - loss: 0.2220 - acc: 0.9225
 3296/25000 [==>...........................] - ETA: 3:07 - loss: 0.2207 - acc: 0.9229
 3328/25000 [==>...........................] - ETA: 3:06 - loss: 0.2194 - acc: 0.9234
 3360/25000 [===>..........................] - ETA: 3:06 - loss: 0.2185 - acc: 0.9235
 3392/25000 [===>..........................] - ETA: 3:06 - loss: 0.2188 - acc: 0.9233
 3424/25000 [===>..........................] - ETA: 3:05 - loss: 0.2189 - acc: 0.9232
 3456/25000 [===>..........................] - ETA: 3:05 - loss: 0.2192 - acc: 0.9230
 3488/25000 [===>..........................] - ETA: 3:04 - loss: 0.2192 - acc: 0.9226
 3520/25000 [===>..........................] - ETA: 3:04 - loss: 0.2185 - acc: 0.9227
 3552/25000 [===>..........................] - ETA: 3:04 - loss: 0.2197 - acc: 0.9229
 3584/25000 [===>..........................] - ETA: 3:03 - loss: 0.2187 - acc: 0.9233
 3616/25000 [===>..........................] - ETA: 3:











Epoch 3/4

   32/25000 [..............................] - ETA: 3:52 - loss: 0.1961 - acc: 0.9375
   64/25000 [..............................] - ETA: 3:38 - loss: 0.1482 - acc: 0.9531
   96/25000 [..............................] - ETA: 3:38 - loss: 0.1403 - acc: 0.9583
  128/25000 [..............................] - ETA: 3:40 - loss: 0.1357 - acc: 0.9688
  160/25000 [..............................] - ETA: 3:40 - loss: 0.1231 - acc: 0.9688
  192/25000 [..............................] - ETA: 3:37 - loss: 0.1227 - acc: 0.9635
  224/25000 [..............................] - ETA: 3:34 - loss: 0.1274 - acc: 0.9643
  256/25000 [..............................] - ETA: 3:32 - loss: 0.1323 - acc: 0.9609
  288/25000 [..............................] - ETA: 3:31 - loss: 0.1333 - acc: 0.9583
  320/25000 [..............................] - ETA: 3:29 - loss: 0.1547 - acc: 0.9437
  352/25000 [..............................] - ETA: 3:28 - loss: 0.1682 - acc: 0.9318
  384/25000 [..............................

 1056/25000 [>.............................] - ETA: 3:25 - loss: 0.1334 - acc: 0.9555
 1088/25000 [>.............................] - ETA: 3:24 - loss: 0.1317 - acc: 0.9559
 1120/25000 [>.............................] - ETA: 3:23 - loss: 0.1337 - acc: 0.9545
 1152/25000 [>.............................] - ETA: 3:23 - loss: 0.1327 - acc: 0.9540
 1184/25000 [>.............................] - ETA: 3:22 - loss: 0.1299 - acc: 0.9552
 1216/25000 [>.............................] - ETA: 3:22 - loss: 0.1307 - acc: 0.9539
 1248/25000 [>.............................] - ETA: 3:22 - loss: 0.1283 - acc: 0.9551
 1280/25000 [>.............................] - ETA: 3:21 - loss: 0.1264 - acc: 0.9555
 1312/25000 [>.............................] - ETA: 3:21 - loss: 0.1270 - acc: 0.9550
 1344/25000 [>.............................] - ETA: 3:21 - loss: 0.1266 - acc: 0.9554
 1376/25000 [>.............................] - ETA: 3:20 - loss: 0.1272 - acc: 0.9549
 1408/25000 [>.............................] - ETA: 3:

 4096/25000 [===>..........................] - ETA: 2:59 - loss: 0.1333 - acc: 0.9534
 4128/25000 [===>..........................] - ETA: 2:59 - loss: 0.1327 - acc: 0.9537
 4160/25000 [===>..........................] - ETA: 2:59 - loss: 0.1328 - acc: 0.9534
 4192/25000 [====>.........................] - ETA: 2:59 - loss: 0.1323 - acc: 0.9537
 4224/25000 [====>.........................] - ETA: 2:59 - loss: 0.1316 - acc: 0.9541
 4256/25000 [====>.........................] - ETA: 2:58 - loss: 0.1329 - acc: 0.9537
 4288/25000 [====>.........................] - ETA: 2:58 - loss: 0.1327 - acc: 0.9538
 4320/25000 [====>.........................] - ETA: 2:58 - loss: 0.1331 - acc: 0.9537
 4352/25000 [====>.........................] - ETA: 2:57 - loss: 0.1323 - acc: 0.9540
 4384/25000 [====>.........................] - ETA: 2:57 - loss: 0.1317 - acc: 0.9544
 4416/25000 [====>.........................] - ETA: 2:57 - loss: 0.1314 - acc: 0.9545
 4448/25000 [====>.........................] - ETA: 2:











Epoch 4/4

   32/25000 [..............................] - ETA: 3:26 - loss: 0.1072 - acc: 0.9688
   64/25000 [..............................] - ETA: 3:24 - loss: 0.0750 - acc: 0.9844
   96/25000 [..............................] - ETA: 3:25 - loss: 0.0669 - acc: 0.9896
  128/25000 [..............................] - ETA: 3:24 - loss: 0.0799 - acc: 0.9766
  160/25000 [..............................] - ETA: 3:23 - loss: 0.0812 - acc: 0.9750
  192/25000 [..............................] - ETA: 3:28 - loss: 0.0923 - acc: 0.9688
  224/25000 [..............................] - ETA: 3:26 - loss: 0.0842 - acc: 0.9732
  256/25000 [..............................] - ETA: 3:25 - loss: 0.0829 - acc: 0.9766
  288/25000 [..............................] - ETA: 3:25 - loss: 0.0784 - acc: 0.9792
  320/25000 [..............................] - ETA: 3:24 - loss: 0.0746 - acc: 0.9812
  352/25000 [..............................] - ETA: 3:23 - loss: 0.0704 - acc: 0.9830
  384/25000 [..............................

 1600/25000 [>.............................] - ETA: 3:22 - loss: 0.0603 - acc: 0.9800
 1632/25000 [>.............................] - ETA: 3:22 - loss: 0.0596 - acc: 0.9804
 1664/25000 [>.............................] - ETA: 3:21 - loss: 0.0587 - acc: 0.9808
 1696/25000 [=>............................] - ETA: 3:21 - loss: 0.0583 - acc: 0.9805
 1728/25000 [=>............................] - ETA: 3:20 - loss: 0.0576 - acc: 0.9809
 1760/25000 [=>............................] - ETA: 3:21 - loss: 0.0590 - acc: 0.9801
 1792/25000 [=>............................] - ETA: 3:21 - loss: 0.0585 - acc: 0.9805
 1824/25000 [=>............................] - ETA: 3:20 - loss: 0.0581 - acc: 0.9803
 1856/25000 [=>............................] - ETA: 3:20 - loss: 0.0611 - acc: 0.9801
 1888/25000 [=>............................] - ETA: 3:20 - loss: 0.0610 - acc: 0.9799
 1920/25000 [=>............................] - ETA: 3:19 - loss: 0.0623 - acc: 0.9797
 1952/25000 [=>............................] - ETA: 3:

 4896/25000 [====>.........................] - ETA: 2:55 - loss: 0.0613 - acc: 0.9804
 4928/25000 [====>.........................] - ETA: 2:55 - loss: 0.0618 - acc: 0.9801
 4960/25000 [====>.........................] - ETA: 2:55 - loss: 0.0624 - acc: 0.9798
 4992/25000 [====>.........................] - ETA: 2:54 - loss: 0.0621 - acc: 0.9800
 5024/25000 [=====>........................] - ETA: 2:54 - loss: 0.0619 - acc: 0.9801
 5056/25000 [=====>........................] - ETA: 2:54 - loss: 0.0620 - acc: 0.9798
 5088/25000 [=====>........................] - ETA: 2:54 - loss: 0.0619 - acc: 0.9800
 5120/25000 [=====>........................] - ETA: 2:53 - loss: 0.0622 - acc: 0.9799
 5152/25000 [=====>........................] - ETA: 2:53 - loss: 0.0620 - acc: 0.9800
 5184/25000 [=====>........................] - ETA: 2:53 - loss: 0.0616 - acc: 0.9801
 5216/25000 [=====>........................] - ETA: 2:53 - loss: 0.0616 - acc: 0.9801
 5248/25000 [=====>........................] - ETA: 2:












   32/25000 [..............................] - ETA: 48s
   64/25000 [..............................] - ETA: 50s
   96/25000 [..............................] - ETA: 51s
  128/25000 [..............................] - ETA: 51s
  160/25000 [..............................] - ETA: 51s
  192/25000 [..............................] - ETA: 51s
  224/25000 [..............................] - ETA: 51s
  256/25000 [..............................] - ETA: 51s
  288/25000 [..............................] - ETA: 51s
  320/25000 [..............................] - ETA: 51s
  352/25000 [..............................] - ETA: 51s
  384/25000 [..............................] - ETA: 51s
  416/25000 [..............................] - ETA: 51s
  448/25000 [..............................] - ETA: 51s
  480/25000 [..............................] - ETA: 51s
  512/25000 [..............................] - ETA: 51s
  544/25000 [..............................] - ETA: 51s
  576/25000 [..............................] - 

 3552/25000 [===>..........................] - ETA: 44s
 3584/25000 [===>..........................] - ETA: 44s
 3616/25000 [===>..........................] - ETA: 44s
 3648/25000 [===>..........................] - ETA: 44s
 3680/25000 [===>..........................] - ETA: 44s
 3712/25000 [===>..........................] - ETA: 44s
 3744/25000 [===>..........................] - ETA: 44s
 3776/25000 [===>..........................] - ETA: 44s
 3808/25000 [===>..........................] - ETA: 44s
 3840/25000 [===>..........................] - ETA: 44s
 3872/25000 [===>..........................] - ETA: 44s
 3904/25000 [===>..........................] - ETA: 44s
 3936/25000 [===>..........................] - ETA: 44s
 3968/25000 [===>..........................] - ETA: 44s
 4000/25000 [===>..........................] - ETA: 43s
 4032/25000 [===>..........................] - ETA: 43s
 4064/25000 [===>..........................] - ETA: 43s
 4096/25000 [===>..........................] - E





Test score: 0.708789501987
Test accuracy: 0.82368
Job state: succeeded ExitCode: 0


### Download stdout.txt and stderr.txt files for the Job

In [67]:
files = client.jobs.list_output_files(cfg.resource_group, job_name, models.JobsListOutputFilesOptions("stdOuterr")) 
for file in list(files):
    utilities.download_file(file.download_url, file.name)
print("All files Downloaded")

Downloading https://aitrainigstorage.file.core.windows.net/kimhoondong351/1aa15964-43e9-4fab-9be5-81abdcb9c8d1/holkrazure/jobs/azure_batch_ai_keras_351_01_26_2018_162948/1dbe3e86-8c8a-44c9-8b20-5006d2977f24/stderr.txt?sv=2016-05-31&sr=f&sig=qHv2KEotM%2Fh3JPY5MD7drGBk2kyFXfrqKkX5GdalNm8%3D&se=2018-01-26T18%3A10%3A03Z&sp=rl ...Done
Downloading https://aitrainigstorage.file.core.windows.net/kimhoondong351/1aa15964-43e9-4fab-9be5-81abdcb9c8d1/holkrazure/jobs/azure_batch_ai_keras_351_01_26_2018_162948/1dbe3e86-8c8a-44c9-8b20-5006d2977f24/stdout.txt?sv=2016-05-31&sr=f&sig=iuFtzoJzZh7%2BRzUFR1xksMhMlMEqLE1eysvUBnrG1FY%3D&se=2018-01-26T18%3A10%3A03Z&sp=rl ...Done
All files Downloaded


In [68]:
print('stdout.txt content:')
with open('stdout.txt') as f:
    print(f.read())

print('stderr.txt content:')
with open('stderr.txt') as f:
    print(f.read())

stdout.txt content:
Loading data...
Downloading data from https://s3.amazonaws.com/text-datasets/imdb.npz

    8192/17464789 [..............................] - ETA: 3s
 3022848/17464789 [====>.........................] - ETA: 0s
25000 train sequences
25000 test sequences
Pad sequences (samples x time)
x_train shape: (25000, 100)
x_test shape: (25000, 100)
Build model...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/4

   32/25000 [..............................] - ETA: 3:20:57 - loss: 0.6938 - acc: 0.5312
   64/25000 [..............................] - ETA: 1:42:12 - loss: 0.6927 - acc: 0.5469
   96/25000 [..............................] - ETA: 1:09:26 - loss: 0.6926 - acc: 0.5417
  128/25000 [..............................] - ETA: 52:56 - loss: 0.6928 - acc: 0.5391  
  160/25000 [..............................] - ETA: 42:58 - loss: 0.6927 - acc: 0.5375
  192/25000 [..............................] - ETA: 36:20 - loss: 0.6936 - acc: 0.5104
  224/25000 [..............

### Delete the Job

In [51]:
client.jobs.delete(cfg.resource_group, job_name)

<msrestazure.azure_operation.AzureOperationPoller at 0x240b4457390>

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [52]:
client.clusters.delete(cfg.resource_group, cluster_name)

<msrestazure.azure_operation.AzureOperationPoller at 0x240b4457438>

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [53]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)

True

Token expired or is invalid. Attempting to refresh.
