# Creating a Job per File

## Introduction

This example shows how to map files to jobs in an Azure Storage volume. Potential use cases of this functionality include:
- creating a job for each input data file in a directory
- creating a job for each input script in a directory

In this recipe, we focus on the first example. We will train three RNN models from three separate input files.

## Details
- We provide a Tensorflow example for a RNN (Recurrent Neural Network) that learns how to generate new text after training on a textual dataset.
- The RNN code is adapted from a sample available at https://github.com/sherjilozair/char-rnn-tensorflow
- We generate three jobs, one for each of three training data files: the text from the Linux kernel, Tolstoy's War and Peace, and Shakespeare's works

## Instructions

### Install Dependencies and Create Configuration file.
Follow [instructions](/recipes) to install all dependencies and create configuration file.

### Read Configuration and Create Batch AI client

In [None]:
from __future__ import print_function

import sys
import threading
import logging

import azure.mgmt.batchai.models as models
from azure.storage.blob import BlockBlobService
from azure.storage.file import FileService

sys.path.append('../../..')
import utilities as utils
from utilities.job_factory import ParameterSweep, FileParamSpec

cfg = utils.config.Configuration('../../configuration.json')
client = utils.config.create_batchai_client(cfg)

Create Resource Group and Batch AI workspace if not exists:

In [None]:
utils.config.create_resource_group(cfg)
_ = client.workspaces.create(cfg.resource_group, cfg.workspace, cfg.location).result()

## 1. Prepare Training Dataset and Script in Azure Storage

### Create Azure Blob Container

We will create a new Blob Container with name `batchaisample` under your storage account. This will be used to store the *input training dataset*

**Note** You don't need to create new blob Container for every cluster. We are doing this in this sample to simplify resource management for you.

In [None]:
azure_blob_container_name = 'batchaisample'
blob_service = BlockBlobService(cfg.storage_account_name, cfg.storage_account_key)
blob_service.create_container(azure_blob_container_name, fail_on_exist=False)

### Upload RNN Dataset to Azure Blob Container

For demonstration purposes, we will download three textual datasets to the current directory and upload it to Azure Blob Container directory named `rnn_dataset`.

There are multiple ways to create folders and upload files into Azure Blob Container - you can use [Azure Portal](https://ms.portal.azure.com), [Storage Explorer](http://storageexplorer.com/), [Azure CLI2](/azure-cli-extension) or Azure SDK for your preferable programming language.
In this example we will use Azure SDK for python to copy files into Blob.

In [None]:
rnn_dataset_directory = 'rnn_dataset'
utils.dataset.download_and_upload_rnn_dataset_to_blob(
    blob_service, azure_blob_container_name, rnn_dataset_directory)

### Create Azure File Share

For this example we will create a new File Share with name `batchaisample` under your storage account. This will be used to share the *training script file* and *output file*.

**Note** You don't need to create new file share for every cluster. We are doing this in this sample to simplify resource management for you.

In [None]:
azure_file_share_name = 'batchaisample'
file_service = FileService(cfg.storage_account_name, cfg.storage_account_key)
file_service.create_share(azure_file_share_name, fail_on_exist=False)

Upload the training script [char_rnn.py](char_rnn.py) to file share directory named `rnn_samples`.

In [None]:
script_dir = "rnn_samples"
file_service.create_directory(
    azure_file_share_name, script_dir, fail_on_exist=False)
file_service.create_file_from_path(
    azure_file_share_name, script_dir, 'char_rnn.py', 'char_rnn.py')

## 2. Create Azure Batch AI Compute Cluster

### Configure Compute Cluster

- For this example we will use a GPU cluster of `STANDARD_NC6` nodes. Number of nodes in the cluster is configured with `nodes_count` variable;
- We will call the cluster `nc6`;

So, the cluster will have the following parameters:

In [None]:
nodes_count = 3
cluster_name = 'nc6'

parameters = models.ClusterCreateParameters(
    location=cfg.location,
    vm_size='STANDARD_NC6',
    scale_settings=models.ScaleSettings(
        manual=models.ManualScaleSettings(target_node_count=nodes_count)
    ),
    user_account_settings=models.UserAccountSettings(
        admin_user_name=cfg.admin,
        admin_user_password=cfg.admin_password or None,
        admin_user_ssh_public_key=cfg.admin_ssh_key or None,
    )
)

### Create Compute Cluster

In [None]:
_ = client.clusters.create(cfg.resource_group, cfg.workspace, cluster_name, parameters).result()

### Monitor Cluster Creation

Monitor the just created cluster. The `utilities` module contains a helper function to print out detail status of the cluster.

In [None]:
cluster = client.clusters.get(cfg.resource_group, cfg.workspace, cluster_name)
utils.cluster.print_cluster_status(cluster)

## 3. Mapping FIles to Jobs with Parameter Sweep

The ParameterSweep module allows you to create a collection of jobs from a collection of files, with one job for each file. 

We provide the credentials to the storage account via the config file. The storage type is "BLOB", since the input files are found on Azure Blob Storage. The `mount_method` refers to whether the storage system was mounted to the cluster or job; the volume is mounted to the job (as seen later). The `mount_path` refers to the `models.AzureBlobFileSystemReference.relative_mount_path` we use while mounting the volume. Finally, the `filter_str` is a regex that must match the blob name.

The file paths generated will be:
```
['$AZ_BATCHAI_JOB_MOUNT_ROOT/bfs/rnn_dataset/linux_input.txt',
 '$AZ_BATCHAI_JOB_MOUNT_ROOT/bfs/rnn_dataset/shakespeare_input.txt',
 '$AZ_BATCHAI_JOB_MOUNT_ROOT/bfs/rnn_dataset/war_and_peace_input.txt']
```

In [None]:
param_specs = [
    FileParamSpec(
        parameter_name="DATA_PATH",
        storage_account_name=cfg.storage_account_name,
        storage_account_key=cfg.storage_account_key,
        storage_type="BLOB",
        mount_method="JOB",
        container="batchaisample",
        mount_path="bfs",
        filter_str="rnn_dataset/.+"
    )
]

Create a parameter substitution object.

In [None]:
parameters = ParameterSweep(param_specs)

We will use the parameter substitution object to specify where we would like to substitute the parameters. We substitute
the values for `--data_path` into `models.JobCreateParameters.tensor_flow_settings.command_line_args`. Note that the `parameters` variable is used like a dict, with the `parameter_name` being used as the key to specify which parameter to substitute. When `parameters.generate_jobs` is called, the `parameters[name]` variables will be replaced with actual values.

In [None]:
azure_file_share_mount_path = 'afs'
azure_blob_mount_path = 'bfs'
jcp = models.JobCreateParameters(
    cluster=models.ResourceId(id=cluster.id),
    node_count=1,
    output_directories=[
        models.OutputDirectory(
            id='LOGS',
            path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(
                azure_file_share_mount_path),
            path_suffix='logs'),
        models.OutputDirectory(
            id='SAVE',
            path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(
                azure_file_share_mount_path),
            path_suffix='save'),
        models.OutputDirectory(
            id='OUT',
            path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(
                azure_file_share_mount_path),
            path_suffix='out')
    ],
    input_directories=[
        models.InputDirectory(
            id='SCRIPT',
            path='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}/{1}'.format(
                azure_file_share_mount_path, script_dir))],
    std_out_err_path_prefix='$AZ_BATCHAI_JOB_MOUNT_ROOT/{0}'.format(azure_file_share_mount_path),
    mount_volumes=models.MountVolumes(
        azure_file_shares=[
            models.AzureFileShareReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                azure_file_url='https://{0}.file.core.windows.net/{1}'.format(
                    cfg.storage_account_name, azure_file_share_name),
                relative_mount_path=azure_file_share_mount_path)
        ],
        azure_blob_file_systems=[
            models.AzureBlobFileSystemReference(
                account_name=cfg.storage_account_name,
                credentials=models.AzureStorageCredentialsInfo(
                    account_key=cfg.storage_account_key),
                container_name=azure_blob_container_name,
                relative_mount_path=azure_blob_mount_path)
        ]
    ),
    container_settings=models.ContainerSettings(
        image_source_registry=models.ImageSourceRegistry(image='tensorflow/tensorflow:1.8.0-gpu')),
    tensor_flow_settings=models.TensorFlowSettings(
        python_script_file_path='$AZ_BATCHAI_INPUT_SCRIPT/char_rnn.py',
        master_command_line_args="--data_path {0} --save_dir $AZ_BATCHAI_OUTPUT_SAVE "
                                 "--out_dir $AZ_BATCHAI_OUTPUT_OUT --log_dir $AZ_BATCHAI_OUTPUT_LOGS ".format(
                                    parameters["DATA_PATH"])
    )
)


Create a new experiment called ```random_search_experiment```

In [None]:
experiment_name = 'rnn_test'
_ = client.experiments.create(cfg.resource_group, cfg.workspace, experiment_name).result()

Next, we generate a list of jobs to submit and then submit the jobs to an experiment.

In [None]:
# Generate Jobs
jobs_to_submit = parameters.generate_jobs(jcp)

# Submit Jobs
experiment_utils = utils.experiment.ExperimentUtils(client, cfg.resource_group, cfg.workspace, experiment_name)
jobs = experiment_utils.submit_jobs(jobs_to_submit, 'rnn_test').result()

To view the progress of the jobs and the output files, view the job in the Azure Portal. On the left panel, click Environment Variables to see the parameters used to create the jobs, and Output Files -> OUT to see the generated output from the RNN (when the job is complete).

## 4. Clean Up (Optional)

### Delete the Experiment
Delete the experiment and jobs inside it

In [None]:
_ = client.experiments.delete(cfg.resource_group, cfg.workspace, experiment_name).result()

### Delete the Cluster
When you are finished with the sample and don't want to submit any more jobs you can delete the cluster using the following code.

In [None]:
_ = client.clusters.delete(cfg.resource_group, cfg.workspace, cluster_name).result()

### Delete File Share
When you are finished with the sample and don't want to submit any more jobs you can delete the file share completely with all files using the following code.

In [None]:
service = FileService(cfg.storage_account_name, cfg.storage_account_key)
service.delete_share(azure_file_share_name)