# Chainer リモート環境での学習

In this tutorial, we demonstrate how to use the Azure ML Python SDK to train a Convolutional Neural Network (CNN) on a single-node GPU with Chainer to perform handwritten digit recognition on the popular MNIST dataset. 

## Python SDK バージョン確認

In [1]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

SDK version: 1.0.15


## ワークスペースへの接続
Azure Machine Learning service の [ワークスペース](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) に接続します。Azureに対する認証が必要になります。

In [2]:
from azureml.core.workspace import Workspace

ws = Workspace.get(name='azureml', 
                      subscription_id='9c0f91b8-eb2f-484c-979c-15848c098a6b', 
                      resource_group='dllab'
                   )
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

Falling back to use azure cli credentials. This fall back to use azure cli credentials will be removed in the next release. 
Make sure your code doesn't require 'az login' to have happened before using azureml-sdk, except the case when you are specifying AzureCliAuthentication in azureml-sdk.


Workspace name: azureml
Azure region: eastus
Subscription id: 9c0f91b8-eb2f-484c-979c-15848c098a6b
Resource group: dllab


## 計算環境 Machine Learning Compute (旧Batch AI) の新規作成 or 既存環境設定
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) for training your model. In this tutorial, we use Azure ML managed compute ([AmlCompute](https://docs.microsoft.com/azure/machine-learning/service/how-to-set-up-training-targets#amlcompute)) for our remote training compute resource.

**Creation of AmlCompute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace, this code will skip the creation process.

As with other Azure services, there are limits on certain resources (e.g. AmlCompute) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "gpucluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(vm_size='Standard_NC6s_v3', 
                                                           max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

Found existing compute target.
{'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-02-22T02:20:53.723000+00:00', 'creationTime': '2019-02-22T01:18:20.614322+00:00', 'currentNodeCount': 2, 'errors': None, 'modifiedTime': '2019-02-22T01:19:24.109719+00:00', 'nodeStateCounts': {'idleNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0, 'preparingNodeCount': 0, 'runningNodeCount': 2, 'unusableNodeCount': 0}, 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 2, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'targetNodeCount': 2, 'vmPriority': 'LowPriority', 'vmSize': 'STANDARD_NC6'}


The above code creates a GPU cluster. If you instead want to create a CPU cluster, provide a different VM size to the `vm_size` parameter, such as `STANDARD_D2_V2`.

## リモート環境でのモデル開発
Now that you have your data and training script prepared, you are ready to train on your remote compute cluster. You can take advantage of Azure compute to leverage GPUs to cut down your training time. 

### プロジェクトフォルダの作成
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.

In [4]:
import os

project_folder = './chainer-mnist'
os.makedirs(project_folder, exist_ok=True)

### 学習スクリプトの準備
Now you will need to create your training script. In this tutorial, the training script is already provided for you at `chainer_mnist_hd.py`. In practice, you should be able to take any custom training script as is and run it with Azure ML without having to modify your code.

However, if you would like to use Azure ML's [tracking and metrics](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#metrics) capabilities, you will have to add a small amount of Azure ML code inside your training script. 

In `chainer_mnist_hd.py`, we will log some metrics to our Azure ML run. To do so, we will access the Azure ML `Run` object within the script:
```Python
from azureml.core.run import Run
run = Run.get_context()
```
Further within `chainer_mnist_hd.py`, we log the batchsize and epochs parameters, and the highest accuracy the model achieves:
```Python
run.log('Batch size', np.int(args.batchsize))
run.log('Epochs', np.int(args.epochs))

run.log('Accuracy', np.float(val_accuracy))
```
These run metrics will become particularly important when we begin hyperparameter tuning our model in the "Tune model hyperparameters" section.

Once your script is ready, copy the training script `chainer_mnist_hd.py` into your project directory.

In [5]:
import shutil

shutil.copy('chainer_mnist.py', project_folder)

'./chainer-mnist/chainer_mnist.py'

### Experiment "実験" の作成
[Experiment"実験"](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) を作成し、Chainerによるモデル学習をトラックします。

In [6]:
from azureml.core import Experiment

experiment_name = 'chainer-mnist-remote'
experiment = Experiment(ws, name=experiment_name)

### Chainer estimator の作成
The Azure ML SDK's Chainer estimator enables you to easily submit Chainer training jobs for both single-node and distributed runs. The following code will define a single-node Chainer job.

In [7]:
from azureml.train.dnn import Chainer

script_params = {
    '--epochs': 10,
    '--batchsize': 128,
    '--output_dir': './outputs'
}

estimator = Chainer(source_directory=project_folder, 
                    script_params=script_params,
                    compute_target=compute_target,
                    pip_packages=['numpy', 'pytest','cupy-cuda90'],
                    entry_script='chainer_mnist.py',
                    use_gpu=True)

The `script_params` parameter is a dictionary containing the command-line arguments to your training script `entry_script`. To leverage the Azure VM's GPU for training, we set `use_gpu=True`.

### ジョブの実行
Run your experiment by submitting your estimator object. Note that this call is asynchronous.

In [8]:
run = experiment.submit(estimator)

### run のモニタリング
You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes.

In [11]:
from azureml.widgets import RunDetails

RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 's…

In [10]:
run.wait_for_completion(show_output=True)

RunId: chainer-mnist-remote_1550802908_3644a0b1

Streaming azureml-logs/60_control_log.txt

Streaming log file azureml-logs/60_control_log.txt

Streaming azureml-logs/80_driver_log.txt

Downloading from http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz...
Downloading from http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz...
Downloading from http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz...
Downloading from http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz...
epoch:01 train_loss:0.2911 val_loss:0.2650 val_accuracy:0.9226
epoch:02 train_loss:0.1974 val_loss:0.1865 val_accuracy:0.9450
epoch:03 train_loss:0.1718 val_loss:0.1581 val_accuracy:0.9522
epoch:04 train_loss:0.1350 val_loss:0.1396 val_accuracy:0.9562
epoch:05 train_loss:0.1398 val_loss:0.1177 val_accuracy:0.9639
epoch:06 train_loss:0.1115 val_loss:0.1053 val_accuracy:0.9690
epoch:07 train_loss:0.0660 val_loss:0.1013 val_accuracy:0.9694
epoch:08 train_loss:0.0913 val_loss:0.0958 val_accuracy

{'runId': 'chainer-mnist-remote_1550802908_3644a0b1',
 'target': 'gpucluster',
 'status': 'Completed',
 'startTimeUtc': '2019-02-22T02:44:29.083393Z',
 'endTimeUtc': '2019-02-22T02:47:01.678075Z',
 'properties': {'azureml.runsource': 'experiment',
  'ContentSnapshotId': '436df501-502e-4150-952c-c6b6376aed54'},
 'runDefinition': {'Script': 'chainer_mnist.py',
  'Arguments': ['--epochs',
   '10',
   '--batchsize',
   '128',
   '--output_dir',
   './outputs'],
  'SourceDirectoryDataStore': None,
  'Framework': 0,
  'Communicator': 0,
  'Target': 'gpucluster',
  'DataReferences': {},
  'JobName': None,
  'AutoPrepareEnvironment': True,
  'MaxRunDurationSeconds': None,
  'NodeCount': 1,
  'Environment': {'Python': {'InterpreterPath': 'python',
    'UserManagedDependencies': False,
    'CondaDependencies': {'name': 'project_environment',
     'dependencies': ['python=3.6.2',
      {'pip': ['azureml-defaults',
        'chainer==5.1.0',
        'numpy',
        'pytest',
        'cupy-cuda90']