# Datastoreを使用した ChainerMNによる分散学習
In this tutorial, you will run a Chainer training example on the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset using ChainerMN distributed training across a GPU cluster. Training data is uploaded during the notebook to Azure Blob storage and registered as "Datastore" in Azure Machine Learning service Workspace. You can use "Datastore" after this experiment to access to the data in Azure Blob storage.

## Azure ML service Python SDK バージョン確認

In [1]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

SDK version: 1.0.18


## ワークスペースへの接続
Azure Machine Learning service の [ワークスペース](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#workspace) に接続します。Azureに対する認証が必要になります。

In [2]:
from azureml.core.workspace import Workspace

ws = Workspace.get(name='azureml', 
                      subscription_id='9c0f91b8-eb2f-484c-979c-15848c098a6b', 
                      resource_group='amlservice'
                   )
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


Workspace name: azureml
Azure region: southeastasia
Subscription id: 9c0f91b8-eb2f-484c-979c-15848c098a6b
Resource group: amlservice


## 計算環境 Machine Learning Compute (旧Batch AI) の新規作成 or 既存環境設定
You will need to create a [compute target](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#compute-target) to execute your training script on. In this tutorial, you create an [Azure Batch AI](https://docs.microsoft.com/azure/batch-ai/overview) cluster as your training compute resource. This code creates a cluster for you if it does not already exist in your workspace.

**Creation of the cluster takes approximately 5 minutes.** If the cluster is already in your workspace this code will skip the cluster creation process.

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

batchai_cluster_name = "gpu-ibclst"
vm_size = "Standard_NC24rs_v3"

try:
    # Check for existing cluster
    compute_target = ComputeTarget(ws,batchai_cluster_name)
    print('Found existing compute target ' + batchai_cluster_name)
except:
    # Else, create new one
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                    vm_priority = "lowpriority",
                                                                    min_nodes = 0, 
                                                                    max_nodes = 2)
    compute_target = ComputeTarget.create(ws, batchai_cluster_name, provisioning_config)
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

Found existing compute target gpu-ibclst


The above code creates GPU compute. If you instead want to create CPU compute, provide a different VM size to the `vm_size` parameter, such as `STANDARD_D2_V2`.

## リモート環境でのモデル開発
Now that we have the cluster ready to go, let's run our distributed training job.

### プロジェクトフォルダの作成
Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script and any additional files your training script depends on.

In [4]:
import os

project_folder = './chainer-distr'
os.makedirs(project_folder, exist_ok=True)

### Prepare training script
Now you will need to create your training script. In this tutorial, the script for distributed training of MNIST is already provided for you at `train_mnist.py`. In practice, you should be able to take any custom Chainer training script as is and run it with Azure ML without having to modify your code.

Once your script is ready, copy the training script `train_mnist.py` into the project directory.

In [5]:
import shutil

shutil.copy('chainer_mnist.py', project_folder)

'./chainer-distr/chainer_mnist.py'

### 学習データのダウンロード

In [6]:
import urllib
import os

os.makedirs('./data/mnist', exist_ok=True)

urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz', filename='./data/mnist/train-images.gz')
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz', filename='./data/mnist/train-labels.gz')
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz', filename='./data/mnist/test-images.gz')
urllib.request.urlretrieve('http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz', filename='./data/mnist/test-labels.gz')

('./data/mnist/test-labels.gz', <http.client.HTTPMessage at 0x116e6a278>)

### 学習データをデフォルトのDatastoreに保存

In [7]:
ds = ws.get_default_datastore()

In [8]:
ds.upload(src_dir='./data/mnist', target_path='mnist', overwrite=True, show_progress=True)

Uploading ./data/mnist/test-images.gz
Uploading ./data/mnist/test-labels.gz
Uploading ./data/mnist/train-images.gz
Uploading ./data/mnist/train-labels.gz
Uploaded ./data/mnist/test-labels.gz, 1 files out of an estimated total of 4
Uploaded ./data/mnist/train-labels.gz, 2 files out of an estimated total of 4
Uploaded ./data/mnist/test-images.gz, 3 files out of an estimated total of 4
Uploaded ./data/mnist/train-images.gz, 4 files out of an estimated total of 4


$AZUREML_DATAREFERENCE_ae7b7970a0b64a73a2c688d35245637a

Azure Portal もしくは、Storage ExplorerなどのツールからBlobにデータがアップロードされていることを確認します。



<div align="left"><img src= "../images/defaultblob.png" width="500" align ="left">

### Experiment "実験" の作成
[Experiment"実験"](https://docs.microsoft.com/azure/machine-learning/service/concept-azure-machine-learning-architecture#experiment) を作成し、Chainerによるモデル学習をトラックします。

In [9]:
from azureml.core import Experiment

experiment_name = 'chainermn-dirt'
experiment = Experiment(ws, name=experiment_name)

### Create a Chainer estimator
The Azure ML SDK's Chainer estimator enables you to easily submit Chainer training jobs for both single-node and distributed runs.

In [10]:
from azureml.train.dnn import Chainer


script_params = {
    '-d': ds.path('mnist').as_mount()
}


estimator = Chainer(source_directory=project_folder,
                      compute_target=compute_target,
                      entry_script='chainer_mnist.py',
                      script_params=script_params,
                      node_count=2,
                      process_count_per_node=1,
                      distributed_backend='mpi',     
                      pip_packages=['mpi4py','pytest'],
                      use_gpu=True)
 

The above code specifies that we will run our training script on `2` nodes, with one worker per node. In order to execute a distributed run using MPI, you must provide the argument `distributed_backend='mpi'`. Using this estimator with these settings, Chainer and its dependencies will be installed for you. However, if your script also uses other packages, make sure to install them via the `Chainer` constructor's `pip_packages` or `conda_packages` parameters.

### ジョブの実行
Run your experiment by submitting your estimator object. Note that this call is asynchronous.

In [11]:
run = experiment.submit(estimator)
print(run)

Run(Experiment: chainer-IB,
Id: chainer-IB_1553064401_e5ac1eef,
Type: azureml.scriptrun,
Status: Queued)


### run のモニタリング
You can monitor the progress of the run with a Jupyter widget. Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes. You can see that the widget automatically plots and visualizes the loss metric that we logged to the Azure ML run.


### 学習の経過の確認
学習の実行には下記の4ステップがあります。

1. 準備：Chainer Estimater で指定されたPython環境に合わせてdockerイメージが作成され、それがワークスペースのAzure Container Registryにアップロードされます。このステップはPython環境ごとに一度だけ起こります。（その後の実行のためにコンテナはキャッシュされます。）画像の作成とアップロードには約5分かかります。ジョブの準備中、ログは実行履歴にストリーミングされ、イメージ作成の進行状況を監視するために表示できます。


2. スケーリング：計算をスケールアップする必要がある場合（つまり、バッチAIクラスターで現在実行可能な数より多くのノードを実行する必要がある場合）、クラスターは必要な数のノードを使用可能にするためにスケールアップを試みます。スケーリングは通常約5分かかります。


3. 実行中：スクリプトフォルダ内のすべてのスクリプトがコンピューティングターゲットにアップロードされ、データストアがマウントまたはコピーされてentry_scriptが実行されます。ジョブの実行中は、stdoutと./logsフォルダが実行履歴にストリーミングされ、実行の進行状況を監視するために表示できます。


4. 後処理：実行の./outputsフォルダが実行履歴にコピーされます。

In [15]:
from azureml.widgets import RunDetails

RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 's…

In [13]:
#run.wait_for_completion(show_output=True)