# Run `mnist-*.py` on AML

## Initialize workspace

_Copied from AML [example >>](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-tensorflow/train-hyperparameter-tune-deploy-with-tensorflow.ipynb)_

Initialize a Workspace object from the existing workspace you created in the Prerequisites step. Workspace.from_config() creates a workspace object from the details stored in config.json.

In [1]:
%matplotlib inline
import numpy as np
import os
import matplotlib.pyplot as plt
import azureml.core
from azureml.core import Workspace

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

from azureml.telemetry import set_diagnostics_collection

set_diagnostics_collection(send_diagnostics=True)

ws = None
try:
    ws = Workspace.from_config()
except:
    subscription_id = os.getenv("SUBSCRIPTION_ID", default="f36a6329-7382-4b2e-b386-452fecdfcd73")
    resource_group = os.getenv("RESOURCE_GROUP", default="aml-test")
    workspace_name = os.getenv("WORKSPACE_NAME", default="aml-test-ws")
    workspace_region = os.getenv("WORKSPACE_REGION", default="eastus2")
    try:
        print("Connecting to workspace '%s'..." % workspace_name)
        ws = Workspace(subscription_id = subscription_id, resource_group = resource_group, workspace_name = workspace_name)
    except:
        print("Workspace not accessible. Creating a new one...")
        try:
            ws = Workspace.create(
                name = workspace_name,
                subscription_id = subscription_id,
                resource_group = resource_group, 
                location = workspace_region,
                create_resource_group = True,
                exist_ok = True)
        except:
            print("Failed to connect to workspace. Quit with error.")
            sys.exit(1)
    ws.write_config()
print(ws.get_details())

Azure ML SDK Version:  1.0.21
Turning diagnostics collection on. 
Found the config file in: /home/wuh/hello-tf/mnist/run-on-aml/aml_config/config.json
{'id': '/subscriptions/f36a6329-7382-4b2e-b386-452fecdfcd73/resourceGroups/aml-test/providers/Microsoft.MachineLearningServices/workspaces/aml-test-ws', 'name': 'aml-test-ws', 'location': 'eastus2', 'type': 'Microsoft.MachineLearningServices/workspaces', 'workspaceid': 'c8bab51f-43ca-4fcf-9122-bbccba990797', 'description': '', 'friendlyName': 'aml-test-ws', 'creationTime': '2019-03-20T11:41:32.5628663+00:00', 'containerRegistry': '/subscriptions/f36a6329-7382-4b2e-b386-452fecdfcd73/resourcegroups/aml-test/providers/microsoft.containerregistry/registries/amltestwacrzghwzoya', 'keyVault': '/subscriptions/f36a6329-7382-4b2e-b386-452fecdfcd73/resourcegroups/aml-test/providers/microsoft.keyvault/vaults/amltestwkeyvaultcombopoo', 'applicationInsights': '/subscriptions/f36a6329-7382-4b2e-b386-452fecdfcd73/resourcegroups/aml-test/providers/micro

## Attach the blobstore with the training data to the workspace

To make the data accessible for remote training, you will need to keep the data in the cloud. AML provides a convenient way to do so via a [Datastore](https://docs.microsoft.com/azure/machine-learning/service/how-to-access-data). The datastore provides a mechanism for you to upload/download data, and interact with it from your remote compute targets. It is an abstraction over Azure Storage. The datastore can reference either an Azure Blob container or Azure file share as the underlying storage.

All contents, except the 'out' folder, in [philly-gfs://philly/wu2/v-minghh/bert-re3qa](https://storage.wu2.philly.selfhost.corp.microsoft.com/msrnext/v-minghh/bert-re3qa) have been copied to [wuh-blob://demo-2/msrnext/v-minghh/bert-re3qa](https://ms.portal.azure.com/#blade/Microsoft_Azure_Storage/ContainerMenuBlade/overview/storageAccountId/%2Fsubscriptions%2Fa20c82c7-4497-4d44-952a-3105f790e26b%2FresourceGroups%2Faml-test%2Fproviders%2FMicrosoft.Storage%2FstorageAccounts%2Fwuhamltestsa/path/demo-2/etag/%220x8D6C84080A38A69%22)

In [2]:
from azureml.core import Datastore

datastore_name = "hellotfstore"

# Unregister all datastores except the built-in stores of the workspace.
# Ref: print("Datastore '%s' registered." % datastore_name)
#for name, ds in datastores.items():
#    if name != "workspaceblobstore" and name != "workspacefilestore":
#        ds.unregister()

if datastore_name not in ws.datastores:
    ds = Datastore.register_azure_blob_container(
        workspace=ws, 
        datastore_name=datastore_name,
        container_name="hello-tf",
        account_name="wuhamltestsa",
        account_key="LBpyUOlJT/wbiHQReiwY1EB3WhDF3Sn2STia4UY//SkMWerh08M0QjhImmQ8TwCrmvDfq0tVtB3xF9mxZFiMXA=="
    )
    print("Datastore '%s' registered." % datastore_name)
else:
    ds = Datastore(ws, datastore_name)
    print("Datastore '%s' has already been regsitered." % datastore_name)
    
# List all registrered datastores in the current workspace.
# Ref: https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data#find--define-datastores
print("All registered datastors:")
for name, ds in ws.datastores.items():
    print("  - %s (%s)" % (name, ds.datastore_type))

#define default datastore for current workspace
#ws.set_default_datastore(datastore_name)
#ds = ws.get_default_datastore()
    
# The difference between as_mount(), as_download(), and as_upload():
# https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data#access-datastores-during-training
print(ds.path("data/mnist").as_mount())

Datastore 'hellotfstore' has already been regsitered.
All registered datastors:
  - workspaceblobstore (AzureBlob)
  - workspacefilestore (AzureFile)
  - externalblobstore (AzureBlob)
  - hellotfstore (AzureBlob)
$AZUREML_DATAREFERENCE_1de05827546e4ffeb53db2f96460935c


## Create or attach to existing AmlCompute

_Copied from AML [example >>](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/training-with-deep-learning/train-hyperparameter-tune-deploy-with-tensorflow/train-hyperparameter-tune-deploy-with-tensorflow.ipynb)_

In [3]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

compute_targets = ws.compute_targets
for name, ct in compute_targets.items():
    print(name, ct.type, ct.provisioning_state)

# choose a name for your cluster
#cluster_name = "gpucluster-nc24"
cluster_name = "cpucluster-II"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target')
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(
        #vm_size="STANDARD_NC24",
        vm_size="STANDARD_D2_V2",
        min_nodes=1,
        max_nodes=1
    )

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

gpucluster AmlCompute Succeeded
gpucluster-nc24 AmlCompute Succeeded
cpucluster-II AmlCompute Succeeded
Found existing compute target
{'currentNodeCount': 0, 'targetNodeCount': 0, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 0, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2019-05-06T12:30:49.996000+00:00', 'errors': None, 'creationTime': '2019-04-25T11:46:49.467206+00:00', 'modifiedTime': '2019-04-25T12:34:59.560007+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 0, 'maxNodeCount': 1, 'nodeIdleTimeBeforeScaleDown': 'PT600S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_D2_V2'}


## Create an estimator

See https://docs.microsoft.com/en-us/python/api/azureml-train-core/azureml.train.estimator.estimator?view=azure-ml-py for the document of the Estimator class.

In [26]:
from azureml.train.estimator import Estimator

# Single node
est_1 = Estimator(
    compute_target=compute_target,
    use_gpu=False,
    node_count=1,
    pip_packages=['tensorflow==1.13.1'],
    source_directory="../",
    entry_script="mnist-mlp.py",
    script_params={
        "--data-dir": ds.path("data/mnist").as_mount()
    }
)

# Distributed with PS architecture
from azureml.train.dnn import TensorFlow
est_2 = TensorFlow(
    compute_target=compute_target,
    use_gpu=False,
    node_count=2,
    distributed_backend='ps',
    parameter_server_count=2,
    worker_count=2,
    source_directory="../",
    entry_script="mnist-mlp-dist-ps.py",
    script_params={
        "--data-dir": ds.path("data/mnist").as_mount()
    }
)


# Distributed with Horovod
est_3 = Estimator(
    compute_target=compute_target,
    use_gpu=False,
    node_count=2,
    distributed_backend='mpi',
    process_count_per_node=2,
    pip_packages=['tensorflow==1.13.1', 'horovod'],
    source_directory="../",
    entry_script="mnist-mlp-dist-hvd.py",
    script_params={
        "--data-dir": ds.path("data/mnist").as_mount()
    }
)

import datetime
print("[%s] %s" % (str(datetime.datetime.now()), str(est)))


[2019-05-15 07:20:08.720646] <azureml.train.estimator.Estimator object at 0x7f823e6b6a90>


## Submit the job

In [31]:
from azureml.core import Experiment
exp = Experiment(workspace=ws, name='aml-hello-tf')

run = exp.submit(est_2)
print(run)

from azureml.widgets import RunDetails

RunDetails(run).show()

Run(Experiment: aml-hello-tf,
Id: aml-hello-tf_1557905898_efa3a62c,
Type: azureml.scriptrun,
Status: Running)


_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 's…