## Training DLWP on Azure with Microsoft Azure Machine Learning service
For a reference on getting started with the Microsoft Azure Machine Learning service, refer to the [Microsoft documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/).

First, let's import the core AzureML Python modules.

In [None]:
import azureml.core
from azureml.core import Workspace
from azureml.core import Experiment

import os

#### Set the parameters for our model run
Here we set the directory where the dataset of predictor/target data is stored, the name of said dataset, and the name of the model to save. Tags optionally specifies some parameters for easy reference in the list of experiment runs. The environment name is the environment to use/create on Azure ML. Set to None to use a default configuration with TensorFlow 1.13.

In [None]:
data_directory = '/home/disk/wave2/jweyn/Data/Azure'
predictor_file = 'era5/era5_2deg_3h_CS2_1979-2018_z-tau-t2_500-1000_tcwv_psi850.nc'
model_file = 'dlwp_era5_6h-3_CS48_tau-sfc1000-lsm-topo_UNET2-relumax-2'
log_file = 'logs/era5_6h-3_CS48_tau-sfc1000-lsm-topo_UNET2-relumax-2'
tags = {'in': 'tau-sfc-6h-3-lsm-topo', 'out': 'tau-sfc', 'arch': 'UNET2-relumax-T2-tf-2.1'}
environment_name = 'tf-2.1'

#### Create or import a workspace
In this example, we assume a workspace already exists, but it is easy to create a workspace on-the-fly with `Workspace.create()`. Use environment variables to load sensitive information such as `subscription_id` and authentication passwords.

In [None]:
ws = Workspace.get(
    name='dlwp-ml-scus',
    subscription_id=os.environ.get('AZURE_SUBSCRIPTION_ID'),
    resource_group='DLWP'
)

#### Set up the compute cluster
This code, adapted from the Microsoft documentation example, checks for existing compute resources in the workspace or creates them if they do not exist. We use GPU nodes, of which there are a few choices:
- STANDARD_NC6: Tesla K80
- STANDARD_NC6_v2: Tesla P100
- STANDARD_NC6_v3: Tesla V100
- STANDARD_ND6: Tesla P40
- STANDARD_NV6: Tesla M60

In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget

# Name of the cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "compute-NC6v3")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 2)

# Set a GPU VM type
vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_NC6s_v3")

if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('Using existing compute target (%s)' % compute_name)
else:
    print('Creating compute target (%s)' % compute_name)
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=10)

#### Copy data to the compute cluster
This optional step is needed if data hasn't yet been uploaded to a storage blob connected to the workspace.

In [None]:
from azureml.core import Datastore
ds = Datastore(ws, 'cs2deg3h')
print('Datastore name, account, container')
print(ds.datastore_type, ds.account_name, ds.container_name)

# ds.upload(src_dir=data_directory, target_path='era5', overwrite=False, show_progress=True)

#### Create the experiment

In [None]:
experiment_name = 'era-CS'

exp = Experiment(workspace=ws, name=experiment_name)

#### Optionally create a user-managed environment

Azure ML makes it possible to create a custom environment (e.g., customized Docker images). This is useful for getting reproducible python environments for multiple experiments. This should build a useable environment for what we want to do... if not, we supply arguments to the TensorFlow Estimator class instance below.

In [None]:
from azureml.core import Environment
from azureml.core.environment import CondaDependencies

conda_pkgs = ['netCDF4', 'dask', 'xarray', 'scikit-learn']
pip_pkgs = ['tensorflow==2.1.0']

if environment_name is not None:
    if environment_name in Environment.list(workspace=ws).keys():
        print('Using existing environment %s' % environment_name) 
        env = Environment.get(workspace=ws, name=environment_name)
        env_status = 'Succeeded'
    else:
        print('Creating environment %s' % environment_name)
        env = Environment(name=environment_name)
        env.docker.enabled = True
        env.docker.gpu_support = True
        env.docker.base_image = "mcr.microsoft.com/azureml/base-gpu:openmpi3.1.2-cuda10.1-cudnn7-ubuntu18.04"
        # Add this line to mount /datadrive on a remote VM
        # env.docker.arguments = "--mount 'type=volume,src=/datadrive,dst=/datadrive'"

        # Add conda and pip dependencies
        conda_dep = CondaDependencies()
#         conda_dep.add_tensorflow_pip_package(core_type='gpu', version='2.0')
        for pkg in conda_pkgs:
            conda_dep.add_conda_package(pkg)
        for pkg in pip_pkgs:
            conda_dep.add_pip_package(pkg)
        env.python.conda_dependencies = conda_dep

        # Register the environment
        env.register(workspace=ws)

        # Build the environment
        build = env.build(workspace=ws)
        env_status = build.wait_for_completion(show_output=True).status
else:
    env_status = None

#### Create a TensorFlow estimator
Now we create a TensorFlow estimator that will send our code to be executed on the compute target.  

The first option is if the image we built earlier succeeded. If so, we use that image. If not, we use the built-in image creation options. Azure creates a Docker image the first time this is run; in the future, it can re-use existing images, including the one created automatically. We upload all of the DLWP source code files located in the parent directory of this notebook.  

The script we pass to the job is `train_tf.py`, located in this directory. Details about the option parameters (and configurable settings for the specific run) can be seen/set there.

In [None]:
from azureml.train.dnn import TensorFlow

script_params = {
    '--root-directory': ds.path().as_mount(),  # if using ML compute
#    '--root-directory': '/datadrive',  # if using remote VM
    '--predictor-file': predictor_file,
    '--model-file': model_file,
    '--log-directory': log_file,
#     '--temp-dir': '/mnt/tmp'
}

if env_status == 'Succeeded':
    est_args = {'environment_definition': env}
    print('Using environment %s' % environment_name)
else:
    est_args = {
        'framework_version': '2.0',
        'conda_packages': conda_pkgs,
        'pip_packages': pip_pkgs,
        'use_gpu': True
    }
    print('Using default AzureML TF environment')

tf_est = TensorFlow(source_directory=os.path.join(os.getcwd(), os.pardir),
                    script_params=script_params,
                    compute_target=compute_target,
                    entry_script=os.path.join(os.getcwd(), 'train_cs.py'),
                    **est_args)

#### Submit the experiment
...and also print a summary table.

In [None]:
run = exp.submit(config=tf_est, tags=tags)
run

#### Download the saved model
...once the run is complete.

In [None]:
if run.get_status() == 'Completed':
    ds.download('/Users/Jojo/Temp/DLWP', prefix=model_file)
else:
    print("model is in '%s' status; can't download files yet" % run.get_status())

In [None]:
ds.download('/Users/Jojo/Temp/DLWP', prefix='dlwp_era5_6h-3_CS48_tau-sfc1000-tcwv-lsm-topo_UNET2-48-relumax')

### Optional extras

#### Register a storage account

This code snippet demonstrates how to register a storage account onto the ML workspace. Storage accounts can be mounted on multiple workspaces. Note that in the datastore cell above, we use this account.

In [None]:
from azureml.core import Datastore
Datastore.register_azure_blob_container(workspace=ws, 
                                        datastore_name='cs2deg3h', 
                                        container_name='cs2deg3h',
                                        account_name='era5',
                                        account_key='<long_key_ending_with_==>')

#### Register a remote VM as a compute target

It need not even be an Azure resource. However, note that you will not be able to mount a datastore using a remote VM; instead, refer to the above code for adding Docker run arguments to mount storage on the VM to the model run's container.

In [None]:
from azureml.core.compute import RemoteCompute, ComputeTarget

compute_target_name = "compute-ND12-1"
attach_config = RemoteCompute.attach_configuration(address='<ip_address>',
                                                   ssh_port=22,
                                                   username='<user>',
                                                   private_key_file='<path/to/key/on/this/machine>',
                                                   private_key_passphrase='<>',
                                                   password='<password>')
compute = ComputeTarget.attach(ws, compute_target_name, attach_config)
status = compute.wait_for_completion(show_output=True)

#### Just list environments

In [None]:
envs = Environment.list(workspace=ws)

for e in envs.keys():
    if 'gpu' in e.lower():
        print("Name", e)
        print("packages", envs[e].python.conda_dependencies.serialize_to_string())