# Part 2 - Model Training
In this section of the lab we will train a small fully connected neural network using the bottleneck features created in the previous part of the lab.


![Transfer Learning](https://github.com/jakazmie/AIDays/raw/master/DataScientistTrack/02-AML-EndToEndWalkthrough/images/TLArch.png)

We will use TensorFlow's Keras API to define the network and run training. 

We will run training on Azure Batch AI GPU cluster. After the model is trained, it will be registered in AML Model Registry.

![AML Arch](https://github.com/jakazmie/AIDays/raw/master/DataScientistTrack/02-AML-EndToEndWalkthrough/images/amlarch.png)



### Create training script

We will use a simple network with one hidden layer. The input to the network is a vector of 2048 floating point numbers - the shape of the bottleneck features we created in the previous step of the lab. The output layer consists of 6 units - representing six land type classes. To control overfitting the network uses a Dropout layer between the hidden layer and the output layer and l1 and l2 regularization in the output layer.

A number of units in the hidden layer, l1 and l2 values, and batch size are our tuneable hyperparameters. The Dropout ratio is fixed at 0.5.

Since the bottleneck feature files are rather small they are loaded to memory as numpy arrays. 

The trained model will be saved into the `./outputs` folder. This is one of the special folders in AML. The other one is the `./logs` folder. The content in these folders is automatically uploaded to the run history so that it accessible after the run is finished.

#### Create a folder to hold the script

In [1]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

#### Use Jupyter `%%writefile` magic to write the script


In [2]:
%%writefile $script_folder/train.py

import os
import tensorflow as tf
from tensorflow.keras.applications import resnet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Dropout, Flatten, Input
from tensorflow.keras.regularizers import l1_l2

from azureml.core import Run

import numpy as np
import random
import h5py


class RunCallback(tf.keras.callbacks.Callback):
    def __init__(self, run):
        self.run = run
        
    def on_epoch_end(self, batch, logs={}):
        self.run.log(name="training_acc", value=float(logs.get('acc')))
        self.run.log(name="validation_acc", value=float(logs.get('val_acc')))


# Define network
def fcn_classifier(input_shape=(2048,), units=512, classes=6,  l1=0.01, l2=0.01):
    features = Input(shape=input_shape)
    x = Dense(units, activation='relu')(features)
    x = Dropout(0.5)(x)
    y = Dense(classes, activation='softmax', kernel_regularizer=l1_l2(l1=l1, l2=l2))(x)
    model = Model(inputs=features, outputs=y)
    model.compile(optimizer='adadelta', loss='categorical_crossentropy', metrics=['accuracy'])
    return model


def train_evaluate(run):
   
    print("Loading bottleneck features")
    train_file_name = os.path.join(FLAGS.data_folder, FLAGS.train_file_name)
    valid_file_name = os.path.join(FLAGS.data_folder, FLAGS.valid_file_name)
    
    # Load bottleneck training features and labels
    with h5py.File(train_file_name, "r") as hfile:
        train_features = np.array(hfile.get('features'))
        train_labels = np.array(hfile.get('labels'))
        
        
    # Load bottleneck validation features and labels
    with h5py.File(valid_file_name, "r") as hfile:
        valid_features = np.array(hfile.get('features'))
        valid_labels = np.array(hfile.get('labels'))
        
    # Create a network
    model = fcn_classifier(input_shape=(2048,), units=FLAGS.units, l1=FLAGS.l1, l2=FLAGS.l2)
    
    run_callback = RunCallback(run)
    
    # Start training
    print("Starting training")
    model.fit(train_features, train_labels,
          batch_size=FLAGS.batch_size,
          epochs=FLAGS.epochs,
          shuffle=True,
          validation_data=(valid_features, valid_labels),
          callbacks=[run_callback])
          
    # Save the trained model to outp'uts which is a standard folder expected by AML
    print("Training completed.")
    os.makedirs('outputs', exist_ok=True)
    model_file = os.path.join('outputs', 'aerial_classifier.hd5')
    print("Saving model to: {0}".format(model_file))
    model.save(model_file)
    

FLAGS = tf.app.flags.FLAGS

# Default global parameters
tf.app.flags.DEFINE_integer('batch_size', 32, "Number of images per batch")
tf.app.flags.DEFINE_integer('epochs', 10, "Number of epochs to train")
tf.app.flags.DEFINE_integer('units', 512, "Number of epochs to train")
tf.app.flags.DEFINE_float('l1', 0.01, "l1 regularization")
tf.app.flags.DEFINE_float('l2', 0.01, "l2 regularization")
tf.app.flags.DEFINE_string('data_folder', './bottleneck', "Folder with bottleneck features and labels")
tf.app.flags.DEFINE_string('train_file_name', 'aerial_bottleneck_train.h5', "Training file name")
tf.app.flags.DEFINE_string('valid_file_name', 'aerial_bottleneck_valid.h5', "Validation file name")

def main(argv=None):
    # get hold of the current run
    run = Run.get_submitted_run()
    train_evaluate(run)
  

if __name__ == '__main__':
    tf.app.run()

Overwriting ./script/train.py


### Connect to AML workspace


In [3]:
import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Found the config file in: /home/demouser/repos/AIDays/DataScientistTrack/02-AML-EndToEndWalkthrough/aml_config/config.json
jkaml
jkaml
southcentralus
952a710c-8d9c-40c1-9fec-f752138cc0b3


### Create Experiment

We will create a new experiment to manage training logs

In [4]:
experiment_name = 'aerial-classifier-train'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

### Create remote compute cluster

We will use Azure Batch AI GPU cluster to run training and hyper parameter tuning.

The cluster is set up for autoscaling. It will start with a single node and can scale to up to 4 nodes. The nodes are NC6 VMs with Tesla K80 GPU.

**Creation of the cluster takes approximately 5 minutes.** If the cluster is already in the workspace this code uses it and skips the creation process.

In [5]:
from azureml.core.compute import ComputeTarget, BatchAiCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
batchai_cluster_name = ws.name + 'gpucluster'

try:
    # look for the existing cluster by name
    compute_target = ComputeTarget(workspace=ws, name=batchai_cluster_name)
    if type(compute_target) is BatchAiCompute:
        print('found compute target {}, just use it.'.format(batchai_cluster_name))
    else:
        print('{} exists but it is not a Batch AI cluster. Please choose a different name.'.format(batchai_cluster_name))
except ComputeTargetException:
    print('creating a new compute target...')
    compute_config = BatchAiCompute.provisioning_configuration(vm_size="STANDARD_NC6", # GPU-based VM
                                                                #vm_priority='lowpriority', # optional
                                                                autoscale_enabled=True,
                                                                cluster_min_nodes=1, 
                                                                cluster_max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, batchai_cluster_name, compute_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
    # Use the 'status' property to get a detailed status for the current cluster. 
    print(compute_target.status.serialize())

found compute target jkamlgpucluster, just use it.


### Configure datastore

The bottleneck files have been uploaded to the workspace's default datastore during the previous step. We will mount the store on the nodes of the cluster.

In [6]:
from azureml.core import Datastore

ds = ws.get_default_datastore()
print("Using the default datastore for output: ")
print(ds.name, ds.datastore_type, ds.account_name, ds.container_name)


Using the default datastore for output: 
workspacefilestore AzureFile jkaml4481123098 azureml-filestore-0406c2cf-2116-47e4-82c7-da677f6c5858


### Create an estimator

As in the previous step we will use **Estimator** to submit the training job

The compute target is the Batch AI cluster.

In [7]:
from azureml.train.estimator import Estimator

script_params = {
    '--data_folder': ds.path('bottleneck').as_mount(),
    '--l1': 0.001,
    '--l2': 0.001,
    '--units': 512,
    '--epochs': 75
}

est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=compute_target,
                entry_script='train.py',
                node_count=1,
                process_count_per_node=1,
                use_gpu=True,
                pip_packages=['h5py','pillow','tensorflow-gpu']
                )

### Submit the job to the cluster

Run the experiment by submitting the estimator object.

In [8]:
run = exp.submit(config=est)
run

Experiment,Id,Type,Status,Details Page,Docs Page
aerial-classifier-train,aerial-classifier-train_1539666733889,azureml.scriptrun,Running,Link to Azure Portal,Link to Documentation


Since the call is asynchronous, it returns a **Preparing** or **Running** state as soon as the job is started.

### Monitor a remote run

In total, the first run takes **approximately 10 minutes**. But for subsequent runs, as long as the script dependencies don't change, the same image is reused and hence the container start up time is much faster.

Here is what's happening while you wait:

- **Image creation**: A Docker image is created matching the Python environment specified by the estimator. The image is uploaded to the workspace. This stage happens once for each Python environment since the container is cached for subsequent runs.  During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. 

- **Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.

- **Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.


You can check the progress of a running job in multiple ways. This tutorial uses a Jupyter widget as well as a `wait_for_completion` method. 

### Jupyter widget

Watch the progress of the run with a Jupyter widget.  Like the run submission, the widget is asynchronous and provides live updates every 10-15 seconds until the job completes. 

Note: Currently, there is a problem with RunDetails widget in DSVM. 

In [9]:
from azureml.train.widgets import RunDetails
RunDetails(run).show()

_UserRun(widget_settings={'childWidgetDisplay': 'popup'})

### Get log results upon completion

Model training and monitoring happen in the background. Wait until the model has completed training before running more code. Use `wait_for_completion` to show when the model training is complete.

In [10]:
run.wait_for_completion(show_output=True) # specify True for a verbose log

RunId: aerial-classifier-train_1539666733889

Streaming azureml-logs/60_control_log.txt

Streaming log file azureml-logs/60_control_log.txt
Streaming log file azureml-logs/80_driver_log.txt

Streaming azureml-logs/80_driver_log.txt

Loading bottleneck features
Starting training
Train on 17673 samples, validate on 3093 samples
Epoch 1/75
2018-10-16 05:12:50.318182: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-10-16 05:12:50.428776: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1411] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 3552:00:00.0
totalMemory: 11.17GiB freeMemory: 11.09GiB
2018-10-16 05:12:50.428819: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1490] Adding visible gpu devices: 0
2018-10-16 05:12:50.708236: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] Device interconnect StreamExecutor with s

{'runId': 'aerial-classifier-train_1539666733889',
 'target': 'jkamlgpucluster',
 'status': 'Finalizing',
 'startTimeUtc': '2018-10-16T05:12:19.640215Z',
 'properties': {'azureml.runsource': 'experiment',
  'ContentSnapshotId': '5d0b91db-df91-409d-988f-e8a3c5d1f855'},
 'runDefinition': {'Script': 'train.py',
  'Arguments': ['--data_folder',
   '$AZUREML_DATAREFERENCE_ac3ac0d274e64c84a173c2e63e3cb87b',
   '--l1',
   '0.001',
   '--l2',
   '0.001',
   '--units',
   '512',
   '--epochs',
   '75'],
  'Framework': 0,
  'Target': 'jkamlgpucluster',
  'DataReferences': {'ac3ac0d274e64c84a173c2e63e3cb87b': {'DataStoreName': 'workspacefilestore',
    'Mode': 'Mount',
    'PathOnDataStore': 'bottleneck',
    'PathOnCompute': None,
    'Overwrite': False}},
  'JobName': None,
  'AutoPrepareEnvironment': True,
  'MaxRunDurationSeconds': None,
  'Environment': {'Python': {'InterpreterPath': 'python',
    'UserManagedDependencies': False,
    'CondaDependencies': {'name': 'project_environment',
    

### Display run results

The training has completed. You can see the logs generated during the run by executing `Run.get_file_names()` method.

In [11]:
print(run.get_file_names())

['azureml-logs/60_control_log.txt', 'azureml-logs/80_driver_log.txt', 'outputs/aerial_classifier.hd5', 'driver_log', 'azureml-logs/azureml.log', 'azureml-logs/55_batchai_execution.txt']


## Hyperparameter tuning

As you can see on **Training Accuracy** and **Validation Accuracy** charts the validation accuracy reaches a plateau while the training accuracy keeps climbing. This is an indication that the model is overfitting. We may need to fine tune hyperparameters to achieve better performance.

As described before, our network has 4 hyperparameters: 
- Number of units in the hidden layer
- l1 and l2 regularization, and
- mini-batch size.

So far we have executed a single training run on one set of hyperparameters. 

Now, we are going to use AML feature called *hyperdrive* to launch multiple runs on multiple cluster nodes using different combinations of hyperparameter values.

First, let's define the hyperparameter space using random sampling


In [12]:
from azureml.train.hyperdrive import *

ps = RandomParameterSampling(
    {
        '--batch_size': choice(32, 64),
        '--units': choice(256, 512),
        '--l1': choice(0.01, 0.04),
        '--l2': choice(0.01, 0.04, 0.08)
    }
)

Next, we will create a new estimator without the above parameters since they will be passed in later. Note we still need to keep the *data_folder* parameter since that's not a hyperparamter we will sweep.

In [13]:
script_params = {
    '--data_folder': ds.path('bottleneck').as_mount()
}

est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=compute_target,
                entry_script='train.py',
                node_count=1,
                process_count_per_node=1,
                use_gpu=True,
                pip_packages=['h5py','pillow','tensorflow-gpu']
                )

Now we will define an early termnination policy. The BanditPolicy - the way we configured it - states to check the job every 30 epochs. If the primary metric (defined later) falls outside of the top 10% range, Azure ML terminates the job. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric.

In [14]:
policy = BanditPolicy(evaluation_interval=30, slack_factor=0.1)

Now we are ready to configure a run configuration object, and specify the primary metric as *validation_acc* that's recorded in our training runs. If you go back to visit the training script, you will notice that this value is being logged after every epoch (a full batch set). We also want to tell the service that we are looking to maximizing this value. We also set the number of total runs to 20, and maximal concurrent job to 4, which is the same as the number of nodes in our computer cluster.

In [15]:
htc = HyperDriveRunConfig(estimator=est, 
                          hyperparameter_sampling=ps,
                          policy=policy,
                          primary_metric_name='validation_acc', 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=20,
                          max_concurrent_runs=4)

Finally, let's launch the hyperparameter tuning job.

In [16]:
hdr = exp.submit(config=htc)
hdr

Experiment,Id,Type,Status,Details Page,Docs Page
aerial-classifier-train,aerial-classifier-train_1539667013556,hyperdrive,Running,Link to Azure Portal,Link to Documentation


In [17]:
from azureml.train.widgets import RunDetails
RunDetails(hdr).show()

_HyperDrive(widget_settings={'childWidgetDisplay': 'popup'})

In [18]:
hdr.wait_for_completion(show_output=True) # specify True for a verbose log

RunId: aerial-classifier-train_1539667013556

Execution Summary
RunId: aerial-classifier-train_1539667013556



{'runId': 'aerial-classifier-train_1539667013556',
 'target': 'jkamlgpucluster',
 'status': 'Completed',
 'endTimeUtc': '2018-10-16T05:31:24.000Z',
 'properties': {'primary_metric_config': '{"name": "validation_acc", "goal": "maximize"}',
  'runTemplate': 'HyperDrive',
  'azureml.runsource': 'hyperdrive'},
 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://jkaml4481123098.blob.core.windows.net/azureml/ExperimentRun/aerial-classifier-train_1539667013556/azureml-logs/hyperdrive.txt?sv=2017-04-17&sr=b&sig=%2FS9aauyHyz%2FPmDKNSotaLS9B88rpQ04MK9PmcZYjrJ8%3D&st=2018-10-16T05%3A21%3A24Z&se=2018-10-16T13%3A31%3A24Z&sp=r'}}

## Find and register best model
When all the jobs finish, we can find out the one that has the highest accuracy.

In [19]:
best_run = hdr.get_best_run_by_primary_metric()

In [22]:
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['Arguments']

print('Best Run Id: ', best_run.id)
print('\n Validation Accuracy:', best_run_metrics['validation_acc'])
print('\n Batch size:',parameter_values[3])
print('\n Units:',parameter_values[9])
print('\n L1:',parameter_values[5])
print('\n L2:',parameter_values[7])

['--data_folder', '$AZUREML_DATAREFERENCE_87670e04babb41e1a0967a1dde85e8e8', '--batch_size', '64', '--l1', '0.01', '--l2', '0.04', '--units', '512']
Best Run Id:  aerial-classifier-train_1539667013556_10

 Validation Accuracy: [0.9343679275976736, 0.9272550921628208, 0.9385709667375394, 0.9405108309277739, 0.9443905593660554, 0.94245069515655, 0.9456838021723922, 0.9434206272420319, 0.9489169091882345, 0.9469770449787291]

 Batch size: 64

 Units: 512

 L1: 0.01

 L2: 0.04


## Register model
The last step in the training script wrote the file `model.hd5` in the `outputs` directory. As noted before, `outputs` is a special directory in that all content in this  directory is automatically uploaded to your workspace.  This content appears in the run record in the experiment under your workspace. 

You can register the model so that it can be later queried, examined and deployed.

In [None]:
model = best_run.register_model(model_name='aerial_classifier', model_path='outputs/model.hd5')
print(model.name, model.id, model.version, sep = '\t')

The model is ready for deployment.