# Lab 4 - Training and tuning FCNN classifier


In this lab we will attempt to further improve our image classifier by training a simple fully connected neural network using using the bottleneck features.


![Transfer Learning](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/TLArch.png)


We will first demonstrate who to run the training script locally to facilitate development and debugging.

After that will use AML feature called `Hyperdrive` to fine tune hyperparameters of our neural network. `Hyperdrive` will utilize Azure Batch AI GPU cluster to run and evaluate concurrent training jobs. After the model is fine tuned, the best version will be registered in AML Model Registry.

![AML Arch](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/amlarch.png)



In [None]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

## Connect to AML Workspace


In [1]:
import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Found the config file in: /data/home/demouser/notebooks/MTC_AzureAILabs/DataScienceTrack/01-aml-walkthrough/aml_config/config.json
jkamllab
jkamllab
eastus2
952a710c-8d9c-40c1-9fec-f752138cc0b3


## Create AML Experiment

To track training runs.

In [2]:
experiment_name = 'aerial-train-fcnn'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

## Create training script

In the training script, we use Tensorflow.Keras to define and train a simple fully connected neural network.

The network has one hidden layer. The input to the network is a vector of 2048 floating point numbers - the bottleneck features created in Lab 2. The output layer consists of 6 units - representing six land type classes. To control overfitting the network uses a Dropout layer between the hidden layer and the output layer and L1 and L2 regularization in the output layer.

The number of units in the hidden layer, L1 and L2 values, and batch size are all tuneable hyperparameters. The Dropout ratio is fixed at 0.5.

Since the bottleneck feature files are small (as compared to original image datasets) they can be loaded into memory all at once.

The trained model will be saved into the ./outputs folder. This is one of the special folders in AML. The other one is the ./logs folder. The content in these folders is automatically uploaded to the run history.

The script uses AML Run object to track two performane measures: training accuracy and validation accuracy. The metrics are captured at the end of each epoch.


### Create a folder to hold the script

In [3]:
import os
script_folder = './script'
script_name = 'train.py'
os.makedirs(script_folder, exist_ok=True)

### Use Jupyter `%%writefile` magic to write the script


In [4]:
%%writefile $script_folder/train.py

import os
import tensorflow as tf
from tensorflow.keras.applications import resnet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Dropout, Flatten, Input
from tensorflow.keras.regularizers import l1_l2
from sklearn.model_selection import train_test_split

from azureml.core import Run

import numpy as np
import random
import h5py


# Create custom callback to track accuracy measures in AML Experiment
class RunCallback(tf.keras.callbacks.Callback):
    def __init__(self, run):
        self.run = run
        
    def on_epoch_end(self, batch, logs={}):
        self.run.log(name="training_acc", value=float(logs.get('acc')))
        self.run.log(name="validation_acc", value=float(logs.get('val_acc')))


# Define network
def fcn_classifier(input_shape=(2048,), units=512, classes=6,  l1=0.01, l2=0.01):
    features = Input(shape=input_shape)
    x = Dense(units, activation='relu')(features)
    x = Dropout(0.5)(x)
    y = Dense(classes, activation='softmax', kernel_regularizer=l1_l2(l1=l1, l2=l2))(x)
    model = Model(inputs=features, outputs=y)
    model.compile(optimizer='adadelta', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Training regime
def train_evaluate(run):
   
    print("Loading bottleneck features")
    train_file_name = os.path.join(FLAGS.data_folder, FLAGS.train_file_name)
    
    # Load bottleneck training features and labels
    with h5py.File(train_file_name, "r") as hfile:
        features = np.array(hfile.get('features'))
        labels = np.array(hfile.get('labels'))
        
    
        
    # Split the data into training and validation partitions   
    X_train, X_validation, y_train, y_validation = train_test_split(features, labels,
                                                               test_size=0.1,
                                                               shuffle=True,
                                                               stratify=labels)
        
    # Convert labels into one-hot encoded format
    y_train = to_categorical(y_train, num_classes=6)
    y_validation = to_categorical(y_validation, num_classes=6)
    
    # Create a network
    model = fcn_classifier(input_shape=(2048,), units=FLAGS.units, l1=FLAGS.l1, l2=FLAGS.l2)
    
    # Create AML tracking callback
    run_callback = RunCallback(run)
    
    # Start training
    print("Starting training")
    model.fit(X_train, y_train,
          batch_size=FLAGS.batch_size,
          epochs=FLAGS.epochs,
          shuffle=True,
          validation_data=(X_validation, y_validation),
          callbacks=[run_callback])
          
    # Save the trained model to outputs which is a standard folder expected by AML
    print("Training completed.")
    os.makedirs('outputs', exist_ok=True)
    model_file = os.path.join('outputs', 'aerial_fcnn_classifier.hd5')
    print("Saving model to: {0}".format(model_file))
    model.save(model_file)
    

FLAGS = tf.app.flags.FLAGS

# Default global parameters
tf.app.flags.DEFINE_integer('batch_size', 32, "Number of images per batch")
tf.app.flags.DEFINE_integer('epochs', 10, "Number of epochs to train")
tf.app.flags.DEFINE_integer('units', 512, "Number of epochs to train")
tf.app.flags.DEFINE_float('l1', 0.01, "l1 regularization")
tf.app.flags.DEFINE_float('l2', 0.01, "l2 regularization")
tf.app.flags.DEFINE_string('data_folder', './bottleneck', "Folder with bottleneck features and labels")
tf.app.flags.DEFINE_string('train_file_name', 'aerial_bottleneck_resnet50.h5', "Training file name")

def main(argv=None):
    
    # get hold of the current run
    run = Run.get_submitted_run()
    train_evaluate(run)
  

if __name__ == '__main__':
    tf.app.run()
    

Writing ./script/train.py


## Run the script locally

### Download bottleneck features 

The bottleneck features file has been uploaded to the default datastore in Lab 2.

In [5]:
data_folder = 'bottleneck_features'

In [None]:
ds = ws.get_default_datastore()
ds.download(target_path='/tmp', prefix=data_folder, overwrite=True)
!ls /tmp/bottleneck_features

### Configure Run environment
We will use a user-managed run, which means we assume that all the necessary packages are already available in the Python environment selected to run the script. In our case this is true, as we pre-installed all the dependencies during the lab setup. Alternatively, you can execute a local run in system-managed environment. In that case AML would build a new conda environment and execute the script in it.

*Make sure to modify the **interpreter_path** property to point to your Python environment. On DSVM this path is `/anaconda/envs/py36/bin/python`*

In [None]:
from azureml.core.runconfig import RunConfiguration

run_config = RunConfiguration()
run_config.environment.python.user_managed_dependencies = True
run_config.environment.python.interpreter_path = '/anaconda/envs/py36/bin/python'


### Run the script. 
Note that we need to supply an absolute path to the folder with training  images. 

In [None]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(
    source_directory=script_folder,
    script=script_name,
    run_config=run_config,
    arguments=['--data_folder', '/tmp/bottleneck_features',
               '--train_file_name', 'aerial_bottleneck_resnet50.h5',
               '--l1', 0.001,
               '--l2', 0.001,
               '--units', 512,
               '--epochs', 40])

tags = {"Compute target": "Local"}
run = exp.submit(src, tags=tags)
run

### Monitor the run

In [None]:
from azureml.train.widgets import RunDetails
RunDetails(run).show()

Wait for the job completion.

In [None]:
run.wait_for_completion(show_output=False) # specify True for a verbose log

## Fine tune the network with `Hyperdrive`

If you analyze the *training_acc* and *validation_acc* charts you will notice that *training_acc* continues to climb while *validation_acc* reaches a plateua and may even start degrading. This is an indication of overfitting. In the next part we will tune hyper-parameters of the model to achieve better and more consistent performance.

We will use AML Hyperdrive and Azure Batch AI GPU cluster to run concurrent hyperparameter tuning jobs.

### Create remote compute cluster

We will use Azure Batch AI GPU cluster to run  hyper parameter tuning.

The cluster is set up for autoscaling. It will start with a single node and can scale to up to 4 nodes. The nodes are NC6 VMs with Tesla K80 GPU.

**Creation of the cluster takes approximately 5 minutes.** If the cluster is already in the workspace this code uses it and skips the creation process.

In [6]:
from azureml.core.compute import BatchAiCompute
from azureml.core.compute import ComputeTarget

# Choose a name for your cluster.
batchai_cluster_name = "batchaigpucls"

found = False
# Check if this compute target already exists in the workspace.
for ct_name, ct in ws.compute_targets.items():
    print(ct.name, ct.type)
    if (ct.name == batchai_cluster_name and ct.type == 'BatchAI'):
        found = True
        print('Found existing compute target.')
        bai_compute_target = ct
        break
        
if not found:
    print('Creating a new compute target...')
    provisioning_config = BatchAiCompute.provisioning_configuration(vm_size = "STANDARD_NC6", 
                                                                autoscale_enabled = True,
                                                                cluster_min_nodes = 1, 
                                                                cluster_max_nodes = 4)

    # Create the cluster.
    bai_compute_target = ComputeTarget.create(ws, batchai_cluster_name, provisioning_config)
    
    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min_node_count is provided, it will use the scale settings for the cluster.
    bai_compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)

batchaicls BatchAI
Creating a new compute target...
Creating
succeeded..............
BatchAI wait for completion finished
Minimum number of nodes requested have been provisioned


### Configure datastore

The bottleneck files have been uploaded to the workspace's default datastore during Lab 2. We will mount the store on the nodes of the cluster.

In [7]:
from azureml.core import Datastore

ds = ws.get_default_datastore()
print("Using the default datastore for training data: ")
print(ds.name, ds.datastore_type, ds.account_name, ds.container_name)


Using the default datastore for training data: 
workspacefilestore AzureFile jkamllabstoragefcedzgrt azureml-filestore-274a0f3b-af67-4dc1-9e18-76e7038eafde


### Run a test run on a single node of the cluster

In [8]:
from azureml.train.estimator import Estimator

script_params = {
    '--data_folder': ds.path('bottleneck_features').as_download(),
    '--training_file_name': 'aerial_bottleneck_resnet50.h5',
    '--l1': 0.001,
    '--l2': 0.001,
    '--units': 512,
    '--epochs': 10
}


pip_packages = ['h5py','pillow', 'scikit-learn', 'tensorflow-gpu']

est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=bai_compute_target,
                entry_script=script_name,
                pip_packages=pip_packages,
                use_gpu=True,
                node_count=1,
                process_count_per_node=1
                )

In [9]:
tags = {"Compute target": "BAI", "Run Type": "Test drive"}
run = exp.submit(est, tags=tags)
run

Experiment,Id,Type,Status,Details Page,Docs Page
aerial-train-fcnn,aerial-train-fcnn_1541720931346,azureml.scriptrun,Queued,Link to Azure Portal,Link to Documentation


In [11]:
from azureml.train.widgets import RunDetails
RunDetails(run).show()

_UserRun(widget_settings={'childWidgetDisplay': 'popup'})

In [12]:
run.wait_for_completion(show_output=True)

RunId: aerial-train-fcnn_1541720931346

Streaming azureml-logs/60_control_log.txt

Streaming log file azureml-logs/60_control_log.txt
Streaming log file azureml-logs/80_driver_log.txt

Streaming azureml-logs/80_driver_log.txt

Loading bottleneck features
Starting training
Train on 19882 samples, validate on 2210 samples
Epoch 1/10
2018-11-08 23:52:05.236499: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-08 23:52:05.357021: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 57fc:00:00.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-11-08 23:52:05.357064: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2018-11-08 23:52:05.638159: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strengt

Epoch 3/10

   32/19882 [..............................] - ETA: 3s - loss: 0.2700 - acc: 0.937
  448/19882 [..............................] - ETA: 2s - loss: 0.2294 - acc: 0.950
  896/19882 [>.............................] - ETA: 2s - loss: 0.2248 - acc: 0.945
 1344/19882 [=>............................] - ETA: 2s - loss: 0.2023 - acc: 0.950
 1792/19882 [=>............................] - ETA: 2s - loss: 0.1919 - acc: 0.952
 2240/19882 [==>...........................] - ETA: 2s - loss: 0.1937 - acc: 0.953
 2688/19882 [===>..........................] - ETA: 2s - loss: 0.1927 - acc: 0.951
 3136/19882 [===>..........................] - ETA: 2s - loss: 0.1949 - acc: 0.952
 3584/19882 [====>.........................] - ETA: 1s - loss: 0.1946 - acc: 0.952
 4032/19882 [=====>........................] - ETA: 1s - loss: 0.1982 - acc: 0.952
 4480/19882 [=====>........................] - ETA: 1s - loss: 0.1997 - acc: 0.950
Epoch 4/10

   32/19882 [..............................] - ETA: 3s - loss: 

Epoch 6/10

   32/19882 [..............................] - ETA: 3s - loss: 0.2645 - acc: 0.937
  480/19882 [..............................] - ETA: 2s - loss: 0.1571 - acc: 0.947
  896/19882 [>.............................] - ETA: 2s - loss: 0.1584 - acc: 0.953
 1344/19882 [=>............................] - ETA: 2s - loss: 0.1606 - acc: 0.953
 1792/19882 [=>............................] - ETA: 2s - loss: 0.1579 - acc: 0.954
 2240/19882 [==>...........................] - ETA: 2s - loss: 0.1644 - acc: 0.953
 2688/19882 [===>..........................] - ETA: 2s - loss: 0.1605 - acc: 0.956
 3136/19882 [===>..........................] - ETA: 1s - loss: 0.1601 - acc: 0.956
 3584/19882 [====>.........................] - ETA: 1s - loss: 0.1616 - acc: 0.955
 4000/19882 [=====>........................] - ETA: 1s - loss: 0.1622 - acc: 0.954
 4416/19882 [=====>........................] - ETA: 1s - loss: 0.1647 - acc: 0.954
Epoch 7/10

   32/19882 [..............................] - ETA: 4s - loss: 

 4032/19882 [=====>........................] - ETA: 1s - loss: 0.1463 - acc: 0.959
 4480/19882 [=====>........................] - ETA: 1s - loss: 0.1483 - acc: 0.959
Epoch 9/10

   32/19882 [..............................] - ETA: 3s - loss: 0.3192 - acc: 0.937
  480/19882 [..............................] - ETA: 2s - loss: 0.1198 - acc: 0.970
  928/19882 [>.............................] - ETA: 2s - loss: 0.1287 - acc: 0.967
 1376/19882 [=>............................] - ETA: 2s - loss: 0.1159 - acc: 0.972
 1792/19882 [=>............................] - ETA: 2s - loss: 0.1109 - acc: 0.9738
 2208/19882 [==>...........................] - ETA: 2s - loss: 0.1278 - acc: 0.969
 2624/19882 [==>...........................] - ETA: 2s - loss: 0.1334 - acc: 0.965
 3040/19882 [===>..........................] - ETA: 2s - loss: 0.1352 - acc: 0.965
 3488/19882 [====>.........................] - ETA: 2s - loss: 0.1342 - acc: 0.964
 3936/19882 [====>.........................] - ETA: 1s - loss: 0.1337 - ac

Training completed.
Saving model to: outputs/aerial_fcnn_classifier.hd5


The experiment completed successfully. Finalizing run...
Logging experiment finalizing status in history service

Execution Summary
RunId: aerial-train-fcnn_1541720931346



{'runId': 'aerial-train-fcnn_1541720931346',
 'target': 'batchaigpucls',
 'status': 'Finalizing',
 'startTimeUtc': '2018-11-08T23:49:57.909342Z',
 'properties': {'azureml.runsource': 'experiment',
  'ContentSnapshotId': '6868fa55-9fdd-4b50-a14f-e9b522887eef'},
 'runDefinition': {'Script': 'train.py',
  'Arguments': ['--data_folder',
   '$AZUREML_DATAREFERENCE_bc3866b37c4e4fffbb265cbf96602e44',
   '--training_file_name',
   'aerial_bottleneck_resnet50.h5',
   '--l1',
   '0.001',
   '--l2',
   '0.001',
   '--units',
   '512',
   '--epochs',
   '10'],
  'SourceDirectoryDataStore': None,
  'Framework': 0,
  'Communicator': 0,
  'Target': 'batchaigpucls',
  'DataReferences': {'bc3866b37c4e4fffbb265cbf96602e44': {'DataStoreName': 'workspacefilestore',
    'Mode': 'Download',
    'PathOnDataStore': 'bottleneck_features',
    'PathOnCompute': None,
    'Overwrite': False}},
  'JobName': None,
  'AutoPrepareEnvironment': True,
  'MaxRunDurationSeconds': None,
  'NodeCount': 1,
  'Environment': 

### Configure Hyperdrive

As noted before, our network has 5 hyperparameters:

- Number of units in the hidden layer
- L1 and L2 regularization
- mini-batch size, and
- dropout ratio

As we have limited time to complete the lab, we are going to limit a number of hyperparameter combinations to try. We will use a fixed batch-size and dropout ratio and focus on hidden layer units and L1 and L2 regularization.

*Hyperdrive* supports many strategies for sampling the hyperparameter space. In this lab, we are going to use the simplest one - grid sampling.


In [13]:
from azureml.train.hyperdrive import *

ps = GridParameterSampling(
    {
        '--units': choice(256, 512),
        '--l1': choice(0.001, 0.01, 0.05),
        '--l2': choice(0.001, 0.01, 0.05)
    }
)

We will use **Estimator** object to configure the training job. Note how we pass the location of the bottleneck files to the estimator. The job will run on GPU VMs and as such we need to use the GPU version of Tensorflow.

In [14]:
from azureml.train.estimator import Estimator

script_params = {
    '--data_folder': ds.path('bottleneck_features').as_download(),
    '--training_file_name': 'aerial_bottleneck_resnet50.h5',
    '--epochs': 50
}

pip_packages = ['h5py','pillow', 'scikit-learn', 'tensorflow-gpu']

est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=bai_compute_target,
                entry_script=script_name,
                pip_packages=pip_packages,
                use_gpu=True,
                node_count=1,
                process_count_per_node=1
                )

*Hyperdrive* supports early termination policies to limit exploration of hyperparameter combinations that don't show promise of helping reach the target metric. This feature is especially useful when traversing large hyperparameter spaces. Since we are going to run a small number of jobs we will not apply early termination.

In [15]:
policy = NoTerminationPolicy()

Now we are ready to configure a run configuration object, and specify the primary metric as *validation_acc* that's recorded in our training runs. If you go back to visit the training script, you will notice that this value is being logged after every run. We also want to tell the service that we are looking to maximizing this value. We also set the number of total runs to 12, and maximal concurrent job to 4, which is the same as the number of nodes in our computer cluster. 

In [16]:
htc = HyperDriveRunConfig(estimator=est, 
                          hyperparameter_sampling=ps,
                          policy=policy,
                          primary_metric_name='validation_acc', 
                          primary_metric_goal=PrimaryMetricGoal.MAXIMIZE, 
                          max_total_runs=12,
                          max_concurrent_runs=4)

Finally, let's launch the hyperparameter tuning job.

The first run takes longer as the system has to prepare and deploy a docker image with training job runtime dependencies. As long as the dependencies don't change the following runs will be much faster.

Here is what's happening whie you wait.

- **Image creation**: A Docker image is created matching the Python environment specified by the estimator. The image is uploaded to the workspace. This stage happens once for each Python environment since the container is cached for subsequent runs. During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically.

- **Running**: In this stage, the necessary scripts and files are sent to the compute target, then data stores are mounted/copied, then the entry_script is run. While the job is running, stdout and the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs.

- **Post-Processing**: The ./outputs directory of the run is copied over to the run history in your workspace so you can access these results.


In [17]:
tags = {"Compute target": "BAI"}

hdr = exp.submit(config=htc, tags=tags)
hdr

Experiment,Id,Type,Status,Details Page,Docs Page
aerial-train-fcnn,aerial-train-fcnn_1541721373750,hyperdrive,Running,Link to Azure Portal,Link to Documentation


In [18]:
from azureml.train.widgets import RunDetails
RunDetails(hdr).show()

_HyperDrive(widget_settings={'childWidgetDisplay': 'popup'})

In [None]:
hdr.wait_for_completion(show_output=False) # specify True for a verbose log

## Find and register best model
When all jobs finish, we can find out the one that has the highest accuracy.

In [None]:
best_run = hdr.get_best_run_by_primary_metric()

In [None]:
best_run_metrics = best_run.get_metrics()
parameter_values = best_run.get_details()['runDefinition']['Arguments']

print('Best Run Id: ', best_run.id)
print('\n Validation Accuracy:', best_run_metrics['validation_acc'])
print('\n Units:',parameter_values[7])
print('\n L1:',parameter_values[9])
print('\n L2:',parameter_values[11])


Check the output of the best run.

In [None]:
print(best_run.get_file_names())

## Register model
The last step in the training script wrote the file `aerial_fcnn_classifier.hd5` in the `outputs` directory. As noted before, `outputs` is a special directory in that all content in this  directory is automatically uploaded to your workspace.  This content appears in the run record in the experiment under your workspace. 

You can register the model so that it can be later queried, examined and deployed.

In [None]:
model = best_run.register_model(model_name='aerial_classifier', 
                                model_path='outputs/aerial_fcnn_classifier.hd5')
print(model.name, model.id, model.version, sep = '\t')

## Next Step

The model is now ready for deployment. 

## Clean up resources
Before you move to the next step, delete the cluster.

In [None]:
bai_compute_target.delete()