# Lab 2 - Feature Engineering
In this lab we will develop and run a Python script that pre-processes our image set into a set of powerful features - sometimes referred to as bottleneck features.

To create bottleneck features we will utilize a pre-trained Deep Learning network that was trained on a general computer vision domain. 

As explained by your instructor this approach is called Transfer Learning. Transfer Learning is a powerful Machine Learning technique that is based on an observation that the knowledge gained while solving one problem can be applied to a different (but related problem).

In the context of an image classification task, a DNN trained on one visual domain can accelerate learing in another visual domain. Although, our pre-trained network does not know how to classify aerial land plot images, it knows enough about representing image concepts that if we use it to pre-process aerial images, the extracted image features can be used to effectively train a relatively simple classifier on a **limited number** of samples.

The below diagram represents the architecture of our solution.

![Transfer Learning](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/tlcl.png)

We will use **ResNet50** trained on **imagenet** dataset to extract features. We will occasionally refer to this component of the solution as a featurizer. The output of the featurizer is a vector of 2048 floating point numbers, each representing a feature extracted from an image. 

We will then use extracted features to train an scikit-learn classifier. (next lab).


In [1]:
# Check core SDK version number
import azureml.core
print("SDK version:", azureml.core.VERSION)

SDK version: 1.0.2


## Connect to AML Workspace

In [2]:
import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Found the config file in: /Users/jarekk/repos/jakazmie/AMLsLabs/01-aml-walkthrough/aml_config/config.json
jkamlslab
jkamlslab
eastus2
952a710c-8d9c-40c1-9fec-f752138cc0b3


## Create AML Experiment
We will track runs of the feature engineering script in a dedicated Experiment.

In [3]:
from azureml.core import Experiment
experiment_name = 'aerial-feature-engineering'
exp = Experiment(workspace=ws, name=experiment_name)

## Prepare data pre-processing script

The Python script processes an input image dataset into an output bottleneck feature set. The script expects the images to be organized in the below folder structure:
```
Barren/
Cultivated/
Developed/
Forest/
Herbaceous/
Shrub/
```

The location of the input dataset and the location where to save the output dataset are passed to the script as command line parameters. The output dataset will be stored in a binary HDF5 data format used commonly in Machine Learning and High Performance Computing solutions.

The script is designed to work with a large number of images. As such it does not load all input images to memory at once. Instead it utilizes a utility function `load_images` to feed the featurizer. The function yields batches of images - as Numpy arrays - preprocessed to the format required by **ResNet50**. 

We will not attempt to run the script on a full dataset in a local environment. It is very computationally intensive and unless you run it in an evironment equipped with a powerful GPU it would be very slow. 

However, we will demonstrate how to run the script locally using the same small development dataset we used in the previous lab. Running the script locally under the control of Azure ML can be very usefull during development and debugging.

To process the full dataset we will execute the script on a remote Azure ML Compute equipped with NVidia GPU.

### Create a folder to hold the script

In [4]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

### Use Jupyter `%%writefile` magic to write the script

In [5]:
%%writefile $script_folder/extract.py

import os
import numpy as np
import random
import h5py
from tqdm import tqdm

import tensorflow as tf

import azureml.contrib.brainwave.models.utils as utils
from azureml.contrib.brainwave.models import QuantizedResnet50


def get_batch(pathnames, batchsize=64):
    """Yield succesive batches of images"""
    for i in range(0, len(pathnames), batchsize):
        yield pathnames[i:i+batchsize]
        

def load_images(batch):
    """Return a batch of images as a list of bytes sequences"""
    images = []
    for path in batch:
        with open(path, 'rb') as f:
            images.append(f.read())
    return images

def create_bottleneck_features():
    """Createl bottleneck features and save them to H5 formatted file"""
    img_dir = FLAGS.input_data_dir
    
    # Label images 
    
    # Create the dictionary that maps class names into numeric labels   
    label_map = {
        "Barren": 0,
        "Cultivated": 1,
        "Developed": 2,
        "Forest": 3,
        "Herbaceous": 4,
        "Shrub": 5}    

    # Create a list of all images in a root folder with associated numeric labels
    folders = list(label_map.keys())
    labeled_image_list = [(os.path.join(img_dir, folder, image), label_map[folder]) 
                          for folder in folders 
                          for image in os.listdir(os.path.join(img_dir, folder))
                              ]
    # Shuffle the list
    random.shuffle(labeled_image_list)
    image_paths, labels = zip(*labeled_image_list)
    
    # Build featurizer graph
    
    # Convert input images (loaded as bytes sequences) into (224, 224, 3) tensors
    # with pixel values in Caffe encoding
    in_images = tf.placeholder(tf.string)
    image_tensors = utils.preprocess_array(in_images)

    # Create ResNet152 
    model_path = os.path.expanduser('~/models')
    resnet = QuantizedResnet50(model_path, is_frozen=True)

    # Import ResNet152 graph
    features = resnet.import_graph_def(input_tensor=image_tensors)
    
    # Generate bottleneck features
    print("Generating bottleneck features")
    bottleneck_features = []
    with tf.Session() as sess:
        for paths in tqdm(get_batch(image_paths)):
            image_batch = load_images(paths)
            result = sess.run([features], feed_dict={in_images: image_batch})
            result = np.reshape(result[0], (len(result[0]), 2048))
            bottleneck_features.extend(result)
        
    bottleneck_features = np.array(bottleneck_features)
    print(bottleneck_features.shape)
        
    # Save the bottleneck features to HDF5 file
    filename = FLAGS.file_name
    output_file = os.path.join(FLAGS.output_data_dir, filename)
    labels = np.asarray(labels)
    print("Saving bottleneck features to {}".format(output_file))
    print("   Features: ", bottleneck_features.shape)
    print("   Labels: ", labels.shape)
    with h5py.File(output_file, "w") as hfile:
        features_dset = hfile.create_dataset('features', data=bottleneck_features)
        labels_dset = hfile.create_dataset('labels', data=labels)
    
    print("Done")

FLAGS = tf.app.flags.FLAGS

# Default global parameters
tf.app.flags.DEFINE_integer('batch_size', 64, "Number of images per batch")
tf.app.flags.DEFINE_string('input_data_dir', 'aerialtiny', "Folder with training and validation images")
tf.app.flags.DEFINE_string('output_data_dir', 'bottleneck_features', "A folder for saving bottleneck features")
tf.app.flags.DEFINE_string('file_name', 'aerial_bottleneck_resnet50.h5', "Name of output training file")


def main(argv=None):
    print("Starting")
    print("Reading images from:", FLAGS.input_data_dir)
    print("The output bottleneck file will be saved to:", FLAGS.output_data_dir)

    os.makedirs(FLAGS.output_data_dir, exist_ok=True)

    create_bottleneck_features()
  
if __name__ == '__main__':
    tf.app.run()

Overwriting ./script/extract.py


## Run the script locally

As noted in the introduction, we will first run the script locally

### Configure Run environment
We will use a user-managed run, which means we assume that all the necessary packages are already available in the Python environment selected to run the script. In our case this is true, as we pre-installed all the dependencies during the lab setup. Alternatively, you can execute a local run in system-managed environment. In that case AML would build a new conda environment and execute the script in it.

*Make sure to modify the **interpreter_path** property to point to your Python environment. On DSVM this path is `/anaconda/envs/py36/bin/python`*

In [6]:
from azureml.core.runconfig import RunConfiguration

run_config = RunConfiguration()
run_config.environment.python.user_managed_dependencies = True
#run_config.environment.python.interpreter_path = '/anaconda/envs/py36/bin/python'
run_config.environment.python.interpreter_path = '/anaconda3/envs/aml/bin/python'


### Run the script. 
Note that we need to supply an absolute path to the folder with training and validation images. 

In [7]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(
    source_directory='./script',
    script='extract.py',
    run_config=run_config,
    arguments=['--input_data_dir', '/tmp/aerial-tiny',
               '--output_data_dir', '/tmp/bottleneck_features',
               '--file_name', 'aerial_bottleneck_resnet50_brainwave.h5'])

tags = {"Compute target": "Local", "DNN": "ResNet50"}
run = exp.submit(src, tags=tags)
run

Experiment,Id,Type,Status,Details Page,Docs Page
aerial-feature-engineering,aerial-feature-engineering_1544056301_dea019bf,azureml.scriptrun,Running,Link to Azure Portal,Link to Documentation


Block to wait till run finishes and stream the output. Check CPU utilization on your workstation. On Linux run `htop` utility in a Jupyter terminal window.

In [8]:
run.wait_for_completion(show_output=True)

RunId: aerial-feature-engineering_1544056301_dea019bf

Streaming azureml-logs/80_driver_log.txt

Starting
Reading images from: /tmp/aerial-tiny
The output bottleneck file will be saved to: /tmp/bottleneck_features
2018-12-05 16:31:55.500224: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
Converted 265 variables to const ops.
Generating bottleneck features

0it [00:00, ?it/s]
1it [00:34, 34.69s/it]
2it [00:56, 30.70s/it]
3it [01:17, 27.83s/it]
4it [01:38, 25.85s/it]
5it [01:58, 24.24s/it]
6it [02:19, 23.05s/it]
7it [02:40, 22.40s/it]
8it [03:00, 21.79s/it]
9it [03:21, 21.53s/it]
10it [03:42, 21.28s/it]
11it [04:02, 21.06s/it]
12it [04:23, 21.06s/it]
13it [04:43, 20.77s/it]
14it [05:04, 20.91s/it]
15it [05:27, 21.53s/it]
16it [05:49, 21.47s/it]
17it [06:09, 20.97s/it]
18it [06:30, 21.13s/it]
19it [06:50, 20.69s/it]
20it [07:10, 20.67s/it]
21it [07:24, 18.65s/it]
(1326, 2048)
Saving bottlene

{'runId': 'aerial-feature-engineering_1544056301_dea019bf',
 'target': 'local',
 'status': 'Finalizing',
 'startTimeUtc': '2018-12-06T00:31:42.227766Z',
 'properties': {'azureml.runsource': 'experiment',
  'ContentSnapshotId': 'f7c7a2f1-07ac-4572-ba3f-1afe825c3304'},
 'runDefinition': {'Script': 'extract.py',
  'Arguments': ['--input_data_dir',
   '/tmp/aerial-tiny',
   '--output_data_dir',
   '/tmp/bottleneck_features',
   '--file_name',
   'aerial_bottleneck_resnet50_brainwave.h5'],
  'SourceDirectoryDataStore': None,
  'Framework': 0,
  'Communicator': 0,
  'Target': 'local',
  'DataReferences': {},
  'JobName': None,
  'AutoPrepareEnvironment': True,
  'MaxRunDurationSeconds': None,
  'NodeCount': 1,
  'Environment': {'Python': {'InterpreterPath': '/anaconda3/envs/aml/bin/python',
    'UserManagedDependencies': True,
    'CondaDependencies': {'name': 'project_environment',
     'dependencies': ['python=3.6.2', {'pip': ['azureml-defaults']}]},
    'CondaDependenciesFile': None},
   

The logs from the run have been pushed to AML Experiment.

In [9]:
print(run.get_file_names())

['azureml-logs/60_control_log.txt', 'azureml-logs/80_driver_log.txt', 'driver_log', 'azureml-logs/azureml.log']


The bottleneck files can be found in a local directory passed to the run as a command line parameter.

In [10]:
%%sh

ls /tmp/bottleneck_features

aerial_bottleneck_resnet50_brainwave.h5


## Run the script on Azure Machine Learning Managed Compute

As you can see, even on a really small dataset the processing is very slow. In the next step, you will run the script on a full dataset using Azure ML Managed Compute. 

Azure Machine Learning Managed Compute(AmlCompute) is a managed service that enables you to train machine learning models on clusters of Azure virtual machines, including VMs with GPU support. 


### Create Azure ML Managed Compute

We will use a single *Standard_NC6* VM equipped with Tesla K80 GPU as Azure ML Compute target.

In [11]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os


# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "gpucompute")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 1)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 1)

vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_NC6")


if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)

    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

     # For a more detailed view of current AmlCompute status, use the 'status' property    
    print(compute_target.status.serialize())

found compute target. just use it. gpucompute


### Configure Datastores 
The dataset we will use for training has been uploaded to a public Azure blob storage container. We will register this container as an AML Datastore within our workspace. Before the data prep script runs, the datastore's content - training images - will be copied to the local storage on the compute node.

After the script completes, its output - the bottleneck features file - will be uploaded by AML to the workspace's default datastore.

In [12]:
from azureml.core import Datastore

images_account = 'azureailabs'
images_container = 'aerial-med'
datastore_name = 'input_images'

# Check if the datastore exists. If not create a new one
try:
    input_ds = Datastore.get(ws, datastore_name)
    print('Found existing datastore for input images:', input_ds.name)
except:
    input_ds = Datastore.register_azure_blob_container(workspace=ws, datastore_name=datastore_name,
                                            container_name=images_container,
                                            account_name=images_account)
    print('Creating new datastore for input images')

 
   
print(input_ds.name, input_ds.datastore_type, input_ds.account_name, input_ds.container_name)

output_ds = ws.get_default_datastore()
print("Using the default datastore for output: ")
print(output_ds.name, output_ds.datastore_type, output_ds.account_name, output_ds.container_name)


Found existing datastore for input images: input_images
input_images AzureBlob azureailabs aerial-med
Using the default datastore for output: 
workspaceblobstore AzureBlob jkamlslastoragevfzvtchj azureml-blobstore-b9d096b6-8b2a-49bb-aef5-cc1bd0f6b751


### Start and monitor a remote run

We will run a script in a docker that will be created automatically by AML and configured with the 
specified dependencies.

The first run takes longer. The subsequent runs, as long as the script dependencies don't change, are much faster.

You can check the progress of a running job in multiple ways: Azure Portal, AML Jupyter Widgets, log files streaming. We will use AML Jupyter Widgets.

In [13]:
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DataReferenceConfiguration
from azureml.core.runconfig import RunConfiguration
from azureml.core import ScriptRunConfig

script_folder = 'script'
script_name = 'extract.py'
output_dir = 'bottleneck_features'
input_dir = 'aerial'

# create a new RunConfig object
run_config = RunConfiguration(framework="python")

# Set compute target to Azure ML compute and configure docker base image
run_config.target = compute_target.name
run_config.environment.docker.enabled = True
run_config.environment.docker.base_image = azureml.core.runconfig.DEFAULT_GPU_IMAGE

# specify CondaDependencies obj
# pip_packages = ['azureml-sdk[contrib]']
conda_packages = ['h5py', 'tqdm', 'tensorflow-gpu==1.10']
#conda_dependencies = CondaDependencies.create(pip_packages=pip_packages, conda_packages=conda_packages)
conda_dependencies = CondaDependencies.create(conda_packages=conda_packages)
run_config.environment.python.conda_dependencies = conda_dependencies
    

# configure data references
input_dr = DataReferenceConfiguration(datastore_name=input_ds.name, 
                   path_on_compute=input_dir,                   
                   mode='download', # download files from datastore to compute target
                   overwrite=True)

output_dr = DataReferenceConfiguration(datastore_name=output_ds.name, 
                   path_on_datastore=output_dir, 
                   path_on_compute=output_dir,
                   mode='upload', # upload files from the compute to datastore
                   overwrite=True)

run_config.data_references = {input_ds.name: input_dr, output_ds.name: output_dr}
    

# Specify command line arguments
arguments = ['--input_data_dir', str(input_ds.as_download()),
             '--output_data_dir', output_dir,
             '--file_name', 'aerial_bottleneck_resnet50_brainwave.h5']


# Configure the script 
src = ScriptRunConfig(source_directory=script_folder, 
                      script=script_name, 
                      run_config=run_config, 
                      arguments=arguments 
                     ) 


Submit the run and start RunDetails widget.

In [15]:
from azureml.widgets import RunDetails

tags = {"Compute target": "AML Compute GPU", "DNN": "Brainwave ResNet50"}
run = exp.submit(src)

RunDetails(run).show()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': True, 'log_level': 'INFO', 's…

Block to wait till the run finishes.

In [16]:
run.wait_for_completion(show_output=False)

{'runId': 'aerial-feature-engineering_1544056948838',
 'target': 'gpucompute',
 'status': 'Finalizing',
 'startTimeUtc': '2018-12-06T00:42:48.125762Z',
 'properties': {'azureml.runsource': 'experiment',
  'ContentSnapshotId': 'f7c7a2f1-07ac-4572-ba3f-1afe825c3304'},
 'runDefinition': {'Script': 'extract.py',
  'Arguments': ['--input_data_dir',
   '$AZUREML_DATAREFERENCE_input_images',
   '--output_data_dir',
   'bottleneck_features',
   '--file_name',
   'aerial_bottleneck_resnet50_brainwave.h5'],
  'SourceDirectoryDataStore': None,
  'Framework': 0,
  'Communicator': 0,
  'Target': 'gpucompute',
  'DataReferences': {'input_images': {'DataStoreName': 'input_images',
    'Mode': 'Download',
    'PathOnDataStore': None,
    'PathOnCompute': 'aerial',
    'Overwrite': True},
   'workspaceblobstore': {'DataStoreName': 'workspaceblobstore',
    'Mode': 'Upload',
    'PathOnDataStore': 'bottleneck_features',
    'PathOnCompute': 'bottleneck_features',
    'Overwrite': True}},
  'JobName': No

After the run, AML copied the output bottleneck files to the default datastore. You can verify it using Azure Portal.

## Clean up resources

Before you move to the next step, you can delete the GPU VM. We will not need it anymore.

In [None]:
compute_target.delete()

## Next Step
The run has completed. You are ready to move to the next part of the lab in which you are going to train a multinomial classification model using the bottleneck features.