# Part 1 - Feature Engineering
In this section of the lab we will develop and run a Python script that will pre-process training and validation images into a set of powerful features - refered to in the lab as bottleneck features.

To create bottleneck features we will utilize a pre-trained Deep Learning network that was trained on a general computer vision domain. 

As explained by your instructor this approach is called Transfer Learning. Transfer Learning is a powerful Machine Learning technique that is based on an observation that the knowledge gained while solving one problem can be applied to a different (but related problem).

In the context of an image classification task, a DNN trained on one visual domain can accelerate learing in another visual domain. Although, our pre-trained network does not know how to classify aerial land plot images, it knows enough about representing image concepts that if we use it to pre-process aerial images, the extracted image features can be used to effectively train a relatively simple classifier on a **limited number** of samples.

The below diagram represents the architecture of our solution.

![Transfer Learning](https://github.com/jakazmie/AIDays/raw/master/DataScientistTrack/02-AML-EndToEndWalkthrough/images/TLArch.png)

We will use a **ResNet50** trained on **imagenet** dataset to extract features. We will occasionally refer to this component of the solution as a featurizer. The output of the featurizer is a vector of 2048 floating point numbers, each representing a feature extracted from an image. 

We will then use extracted features to train a simple fully connected neural network (FCNN) to classify aerial land plot images. We could use any other classification algorithm - e.g. logistic regression or decision trees but FCNN gives us a lot flexibility in fine tuning the model.


The Python script generated by the next Jupyter cell processes an input image dataset into an output bottleneck feature set. The script expects the images to be organized in the below folder structure:
```
train/
   Barren/
   Cultivated/
   Developed/
   Forest/
   Herbaceous/
   Shrub/
valid/
   Barren/
   Cultivated/
   Developed/
   Forest/
   Herbaceous/
   Shrub/
```

The location of the input dataset and the location where to save the output dataset are passed to the script as command line parameters. The output dataset will be stored in a binary HDF5 data format used commonly in Machine Learning and High Performance Computing solutions.

The script is designed to work with a large number of images. As such it does not load all input images to memory at once. Instead it utilizes custom Python generator class - `ImageGenerator` to feed **ResNet50** featurizer. The class yields batches of images - as Numpy arrays - preprocessed to the format required by **ResNet50**. 

We will not attempt to run the script in a local environment. It is very computationally intensive and unless you run it in an evironment equipped with a powerful GPU it would be very slow. It would be *painfully* slow if you attempt to run it in Azure Notebooks.

Instead we will run the script on a remote Azure GPU VM.

Your instructor will dive into the code in the script and explain key snippets.

### Prepare data pre-processing script

#### Create a folder to hold the script

In [1]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

#### Use Jupyter `%%writefile` magic to write the script

In [21]:
%%writefile $script_folder/extract.py

import os
import numpy as np
import random
import h5py

import tensorflow as tf
from tensorflow.keras.applications import resnet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.utils import to_categorical

# This is a generator that yields batches of preprocessed images
class ImageGenerator(tf.keras.utils.Sequence):    
    
    def __init__(self, img_dir, preprocess_fn=None, batch_size=64):
        
        # Create the dictionary that maps class names into numeric labels 
        folders = os.listdir(img_dir)
        folders.sort()
        indexes = range(len(folders))
        label_map = {key: value for (key, value) in zip(folders, indexes)}
        self.num_classes = len(label_map)
        
        # Create a list of all images in a root folder with associated numeric labels
        labeled_image_list = [(os.path.join(img_dir, folder, image), label_map[folder]) 
                              for folder in folders 
                              for image in os.listdir(os.path.join(img_dir, folder))
                              ]
        # Shuffle the list
        random.shuffle(labeled_image_list)
        # Set image list and associated label list
        self.image_list, self.label_list = zip(*labeled_image_list) 
        # Set batch size
        self.batch_size = batch_size
       
        # Set the pre-processing function passed as a parameter
        self.preprocess_fn = preprocess_fn
        
        # Set number of batches
        self.n_batches = len(self.image_list) // self.batch_size
        if len(self.image_list) % self.batch_size > 0:
            self.n_batches += 1
            
    def __len__(self):
        
        return self.n_batches
    
    def __getitem__(self, index):
        pathnames = self.image_list[index*self.batch_size:(index+1)*self.batch_size]
        images = self.__load_images(pathnames)
        
        return images
    
    # Load a set of images passed as a parameter into a NumPy array
    def __load_images(self, pathnames):
        images = []
        for pathname in pathnames:
            img = image.load_img(pathname, target_size=(224,224,3))
            img = image.img_to_array(img)
            images.append(img)
        images = np.asarray(images)
        if self.preprocess_fn != None:
            images = self.preprocess_fn(images)   
        
        return images
    
    # Return labels in one-hot encoding
    def get_labels(self):
        
        return to_categorical(np.asarray(self.label_list), self.num_classes)
    


def create_bottleneck_features():
    # Configure input directories
    train_images_dir = os.path.join(FLAGS.input_data_dir, 'train')
    valid_images_dir = os.path.join(FLAGS.input_data_dir, 'valid')

    # Create generators for training and validation data
    train_generator = ImageGenerator(train_images_dir, resnet50.preprocess_input)
    valid_generator = ImageGenerator(valid_images_dir, resnet50.preprocess_input)

    # Create a featurizer
    featurizer = resnet50.ResNet50(
                weights = 'imagenet', 
                input_shape=(224,224,3), 
                include_top = False,
                pooling = 'avg')

    # Generate training bottleneck features
    print("Generating training bottleneck features")
    features = featurizer.predict_generator(train_generator, verbose=1)
    labels = train_generator.get_labels()
    
    # Save the training dataset to HDF5 file
    filename = 'aerial_bottleneck_train.h5'
    output_file = os.path.join(FLAGS.output_data_dir, filename)
    print("Saving training features to {}".format(output_file))
    with h5py.File(output_file, "w") as hfile:
        features_dset = hfile.create_dataset('features', data=features)
        labels_dset = hfile.create_dataset('labels', data=labels)
    

     # Generate validation bottleneck features
    print("Generating validation bottleneck features")
    features = featurizer.predict_generator(valid_generator, verbose=1)
    labels = valid_generator.get_labels()
    
    # Save the training dataset to HDF5 file
    filename = 'aerial_bottleneck_valid.h5'
    output_file = os.path.join(FLAGS.output_data_dir, filename)
    print("Saving validation features to {}".format(output_file))
    with h5py.File(output_file, "w") as hfile:
        features_dset = hfile.create_dataset('features', data=features)
        labels_dset = hfile.create_dataset('labels', data=labels)
    
    print("Done")

FLAGS = tf.app.flags.FLAGS

# Default global parameters
tf.app.flags.DEFINE_integer('batch_size', 64, "Number of images per batch")
tf.app.flags.DEFINE_string('input_data_dir', 'aerialsmall', "Folder with training and validation images")
tf.app.flags.DEFINE_string('output_data_dir', 'bottleneck', "A folder for saving bottleneck features")


def main(argv=None):
    print("Starting")
    print(FLAGS.input_data_dir)
    print(FLAGS.output_data_dir)
    
    print(os.listdir(FLAGS.input_data_dir))
    print(os.listdir(FLAGS.output_data_dir))
    return

    create_bottleneck_features()
  
if __name__ == '__main__':
    tf.app.run()

Overwriting ./script/extract.py


### Prepare Compute Target
Our script is now ready for execution. We will run the script on a single node of a remote GPU cluster. We will use Azure ML Python SDK to create and configure the cluster, datastores, and run configuration.

The first step is to intialize the AML workspace you created during the lab overview.

#### Initialize AML Workspace

In [3]:
import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Found the config file in: /home/demouser/repos/AIDays/DataScientistTrack/02-AML-EndToEndWalkthrough/aml_config/config.json
jkaml
jkaml
southcentralus
952a710c-8d9c-40c1-9fec-f752138cc0b3


#### Create Azure Batch AI cluster as a compute target

**Creation of the cluster takes approximately 5 minutes.** If the cluster is already in the workspace this code uses it and skips the creation process.

In [30]:
from azureml.core.compute import ComputeTarget, BatchAiCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
batchai_cluster_name = ws.name + 'gpucluster'

try:
    # look for the existing cluster by name
    compute_target = ComputeTarget(workspace=ws, name=batchai_cluster_name)
    if type(compute_target) is BatchAiCompute:
        print('found compute target {}, just use it.'.format(batchai_cluster_name))
    else:
        print('{} exists but it is not a Batch AI cluster. Please choose a different name.'.format(batchai_cluster_name))
except ComputeTargetException:
    print('creating a new compute target...')
    compute_config = BatchAiCompute.provisioning_configuration(vm_size="STANDARD_NC6", # GPU-based VM
                                                                #vm_priority='lowpriority', # optional
                                                                autoscale_enabled=True,
                                                                cluster_min_nodes=1, 
                                                                cluster_max_nodes=4)

    # create the cluster
    compute_target = ComputeTarget.create(ws, batchai_cluster_name, compute_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it uses the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
    # Use the 'status' property to get a detailed status for the current cluster. 
    print(compute_target.status.serialize())

creating a new compute target...
Creating
succeeded................
BatchAI wait for completion finished
Minimum number of nodes requested have been provisioned
{'allocationState': 'steady', 'allocationStateTransitionTime': '2018-10-14T20:39:45.914000+00:00', 'creationTime': '2018-10-14T20:37:52.520000+00:00', 'currentNodeCount': 1, 'errors': None, 'nodeStateCounts': {'idleNodeCount': 0, 'leavingNodeCount': 0, 'preparingNodeCount': 1, 'runningNodeCount': 0, 'unusableNodeCount': 0}, 'provisioningState': 'succeeded', 'provisioningStateTransitionTime': '2018-10-14T20:38:06.284000+00:00', 'scaleSettings': {'manual': None, 'autoScale': {'maximumNodeCount': 4, 'minimumNodeCount': 1, 'initialNodeCount': 1}}, 'vmPriority': 'dedicated', 'vmSize': 'STANDARD_NC6'}


#### Configure Datastores 
The dataset we will use for training is located in a public Azure blob storage container. We will register this container as a datastore within our workspace.

The output of the script - bottleneck files - will be pushed to the default datastore that was created automatically when you created your workspace.

In [31]:
from azureml.core import Datastore

datastore_name = 'input_images'
try:
    input_ds = Datastore.register_azure_blob_container(workspace=ws, datastore_name=datastore_name,
                                            container_name='aerialsmall',
                                            account_name='azureailabs')
    print('Creating new datastore for input images')
except:
    input_ds = Datastore(ws, datastore_name)
    print('Found existing datastore for input images:', input_ds.name)
   
print(input_ds.name, input_ds.datastore_type, input_ds.account_name, input_ds.container_name)

output_ds = ws.get_default_datastore()
print("Using the default datastore for output: ")
print(output_ds.name, output_ds.datastore_type, output_ds.account_name, output_ds.container_name)


Registering datastore failed with 400 error code and error message
b'{\n  "error": {\n    "code": "UserError",\n    "message": "Another data store with the same name already exists but with different values. Please use patch to update.",\n    "target": null,\n    "details": [],\n    "innerError": null,\n    "debugInfo": {\n      "type": "Microsoft.MachineLearning.Common.WebApi.Exceptions.BadRequestException",\n      "message": "Another data store with the same name already exists but with different values. Please use patch to update.",\n      "stackTrace": "   at Microsoft.MachineLearning.DataStore.EntryPoints.Controllers.DataStoreController.CreateOrUpdate(DataStoreDto dto, Boolean create, Boolean createIfNotExists) in /home/vsts/work/1/s/src/azureml-api/src/DataStore/EntryPoints/Controllers/DataStoreController.cs:line 120\\n   at Microsoft.MachineLearning.DataStore.EntryPoints.Controllers.DataStoreController.Create(DataStoreDto dto, Boolean createIfNotExists) in /home/vsts/work/1/s/sr

Found existing datastore for input images: input_images
input_images AzureBlob azureailabs aerialsmall
Using the default datastore for output: 
workspacefilestore AzureFile jkamlstoragetuqtsfhp azureml-filestore-9548cf74-fb4f-4c04-85a5-1a93063dc9b5


#### Configure data access mode

We will now create `DataReferenceConfiguration` objects to inform AML runtime how to access input and output datastores while running the script on the cluster. Since we will be running the script on a single node of the cluster, the most efficient approach is to download the input dataset to the local storage on the node before running the script. The output datastore will be mounted as a remote file system so we can persist the bottleneck feature files outside of the cluster.

In [32]:
from azureml.core.runconfig import DataReferenceConfiguration
input_dr = DataReferenceConfiguration(datastore_name=input_ds.name, 
                                      path_on_datastore=None, 
                                      path_on_compute=input_ds.name,
                                      mode='download', # download files from datastore to compute target
                                      overwrite=True)


output_dr = DataReferenceConfiguration(datastore_name=output_ds.name,
                                       path_on_datastore='bottleneck',
                                       path_on_compute=output_ds.name,
                                       mode="mount")

#### Create Experiment
**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments. We will use **Experiment** to store logs generated by our script.

In [33]:
experiment_name = 'aerial-classifier'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

### Run

We are now ready to run the script on the cluster. There are multiple ways to run the job. We are goint to utilize a higher level **Estimator** object and run the script in a docker container. 

In [34]:
from azureml.train.estimator import Estimator

script_params = {
    '--input_data_dir': input_dr,
    '--output_data_dir': output_dr
}

est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=compute_target,
                entry_script='train.py',
                node_count=1,
                process_count_per_node=1,
                use_gpu=True,
                pip_packages=['h5py','pillow','tensorflow-gpu']
                )


In [35]:
run.wait_for_completion(show_output=True)

RunId: aerial-classifier_1539548655517

Execution Summary
RunId: aerial-classifier_1539548655517



{'runId': 'aerial-classifier_1539548655517',
 'target': 'gpudsvm',
 'status': 'Failed',
 'startTimeUtc': '2018-10-14T20:24:18.563276Z',
 'endTimeUtc': '2018-10-14T20:24:45.330709Z',
 'properties': {'azureml.runsource': 'experiment',
  'ContentSnapshotId': 'fde3f3aa-cba1-4e1f-907c-dc92233db180'},
 'runDefinition': {'Script': 'extract.py',
  'Arguments': ['--input_data_dir',
   '$AZUREML_DATAREFERENCE_input_images',
   '--output_data_dir',
   '$AZUREML_DATAREFERENCE_workspacefilestore'],
  'Framework': 0,
  'Target': 'gpudsvm',
  'DataReferences': {'input_images': {'DataStoreName': 'input_images',
    'Mode': 'Download',
    'PathOnDataStore': None,
    'PathOnCompute': 'input_images',
    'Overwrite': True},
   'workspacefilestore': {'DataStoreName': 'workspacefilestore',
    'Mode': 'Download',
    'PathOnDataStore': 'bottleneck',
    'PathOnCompute': 'workspacefilestore',
    'Overwrite': True}},
  'JobName': None,
  'AutoPrepareEnvironment': True,
  'MaxRunDurationSeconds': None,
  '