# Part 1 - Feature Engineering
In this section of the lab we will develop and run a Python script that pre-processes training and validation images into a set of powerful features - refered to in the lab as bottleneck features.

To create bottleneck features we will utilize a pre-trained Deep Learning network that was trained on a general computer vision domain. 

As explained by your instructor this approach is called Transfer Learning. Transfer Learning is a powerful Machine Learning technique that is based on an observation that the knowledge gained while solving one problem can be applied to a different (but related problem).

In the context of an image classification task, a DNN trained on one visual domain can accelerate learing in another visual domain. Although, our pre-trained network does not know how to classify aerial land plot images, it knows enough about representing image concepts that if we use it to pre-process aerial images, the extracted image features can be used to effectively train a relatively simple classifier on a **limited number** of samples.

The below diagram represents the architecture of our solution.

![Transfer Learning](../images/TLArch.png)

We will use **ResNet50** trained on **imagenet** dataset to extract features. We will occasionally refer to this component of the solution as a featurizer. The output of the featurizer is a vector of 2048 floating point numbers, each representing a feature extracted from an image. 

We will then use extracted features to train a simple fully connected neural network (FCNN) to classify aerial land plot images. We could use any other classification algorithm - e.g. logistic regression or decision trees but FCNN gives us a lot flexibility in fine tuning the model.


The Python script generated by the next Jupyter cell processes an input image dataset into an output bottleneck feature set. The script expects the images to be organized in the below folder structure:
```
train/
   Barren/
   Cultivated/
   Developed/
   Forest/
   Herbaceous/
   Shrub/
valid/
   Barren/
   Cultivated/
   Developed/
   Forest/
   Herbaceous/
   Shrub/
```

The location of the input dataset and the location where to save the output dataset are passed to the script as command line parameters. The output dataset will be stored in a binary HDF5 data format used commonly in Machine Learning and High Performance Computing solutions.

The script is designed to work with a large number of images. As such it does not load all input images to memory at once. Instead it utilizes custom Python generator class - `ImageGenerator` to feed the **ResNet50** featurizer. The class yields batches of images - as Numpy arrays - preprocessed to the format required by **ResNet50**. 

We will not attempt to run the script in a local environment. It is very computationally intensive and unless you run it in an evironment equipped with a powerful GPU it would be very slow. It would be *painfully* slow if you attempt to run it in Azure Notebooks.

Instead we will run the script on a remote Azure GPU VM.

Your instructor will dive into the code in the script and explain key snippets.

### Prepare data pre-processing script

#### Create a folder to hold the script

In [1]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

#### Use Jupyter `%%writefile` magic to write the script

In [11]:
%%writefile $script_folder/extract.py

import os
import numpy as np
import random
import h5py

import tensorflow as tf
from tensorflow.keras.applications import resnet50
from tensorflow.keras.applications.vgg16 import VGG16

from tensorflow.keras.preprocessing import image
from tensorflow.keras.utils import to_categorical

# This is a generator that yields batches of preprocessed images
class ImageGenerator(tf.keras.utils.Sequence):    
    
    def __init__(self, img_dir, preprocess_fn=None, batch_size=64):
        
        # Create the dictionary that maps class names into numeric labels 
        folders = os.listdir(img_dir)
        folders.sort()
        indexes = range(len(folders))
        label_map = {key: value for (key, value) in zip(folders, indexes)}
        self.num_classes = len(label_map)
        
        # Create a list of all images in a root folder with associated numeric labels
        labeled_image_list = [(os.path.join(img_dir, folder, image), label_map[folder]) 
                              for folder in folders 
                              for image in os.listdir(os.path.join(img_dir, folder))
                              ]
        # Shuffle the list
        random.shuffle(labeled_image_list)
        # Set image list and associated label list
        self.image_list, self.label_list = zip(*labeled_image_list) 
        # Set batch size
        self.batch_size = batch_size
       
        # Set the pre-processing function passed as a parameter
        self.preprocess_fn = preprocess_fn
        
        # Set number of batches
        self.n_batches = len(self.image_list) // self.batch_size
        if len(self.image_list) % self.batch_size > 0:
            self.n_batches += 1
            
    def __len__(self):
        
        return self.n_batches
    
    def __getitem__(self, index):
        pathnames = self.image_list[index*self.batch_size:(index+1)*self.batch_size]
        images = self.__load_images(pathnames)
        
        return images
    
    # Load a set of images passed as a parameter into a NumPy array
    def __load_images(self, pathnames):
        images = []
        for pathname in pathnames:
            img = image.load_img(pathname, target_size=(224,224,3))
            img = image.img_to_array(img)
            images.append(img)
        images = np.asarray(images)
        if self.preprocess_fn != None:
            images = self.preprocess_fn(images)   
        
        return images
    
    # Return labels in one-hot encoding
    def get_labels(self):
        
        return to_categorical(np.asarray(self.label_list), self.num_classes)
    


def create_bottleneck_features():
    # Configure input directories
    train_images_dir = os.path.join(FLAGS.input_data_dir, 'train')
    valid_images_dir = os.path.join(FLAGS.input_data_dir, 'valid')

    # Create generators for training and validation data
    #train_generator = ImageGenerator(train_images_dir, resnet50.preprocess_input)
    #valid_generator = ImageGenerator(valid_images_dir, resnet50.preprocess_input)
    
    train_generator = ImageGenerator(train_images_dir, VGG16.preprocess_input)
    valid_generator = ImageGenerator(valid_images_dir, VGG16.preprocess_input)

    # Create a featurizer
    #featurizer = resnet50.ResNet50(
    #            weights = 'imagenet', 
    #            input_shape=(224,224,3), 
    #            include_top = False,
    #            pooling = 'avg')
    
    featurizer = VGG16.vgg16(
                weights = 'imagenet', 
                input_shape=(224,224,3), 
                include_top = False,
                pooling = 'avg')
    

    # Generate training bottleneck features
    print("Generating training bottleneck features")
    features = featurizer.predict_generator(train_generator, verbose=1)
    labels = train_generator.get_labels()
    
    # Save training dataset to HDF5 file
    filename = 'aerial_bottleneck_train_vgg16.h5'
    output_file = os.path.join(FLAGS.output_data_dir, filename)
    print("Saving training features to {}".format(output_file))
    print("   Training features: ", features.shape)
    print("   Training labels: ", labels.shape)
    with h5py.File(output_file, "w") as hfile:
        features_dset = hfile.create_dataset('features', data=features)
        labels_dset = hfile.create_dataset('labels', data=labels)

     # Generate validation bottleneck features
    print("Generating validation bottleneck features")
    features = featurizer.predict_generator(valid_generator, verbose=1)
    labels = valid_generator.get_labels()
    
    # Save validation dataset to HDF5 file
    filename = 'aerial_bottleneck_valid_vgg16.h5'
    output_file = os.path.join(FLAGS.output_data_dir, filename)
    print("Saving validation features to {}".format(output_file))
    print("   Validation features: ", features.shape)
    print("   Validation labels: ", labels.shape)
    with h5py.File(output_file, "w") as hfile:
        features_dset = hfile.create_dataset('features', data=features)
        labels_dset = hfile.create_dataset('labels', data=labels)
    
    print("Done")

FLAGS = tf.app.flags.FLAGS

# Default global parameters
tf.app.flags.DEFINE_integer('batch_size', 64, "Number of images per batch")
tf.app.flags.DEFINE_string('input_data_dir', 'aerialsmall', "Folder with training and validation images")
tf.app.flags.DEFINE_string('output_data_dir', 'bottleneck', "A folder for saving bottleneck features")


def main(argv=None):
    print("Starting")
    print("Reading training data from:", FLAGS.input_data_dir)
    print("Output bottleneck files will be saved to:", FLAGS.output_data_dir)
    os.makedirs(FLAGS.output_data_dir, exist_ok=True)
   
    create_bottleneck_features()
  
if __name__ == '__main__':
    tf.app.run()

Overwriting script/extract.py


### Prepare Compute Target
Our script is now ready for execution. We will run the script on a remote GPU equipped Azure Virtual Machine. We will use Azure ML Python SDK to create and configure the VM, datastores, and run configuration.

The first step is to intialize the AML workspace you created during the lab overview.

#### Initialize AML Workspace

In [12]:
import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config('../aml_config/config.json')
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Found the config file in: /home/demouser/repos/HighPerformanceTensorFlowOnAzure/aml_config/config.json
jkaml
jkaml
eastus2
952a710c-8d9c-40c1-9fec-f752138cc0b3


#### Create Azure DSVM as a compute target

We will use Azure Data Science Virtual Machine equipped with Tesla K80 GPU as a compute target. If the VM is already in the workspace this code uses it and skips the creation process.

In [13]:
from azureml.core.compute import DsvmCompute
from azureml.core.compute_target import ComputeTargetException

compute_target_name = 'gpudsvm'
compute_target_type = 'Standard_NC6'

try:
    dsvm_compute = DsvmCompute(workspace=ws, name=compute_target_name)
    print('Found existing DSVM:', dsvm_compute.name)
except ComputeTargetException:
    dsvm_config = DsvmCompute.provisioning_configuration(vm_size=compute_target_type)
    dsvm_compute = DsvmCompute.create(ws, name=compute_target_name, provisioning_configuration=dsvm_config)
    dsvm_compute.wait_for_completion(show_output=True)

Found existing DSVM: gpudsvm


#### Configure Datastores 
The dataset we will use for training has been uploaded to a public Azure blob storage container. We will register this container as a datastore within our workspace. Before the data prep script runs, the datastore's content - training and validation images - will be copied to the local storage on DSVM.

The output of the script - bottleneck files - will be pushed to the default datastore that was created automatically when you created your workspace.

In [14]:
from azureml.core import Datastore

images_account = 'azureailabs'
images_container = 'aerialmed'
datastore_name = 'input_images'

try:
    input_ds = Datastore.register_azure_blob_container(workspace=ws, datastore_name=datastore_name,
                                            container_name=images_container,
                                            account_name=images_account)
    print('Creating new datastore for input images')
except:
    input_ds = Datastore(ws, datastore_name)
    print('Found existing datastore for input images:', input_ds.name)
   
print(input_ds.name, input_ds.datastore_type, input_ds.account_name, input_ds.container_name)

output_ds = ws.get_default_datastore()
print("Using the default datastore for output: ")
print(output_ds.name, output_ds.datastore_type, output_ds.account_name, output_ds.container_name)


Registering datastore failed with 400 error code and error message
b'{\n  "error": {\n    "code": "UserError",\n    "message": "Another data store with the same name already exists but with different values. Please use patch to update.",\n    "target": null,\n    "details": [],\n    "innerError": null,\n    "debugInfo": {\n      "type": "Microsoft.MachineLearning.Common.WebApi.Exceptions.BadRequestException",\n      "message": "Another data store with the same name already exists but with different values. Please use patch to update.",\n      "stackTrace": "   at Microsoft.MachineLearning.DataStore.EntryPoints.Controllers.DataStoreController.CreateOrUpdate(DataStoreDto dto, Boolean create, Boolean createIfNotExists) in /home/vsts/work/1/s/src/azureml-api/src/DataStore/EntryPoints/Controllers/DataStoreController.cs:line 120\\n   at Microsoft.MachineLearning.DataStore.EntryPoints.Controllers.DataStoreController.Create(DataStoreDto dto, Boolean createIfNotExists) in /home/vsts/work/1/s/sr

Found existing datastore for input images: input_images
input_images AzureBlob azureailabs aerialmed
Using the default datastore for output: 
workspacefilestore AzureFile jkamlstoragekhsgpljj azureml-filestore-381bfd16-8663-49a4-9e97-330161b029bd


#### Create Experiment
**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments. We will use **Experiment** to store logs generated by our script.

In [15]:
experiment_name = 'aerial-classifier-dataprep'

from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)

### Run

We are now ready to run the script on the cluster. There are multiple ways to run the job. We are going to utilize a higher level **Estimator** object and run the script in a docker container and a remote DSVM. 

The call to start the run is asynchronous, it returns a **Preparing** or **Running** state as soon as the job is started.

### Monitor a remote run

The first run takes longer. The subsequent runs, as long as the script dependencies don't change, are much faster.

Here is what's happening while you wait:

- **Image creation**: A Docker image is created matching the Python environment specified by the estimator. In our case, this will be a base GPU image with the latest version of `tensorflow-gpu`, `h5py`, and `pillow`. The image is uploaded to the workspace. This stage happens once for each Python environment since the container is cached for subsequent runs.  During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Running**: In this stage, the dataprep script is sent to the DSVM, then the data in the input datastore is copied to the local storage on DSVM, then the script is run. While the job is running, stdout and the ./logs directory are streamed to the run history. You can monitor the run's progress using these logs. 

- **Post-Processing**: The created bottleneck files are copied to the default datastore. The ./outputs directory on the DSVM  is copied over to the run history in your workspace so you can access these results.


You can check the progress of a running job in multiple ways. This lab uses a Jupyter widget as well as a `wait_for_completion` method. 

In [16]:
from azureml.train.estimator import Estimator

# Define the location of the dataprep script and the location for the output bottleneck files
script_folder = 'script'
script_name = 'extract.py'
output_dir = './bottleneck'
pip_packages = ['h5py','pillow','tensorflow-gpu']

script_params = {
    '--input_data_dir': input_ds.as_download(),
    '--output_data_dir': output_dir
}

est = Estimator(source_directory=script_folder,
                script_params=script_params,
                compute_target=dsvm_compute,
                entry_script=script_name,
                node_count=1,
                process_count_per_node=1,
                use_gpu=True,
                pip_packages=pip_packages,
                inputs=[output_ds.path(output_dir).as_upload(path_on_compute=output_dir)])
                

In [17]:
tags = {"DNN": "VGG16"}
run = exp.submit(config=est, tags=tags)
run

Experiment,Id,Type,Status,Details Page,Docs Page
aerial-classifier-dataprep,aerial-classifier-dataprep_1540157141770,azureml.scriptrun,Running,Link to Azure Portal,Link to Documentation


Monitor the run using Jupiter Run widget.

In [18]:
from azureml.train.widgets import RunDetails
RunDetails(run).show()

_UserRun(widget_settings={'childWidgetDisplay': 'popup'})

Monitor the run using `wait_for_completion` method.

In [19]:
run.wait_for_completion(show_output=False)

{'runId': 'aerial-classifier-dataprep_1540157141770',
 'target': 'gpudsvm',
 'status': 'Finalizing',
 'startTimeUtc': '2018-10-21T21:25:44.480345Z',
 'properties': {'azureml.runsource': 'experiment',
  'ContentSnapshotId': '1e46a00d-6dc0-4169-a196-d42d369122b1'},
 'runDefinition': {'Script': 'extract.py',
  'Arguments': ['--input_data_dir',
   '$AZUREML_DATAREFERENCE_input_images',
   '--output_data_dir',
   './bottleneck'],
  'Framework': 0,
  'Target': 'gpudsvm',
  'DataReferences': {'93ef68f2ce0a47ab9ebca75ba4b64199': {'DataStoreName': 'workspacefilestore',
    'Mode': 'Upload',
    'PathOnDataStore': './bottleneck',
    'PathOnCompute': './bottleneck',
    'Overwrite': False},
   'input_images': {'DataStoreName': 'input_images',
    'Mode': 'Download',
    'PathOnDataStore': None,
    'PathOnCompute': None,
    'Overwrite': False}},
  'JobName': None,
  'AutoPrepareEnvironment': True,
  'MaxRunDurationSeconds': None,
  'Environment': {'Python': {'InterpreterPath': 'python',
    'Us

## Next Step
The run has completed. The bottleneck files have been copied to the workspace's default datastore. You are ready to move to the next part of the lab in which you are going to train a small fully connected neural network using the bottleneck features.

To proceed start the `02-train.ipynb` notebook.
