# Part 2 - Feature Engineering
In this lab we will develop and run a Python script that pre-processes training and validation images into a set of powerful features - refered to in the lab as bottleneck features.

To create bottleneck features we will utilize a pre-trained Deep Learning network that was trained on a general computer vision domain. 

As explained by your instructor this approach is called Transfer Learning. Transfer Learning is a powerful Machine Learning technique that is based on an observation that the knowledge gained while solving one problem can be applied to a different (but related problem).

In the context of an image classification task, a DNN trained on one visual domain can accelerate learing in another visual domain. Although, our pre-trained network does not know how to classify aerial land plot images, it knows enough about representing image concepts that if we use it to pre-process aerial images, the extracted image features can be used to effectively train a relatively simple classifier on a **limited number** of samples.

The below diagram represents the architecture of our solution.

![Transfer Learning](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/tlcl.png)

We will use **ResNet50** trained on **imagenet** dataset to extract features. We will occasionally refer to this component of the solution as a featurizer. The output of the featurizer is a vector of 2048 floating point numbers, each representing a feature extracted from an image. 

We will then use extracted features to train an scikit-learn classifier. (next lab).

The Python script processes an input image dataset into an output bottleneck feature set. The script expects the images to be organized in the below folder structure:
```
train/
   Barren/
   Cultivated/
   Developed/
   Forest/
   Herbaceous/
   Shrub/
valid/
   Barren/
   Cultivated/
   Developed/
   Forest/
   Herbaceous/
   Shrub/
```

The location of the input dataset and the location where to save the output dataset are passed to the script as command line parameters. The output dataset will be stored in a binary HDF5 data format used commonly in Machine Learning and High Performance Computing solutions.

The script is designed to work with a large number of images. As such it does not load all input images to memory at once. Instead it utilizes custom Python generator class - `ImageGenerator` to feed the featurizer. The class yields batches of images - as Numpy arrays - preprocessed to the format required by **ResNet50**. 

We will not attempt to run the script on a full dataset in a local environment. It is very computationally intensive and unless you run it in an evironment equipped with a powerful GPU it would be very slow. 

However, we will demonstrate how to run the script locally using the same small development dataset we used in the previous lab. Running the script locally under the control of Azure ML is very usefull during script development and debugging.

To process the full dataset we will send the debugged script to a remote Azure GPU VM.

Your instructor will dive into the code in the script and explain key snippets.

### Prepare data pre-processing script

#### Create a folder to hold the script

In [1]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

#### Use Jupyter `%%writefile` magic to write the script

In [2]:
%%writefile $script_folder/extract.py

import os
import numpy as np
import random
import h5py

import tensorflow as tf
from tensorflow.keras.applications import resnet50

from tensorflow.keras.preprocessing import image
from tensorflow.keras.utils import to_categorical

# This is a generator that yields batches of preprocessed images
class ImageGenerator(tf.keras.utils.Sequence):    
    
    def __init__(self, img_dir, preprocess_fn=None, batch_size=64):
        
        # Create the dictionary that maps class names into numeric labels 
        folders = os.listdir(img_dir)
        folders.sort()
        indexes = range(len(folders))
        label_map = {key: value for (key, value) in zip(folders, indexes)}
        self.num_classes = len(label_map)
        
        # Create a list of all images in a root folder with associated numeric labels
        labeled_image_list = [(os.path.join(img_dir, folder, image), label_map[folder]) 
                              for folder in folders 
                              for image in os.listdir(os.path.join(img_dir, folder))
                              ]
        # Shuffle the list
        random.shuffle(labeled_image_list)
        # Set image list and associated label list
        self.image_list, self.label_list = zip(*labeled_image_list) 
        # Set batch size
        self.batch_size = batch_size
       
        # Set the pre-processing function passed as a parameter
        self.preprocess_fn = preprocess_fn
        
        # Set number of batches
        self.n_batches = len(self.image_list) // self.batch_size
        if len(self.image_list) % self.batch_size > 0:
            self.n_batches += 1
            
    def __len__(self):
        
        return self.n_batches
    
    def __getitem__(self, index):
        pathnames = self.image_list[index*self.batch_size:(index+1)*self.batch_size]
        images = self.__load_images(pathnames)
        
        return images
    
    # Load a set of images passed as a parameter into a NumPy array
    def __load_images(self, pathnames):
        images = []
        for pathname in pathnames:
            img = image.load_img(pathname, target_size=(224,224,3))
            img = image.img_to_array(img)
            images.append(img)
        images = np.asarray(images)
        if self.preprocess_fn != None:
            images = self.preprocess_fn(images)   
        
        return images
    
    # Return labels in one-hot encoding
    def get_labels(self):
        
        #return to_categorical(np.asarray(self.label_list), self.num_classes)
        return np.asarray(self.label_list)


def create_bottleneck_features():
    # Configure input directories
    train_images_dir = os.path.join(FLAGS.input_data_dir, 'train')
    #valid_images_dir = os.path.join(FLAGS.input_data_dir, 'valid')

    train_generator = ImageGenerator(train_images_dir, resnet50.preprocess_input)
    #valid_generator = ImageGenerator(valid_images_dir, resnet50.preprocess_input)
    
    featurizer = resnet50.ResNet50(
                weights = 'imagenet', 
                input_shape=(224,224,3), 
                include_top = False,
                pooling = 'avg')
    

    # Generate training bottleneck features
    print("Generating training bottleneck features")
    features = featurizer.predict_generator(train_generator, verbose=1)
    labels = train_generator.get_labels()
    
    # Save training dataset to HDF5 file
    filename = FLAGS.training_file_name
    output_file = os.path.join(FLAGS.output_data_dir, filename)
    print("Saving training features to {}".format(output_file))
    print("   Training features: ", features.shape)
    print("   Training labels: ", labels.shape)
    with h5py.File(output_file, "w") as hfile:
        features_dset = hfile.create_dataset('features', data=features)
        labels_dset = hfile.create_dataset('labels', data=labels)

     # Generate validation bottleneck features
    #print("Generating validation bottleneck features")
    #features = featurizer.predict_generator(valid_generator, verbose=1)
    #labels = valid_generator.get_labels()
    
    # Save validation dataset to HDF5 file
    #filename = FLAGS.validation_file_name
    #output_file = os.path.join(FLAGS.output_data_dir, filename)
    #print("Saving validation features to {}".format(output_file))
    #print("   Validation features: ", features.shape)
    #print("   Validation labels: ", labels.shape)
    #with h5py.File(output_file, "w") as hfile:
    #    features_dset = hfile.create_dataset('features', data=features)
    #    labels_dset = hfile.create_dataset('labels', data=labels)
    
    #print("Done")

FLAGS = tf.app.flags.FLAGS

# Default global parameters
tf.app.flags.DEFINE_integer('batch_size', 64, "Number of images per batch")
tf.app.flags.DEFINE_string('input_data_dir', 'aerialtiny', "Folder with training and validation images")
tf.app.flags.DEFINE_string('output_data_dir', 'bottleneck', "A folder for saving bottleneck features")
tf.app.flags.DEFINE_string('training_file_name', 'aerial_bottleneck_train_resnet50.h5', "Name of output training file")
#tf.app.flags.DEFINE_string('validation_file_name', 'aerial_bottleneck_valid_vgg16.h5', "Name of output validation file")

def main(argv=None):
    print("Starting")
    print("Reading training data from:", FLAGS.input_data_dir)
    print("Output bottleneck files will be saved to:", FLAGS.output_data_dir)

    os.makedirs(FLAGS.output_data_dir, exist_ok=True)

    create_bottleneck_features()
  
if __name__ == '__main__':
    tf.app.run()

Overwriting ./script/extract.py


### Run the script locally

As noted in the introduction, we will first run the script locally

#### Initialize AML Workspace

In [3]:
import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config('../aml_config/config.json')
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Found the config file in: /home/demouser/repos/MTC_AzureAILabs/DataScienceTrack/01-aml-walkthrough-sklearn/aml_config/config.json
jkamllab
jkamllab
eastus2
952a710c-8d9c-40c1-9fec-f752138cc0b3


#### Create an Experiment
We will track local runs in a dedicated Experiment

In [4]:
from azureml.core import Experiment
experiment_name = 'aerial-feature-engineering'
exp = Experiment(workspace=ws, name=experiment_name)

#### Configure Run environment
We will use a user-managed run, which means we assume that all the necessary packages are already available in the Python environment selected to run the script. In our case this is true, as we pre-installed all the dependencies during the lab setup. Alternatively, you can execute a local run in system-managed environment. In that case AML would build a new conda environment and execute the script in it.

In [5]:
from azureml.core.runconfig import RunConfiguration

run_config = RunConfiguration()
run_config.environment.python.user_managed_dependencies = True
#run_config.environment.python.interpreter_path = '/anaconda/envs/py36/bin/python' # AML env
run_config.environment.python.interpreter_path = '/home/demouser/anaconda3/envs/aml/bin/python'


Run the script. Note that we need to supply an absolute path to the folder with training and validation images. 

In [6]:
from azureml.core import ScriptRunConfig

src = ScriptRunConfig(
    source_directory='./script',
    script='extract.py',
    run_config=run_config,
    arguments=['--input_data_dir', '/tmp/aerialtiny',
               '--output_data_dir', 'bottleneck'] )

tags = {"Compute target": "Local", "DNN": "ResNet50"}
run = exp.submit(src, tags=tags)
run

Experiment,Id,Type,Status,Details Page,Docs Page
aerial-feature-engineering,aerial-feature-engineering_1541173516_04b87551,azureml.scriptrun,Running,Link to Azure Portal,Link to Documentation


Block to wait till run finishes and stream the output.

In [7]:
run.wait_for_completion(show_output=True)

RunId: aerial-feature-engineering_1541173516_04b87551

Streaming azureml-logs/60_control_log.txt

Streaming log file azureml-logs/60_control_log.txt
Running: ['/home/demouser/anaconda3/envs/aml/bin/python', 'azureml-setup/run_script.py', '/home/demouser/anaconda3/envs/aml/bin/python', 'azureml-setup/context_manager_injector.py', '-i', 'ProjectPythonPath:context_managers.ProjectPythonPath', '-i', 'OutputCollection:context_managers.RunHistory', 'extract.py', '--input_data_dir', '/tmp/aerialtiny', '--output_data_dir', 'bottleneck']
Logging experiment running status in history service.
Streaming log file azureml-logs/80_driver_log.txt

Streaming azureml-logs/80_driver_log.txt

Starting
Reading training data from: /tmp/aerialtiny
Output bottleneck files will be saved to: bottleneck
2018-11-02 15:45:33.148338: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-11-02 15:45:36.669586: I tensorfl

{'runId': 'aerial-feature-engineering_1541173516_04b87551',
 'target': 'local',
 'status': 'Finalizing',
 'startTimeUtc': '2018-11-02T15:45:16.885403Z',
 'properties': {'azureml.runsource': 'experiment',
  'ContentSnapshotId': '23163768-4467-4dec-84a4-3aa706d29dfa'},
 'runDefinition': {'Script': 'extract.py',
  'Arguments': ['--input_data_dir',
   '/tmp/aerialtiny',
   '--output_data_dir',
   'bottleneck'],
  'SourceDirectoryDataStore': None,
  'Framework': 0,
  'Target': 'local',
  'DataReferences': {},
  'JobName': None,
  'AutoPrepareEnvironment': True,
  'MaxRunDurationSeconds': None,
  'Environment': {'Python': {'InterpreterPath': '/home/demouser/anaconda3/envs/aml/bin/python',
    'UserManagedDependencies': True,
    'CondaDependencies': {'name': 'project_environment',
     'dependencies': ['python=3.6.2', {'pip': ['azureml-defaults']}]},
    'CondaDependenciesFile': None},
   'EnvironmentVariables': {'EXAMPLE_ENV_VAR': 'EXAMPLE_VALUE'},
   'Docker': {'BaseImage': 'mcr.microsoft.

The logs from the run have been pushed to AML Experiment.

In [8]:
print(run.get_file_names())

['azureml-logs/60_control_log.txt', 'azureml-logs/80_driver_log.txt', 'driver_log', 'azureml-logs/azureml.log']


The bottleneck files can be found in a local folder associated with the run.

In [9]:
%env RUN_ID=$run.id

env: RUN_ID=aerial-feature-engineering_1541173516_04b87551


In [10]:
%%sh

ls /tmp/azureml_runs/$RUN_ID/

aml_config
assets
azureml-logs
azureml-setup
bottleneck
extract.py
invocation.zip
outputs


### Run the script on a remote GPU VM

As you can see, even on a really small dataset the processing is very slow. In the next step, you will run the script on a full dataset using a remote GPU equipped VM. 

#### Create Azure DSVM as a compute target

We will use *Standard_NC6* VM equipped with Tesla K80 GPU as a compute target. If the VM is already in the workspace this code uses it and skips the creation process.

In [None]:
from azureml.core.compute import DsvmCompute
from azureml.core.compute_target import ComputeTargetException

compute_target_name = 'gpudsvm'
compute_target_type = 'Standard_NC6'

try:
    dsvm_compute = DsvmCompute(workspace=ws, name=compute_target_name)
    print('Found existing DSVM:', dsvm_compute.name)
except ComputeTargetException:
    dsvm_config = DsvmCompute.provisioning_configuration(vm_size=compute_target_type)
    dsvm_compute = DsvmCompute.create(ws, name=compute_target_name, provisioning_configuration=dsvm_config)
    dsvm_compute.wait_for_completion(show_output=True)

Creating........................

#### Configure Datastores 
The dataset we will use for training has been uploaded to a public Azure blob storage container. We will register this container as a datastore within our workspace. Before the data prep script runs, the datastore's content - training and validation images - will be copied to the local storage on DSVM.



In [None]:
from azureml.core import Datastore

images_account = 'azureailabs'
images_container = 'aerialmed'
datastore_name = 'input_images'

# Check if the datastore exists. If not create a new one
try:
    input_ds = Datastore.get(ws, datastore_name)
    print('Found existing datastore for input images:', input_ds.name)
except:
    input_ds = Datastore.register_azure_blob_container(workspace=ws, datastore_name=datastore_name,
                                            container_name=images_container,
                                            account_name=images_account)
    print('Creating new datastore for input images')

 
   
print(input_ds.name, input_ds.datastore_type, input_ds.account_name, input_ds.container_name)

output_ds = ws.get_default_datastore()
print("Using the default datastore for output: ")
print(output_ds.name, output_ds.datastore_type, output_ds.account_name, output_ds.container_name)


#### Start and monitor a remote run

We will run a script in a new Conda environment that will be created automatically by AML and configured with the 
specified dependencies.

The first run takes longer. The subsequent runs, as long as the script dependencies don't change, are much faster.

You can check the progress of a running job in multiple ways. We will use AML Jupyter widgets.

In [None]:
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DataReferenceConfiguration
from azureml.core.runconfig import RunConfiguration
from azureml.core import ScriptRunConfig

script_folder = 'script'
script_name = 'extract.py'
output_dir = 'bottleneck'
input_dir = 'aerial'

# create a new RunConfig object
conda_run_config = RunConfiguration(framework="python")

# Set compute target to DSVM
conda_run_config.target = dsvm_compute.name

# specify CondaDependencies obj
conda_packages = ['scikit-image', 'h5py', 'tensorflow-gpu']
conda_run_config.environment.python.conda_dependencies = \
    CondaDependencies.create(conda_packages=conda_packages)
    

# configure data references
input_dr = DataReferenceConfiguration(datastore_name=input_ds.name, 
                   path_on_compute=input_dir,                   
                   mode='download', # download files from datastore to compute target
                   overwrite=True)

output_dr = DataReferenceConfiguration(datastore_name=output_ds.name, 
                   path_on_datastore=output_dir, 
                   path_on_compute=output_dir,
                   mode='upload', # upload files from the compute to datastore
                   overwrite=True)

conda_run_config.data_references = {input_ds.name: input_dr, output_ds.name: output_dr}
    

# Specify command line arguments
arguments = ['--input_data_dir', str(input_ds.as_download()),
             '--output_data_dir', output_dir]


# Configure the script 
src = ScriptRunConfig(source_directory=script_folder, 
                      script=script_name, 
                      run_config=conda_run_config, 
                      arguments=arguments 
                     ) 


Submit the run and start RunDetails widget.

In [None]:
from azureml.train.widgets import RunDetails

tags = {"Compute target": "DSVM", "DNN": "ResNet50"}
run = exp.submit(src)

RunDetails(run).show()

Block to wait till run finishes and stream the output.

In [None]:
run.wait_for_completion(show_output=False)

After the run AML copied the output bottleneck files to the default datastore. You can verify it using Azure Portal.

## Clean up resources

Before you move to the next step, you can delete the GPU VM. We will not need it anymore.

In [None]:
dsvm_compute.delete()

## Next Step
The run has completed. You are ready to move to the next part of the lab in which you are going to train a multinomial classification model using the bottleneck features.