# Developing Custom Image Classification Model
## Azure Machine Learning Pipelines version

In this lab, you will developed a custom image classification model to automatically classify the type of land shown in aerial images of 224-meter x 224-meter plots. Land use classification models can be used to track urbanization, deforestation, loss of wetlands, and other major environmental trends using periodically collected aerial imagery. The images used in this lab are based off of imagery from the U.S. National Land Cover Database. U.S. National Land Cover Database defines six primary classes of land use: *Developed*, *Barren*, *Forested*, *Grassland*, *Shrub*, *Cultivated*. Example images from each land use class are shown here:

Developed | Cultivated | Barren
--------- | ------ | ----------
![Developed](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/developed1.png) | ![Cultivated](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/cultivated1.png) | ![Barren](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/barren1.png)

Forested | Grassland | Shrub
---------| ----------| -----
![Forested](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/forest1.png) | ![Grassland](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/grassland1.png) | ![Shrub](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/shrub1.png)

You shall employ a machine learning technique called transfer learning. Transfer learning is one of the fastest (code and run-time-wise) ways to start using deep learning. It allows for the reuse of knowledge gained while solving one problem to a different but related problem. For example, knowledge gained while learning to recognize landmarks and landscapes could apply when trying to recognize aerial land plots. Transfer Learning makes it feasible to train very effective ML models on relatively small training data sets.

Although the primary goal of this lab is to understand how to use Azure ML to orchestrate deep learning workflows rather then to dive into Deep Learning techniques, ask the instructor if you want to better understand the approach utilized in the lab.

You will start by pre-processing training images into a set of powerful features - sometimes referred to as bottleneck features.

To create bottleneck features you will utilize a pre-trained Deep Learning network that was trained on a general computer vision domain. 

Although, the pre-trained network does not know how to classify aerial land plot images, it knows enough about representing image concepts that if we use it to pre-process aerial images, the extracted image features can be used to effectively train a relatively simple classifier on a **limited number** of samples.

The below diagram represents the architecture of our solution.

![Transfer Learning](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/TLArch.png)

We will use **ResNet50** trained on **imagenet** dataset to extract features. We will occasionally refer to this component of the solution as a featurizer. The output of the featurizer is a vector of 2048 floating point numbers, each representing a feature extracted from an image. 

We will then use extracted features to train a simple fully connected neural network (the top) that will peform final image classification.

To orcherstrate this workflow, we will use *Azure Machine Learning Pipelines*. Our pipeline will have 3 steps:

1. Copy training images from a public Azure Blob Storage container to the default storage in AML Workspace
2. Extract features 
3. Train the top network

**NOTE: Please make sure to update Azure ML SDK to the latest version.**

## Connect to AML Workspace

In [None]:
# Check core SDK version number
import azureml.core
print("SDK version:", azureml.core.VERSION)

In [None]:
import azureml.core
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

## Create Azure ML Managed Compute

To run the lab's scripts we will utilize Azure ML managed compute resources. Specifically, an autoscale cluster of *Standard_NC6* VMs (equipped with Tesla K80 GPU). 

In [None]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os


# choose a name for your cluster
compute_name = os.environ.get("AML_COMPUTE_CLUSTER_NAME", "gpu-cluster")
compute_min_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MIN_NODES", 1)
compute_max_nodes = os.environ.get("AML_COMPUTE_CLUSTER_MAX_NODES", 4)

vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_NC6")
#vm_size = os.environ.get("AML_COMPUTE_CLUSTER_SKU", "STANDARD_ND6s")

if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('found compute target. just use it. ' + compute_name)
else:
    print('creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size,
                                                                min_nodes = compute_min_nodes, 
                                                                max_nodes = compute_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)

    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)

     # For a more detailed view of current AmlCompute status, use the 'status' property    
    print(compute_target.status.serialize())

## Data Copy Step

We will use the AML's built-in **DataTransferStep** to copy training images from the remote container to the default storage in the Workspace. 

**Step** is a unit of execution in Azure ML Pipelines. **Step** typically needs a target of execution (compute target), a script to execute, and may require script arguments and inputs, and can produce outputs. The step also could take a number of other parameters. Azure Machine Learning Pipelines provides the following built-in Steps:

- PythonScriptStep: Add a step to run a Python script in a Pipeline.
- AdlaStep: Adds a step to run U-SQL script using Azure Data Lake Analytics.
- DataTransferStep: Transfers data between Azure Blob and Data Lake accounts.
- DatabricksStep: Adds a DataBricks notebook as a step in a Pipeline.
- HyperDriveStep: Creates a Hyper Drive step for Hyper Parameter Tuning in a Pipeline.

**DataTransferStep** is used to transfer data between Azure Blob, Azure Data Lake Store, and Azure SQL database.

To configure **DataTransferStep** you need to provide:

- **name:** Name of module
- **source_data_reference:** Input connection that serves as source of data transfer operation.
- **destination_data_reference:** Input connection that serves as destination of data transfer operation.
- **compute_target:** Azure Data Factory to use for transferring data.
- **allow_reuse:** Whether the step should reuse results of previous DataTransferStep when run with same inputs. Set as False to force data to be transferred again.

Optional arguments to explicitly specify whether a path corresponds to a file or a directory. These are useful when storage contains both file and directory with the same name or when creating a new destination path.

- **source_reference_type:** An optional string specifying the type of source_data_reference. Possible values include: 'file', 'directory'. When not specified, we use the type of existing path or directory if it's a new path.
- **destination_reference_type:** An optional string specifying the type of destination_data_reference. Possible values include: 'file', 'directory'. When not specified, we use the type of existing path or directory if it's a new path.

The **DataTransferStep** utilizes Azure Data Factory as a processing engine.

### Register input Datastore

This is where the training images are located.

In [None]:
from azureml.core import Datastore

# Create input data store
images_account = 'azureailabs'
images_container = 'aerialsmall'
datastore_name = 'input_data'
SAS_TOKEN="?sv=2018-03-28&ss=b&srt=co&sp=rdl&se=2019-12-31T08:42:04Z&st=2019-02-07T00:42:04Z&spr=https&sig=a1PW3b6%2FNvWWBo3m8luTQkdbyj%2FZW%2FBJDR9RFpjUf%2BQ%3D"

# Check if the datastore exists. If not create a new one
try:
    input_ds = Datastore.get(ws, datastore_name)
    print('Found existing datastore for input images:', input_ds.name)
except:
    input_ds = Datastore.register_azure_blob_container(workspace=ws, datastore_name=datastore_name,
                                            container_name=images_container,
                                            account_name=images_account,
                                            sas_token=SAS_TOKEN)
    print('Creating new datastore for input images')
    
print(input_ds.name, input_ds.datastore_type, input_ds.account_name, input_ds.container_name)

### Get the default Datastore

In [None]:
default_ds = Datastore(ws, "workspaceblobstore")
print("Blobstore's name: {}".format(default_ds.name))

### Create DataReferences

Data references point to specific locations within datastores.

In [None]:
from azureml.data.data_reference import DataReference

remote_data_ref = DataReference(
    datastore=input_ds,
    data_reference_name="remote_container",
    path_on_datastore=None)


In [None]:
images_data_ref = DataReference(
    datastore=default_ds,
    data_reference_name="input_images",
    path_on_datastore="input_images")


### Set up Data Factory Account

In [None]:
from azureml.core.compute import DataFactoryCompute
from azureml.exceptions import ComputeTargetException


data_factory_name = 'jkmlwsadf'

def get_or_create_data_factory(workspace, factory_name):
    try:
        return DataFactoryCompute(workspace, factory_name)
    except ComputeTargetException as e:
        if 'ComputeTargetNotFound' in e.message:
            print('Data factory not found, creating...')
            provisioning_config = DataFactoryCompute.provisioning_configuration()
            data_factory = ComputeTarget.create(workspace, factory_name, provisioning_config)
            data_factory.wait_for_completion()
            return data_factory
        else:
            raise e
            
data_factory_compute = get_or_create_data_factory(ws, data_factory_name)

print("setup data factory account complete")

### Create DataTransferStep



In [None]:
from azureml.pipeline.steps import DataTransferStep

transfer_images_step = DataTransferStep(
    name="transfer_images",
    source_data_reference=remote_data_ref,
    destination_data_reference=images_data_ref,
    source_reference_type='directory',
    destination_reference_type='directory',
    allow_reuse=False,
    compute_target=data_factory_compute)

print("data transfer step created")

**Note**. At this point, there does not seem to be a way (at least it is not well ocumented) of sequentially linking DataTranserStep with PythonScriptStep in the same Pipeline. As such we will execute this step in a discrete Pipeline. 

In [None]:
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[transfer_images_step])
pipeline.validate()

In [None]:
from azureml.core import Experiment
from azureml.widgets import RunDetails

experiment_name = 'aerial-transfer-pipeline'
exp = Experiment(workspace=ws, name=experiment_name)

pipeline_run = exp.submit(pipeline)
RunDetails(pipeline_run).show()

#### Clean up

In [None]:
data_factory_compute.delete()

## Feature Extraction Step

As noted in the introduction, we will utilize a pretrained **ResNet50** convolutional neural net as a featurizer. This CNN was trained on *Imagenet* dataset. 

We will encapsulate feature extraction as an Azure Machine Learning Pipeline *PythonScriptStep*.

The following code will create the *PythonScriptStep* to be executed in the Azure Machine Learning Compute using code in the `extract.py` script.

The script processes an input image datasets into an output bottleneck feature sets. The script expects the images to be organized in the below folder structure:
```
Barren/
Cultivated/
Developed/
Forest/
Herbaceous/
Shrub/
```

The location of the input dataset and the location where to save the output dataset are passed to the script as command line parameters. The output dataset will be stored in a binary HDF5 data format used commonly in Machine Learning and High Performance Computing solutions.


### Create a feature extraction script

In [None]:
import os
script_folder = './script'
os.makedirs(script_folder, exist_ok=True)

In [None]:
%%writefile $script_folder/extract.py

import os
import numpy as np
import random
import h5py

import tensorflow as tf
from tensorflow.keras.applications import resnet50
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.preprocessing import image
from tensorflow.keras.utils import to_categorical


def create_bottleneck_features(data_dir, output_file, featurizer):
    
    # A hack to mitigate a bug in TF.Keras 1.12
    def preprocess_input_new(x):
        img = resnet50.preprocess_input(image.img_to_array(x))
        return image.array_to_img(img)
    
    # Create a Keras generator to read and pre-process images. The generator
    # will return batches of numpy arrays representing pre-processed images
    batchsize=64
    classes = ["Barren", "Cultivated", "Developed", "Forest", "Herbaceous", "Shrub"]
    datagen = ImageDataGenerator(preprocessing_function=preprocess_input_new)
    generator = datagen.flow_from_directory(
        directory=data_dir,
        target_size=(224, 224),
        classes=classes,
        batch_size=batchsize)

    # Generate bottleneck features
    # Due to the bug in Tensorflow 1.12 we cannot use predict_generator 
    # Instead we are invoking model.predict in an explicit loop
    features = []
    labels = []
    batches = len(generator)
    for i in range(batches):
        image_batch, label_batch = generator.next()
        features.extend(featurizer.predict(image_batch, batch_size=batchsize))
        labels.extend(label_batch)
        print("Processed batch: {} out of {}".format(i+1, batches))
        
    features = np.asarray(features)
    labels = np.asarray(labels)   
 
    # Save dataset to HDF5 file
    print("Saving features to {}".format(output_file))
    print("   Features: ", features.shape)
    print("   Labels: ", labels.shape)
    with h5py.File(output_file, "w") as hfile:
        features_dset = hfile.create_dataset('features', data=features)
        labels_dset = hfile.create_dataset('labels', data=labels)


FLAGS = tf.app.flags.FLAGS

# Default global parameters
tf.app.flags.DEFINE_integer('batch_size', 64, "Number of images per batch")
tf.app.flags.DEFINE_string('data_folder', 'aerialsmall', "Folder with training and validation images")
tf.app.flags.DEFINE_string('output_data_dir', 'bottleneck', "A folder for saving bottleneck features")
tf.app.flags.DEFINE_string('output_file_suffix', 'aerial_bottleneck_keras.h5', "Filename template for output features")


def main(argv=None):
    print("Starting")
    print("Reading training data from:", FLAGS.data_folder)
    print("Output bottleneck files will be saved to:", FLAGS.output_data_dir)
    os.makedirs(FLAGS.output_data_dir, exist_ok=True)
   
    train_data_dir = os.path.join(FLAGS.data_folder, 'train')
    valid_data_dir = os.path.join(FLAGS.data_folder, 'valid')
    
    train_output_file = os.path.join(FLAGS.output_data_dir, 'train_' + FLAGS.output_file_suffix)
    valid_output_file = os.path.join(FLAGS.output_data_dir, 'valid_' + FLAGS.output_file_suffix)
    
    print(train_output_file)
    print(valid_output_file)
    
    # Create a featurizer
    featurizer = resnet50.ResNet50(
                weights = 'imagenet', 
                input_shape=(224,224,3), 
                include_top = False,
                pooling = 'avg')

    print("Creating training bottleneck features")
    create_bottleneck_features(train_data_dir, train_output_file, featurizer)
    print("Creating validation bottleneck features")
    create_bottleneck_features(valid_data_dir, valid_output_file, featurizer)
  
if __name__ == '__main__':
    tf.app.run()

### Configure Datastores and DataReferences

Our training images are located in the Workspace's default datastore in the `input_images` folder. After the script completes, its output - the bottleneck features file - will be uploaded to the `bottleneck_features` folder in the default datastore.

Intermediate data (or output of a PythonScriptStep) is represented by a *PipelineData* object. *PipelineData* can be produced by one step and consumed in another step by providing the PipelineData object as an output of one step and the input of one or more steps.

Constructing PipelineData
- name: [Required] Name of the data item within the pipeline graph
- datastore_name: Name of the Datastore to write this output to
- output_name: Name of the output
- output_mode: Specifies "upload" or "mount" modes for producing output (default: mount)
- output_path_on_compute: For "upload" mode, the path to which the module writes this output during execution
- output_overwrite: Flag to overwrite pre-existing data



In [None]:
from azureml.pipeline.core import PipelineData

# Create output PipelineData object
bottleneck_features_ref = PipelineData(
    name = "bottleneck_features",
    datastore=default_ds)

### Create PythonScriptStep


In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.pipeline.steps import PythonScriptStep

# Configure runtime for script execution
run_config = RunConfiguration(framework="python")
run_config.environment.docker.enabled = True
run_config.environment.docker.gpu_support = True
# There seems to be a bug in RunConfiguration. Despite setting gpu_support to true
# it is pulling the base CPU image. 
run_config.environment.docker.base_image = 'mcr.microsoft.com/azureml/base-gpu:0.2.1'

pip_packages = ['h5py', 
                'pillow', 
                'scipy',
                'tensorflow-gpu']

conda_dependencies = CondaDependencies.create(pip_packages=pip_packages)
run_config.environment.python.conda_dependencies = conda_dependencies

# Define command line arguments for the script
arguments = ["--data_folder", images_data_ref,
             "--output_data_dir", bottleneck_features_ref]

# Create the step
extract_step = PythonScriptStep(
    script_name='extract.py',
    arguments=arguments,
    inputs=[images_data_ref],
    outputs=[bottleneck_features_ref],
    compute_target=compute_target,
    source_directory=script_folder,
    allow_reuse=True,
    runconfig=run_config
    )


## Training Step

Finally, we will define the training step. We are going to use *PythonScriptStep* and reuse the same runtime configuration as for the feature extraction step.

### Create the training script

In [None]:
script_name = 'train.py'

In [None]:
%%writefile $script_folder/train.py

import os
import tensorflow as tf
from tensorflow.keras.applications import resnet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D, Dropout, Flatten, Input
from tensorflow.keras.regularizers import l1_l2


import numpy as np
import random
import h5py


# Define network
def fcn_classifier(input_shape=(2048,), units=512, classes=6,  l1=0.01, l2=0.01):
    features = Input(shape=input_shape)
    x = Dense(units, activation='relu')(features)
    x = Dropout(0.5)(x)
    y = Dense(classes, activation='softmax', kernel_regularizer=l1_l2(l1=l1, l2=l2))(x)
    model = Model(inputs=features, outputs=y)
    model.compile(optimizer='adadelta', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Training regime
def train_evaluate():
   
    print("Loading bottleneck features")
    train_file_name = os.path.join(FLAGS.data_folder, FLAGS.train_file_name)
    valid_file_name = os.path.join(FLAGS.data_folder, FLAGS.valid_file_name)
    
    # Load bottleneck training features and labels
    with h5py.File(train_file_name, "r") as hfile:
        X_train = np.array(hfile.get('features'))
        y_train = np.array(hfile.get('labels'))
        
     # Load bottleneck validation features and labels
    with h5py.File(valid_file_name, "r") as hfile:
        X_validation = np.array(hfile.get('features'))
        y_validation = np.array(hfile.get('labels'))
        
             
    print(y_train.shape)
    print(y_validation.shape)
    
    # Create a network
    model = fcn_classifier(input_shape=(2048,), units=FLAGS.units, l1=FLAGS.l1, l2=FLAGS.l2)
    
    
    # Start training
    print("Starting training")
    model.fit(X_train, y_train,
          batch_size=FLAGS.batch_size,
          epochs=FLAGS.epochs,
          shuffle=True,
          validation_data=(X_validation, y_validation))
          
    # Save the trained model to outputs which is a standard folder expected by AML
    print("Training completed.")
    os.makedirs('outputs', exist_ok=True)
    model_file = os.path.join('outputs', 'aerial_fcnn_classifier.h5')
    print("Saving model to: {0}".format(model_file))
    model.save(model_file)
    

FLAGS = tf.app.flags.FLAGS

# Default global parameters
tf.app.flags.DEFINE_integer('batch_size', 32, "Number of images per batch")
tf.app.flags.DEFINE_integer('epochs', 10, "Number of epochs to train")
tf.app.flags.DEFINE_integer('units', 512, "Number of epochs to train")
tf.app.flags.DEFINE_float('l1', 0.01, "l1 regularization")
tf.app.flags.DEFINE_float('l2', 0.01, "l2 regularization")
tf.app.flags.DEFINE_string('data_folder', './bottleneck', "Folder with bottleneck features and labels")
tf.app.flags.DEFINE_string('train_file_name', 'train_aerial_bottleneck_keras.h5', "Training file name")
tf.app.flags.DEFINE_string('valid_file_name', 'valid_aerial_bottleneck_keras.h5', "Validation file name")


def main(argv=None):
    
    train_evaluate()
  

if __name__ == '__main__':
    tf.app.run()
    

### Create the training step

Notice that we are using the output of the `extract_step` as the input to the `train_step`. By doing that we are creating a pipeline with Data Dependency. As a result the `train_step` will not start till the `extract_step` is completed.

In [None]:
# Define command line arguments for the script
arguments = ["--data_folder", bottleneck_features_ref,
             '--train_file_name', 'train_aerial_bottleneck_keras.h5',
             '--valid_file_name', 'valid_aerial_bottleneck_keras.h5',
             '--epochs', 50]

# Create the step
train_step = PythonScriptStep(
    script_name='train.py',
    arguments=arguments,
    inputs=[bottleneck_features_ref],
    compute_target=compute_target,
    source_directory=script_folder,
    allow_reuse=True,
    runconfig=run_config
    )

## Build the pipeline

Since there are Data Dependencies between the `extraction` and `training` steps we only need to specify the last step (`training`) in the pipeline's configuration.

In [None]:
from azureml.pipeline.core import Pipeline

pipeline = Pipeline(workspace=ws, steps=[train_step])
pipeline.validate()

### Start and monitor a remote run



In [None]:
from azureml.core import Experiment
from azureml.widgets import RunDetails

experiment_name = 'aerial-classifier-training-pipeline'
exp = Experiment(workspace=ws, name=experiment_name)

pipeline_run = exp.submit(pipeline)
RunDetails(pipeline_run).show()


In [None]:
pipeline_run.get_status()

In [None]:
pipeline_run.cancel()

## Clean up resources
If you are not going to walk through the other labs, delete the cluster.

In [None]:
compute_target.delete()