# Lab 3 - Model training and evaluation
In this lab we will train a multinomial classification model using the bottleneck features created in Lab 2.


![Transfer Learning](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/tlcl.png)

We will start with training a simple logistic regression model in a local environment to validate that the bottleneck features can improve our classifier. We will then use the Azure ML feature called `Automated ML` to find the most optimal model for our image classification task.

![AML Arch](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/automated-machine-learning.png)



In [1]:
# Check core SDK version number
import azureml.core

print("SDK version:", azureml.core.VERSION)

SDK version: 0.1.74


## Connect to AML workspace


In [2]:
import azureml.core
from azureml.core import Workspace
import os

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

Found the config file in: /data/home/demouser/notebooks/MTC_AzureAILabs/DataScienceTrack/01-aml-walkthrough/aml_config/config.json
jkamllab
jkamllab
eastus2
952a710c-8d9c-40c1-9fec-f752138cc0b3


## Train logistic regression model locally

We will start by training a logistic regression model locally using the bottleneck features created in Lab 2.

### Download bottleneck features

In the previous lab, the bottleneck features have been uploaded to the `bottleneck_features`folder in the default datastore.

In [3]:
data_folder = 'bottleneck_features'

In [4]:
ds = ws.get_default_datastore()
ds.download(target_path='.', prefix=data_folder, overwrite=True)
!ls bottleneck_features

aerial_bottleneck_resnet50_brainwave.h5  aerial_bottleneck_resnet50.h5


### Train logistic regression using bottleneck features

In [5]:
import h5py
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split


# Load features and labels into numpy arrays
file_name = os.path.join(data_folder, 'aerial_bottleneck_resnet50_brainwave.h5')
with h5py.File(file_name, "r") as hfile:
    features = np.array(hfile.get('features'))
    labels = np.array(hfile.get('labels'))
 
# Split the data into training and validation partitions   
X_train, X_validation, y_train, y_validation = train_test_split(features, labels,
                                                               test_size=0.1,
                                                               shuffle=True,
                                                               stratify=labels)
    
# Train logistics regresssion model
reg = 0.1
clf = LogisticRegression(
        C=1.0/reg, 
        multi_class='multinomial',
        solver='lbfgs',
        random_state=1)

clf.fit(X_train, y_train)

LogisticRegression(C=10.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='multinomial',
          n_jobs=1, penalty='l2', random_state=1, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

### Evaluate the logistic regression model

In [6]:
# Validate
y_hat = clf.predict(X_validation)
    
# Calculate accuracy 
acc = np.average(y_hat == y_validation)
print('Validation accuracy is:', acc)

Validation accuracy is: 0.9407239819004525


As anticipated, the performance of the model trained on the bottleneck features is much better then the performance of the model trained on raw images. Our approach works. The next step is to find the most optimal classifier. We will accomplish that using the AML feature called `Automated ML`.

## Automated model selection

Automated machine learning (automated ML) automatically picks an algorithm and hyperparameters that optimize a given  primary metric. The model can be downloaded to be further customized as well. There are several options that you can use to configure Automated ML experiments. In this section of the lab we will go through the steps to configure and run the `Automated ML` job.

### Create Experiment

We will create a new experiment to manage `Automated ML' runs

In [5]:
from azureml.core import Experiment

experiment_name = 'aerial-automatedML'
exp = Experiment(workspace=ws, name=experiment_name)

### Create  compute target

Automated ML supports running concurrent experiments on Azure Batch AI clusters. This can significantly shorten the model selection and hyperparameter optimization process. 

**Note:** The creation of the Batch AI cluster can take over 10 minutes, please be patient.


In [6]:
from azureml.core.compute import BatchAiCompute
from azureml.core.compute import ComputeTarget

# Choose a name for your cluster.
batchai_cluster_name = "batchaicls"

found = False
# Check if this compute target already exists in the workspace.
for ct_name, ct in ws.compute_targets.items():
    print(ct.name, ct.type)
    if (ct.name == batchai_cluster_name and ct.type == 'BatchAI'):
        found = True
        print('Found existing compute target.')
        bai_compute_target = ct
        break
        
if not found:
    print('Creating a new compute target...')
    provisioning_config = BatchAiCompute.provisioning_configuration(vm_size = "STANDARD_DS2_V2", 
                                                                autoscale_enabled = True,
                                                                cluster_min_nodes = 1, 
                                                                cluster_max_nodes = 5)

    # Create the cluster.
    bai_compute_target = ComputeTarget.create(ws, batchai_cluster_name, provisioning_config)
    
    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min_node_count is provided, it will use the scale settings for the cluster.
    bai_compute_target.wait_for_completion(show_output = True, min_node_count = None, timeout_in_minutes = 20)
    

gpudsvm VirtualMachine
batchaicls BatchAI
Found existing compute target.


### Create Get Data script

If you are using a remote compute to run your Automated ML experiments, the data fetch must be wrapped in a separate python script that implements `get_data()` function. This script is run on the remote compute where the automated ML experiment is run. `get_data()` eliminates the need to fetch the data over the wire for each iteration. 

In [7]:
import os
script_folder = './script'
script_name = 'get_data.py'
os.makedirs(script_folder, exist_ok=True)

In [8]:
%%writefile $script_folder/get_data.py
import h5py
import numpy as np
import os
from sklearn.model_selection import train_test_split

def get_data():
    # Load bottleneck features
    data_folder = os.environ["AZUREML_DATAREFERENCE_workspacefilestore"]
    file_name = os.path.join(data_folder, 'aerial_bottleneck_resnet50_brainwave.h5')
    
    print("Data folder:", data_folder)
    print("Bottleneck features file:", file_name)
    print("Data folder content:", os.listdir(data_folder))
    
    with h5py.File(file_name, "r") as hfile:
        features = np.array(hfile.get('features'))
        labels = np.array(hfile.get('labels'))
        
    # Split the data into training and validation partitions   
    X_train, X_validation, y_train, y_validation = train_test_split(features, labels,
                                                               test_size=0.1,
                                                               shuffle=True,
                                                               stratify=labels)
        

    return {'X': X_train, 'y': y_train, 'X_valid': X_validation, 'y_valid': y_validation}

Overwriting ./script/get_data.py


### Configure datastore and data reference

The bottleneck files have been uploaded to the workspace's default datastore during the previous step. We will download the files onto the nodes of the cluster.

In [9]:
from azureml.core import Datastore
from azureml.core.runconfig import DataReferenceConfiguration

ds = ws.get_default_datastore()
print("Using the default datastore for training data: ")
print(ds.name, ds.datastore_type, ds.account_name, ds.container_name)

dr = DataReferenceConfiguration(datastore_name=ds.name, 
                   path_on_datastore='bottleneck_features', 
                   path_on_compute='bottleneck_features',
                   mode='download', # download files from datastore to compute target
                   overwrite=True)


Using the default datastore for training data: 
workspacefilestore AzureFile jkamllab3650394639 azureml-filestore-bc740c20-4b07-49e7-92ba-c5bf27a7cb86


### Create Docker run configuration

We will run `Automated ML` jobs in a custom docker image that will include dependencies required by `get_data()` script.


In [10]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import Run
from azureml.core import ScriptRunConfig

# create a new RunConfig object
run_config = RunConfiguration(framework="python")

# Azure Batch AI cluster for Automated ML jobs require docker.
run_config.environment.docker.enabled = True

# Set compute target to BAI cluster
run_config.target = bai_compute_target.name

# Set data references
run_config.data_references = {ds.name: dr}

# specify packages required by get_data
run_config.environment.python.conda_dependencies = \
   CondaDependencies.create(conda_packages=['h5py'])


### Configure Automated ML run.

Automated ML runs can be controlled using a number of configuration parameters. 


|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize.<br> Classification supports the following primary metrics <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i><br><i>precision_score_weighted</i>|
|**max_time_sec**|Time limit in seconds for each iteration|
|**iterations**|Number of iterations. In each iteration Auto ML trains a specific pipeline with the data|
|**n_cross_validations**|Number of cross validation splits|
|**concurrent_iterations**|Max number of iterations that would be executed in parallel. |
|**preprocess**| *True/False* <br>Setting this to *True* enables Auto ML to perform preprocessing <br>on the input to handle *missing data*, and perform some common *feature extraction*|
|**max_cores_per_iteration**| Indicates how many cores on the compute target would be used to train a single pipeline.<br> Default is *1*, you can set it to *-1* to use all cores|
|**exit_score**|*double* value indicating the target for *primary_metric*. <br>Once the target is surpassed the run terminates.|
|**blacklist_algos**|*List* of *strings* indicating machine learning algorithms for AutoML to avoid in this run.<br><br> Allowed values for **Classification**<br><i>LogisticRegression</i><br><i>SGDClassifierWrapper</i><br><i>NBWrapper</i><br><i>BernoulliNB</i><br><i>SVCWrapper</i><br><i>LinearSVMWrapper</i><br><i>KNeighborsClassifier</i><br><i>GradientBoostingClassifier</i><br><i>DecisionTreeClassifier</i><br><i>RandomForestClassifier</i><br><i>ExtraTreesClassifier</i><br><i>LightGBMClassifier</i><br><br>Allowed values for **Regression**<br><i>ElasticNet<i><br><i>GradientBoostingRegressor<i><br><i>DecisionTreeRegressor<i><br><i>KNeighborsRegressor<i><br><i>LassoLars<i><br><i>SGDRegressor<i><br><i>RandomForestRegressor<i><br><i>ExtraTreesRegressor<i>|
|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|
    
For the optimal performance of `AutomatedML` it is recommended to run at least 100 iterations. Due to the lab's time constraints we will only run 25 iterations. We will also limit a number of alogirthms tried using the `blacklist_algos` parameter.

In [11]:
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
import logging


automl_config = AutoMLConfig(run_configuration = run_config,
                             task = 'classification',
                             num_classes = 6,
                             debug_log = 'automl_errors.log',
                             primary_metric = 'accuracy',
                             max_time_sec = 1800,
                             iterations = 25,
                             concurrent_iterations = 5,
                             max_cores_per_iteration = 1,
                             preprocess = False,
                             exit_score = 0.96,
                             blacklist_algos = ['KNeighborsClassifier',
                                                'LinearSVMWrapper',
                                                'NBWrapper',
                                                'BernoulliNB',
                                                'GradientBoostingClassifier',
                                                'SGDClassifierWrapper'],
                             verbosity = logging.INFO,
                             path = script_folder,
                             data_script = os.path.join(script_folder, script_name))

### Run Automated ML job

In [12]:
tags = {"Desc": "automated ml"}
run = exp.submit(config=automl_config, tags=tags)
run

Experiment,Id,Type,Status,Details Page,Docs Page
aerial-automatedML,AutoML_ac5fd8f3-7e2d-482b-ada1-b2f3f270a5ab,automl,Preparing,Link to Azure Portal,Link to Documentation


The call to experiment returns `AutoMLRun` object that can be used to track the run.

Since the call is asynchronous, it reports a **Preparing** or **Running** state as soon as the job is started.

Here is what's happening while you wait:

- **Image creation**: A Docker image is created matching the Python environment specified by the RunConfiguration. The image is uploaded to the workspace. This happens only once for each Python environment since the container is cached for subsequent runs.  During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. 

- **Running**: In this stage, the Automated ML takes over and starts running experiments



You can check the progress of a running job in multiple ways: Azure Portal, AML Widgets or streaming logs.

#### Monitor the run.

We will use AML Widget to monitor the run. The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

The widget is asynchronous - it does not block the notebook. You can execute other cells while the widget is running.

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.

In [13]:
from azureml.train.widgets import RunDetails
RunDetails(run).show()

_AutoML(widget_settings={'childWidgetDisplay': 'popup'})

_UserRun(widget_settings={'childWidgetDisplay': 'popup', 'display': 'popup'})

_UserRun(widget_settings={'childWidgetDisplay': 'popup', 'display': 'popup'})

#### Cancelling Runs

You can cancel ongoing remote runs using the `cancel` and `cancel_iteration` functions.

In [None]:
# Cancel the ongoing experiment and stop scheduling new iterations.
# run.cancel()

# Cancel iteration 1 and move onto iteration 2.
# run.cancel_iteration(1)

#### Analyze the run

You can  use SDK methods to fetch all the child runs and see individual metrics that we log.

In [18]:
import pandas as pd

children = list(run.get_children())
metricslist = {}
for child in children:
    properties = child.get_properties()
    metrics = {k: v for k, v in child.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,15,16,17,18,19,20,21,22,23,24
AUC_macro,0.994984,0.928042,0.996208,0.896381,,0.953435,0.995235,0.967551,0.995921,0.989519,...,,0.990245,0.991259,0.984884,0.993205,0.993959,0.992093,0.977915,,0.990631
AUC_micro,0.995817,0.936663,0.996712,0.913619,,0.956,0.995898,0.968449,0.996541,0.990898,...,,0.991174,0.991835,0.986213,0.993833,0.994692,0.9918,0.977442,,0.991687
AUC_weighted,0.994985,0.928087,0.996208,0.896389,,0.953427,0.995235,0.967548,0.995922,0.989524,...,,0.990248,0.991262,0.984875,0.993207,0.993957,0.992095,0.977914,,0.990634
accuracy,0.947059,0.700905,0.947964,0.676018,,0.785068,0.934842,0.802715,0.947964,0.904977,...,,0.91267,0.91629,0.881448,0.933937,0.934389,0.927602,0.828507,,0.909955
accuracy_max,0.947059,0.700905,0.947964,0.676018,0.0,0.947059,0.947964,0.947964,0.947964,0.947964,...,,0.947964,0.947964,0.947964,0.947964,0.959276,0.959276,0.959276,,0.959276
average_precision_score_macro,0.98223,0.807306,0.984755,0.69253,,0.853102,0.979521,0.884399,0.983521,0.959627,...,,0.961124,0.964038,0.939151,0.973369,0.976778,0.96837,0.913278,,0.961575
average_precision_score_micro,0.985318,0.802095,0.987126,0.745719,,0.851408,0.983464,0.895054,0.986216,0.967261,...,,0.967948,0.969875,0.949393,0.977267,0.98105,0.971437,0.914418,,0.968285
average_precision_score_weighted,0.982235,0.807421,0.984759,0.692512,,0.853077,0.979524,0.884371,0.983526,0.95965,...,,0.961136,0.964058,0.9391,0.973372,0.976763,0.968378,0.913286,,0.96159
balanced_accuracy,0.947049,0.70074,0.947961,0.676213,,0.785039,0.934839,0.802738,0.947956,0.90494,...,,0.91267,0.916263,0.881517,0.933915,0.934416,0.927589,0.828477,,0.909948
f1_score_macro,0.947024,0.672795,0.948008,0.633855,,0.782397,0.935211,0.801225,0.948108,0.905313,...,,0.913345,0.916561,0.881003,0.93379,0.93453,0.927734,0.814791,,0.910211


#### Waiting until the run finishes

`wait_for_complettion` method will block till the run finishes. 

In [19]:
# Wait until the run finishes.
run.wait_for_completion(show_output = True)


***********************************************************************************************
ITERATION: The iteration being evaluated.
PIPELINE: A summary description of the pipeline being evaluated.
DURATION: Time taken for the current iteration.
METRIC: The result of computing score on the fitted pipeline.
BEST: The best observed score thus far.
***********************************************************************************************

 ITERATION     PIPELINE                               DURATION                METRIC      BEST
         0     StandardScalerWrapper LogisticRegressio367                      0.947     0.947
         1     MaxAbsScaler RandomForestClassifier    992                      0.701     0.947
         2     MaxAbsScaler LightGBMClassifier        208                      0.948     0.948
         3     StandardScalerWrapper RandomForestClass474                      0.676     0.948
         4     MinMaxScaler RandomForestClassifier    0                   

{'runId': 'AutoML_ac5fd8f3-7e2d-482b-ada1-b2f3f270a5ab',
 'target': 'batchaicls',
 'status': 'Running',
 'startTimeUtc': '2018-11-20T15:52:17.503812Z',
 'properties': {'num_iterations': '25',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'accuracy',
  'train_split': '0',
  'max_time_seconds': '1800',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'batchaicls',
  'DataPrepJsonString': None,
  'EnableSubsampling': 'False',
  'runTemplate': 'AutoML',
  'azureml.runsource': 'automl',
  'SetupRunId': 'AutoML_ac5fd8f3-7e2d-482b-ada1-b2f3f270a5ab_setup',
  'snapshotId': '68569f84-40d9-4c8b-a618-4fc69d48c728',
  'ProblemInfoJsonString': '{"dataset_num_categorical": 0, "dataset_classes": 6, "dataset_features": 2048, "dataset_samples": 19882, "is_sparse": false}'},
 'logFiles': {}}

### Retrieve the best model

Below we select the best pipeline from our iterations. The `get_output` method returns the best run and the fitted model. The Model includes the pipeline and any pre-processing.  Overloads on `get_output` allow you to retrieve the best run and fitted model for *any* logged metric or for a particular *iteration*.

#### Best model based on the primary metric

In [20]:
best_run, fitted_model = run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: aerial-automatedML,
Id: AutoML_ac5fd8f3-7e2d-482b-ada1-b2f3f270a5ab_10,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
     steps=[('StandardScalerWrapper', <automl.client.core.common.model_wrappers.StandardScalerWrapper object at 0x7fcea81f1d68>), ('SVCWrapper', SVCWrapper(C=4714.8663634573895, class_weight=None, kernel='poly',
      probability=True, random_state=None))])


#### Best model based on any metric
You can also retrieve the best model based on an arbitrary metric. For example, we will retrieve the model with the highest AUC value:

In [21]:
lookup_metric = "AUC_weighted"
best_run_auc, fitted_model_best_auc = run.get_output(metric = lookup_metric)
print(best_run_auc)
print(fitted_model_best_auc)

Run(Experiment: aerial-automatedML,
Id: AutoML_ac5fd8f3-7e2d-482b-ada1-b2f3f270a5ab_10,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
     steps=[('StandardScalerWrapper', <automl.client.core.common.model_wrappers.StandardScalerWrapper object at 0x7fce9467a9e8>), ('SVCWrapper', SVCWrapper(C=4714.8663634573895, class_weight=None, kernel='poly',
      probability=True, random_state=None))])


#### Model from a specific iteration
Show the run and the model from the third iteration:

In [22]:
iteration = 3
third_run, third_model = run.get_output(iteration=iteration)
print(third_run)
print(third_model)

Run(Experiment: aerial-automatedML,
Id: AutoML_ac5fd8f3-7e2d-482b-ada1-b2f3f270a5ab_3,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(memory=None,
     steps=[('StandardScalerWrapper', <automl.client.core.common.model_wrappers.StandardScalerWrapper object at 0x7fcea848ebe0>), ('RandomForestClassifier', RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features=None, max_leaf_nodes=None,
    ...imators=100, n_jobs=1,
            oob_score=True, random_state=None, verbose=0, warm_start=False))])


### Register the best model

You can use `AutoMLRun` to register the trained model with AML Model Registry.

If neither `metric` nor `iteration` are specified in the `register_model` call, the iteration with the best primary metric is registered.

We are going to add two properties to the registration:
- FriendlyName. We will use this property to retrieve the model in the next lab
- RunID. This property will store the ID of the run that created the model. We will use it to retrieve runtime dependencies in the next lab.

In [23]:
description = 'Aerial Classifier Model - Best Accuracy'
model_name = 'AerialClassifier'

model = run.register_model(description=description, tags=tags)
properties = {"FriendlyName": "Lab3BestModel", "RunID": run.id}
model.add_properties(properties)


Registering model AutoMLac5fd8f37best
Model properties add operation complete.


## Next Step

The model is ready for deployment. In the next lab you will deploy the model to Azure Container Instance.

## Clean up resources

Before you move to the next step, you can delete the BAI cluster. We will not need it anymore.

In [None]:
bai_compute_target.delete()