# Lab 3 - Model Training with AutomatedML

In this lab you will us the automated machine learning (Auto ML) capabilities within the Azure Machine Learning service to automatically train multiple models with varying algorithms and hyperparameters, select the best performing model and register that model.

![AutomatedML](https://github.com/jakazmie/images-for-hands-on-labs/raw/master/automated-machine-learning.png)

In [None]:
# Verify AML SDK Installed
# view version history at https://pypi.org/project/azureml-sdk/#history 
import azureml.core
print("SDK Version:", azureml.core.VERSION)

## Connect to the workspace

In [None]:
from azureml.core import Workspace

# Read the workspace config from file
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

## Train a model using AutomatedML

This lab builds upon the lessons learned in the previous lab, but is self contained so you work thru this lab without having to run a previous lab.

To train a model using AutoML you need only provide a configuration for AutoML that defines items such as the type of model (classification or regression), the performance metric to optimize, exit criteria in terms of max training time and iterations and desired performance, any algorithms that should not be used, and the path into which to output the results. This configuration is specified using the AutomMLConfig class, which is then used to drive the submission of an experiment via experiment.submit. When AutoML finishes the parent run, you can easily get the best performing run and model from the returned run object by using run.get_output().

### Create/Get Azure ML Compute cluster

We will run the AutomatedML job in parallel on AzureML Compute cluster.

In [None]:
# Create an Azure ML Compute cluster

# Create Azure ML cluster
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
import os

# choose a name for your cluster
cluster_name = "cpu-bai-cluster"
cluster_min_nodes = 1
cluster_max_nodes = 3
vm_size = "STANDARD_DS11_V2"

if cluster_name in ws.compute_targets:
    compute_target = ws.compute_targets[cluster_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('Found existing compute target, using this compute target instead of creating:  ' + cluster_name)
    else:
        print("Error: A compute target with name ",cluster_name," was found, but it is not of type AmlCompute.")
else:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size = vm_size, 
                                                                min_nodes = cluster_min_nodes, 
                                                                max_nodes = cluster_max_nodes)

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, provisioning_config)
    
    # can poll for a minimum number of nodes and for a specific timeout. 
    # if no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=20)
    
     # For a more detailed view of current BatchAI cluster status, use the 'status' property    
    print(compute_target.status.serialize())

### Create Get Data script

If you are using a remote compute to run your Automated ML experiments, the data fetch must be wrapped in a separate python script that implements get_data() function. This script is run on the remote compute where the automated ML experiment is run. get_data() eliminates the need to fetch the data over the wire for each iteration.

In [None]:
import os
project_folder = './project'
script_name = 'get_data.py'
os.makedirs(project_folder, exist_ok=True)

In [None]:
%%writefile $project_folder/get_data.py
import numpy as np
import pandas as pd
import os
from sklearn.model_selection import train_test_split

def get_data():
    # Load bottleneck features
    data_folder = os.environ["AZUREML_DATAREFERENCE_workspaceblobstore"]
    file_name = os.path.join(data_folder, 'UsedCars_Affordability.csv')
    
    print("Data folder:", data_folder)
    print("Dataset:", file_name)
    print("Data folder content:", os.listdir(data_folder))
    
    df_affordability = pd.read_csv(file_name, delimiter=',')

    features = df_affordability[["Age", "KM"]]
    labels = df_affordability[["Affordable"]]

        
    # Split the data into training and validation partitions   
    train_X, test_X, train_Y, test_Y  = train_test_split(features, labels,
                                                               test_size=0.2,
                                                               shuffle=True)
        # Flatten labels
    train_Y = np.ravel(train_Y)
    test_Y = np.ravel(test_Y)
    
    # Convert to float
    train_X = train_X.astype(float)
    test_X = test_X.astype(float)
        

    return {'X': train_X, 'y': train_Y, 'X_valid': test_X, 'y_valid': test_Y}

### Configure datastore and data reference

The training files have been uploaded to the workspace's default datastore during the previous step. We will download the files onto the nodes of the cluster.

In [None]:
from azureml.core import Datastore
from azureml.core.runconfig import DataReferenceConfiguration

ds = ws.get_default_datastore()
print("Using the default datastore for training data: ")
print(ds.name, ds.datastore_type, ds.account_name, ds.container_name)

dr = DataReferenceConfiguration(datastore_name=ds.name, 
                   path_on_datastore='used_cars', 
                   path_on_compute='used_cars',
                   mode='download', # download files from datastore to compute target
                   overwrite=True)


### Create Docker run configuration
We will run Automated ML jobs in a custom docker image that will include dependencies required by get_data() script.

In [None]:
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import Run
from azureml.core import ScriptRunConfig

# create a new RunConfig object
run_config = RunConfiguration(framework="python")

# Azure ML Compute cluster for Automated ML jobs require docker.
run_config.environment.docker.enabled = True

# Set compute target to BAI cluster
run_config.target = compute_target.name

# Set data references
run_config.data_references = {ds.name: dr}

# specify packages required by get_data
# run_config.environment.python.conda_dependencies = \
#   CondaDependencies.create(conda_packages=['h5py'])


### Configure Automated ML run.

Automated ML runs can be controlled using a number of configuration parameters. 


|Property|Description|
|-|-|
|**task**|classification or regression|
|**primary_metric**|This is the metric that you want to optimize.<br> Classification supports the following primary metrics <br><i>accuracy</i><br><i>AUC_weighted</i><br><i>balanced_accuracy</i><br><i>average_precision_score_weighted</i><br><i>precision_score_weighted</i>|
|**max_time_sec**|Time limit in seconds for each iteration|
|**iterations**|Number of iterations. In each iteration Auto ML trains a specific pipeline with the data|
|**n_cross_validations**|Number of cross validation splits|
|**concurrent_iterations**|Max number of iterations that would be executed in parallel. |
|**preprocess**| *True/False* <br>Setting this to *True* enables Auto ML to perform preprocessing <br>on the input to handle *missing data*, and perform some common *feature extraction*|
|**max_cores_per_iteration**| Indicates how many cores on the compute target would be used to train a single pipeline.<br> Default is *1*, you can set it to *-1* to use all cores|
|**exit_score**|*double* value indicating the target for *primary_metric*. <br>Once the target is surpassed the run terminates.|
|**blacklist_algos**|*List* of *strings* indicating machine learning algorithms for AutoML to avoid in this run.<br><br> Allowed values for **Classification**<br><i>LogisticRegression</i><br><i>SGDClassifierWrapper</i><br><i>NBWrapper</i><br><i>BernoulliNB</i><br><i>SVCWrapper</i><br><i>LinearSVMWrapper</i><br><i>KNeighborsClassifier</i><br><i>GradientBoostingClassifier</i><br><i>DecisionTreeClassifier</i><br><i>RandomForestClassifier</i><br><i>ExtraTreesClassifier</i><br><i>LightGBMClassifier</i><br><br>Allowed values for **Regression**<br><i>ElasticNet<i><br><i>GradientBoostingRegressor<i><br><i>DecisionTreeRegressor<i><br><i>KNeighborsRegressor<i><br><i>LassoLars<i><br><i>SGDRegressor<i><br><i>RandomForestRegressor<i><br><i>ExtraTreesRegressor<i>|
|**path**|Relative path to the project folder. AutoML stores configuration files for the experiment under this folder. You can specify a new empty folder.|
    
For the optimal performance of `AutomatedML` it is recommended to run at least 100 iterations. Due to the lab's time constraints we will only run 50 iterations. We will also limit a number of alogirthms tried using the `blacklist_algos` parameter.

In [None]:
from azureml.train.automl import AutoMLConfig
from azureml.train.automl.run import AutoMLRun
import logging


automl_config = AutoMLConfig(run_configuration = run_config,
                             task = 'classification',
                             debug_log = 'automl_errors.log',
                             primary_metric = 'accuracy',
                             iterations = 50,
                             max_concurrent_iterations = cluster_max_nodes,
                             max_cores_per_iteration = 1,
                             preprocess = False,
                             experiment_exit_score = 0.98,
                             #blacklist_models = ['kNN','LinearSVM'],
                             blacklist_models = ['KNeighborsClassifier','LinearSVMWrapper'],
                             verbosity = logging.INFO,
                             path = project_folder,
                             data_script = os.path.join(project_folder, script_name))



### Run AutomatedML job.

In [None]:
from azureml.core import Experiment

experiment_name = "usedcars_training_automatedml"
exp = Experiment(ws, experiment_name)
tags = {"Desc": "automated ml"}
run = exp.submit(config=automl_config, tags=tags)
run

The call to experiment returns `AutoMLRun` object that can be used to track the run.

Since the call is asynchronous, it reports a **Preparing** or **Running** state as soon as the job is started.

Here is what's happening while you wait:

- **Image creation**: A Docker image is created matching the Python environment specified by the RunConfiguration. The image is uploaded to the workspace. This happens only once for each Python environment since the container is cached for subsequent runs.  During image creation, logs are streamed to the run history. You can monitor the image creation progress using these logs.

- **Scaling**: If the remote cluster requires more nodes to execute the run than currently available, additional nodes are added automatically. 

- **Running**: In this stage, the Automated ML takes over and starts running experiments



You can check the progress of a running job in multiple ways: Azure Portal, AML Widgets or streaming logs.

### Monitor the run.

We will use AML Widget to monitor the run. The widget will first report a "loading" status while running the first iteration. After completing the first iteration, an auto-updating graph and table will be shown. The widget will refresh once per minute, so you should see the graph update as child runs complete.

The widget is asynchronous - it does not block the notebook. You can execute other cells while the widget is running.

**Note:** The widget displays a link at the bottom. Use this link to open a web interface to explore the individual run details.

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

### Cancelling Runs

You can cancel ongoing remote runs using the `cancel` and `cancel_iteration` functions.

In [None]:
# Cancel the ongoing experiment and stop scheduling new iterations.
# run.cancel()

# Cancel iteration 1 and move onto iteration 2.
# run.cancel_iteration(1)

### Analyze the run

You can  use SDK methods to fetch all the child runs and see individual metrics that we log.

In [None]:
import pandas as pd

children = list(run.get_children())
metricslist = {}
for child in children:
    properties = child.get_properties()
    metrics = {k: v for k, v in child.get_metrics().items() if isinstance(v, float)}
    metricslist[int(properties['iteration'])] = metrics

rundata = pd.DataFrame(metricslist).sort_index(1)
rundata

### Waiting until the run finishes

`wait_for_complettion` method will block till the run finishes. 

In [None]:
# Wait until the run finishes.
run.wait_for_completion(show_output = True)

## Try the best model

In [None]:
age = 60
km = 4000

print(best_model.predict( [[age,km]] ))

## Register the best performing model for later use and deployment

In [None]:
# notice the use of the root run (not best_run) to register the best model
run.register_model(description='AutoML trained used cars classifier')