# Lab 5 - Automated ML 

In the previous lab you have trained a first machine learning (ML) model on the automobile prices dataset. Model training is a very iterative process and typically requires multiple iterations to improve upon an existing model. 

Automated machine learning - or Auto ML for short - is the process of automating the time consuming, iterative tasks of machine learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity all while sustaining model quality. Automated ML is based on a breakthrough from our [Microsoft Research division](https://arxiv.org/abs/1705.05355).

Apply automated ML when you want Azure Machine Learning to train and tune a model for you using the target metric you specify. The service then iterates through ML algorithms paired with feature selections, where each iteration produces a model with a training score. The higher the score, the better the model is considered to fit your data.
With automated machine learning, you'll accelerate the time it takes to get production-ready ML models with great ease and efficiency.

<img src="https://docs.microsoft.com/en-us/azure/machine-learning/service/media/tutorial-auto-train-models/flow2.png" style=width:500px/>


## When to use Auto ML

Automated ML democratizes the machine learning model development process, and empowers its users, no matter their data science expertise, to identify an end-to-end machine learning pipeline for any problem.
Data scientists, analysts and developers across industries can use automated ML to:

- Implement machine learning solutions without extensive programming knowledge
- Save time and resources
- Leverage data science best practices
- Provide agile problem-solving

Automated machine learning picks an algorithm and hyperparameters for you and generates a model ready for deployment if you choose to do so. After doing so, it will create another model with a different algorithm and different hyperparameters, trying to improve upon the error metric specified.

Much like before, we first need to establish the connection to our Azure ML workspace.

In [None]:
import logging

from matplotlib import pyplot as plt
import pandas as pd
import os

import azureml.core
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from azureml.train.automl import AutoMLConfig

## Connecting to workspace

By now you are probably familiar wit this first step of establishing a connection to our Azure ML workspace.

In [None]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'german_credit_data_automl'

experiment=Experiment(ws, experiment_name)

output = {}
output['Subscription ID'] = ws.subscription_id
output['Workspace'] = ws.name
output['Resource Group'] = ws.resource_group
output['Location'] = ws.location
output['Experiment Name'] = experiment.name
pd.set_option('display.max_colwidth', -1)
outputDf = pd.DataFrame(data = output, index = [''])
outputDf.T

## Remote compute target

Previous tutorials have run inside of your Compute Instance. In this tutorial, you are now going to train a machine learning model on **remote** compute resources. You'll use the training and deployment workflow for Azure Machine Learning in a Python Jupyter notebook. You can then use the notebook as a template to train your own machine learning model with your own data at a later point in time.

The selected VM size "Standard DS11 v2" features 4 v-cores and 14 GB of RAM which will be sufficient for our purposes. You may select larger VMs as well as a higher node count to build a larger cluster suitable for your purposes. Note that VMs including GPUs are also available.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS11_V2',
                                                           min_nodes = 0, max_nodes=2)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

Next, we are going to retrieve our dataset. Here the fact that the dataset is already centrally registered comes in handy as the compute target will also be available to access it without any additional work required. 

In [None]:
dataset = Dataset.get_by_name(ws, name='german_credit_dataset')

training_data, validation_data = dataset.random_split(percentage=0.8, seed=23)
label_column_name = 'Risk'

Now it is time to define what the Auto ML should be doing exactly.

- *experiment_timeout_hours* determines how long the Auto ML job is allowed to run at most. Once this amount of time is exceeded, the experiment will stop regardless of current results. 
- *primary_metric* specifices which model error metric to optimize for. Refer to [this list](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-understand-automated-ml#classification) to view all possible metrics
- *featurization* allows Auto ML to first pre-process the data before fitting a model. More information about the types of pre-processing that can be applied may be viewed [here](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-automated-ml#preprocessing) and can include missing value imputation.
- *n_cross_validations* ensures that the Auto ML engine will run K-fold cross-validation with K being the number of cross-validations that will be run. Alternatively, it's also possible to use Monte Carlo cross-validation or a custom validation (e.g. for imbalanced datasets).

The [documentation on the AutoMLConfig class](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=experimental) lists a myriad of other arguments that could be passed to define the Auto ML run. A particularly useful one would be *experiment_exit_score* which serves as an exit criterion and terminates the experiment upon achieving a specific error metric.

In [None]:
automl_settings = {
    "n_cross_validations": 3,
    "primary_metric": 'average_precision_score_weighted',
    "featurization": 'auto',
    "enable_early_stopping": False,
    "max_concurrent_iterations": 2, # This is a limit for testing purpose, please increase it as per cluster size
    "experiment_timeout_hours": 0.25, # This is a time limit for testing purposes, remove it for real use cases, this will drastically limit ablity to find the best model possible
    "verbosity": logging.INFO,
}

automl_config = AutoMLConfig(task = 'classification',
                             debug_log = 'automl_errors.log',
                             compute_target = compute_target,
                             training_data = training_data,
                             label_column_name = label_column_name,
                             **automl_settings
                            )

To submit the Auto ML experiment for run, we need to run the next code cell. As with previous models that we trained, you may navigate to the [Azure ML Studio](https://ml.azure.com) and look for the progress in the *Experiments* page. Look for the experiment corresponding with the name of the experiment you have assigned in the code. When clicking on the Run ID of the run that's currently executing, you can then click the *Models* tab which will show the same information as the output below but in the web portal.

Note that this will take at least 15 minutes (time specified for *experiment_timeout_hours*) plus some additional time to actually allocate the resources for the compute target in Azure ML as we have previously set the minimum number of nodes to 0.

In [None]:
remote_run = experiment.submit(automl_config, show_output = True)

Once the Auto ML process has completed, we may inspect the individual runs in the portal or using the widget below.

In [None]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show()

A model explanation will automatically be created for the best model. This can also be viewed in [Azure ML Studio](https://ml.azure.com). To view it, under *Experiments* click the name of the experiment of your AutoML run (here we used <code>german_credit_data_automl</code> if you have not changed it). In the list of runs, select the *Run ID* of your last run (you may only see one). This will open the details page as shown below where you click on *Models*.

<img src="images/automl-run.jpg" alt="Drawing" style="width: 800px;"/>

Now all models trained in this run are being displayed. The column *Explained* shows that the explanation dashboard is only available for the best model. Click on the link *View explanation* as shown below.

<img src="images/automl-view-explanation.jpg" alt="Drawing" style="width: 800px;"/>

By default, the explanation will show the **Global Importance** of features in the model. Click on *Summary Importance* to view more detailed information.

<img src="images/automl-explanations.jpg" alt="Drawing" style="width: 800px;"/>


## Disclaimer

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.