Written by Jun Choi · 2021-08-09 10:22.

# Introduction
In this notebook, you will learn how to set up an AutoML training run with the Azure Machine Learning Python SDK. AutoML automatically picks an algorithm and hyperparameters for you and generates a model ready for deployment. 

# Prerequisites
- [Azure Machine Learning workspace](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-manage-workspace?tabs=python).
- [Azure Machine Learning Python SDK.](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/install?view=azure-ml-py#default-install)

> **WARNING**: Python 3.8+ is not compatible with `automl`. Recommended for this tutorial: Python 3.7.9.

Required package installations:

```
pip install azureml-core==1.32.0
pip install azureml-train-automl==1.32.0
pip install azureml-automl-runtime==1.32.0
pip install onnxruntime==1.8.0
```

Alternatively, you can just run:

```
pip install requirements.txt
```

# Data source
This notebook uses the Auto MPG Dataset from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) and builds a model to predict the fuel efficiency (MPG) of late-1970s and early-1980s automobiles.

# References
- https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train
- https://docs.microsoft.com/en-us/azure/machine-learning/tutorial-auto-train-models
- https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/classification-bank-marketing-all-features/auto-ml-classification-bank-marketing-all-features.ipynb

# Preprocess dataset
The following code downloads and imports the Auto MPG Dataset using pandas. The dataset drops null value rows and converts "Origin" column to categorical column. There is no need to one-hot encode categorical columns or scale numerical columns since AutoML takes care of scaling and one-hot encoding.   

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data'
column_names = ['MPG', 'Cylinders', 'Displacement', 'Horsepower', 'Weight',
                'Acceleration', 'Model Year', 'Origin']

raw_dataset = pd.read_csv(url, names=column_names,
                          na_values='?', comment='\t',
                          sep=' ', skipinitialspace=True)

dataset = raw_dataset.copy()
dataset = dataset.dropna()
dataset['Origin'] = dataset['Origin'].map({1: 'USA', 2: 'Europe', 3: 'Japan'})
dataset.to_csv('../data/autompg.csv', index=False)

train, test = train_test_split(dataset, random_state=42, test_size=0.2)
train.to_csv('../data/autompg_train.csv', index=False)
test.to_csv('../data/autompg_test.csv', index=False)

# Initialize workspace and datastore

The following code creates a Workspace object using `config.json` file which can be downloaded from Azure Machine Learning studio website (Begins with https://ml.azure.com/). Place the downloaded `config.json` file in the same directory as this notebook.

![config.json](../assets/config_json.png)

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
datastore = ws.get_default_datastore()

# Configure compute cluster

[Learn more about compute targets and clusters in Azure Machine Learning.](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target)

In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.core.compute_target import ComputeTargetException

cluster_name = 'AMLCluster'

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing cluster, using it.')
except ComputeTargetException:
    print('Cluster not found, creating new 6 node STANDARD_D2_V2.')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='STANDARD_D2_V2', max_nodes=6
    )
    compute_target = ComputeTarget.create(
        workspace=ws,
        name=cluster_name,
        provisioning_configuration=compute_config,
    )
compute_target.wait_for_completion(show_output=True)

# Configure the training run

The runtime is set by creating adn configuring `RunConfiguration` object. Here we set the compute target.

In [None]:
from azureml.core.runconfig import RunConfiguration

aml_run_config = RunConfiguration()
aml_run_config.target = compute_target

# Uploading train to datastore and registering dataset

Automated machine learning supports data that resides on your local desktop or in the cloud such as Azure Blob Storage. The data can be read into a Pandas DataFrame or an Azure Machine Learning TabularDataset.

Requirements for training data in machine learning:
- Data must be in tabular form.
- The value to predict, target column, must be in the data.

For remote experiments, training data must be accessible from the remote compute. ***Automated ML only accepts Azure Machine Learning `TabularDatasets` when working on a remote compute.***

In [None]:
from azureml.core import Dataset

file_path = '../data/autompg_train.csv'
datastore.upload_files(
    files=[file_path],
    target_path='workspaceblobstore',
    overwrite=True,
    show_progress=True
)


datastore_path = 'workspaceblobstore/autompg_train.csv'
ds_path = [(datastore, datastore_path)]
ds_tab = Dataset.Tabular.from_delimited_files(path=ds_path)
try:
    ds_tab.register(workspace=ws, name='autompg_train', description=None)
    print('Registered data successfully.')
except:
    print('Data not registered.')

# Configure experiment settings

There are several options that you can use to configure your automated ML experiment. These parameters are set by instantiating an `AutoMLConfig` object. See the [`AutoMLConfig` class](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py) for a full list of parameters.

In [None]:
from azureml.train.automl import AutoMLConfig

automl_settings = {
    'experiment_timeout_hours': 0.5,
    'iterations': 5,
    'label_column_name': 'MPG',
    'iteration_timeout_minutes': 5,
    'max_concurrent_iterations': 4,
    'max_cores_per_iteration': -1,
    'primary_metric': 'normalized_root_mean_squared_error',
    'track_child_runs': True,
}

automl_config = AutoMLConfig(
    task='regression',
    compute_target=compute_target,
    featurization='auto',
    training_data=ds_tab,
    enable_onnx_compatible_models=True,
    **automl_settings
)

# Run experiment

For AutoML, you create an `Experiment` object, which is a named object in a `Workspace` used to run experiments.

In [None]:
from azureml.core import Experiment

experiment_name = 'automl-tutorial'
experiment = Experiment(ws, experiment_name)
run = experiment.submit(automl_config, show_output=False)
run.wait_for_completion()

# Save the best ONNX model and ONNX resource json file

The following code saves the best model from the completed AutoML run as an ONNX file and saves `onnx_resource.json` file which is used with `OnnxInferenceHelper`.

In [None]:
from azureml.automl.runtime.onnx_convert import OnnxConverter
from azureml.train.automl import constants

best_run, best_model = run.get_output(return_onnx_model=True)

onnx_path = '../data/best_model.onnx'
OnnxConverter.save_onnx_model(best_model, onnx_path)

output_file_path = '../data/onnx_resource.json'
best_run.download_file(constants.MODEL_RESOURCE_PATH_ONNX, output_file_path=output_file_path)

# Predict with the ONNX model, using onnxruntime package

We will be calculating mean absolute error and mean squared error to evalute the model. These two metrics are common loss functions used for regression problems. Mean absolute error is less sensitive to outliers.

In [None]:
import sys
import json
import onnx

from azureml.automl.core.onnx_convert.onnx_convert_constants import OnnxConvertConstants
from azureml.automl.runtime.onnx_convert import OnnxInferenceHelper
from sklearn.metrics import mean_absolute_error, mean_squared_error

if sys.version_info < OnnxConvertConstants.OnnxIncompatiblePythonVersion:
    python_version_compatible = True
else:
    python_version_compatible = False

if python_version_compatible:
    with open(output_file_path) as f:
        onnx_res = json.load(f)
    data = pd.read_csv('../data/autompg_test.csv')
    onnx_model = onnx.load(onnx_path)
    mdl_bytes = onnx_model.SerializeToString()
    onnxrt_helper = OnnxInferenceHelper(mdl_bytes, onnx_res)
    pred_onnx, pred_prob_onnx = onnxrt_helper.predict(data)

y_true = test['MPG']
print('MAE (Test): {:0.4f}'.format(mean_absolute_error(y_true, pred_onnx)))  # 2.0110
print('MSE (Test): {:0.4f}'.format(mean_squared_error(y_true, pred_onnx)))  # 7.3687