# Tutorial #4: Automated ML using Azure Machine Learning SDK
In this tutorial, we explore how to code [AutoML](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml) using [Azure Machine Learning SDK for Python](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/?view=azure-ml-py).
Automated ML picks an algorithm and hyperparameters for you and generates a model ready for deployment.

Note: This notebook is tested with Azure ML SDK Version 1.24.0. Make sure you change Kernel to "Python 3.6 - AzureML" when running this notebook in Azure.

## Configure workspace

Create a workspace object from the existing workspace. A [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. `Workspace.from_config()` reads the file **config.json** and loads the authentication details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial.

In [None]:
import azureml.core
from azureml.core.workspace import Workspace
ws = Workspace.from_config()
print("Azure ML SDK Version: ", azureml.core.VERSION)

## Prepare the data

In [None]:
import argparse
import os
import pandas as pd
from azureml.core import Dataset

# Read dataset.
clean_df = pd.read_csv('training-data.csv').rename(columns={"sales": "department"})

# Map salary into integers
salary_map = {"low": 0, "medium": 1, "high": 2}
clean_df["salary"] = clean_df["salary"].map(salary_map)

# Create dummy variables for department feature
clean_df = pd.get_dummies(clean_df, columns=["department"], drop_first=True)

print(clean_df.info())

# Note: the "left" column is the label (1=left the company, 0=stay)

## Configure Automated Machine Learning

The automated model training involves the following steps:

1. Define settings for the experiment run. Attach your training data to the configuration, and modify settings that control the training process.


2. Submit the experiment for model tuning. After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. It chooses the best-fit model by optimizing an accuracy metric.


The studio automated ML (web interface) uses a remote compute target for model training. But when you use the Python SDK, you can choose either a local compute or a remote compute target:

- Local compute: Training occurs on your local laptop or VM compute.
- Remote compute: Training occurs on Machine Learning compute clusters.

See [here](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml#choose-compute-target) for pros and cons of local vs remote compute.

Automated ML settings are classified as
- [Experiment settings](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml#experiment-settings)
- [Model settings](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml#model-settings)
- [Run control settings](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml#run-control-settings)

These settings are set by instantiating an [`AutoMLConfig`](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig.automlconfig?view=azure-ml-py) object and specify a value for various parameters.

|Parameter| Value |Description|
|----|----|---|
|**iteration_timeout_minutes**|10|Time limit in minutes for each iteration. Increase this value for larger datasets that need more time for each iteration.|
|**iterations**|6|The total number of different algorithm and parameter combinations to test during an automated ML experiment. If not specified, the default is 1000 iterations.|                         
|**experiment_timeout_hours**|0.3|Maximum amount of time in hours that all iterations combined can take before the experiment terminates.|
|**enable_early_stopping**|True|Flag to enable early termination if the score is not improving in the short term.|
|**primary_metric**| AUC_weighted | Metric that you want to optimize. The best-fit model will be chosen based on this metric.|
|**featurization**| auto | By using auto, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)|
|**verbosity**| logging.INFO | Controls the level of logging.|

See [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#configure-your-experiment-settings) for examples of configuring `AutoMLConfig`.


In [None]:
import logging

automl_settings = {
    "iteration_timeout_minutes": 10,
    "iterations": 6,
    "experiment_timeout_hours": 0.3,
    "enable_early_stopping": True,
    "primary_metric": 'AUC_weighted',
    "featurization": 'auto',
    "verbosity": logging.INFO
}

#### Other settings include:
#### 1. Experiment type

The kind of machine learning problem you are solving is specified in the `task` parameter.

The supported `task` types are `classification`, `regression`, and `forecasting`.

When to use which `task` is define [here](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml#when-to-use-automl-classify-regression--forecast).

#### 2. Allowed or blocked models

Automated machine learning tries different algorithms during the automation and tuning process.

By default, the three different `task` parameter values determine the list of algorithms, or models, to apply.

However you can use `allowed_models` or `blocked_models` parameters to include or exclude algorithms.

See supported models by task type [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#supported-models).

#### 3. Automatic data featurization

The following table shows the accepted settings for featurization in the AutoMLConfig class:

|Featurization configuration|Description|
|----|----|
|"featurization": 'auto'|Specifies that, as part of preprocessing, [data guardrails](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-features#data-guardrails) and<br>[featurization steps](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-features#featurization) are to be done automatically.<br>This setting is the default.|
|"featurization": 'off'|Specifies that [featurization steps](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-features#featurization) are not to be done automatically.|
|"featurization": 'FeaturizationConfig'|Specifies that [customized featurization](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-features#customize-featurization) steps are to be used.|
                                                                                                                                                                     
Note: [Automated machine learning featurization](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-features) steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. When using the model for predictions, the same featurization steps applied during training are applied to your input data automatically.
                                         
#### 4. Primary Metric
The primary metric parameter determines the metric to be used during model training for optimization. 

The available metrics you can select is determined by the task type you choose.
                                                                                  
See valid primary metrics for each task type [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#primary-metric).

#### 5. K-fold cross-validation

To perform k-fold cross-validation, include the `n_cross_validations`.
                                                    
This parameter sets how many cross validations to perform, based on the same number of folds.

In the following code, five folds for cross-validation are defined: `n_cross_validations=5`

Hence, five different trainings, each training using 4/5 of the data, and each validation using 1/5 of the data with a different holdout fold each time. As a result, metrics are calculated with the average of the five validation metrics.

Note: If you do not explicitly specify either a `validation_data` or `n_cross_validations` parameter, automated ML applies default techniques depending on the number of rows provided in the single dataset `training_data`:

|Training data size|Validation technique|
|----|----|
|Larger than 20,000 rows|Train/validation data split is applied.<br><br>The default is to take 10% of the initial training data set as the<br>validation set. In turn, that validation set is used for metrics calculation.|
|Smaller than 20,000 rows|Cross-validation approach is applied.<br><br>The default number of folds depends on the number of rows.<br><br>If the dataset is less than 1,000 rows, 10 folds are used.<br>If the rows are between 1,000 and 20,000, then three folds are used.|

Note: The `n_cross_validations` parameter is not supported in classification scenarios that use deep neural networks.

#### 6. Ensemble configuration

[Ensemble models](https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml#ensemble) are enabled by default, and appear as the final run iterations in an AutoML run.

Currently VotingEnsemble and StackEnsemble are supported.

Ensemble training can be disabled by using the `enable_voting_ensemble` and `enable_stack_ensemble` boolean parameters.
                  
See [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train#ensemble-configuration) for configuration.
                                                                                                                               
                                                                                                                               
#### 7. Exit criteria

Define the exit criteria in your AutoMLConfig to end your experiment.

|Criteria|description|
|----|----|
|No criteria|If you do not define any exit parameters the experiment continues<br>until no further progress is made on your primary metric.|
|After a length of time|Use `experiment_timeout_minutes` in your settings<br>to define how long, in minutes, your experiment should continue to run.<br><br>To help avoid experiment time out failures, there is a minimum of<br>15 minutes, or 60 minutes if your row by column size exceeds 10 million.|
|A score has been reached|Use `experiment_exit_score` completes the experiment<br>after a specified primary metric score has been reached.|

                                                                                                                             

#### 8. Provide validation data or configure a validation data size

You can either start with a single data file and split it into training and validation datasets, or provide a separate data file for the validation set.

`validation_data` parameter assigns which data to use as your validation set, it accepts an Azure Machine Learning dataset or pandas dataframe.

Note: Use your previously defined training settings as a `**kwargs` parameter to an `AutoMLConfig` object.

In [None]:
from azureml.train.automl import AutoMLConfig
from sklearn.model_selection import train_test_split

#Note the split here does not separate the label column "left".
x_train, x_test = train_test_split(clean_df, test_size=0.2, random_state=223)

automl_config = AutoMLConfig(task='classification',
                             training_data=x_train,
                             validation_data=x_test, # Cannot specify n_cross_validations when validation_data is specified
                             label_column_name="left", # Indicate the label column here
                             blocked_models=['RandomForest'],
                             enable_voting_ensemble=False,
                             enable_stack_ensemble=False,
                             debug_log='automl_errors.log',
                             **automl_settings)

Alternatively, you can set the `validation_size` parameter to hold out a portion of the training data for validation. This means that the validation set will be split by automated ML from the initial `training_data` provided.

This value should be between `0.0` and `1.0` non-inclusive (for example, 0.2 means 20% of the data is held out for validation data).

Note: The `validation_size` parameter is not supported in forecasting scenarios.

Note: To perform [Monte Carlo cross-validation](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-configure-cross-validation-data-splits#monte-carlo-cross-validation) must specify both `validation_size` and `n_cross_validations` parameters.

    automl_config = AutoMLConfig(task='classification',
                                 training_data=clean_df,
                                 validation_size=0.2,
                                 n_cross_validations=5,
                                 label_column_name="left",
                                 ...)

## Train the model

Create an experiment object in your workspace. An experiment acts as a container for your individual runs.

Pass the defined `automl_config` object to the experiment, and set the output to `True` to view progress during the run.

After starting the experiment, the output shown updates live as the experiment runs.

For each iteration, it shows the pipeline summary, run duration, and the primary metric score. The `BEST` field tracks the best metric score.

In [None]:
from azureml.core.experiment import Experiment
experiment = Experiment(ws, "predict-employee-retention-automl-sdk")
local_run = experiment.submit(automl_config, show_output=True)

## Explore the results

Explore the results of automatic training with a [Jupyter widget](https://docs.microsoft.com/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py).

The widget allows you to see a graph and table of all individual run iterations and metric score after the run completed.

In [None]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

### (optional) Monitor automated machine learning runs

To access the charts from a previous automated ML run, replace the `run_id` in the code below:

In [None]:
from azureml.widgets import RunDetails
from azureml.core.run import Run
from azureml.core.experiment import Experiment

run_id = 'AutoML_5cb4b0d0-99f2-4951-bf12-15d70fc83843'
experiment = Experiment(ws,"predict-employee-retention-automl-sdk")
run = Run(experiment, run_id)
RunDetails(run).show()

### Retrieve the best model

To select the best model from the iterations, use `get_output` to get the best run and the fitted model.

In [None]:
from pprint import pprint

def print_model(model, prefix=""):
    for step in model.steps:
        print(prefix + step[0])
        if hasattr(step[1], 'estimators') and hasattr(step[1], 'weights'):
            pprint({'estimators': list(e[0] for e in step[1].estimators), 'weights': step[1].weights})
            print()
            for estimator in step[1].estimators:
                print_model(estimator[1], estimator[0]+ ' - ')
        elif hasattr(step[1], '_base_learners') and hasattr(step[1], '_meta_learner'):
            print("\nMeta Learner")
            pprint(step[1]._meta_learner)
            print()
            for estimator in step[1]._base_learners:
                print_model(estimator[1], estimator[0]+ ' - ')
        else:
            pprint(step[1].get_params())
            print()   

In [None]:
best_run, fitted_model = local_run.get_output()

print('BEST RUN:')
print(best_run)
print('\nFITTED MODEL:')
#print(fitted_model)
print_model(fitted_model)
print('\nFITTED MODEL STEPS:')
print(fitted_model.steps)
print('\nMETRICS:')
metrics = best_run.get_metrics()
for metric_name in metrics:
    print(metric_name, ":", metrics[metric_name])

### Test the best model accuracy

Use the best model to run predictions on the test data set (i.e. to predict whether an employee will leave the company).

Print the first 10 predictions.

In [None]:
y_test = x_test.pop("left")

In [None]:
#print(y_test.head())
#print(x_test.head())

In [None]:
y_predict = fitted_model.predict(x_test)
print(y_predict[:10])

### Retrieve auto engineered features

The `get_engineered_feature_names()` returns a list of engineered feature names.

Note: Use `timeseriestransformer` for `task=forecasting`, else use `datatransformer` for `regression` or `classification` task.

In [None]:
fitted_model.named_steps['datatransformer'].get_engineered_feature_names()

The `get_featurization_summary()` gets a featurization summary of all the input features.

In [None]:
fitted_model.named_steps['datatransformer'].get_featurization_summary()

### Download the engineered feature importances from the best run

You can use [`ExplanationClient`](https://docs.microsoft.com/en-us/python/api/azureml-interpret/azureml.interpret.explanationclient?view=azure-ml-py) to download the engineered feature explanations from the artifact store of the `best_run`.

In [None]:
from azureml.interpret import ExplanationClient

client = ExplanationClient.from_run(best_run)
engineered_explanations = client.download_model_explanation(raw=False) #default raw=True when not specified.
feature_importances = engineered_explanations.get_feature_importance_dict()

# Overall feature importance
print('Feature\tImportance')
for key, value in feature_importances.items():
    print(key, '\t', value)

### Download the raw feature importances from the best run

You can use `ExplanationClient` to download the raw feature explanations from the artifact store of the `best_run`.

In [None]:
from azureml.interpret import ExplanationClient

client = ExplanationClient.from_run(best_run)
raw_explanations = client.download_model_explanation() #default raw=True when not specified.
feature_importances = raw_explanations.get_feature_importance_dict() 

# Overall feature importance
print('Feature\tImportance')
for key, value in feature_importances.items():
    print(key, '\t', value)

## Model explanations in automated ML (Preview)

### Set up the model explanations

Use [`automl_setup_model_explanations`](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-runtime/azureml.train.automl.runtime.automl_explain_utilities?view=azure-ml-py#automl-setup-model-explanations-fitted-model--typing-union-sklearn-pipeline-pipeline--azureml-automl-runtime-streaming-pipeline-wrapper-streamingpipelinewrapper---task--str--x--typing-union-numpy-ndarray--pandas-core-frame-dataframe--scipy-sparse-base-spmatrix--azureml-dataprep-api-dataflow-dataflow--azureml-data-tabular-dataset-tabulardataset--nonetype----none--x-test--typing-union-numpy-ndarray--pandas-core-frame-dataframe--scipy-sparse-base-spmatrix--azureml-dataprep-api-dataflow-dataflow--azureml-data-tabular-dataset-tabulardataset--nonetype----none--y--typing-union-numpy-ndarray--pandas-core-series-series--pandas-core-arrays-categorical-categorical--azureml-dataprep-api-dataflow-dataflow--azureml-data-tabular-dataset-tabulardataset--nonetype----none--y-test--typing-union-numpy-ndarray--pandas-core-series-series--pandas-core-arrays-categorical-categorical--azureml-dataprep-api-dataflow-dataflow--azureml-data-tabular-dataset-tabulardataset--nonetype----none--features--typing-union-typing-list-str---nonetype----none--automl-run--typing-union-azureml-core-run-run--nonetype----none----kwargs--typing-any-----azureml-train-automl-runtime-automl-explain-utilities-automlexplainersetupclass) to get the engineered and raw explanations.
                                        
The `fitted_model` can generate the following items:
- Featured data from trained or test samples
- Engineered feature name lists
- Findable classes in your labeled column in classification scenarios

`automl_setup_model_explanations` returns an [`AutoMLExplainerSetupClass`](https://docs.microsoft.com/en-us/python/api/azureml-train-automl-runtime/azureml.train.automl.runtime.automl_explain_utilities.automlexplainersetupclass?view=azure-ml-py) object that contains all the structures from above list.

In [None]:
y_train = x_train.pop("left")

In [None]:
#print(y_train.head())
#print(x_train.head())

In [None]:
from azureml.train.automl.runtime.automl_explain_utilities import automl_setup_model_explanations

automl_explainer_setup_obj = automl_setup_model_explanations(fitted_model,
                                                             X=x_train, 
                                                             X_test=x_test,
                                                             y=y_train, 
                                                             task='classification')

### Initialize the Mimic Explainer for feature importance

To generate an explanation for automated ML models, one way is to use the [`MimicWrapper`](https://docs.microsoft.com/en-us/python/api/azureml-interpret/azureml.interpret.mimic_wrapper.mimicwrapper?view=azure-ml-py) class. You can initialize the `MimicWrapper` with these parameters:
- The explainer setup object
- Your workspace
- A surrogate model to explain the `fitted_model` automated ML model

The `MimicWrapper` also takes the `best_run` object where the engineered explanations will be uploaded.

In [None]:
from azureml.interpret import MimicWrapper

# Initialize the Mimic Explainer to explain transformed features.
explainer = MimicWrapper(ws, 
                         automl_explainer_setup_obj.automl_estimator,
                         explainable_model=automl_explainer_setup_obj.surrogate_model, 
                         init_dataset=automl_explainer_setup_obj.X_transform,
                         run=best_run,
                         features=automl_explainer_setup_obj.engineered_feature_names, 
                         feature_maps=[automl_explainer_setup_obj.feature_map],
                         classes=automl_explainer_setup_obj.classes,
                         explainer_kwargs=automl_explainer_setup_obj.surrogate_model_params
                        )

### Use Mimic Explainer for computing and visualizing engineered feature importance

You can call the `explain()` method in `MimicWrapper` with the transformed test samples to get the feature importance for the generated engineered features.

You can also sign in to Azure Machine Learning studio to view the explanations dashboard visualization of the feature importance values of the generated engineered features by automated ML featurizers.

In [None]:
engineered_explanations = explainer.explain(explanation_types=['local', 'global'],
                                            eval_dataset=automl_explainer_setup_obj.X_test_transform)

feature_importances = engineered_explanations.get_feature_importance_dict()

# Overall feature importance
print('Feature\tImportance')
for key, value in feature_importances.items():
    print(key, '\t', value)

You can visualize the explanation results with `ExplanationDashboard` from `interpret-community` package.

In [None]:
#pip install interpret-community[visualization]

In [None]:
from interpret_community.widget import ExplanationDashboard

ExplanationDashboard(engineered_explanations, 
                     automl_explainer_setup_obj.automl_estimator, 
                     datasetX=automl_explainer_setup_obj.X_test_transform)

You can also get the feature importance of the raw features by setting `get_raw=True` and specify `raw_feature_names` to display the feature name.

Note: The dashboard visualization of the raw features can only be viewed in the Machine Learning studio. See [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-automated-ml-for-ml-models#model-explanations-preview) for the steps.

In [None]:
raw_explanations = explainer.explain(explanation_types=['local', 'global'],
                                     eval_dataset=automl_explainer_setup_obj.X_test_transform,
                                     get_raw=True,
                                     raw_feature_names=automl_explainer_setup_obj.raw_feature_names)

feature_importances = raw_explanations.get_feature_importance_dict()

# Overall feature importance
print('Feature\tImportance')
for key, value in feature_importances.items():
    print(key, '\t', value)

### Register model

To register the model from an automated ML run, use the register_model() method.

In [None]:
# Illustrate how to use tags.
tags = {}
tags['Best AUC_weighted'] = local_run.get_metrics('AUC_weighted').get('AUC_weighted')
print(tags)

# Give a model name and description.
model_name = 'predict-employee-retention-automl-model-' + best_run.properties['model_name']
description = 'Predict employee retention best model from auto ML run.'
model = local_run.register_model(model_name = model_name, 
                           description = description, 
                           tags = tags)
print(model)

### Deploy your model without writing code

1. Once the model is registered, you can find it in the studio by selecting Models on the left pane.


2. Open the model, click the Deploy button at the top of the screen.


See [here](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-use-automated-ml-for-ml-models#deploy-your-model) for the deployment steps.