# Remote Training Via Azure ML Compute (AML Cluster) and HyperDrive (Hyper-parameter Tuning with Multiple Children Runs) 
_**This notebook showcases the creation of a ScikitLearn Binary classification model by remotely training on Azure ML Compute Target (AMLCompute Cluster).**_

_**It shows multiple ways of remote training like using a single Estimator, a ScriptRunConfig and hyper-parameter tunning with HyperDrive with multiple child trainings**_

## Check library versions
This is important when interacting with different executions between remote compute environments (cluster) and the instance/VM with the Jupyter Notebook.
If not using the same versions you can have issues when creating .pkl files in the cluster and downloading them to load it in the Jupyter notebook.

In [None]:
# Check versions
import azureml.core
import sklearn
import joblib
import pandas

print("Azure SDK version:", azureml.core.VERSION)
print('scikit-learn version is {}.'.format(sklearn.__version__))
print('joblib version is {}.'.format(joblib.__version__))
print('pandas version is {}.'.format(pandas.__version__))

## Setup and connect to AML Workspace

In [None]:
from azureml.core import Workspace

ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')

## Create An Experiment

**Experiment** is a logical container in an Azure ML Workspace. It hosts run records which can include run metrics and output artifacts from your experiments.

In [None]:
from azureml.core import Experiment
experiment_name = 'aml-wrkshp-remote-training-amlcompute'
experiment = Experiment(workspace=ws, name=experiment_name)

## Introduction to AmlCompute

Azure Machine Learning Compute is managed compute infrastructure that allows the user to easily create single to multi-node compute of the appropriate VM Family. It is created **within your workspace region** and is a resource that can be used by other users in your workspace. It autoscales by default to the max_nodes, when a job is submitted, and executes in a containerized environment packaging the dependencies as specified by the user. 

Since it is managed compute, job scheduling and cluster management are handled internally by Azure Machine Learning service. 

For more information on Azure Machine Learning Compute Targets, please read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/concept-compute-target)

**Note**: As with other Azure services, there are limits on certain resources (for eg. AmlCompute quota) associated with the Azure Machine Learning service. Please read [this article](https://docs.microsoft.com/azure/machine-learning/service/how-to-manage-quotas) on the default limits and how to request more quota.

Here a picture which explains the architecture behind Azure ML remote training:

![](img/aml-run.png)

### Create project directory and copy the training script into the project directory

Create a directory that will contain all the necessary code from your local machine that you will need access to on the remote resource. This includes the training script, and any additional files your training script depends on.

In [None]:
import os
import shutil

os.getcwd()

In [None]:
project_folder = './classif-attrition-amlcompute'
os.makedirs(project_folder, exist_ok=True)

# Copy the training script into the project directory
shutil.copy('train.py', project_folder)

### Connect or Create a Remote AML compute cluster

Try to use the compute target you had created before (make sure you provide the same name here in the variable `cpu_cluster_name`).
If not available, create a new cluster from the code.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cluster"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_DS3_v2',
                                                           max_nodes=10)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)
    
# For a more detailed view of current AmlCompute status, use get_status().

### Fetch the AML Dataset

In [None]:
aml_dataset = ws.datasets['IBM-Employee-Attrition']

## Create Environment 

Azure Machine Learning environments are an encapsulation of the environment where your machine learning training happens. They specify the Python packages, environment variables, and software settings around your training and scoring scripts. They also specify run times (Python, Spark, or Docker). The environments are managed and versioned entities within your Machine Learning workspace that enable reproducible, auditable, and portable machine learning workflows across a variety of compute targets.

You can use an Environment object on your local compute to:

* Develop your training script.
* Reuse the same environment on Azure Machine Learning Compute for model training at scale.
* Deploy your model with that same environment.
* Revisit the environment in which an existing model was trained.

Read [this article](https://docs.microsoft.com/en-us/azure/machine-learning/concept-environments) for more details.

#### List all the curated environments and packages in your AML Workspace

In [None]:
from azureml.core import Environment

envs = Environment.list(workspace=ws)

# List Environments and packages in my workspace
for env in envs:
    if env.startswith("AzureML"):
        print("Name",env)

In [None]:
# Use curated environment from AML named "AzureML-Tutorial"
curated_environment = Environment.get(workspace=ws, name="AzureML-Tutorial")

# Get environment's details
print("packages", curated_environment.python.conda_dependencies.serialize_to_string())

## Create a Custom Environment (optional)

You can also start building your custom environment using a curated environment as baseline. You have to save curated environment definition files into a folder and then edit them according to your needs.

In [None]:
# Save curated environment definition to folder (Two files, one for conda_dependencies.yml and another file for azureml_environment.json)
curated_environment.save_to_directory(path="./curated_environment_definition", overwrite=True)

# Create custom Environment from Conda specification file
custom_environment = Environment.from_conda_specification(name="custom-workshop-environment", file_path="./curated_environment_definition/conda_dependencies.yml")

# Save curated environment definition to folder (Two files, one for conda_dependencies.yml and another file for azureml_environment.json)
custom_environment.save_to_directory(path="./custom_environment_definition", overwrite=True)

custom_environment.register(ws)

envs = Environment.list(workspace=ws)

# List Environments and packages in my workspace
for env in envs:
    if env.startswith("custom"):
        print("Environment Name",env)
        print("packages", envs[env].python.conda_dependencies.serialize_to_string())

## (Option A) Configure & Run using ScriptRunConfig & Environment 

### Easiest path using curated environments, but less flexible than Estimator

The executed "run" will be a *ScriptRun* object, which can be used to monitor the asynchronous execution of the run, log metrics and store output of the run, and analyze results and access artifacts generated by the run.

In [None]:
# Add training script to run config
from azureml.core import ScriptRunConfig, RunConfiguration, Experiment

# # First run
# script_runconfig = ScriptRunConfig(
#     source_directory=project_folder,
#     script="train.py",
#     arguments=[aml_dataset.as_named_input('attrition')]
# )



# # Second run
# solver   = 'saga'
# penalty  = 'elasticnet'
# l1_ratio = '0.2'

# Third run
solver   = 'saga'
penalty  = 'elasticnet'
l1_ratio = '0.3'

script_runconfig = ScriptRunConfig(
    source_directory=project_folder,
    script="train.py",
    arguments=[aml_dataset.as_named_input('attrition'), '--solver', solver, '--penalty', penalty, '--l1_ratio', l1_ratio]
)

# Attach compute target to run config
# Use runconfig.run_config.target = "local" to exec the run in your current environment
script_runconfig.run_config.target = compute_target


# Attach environment to run config
script_runconfig.run_config.environment = curated_environment

Let's check how the run_config JSON settings look like:

In [None]:
script_runconfig.run_config

### Let's run the training script on the AML Compute Cluster

In [None]:
# Submit the Experiment Run to the AML Compute
run = experiment.submit(script_runconfig)
run

### Monitor your Run using the Widget

In [None]:
from azureml.widgets import RunDetails

RunDetails(run).show()

#### Get log results upon completion
Model training and monitoring happen in the background. Wait until the model has finished training before you run more code. Use *wait_for_completion* to show when the model training is finished:

In [None]:
run.wait_for_completion(show_output=False)

You can execute this section again using the other setups above in order to have more than one logged models


## (Option B.1) Configure an Estimator with specific pkgs versions (using pip and conda)

### Risky! Overriding remote compute Docker image packages with pip and conda might cause issues with inconsistent package versions.

*Estimator* represents a generic estimator to train data using any supplied framework. Compared to a ScriptRunConfig, which only admits Environment objects as a parameter, the Estimator can directly receive lists of dependencies as a parameters.

This class is designed for use with machine learning frameworks that do not already have an Azure Machine Learning pre-configured estimator. Pre-configured estimators exist for *[Chainer](https://chainer.org/)*, *PyTorch*, *TensorFlow* and *SKLearn*.

The SKLearn pre-configured estimator is used for training in Scikit-learn experiments. When submitting a training job, Azure ML runs your script in a conda environment within a Docker container. SKLearn containers have the following dependencies installed:

* Python (3.6.2)
* azureml-defaults (Latest)
* IntelMpi (2018.3.222)
* scikit-learn (0.20.3)
* numpy (1.16.2)
* miniconda (4.5.11)
* scipy (1.2.1)
* joblib (0.13.2)
* git (2.7.4)

Except for the SKLearn one, estimators support single-node as well as multi-node execution. A distributed training job can be run using, for example, Message Passing Interface (MPI) objects. Obviously, a distributed training job can be run:

* only upon the Compute Target Cluster
* only if the used libraries provide distributed backends.

The backends distributed training supported by Azure ML are:

* Message Passing Interface (MPI)
* NVIDIA Collective Communication Library (NCCL)
* Gloo

An example of Fast.AI distributed training [here](https://github.com/nash-lian/Distributed-training-Image-segmentation-Azure-ML).

In [None]:
from azureml.train.estimator import Estimator
from azureml.train.sklearn import SKLearn

script_params = {
    "--solver": 'saga',
    "--penalty": 'elasticnet',
    "--l1_ratio": 0.4
}

pip_packages = [
                'azureml-core==1.17.0', 'azureml-telemetry==1.17.0', 'azureml-dataprep==2.4.2',
                'joblib==0.14.1', 'pandas==1.0.0', 'sklearn-pandas==2.0.2' 
               ]

# Using plain Estimator class
estimator = Estimator(source_directory=project_folder, 
                      script_params=script_params,
                      compute_target=compute_target,
                      entry_script='train.py',
                      pip_packages=pip_packages,
                      conda_packages=['scikit-learn==0.22.2.post1'],
                      inputs=[ws.datasets['IBM-Employee-Attrition'].as_named_input('attrition')])


# # Using SKLearn estimator class
# estimator = SKLearn(source_directory=project_folder, 
#                     script_params=script_params,
#                     compute_target=compute_target,
#                     entry_script='train.py',
#                     pip_packages=pip_packages,
#                     conda_packages=['scikit-learn==0.22.2.post1'],
#                     inputs=[aml_dataset.as_named_input('attrition')])



In [None]:
run = experiment.submit(estimator)
run

In [None]:
from azureml.widgets import RunDetails
RunDetails(run).show()

In [None]:
run.wait_for_completion(show_output=False)

## (Option B.2) Configure an Estimator with Environment

### Better! Easier! Consistent!

### Using an Estimator with Curated Environment 

In [None]:
from azureml.train.estimator import Estimator

# # Load Custom Environment from Workspace
# custom_environment = Environment.get(workspace=ws,name="custom-workshop-environment")  # ,version="1"
# print(custom_environment)

script_params = {
    '--solver': 'liblinear',
    '--penalty': 'l2'
}

# Using plain Estimator class with custom Environment
estimator = Estimator(source_directory=project_folder, 
                      script_params=script_params,
                      compute_target=compute_target,
                      entry_script='train.py',
                      environment_definition=curated_environment,
                      #environment_definition=custom_environment,
                      inputs=[ws.datasets['IBM-Employee-Attrition'].as_named_input('attrition')])

In [None]:
run = experiment.submit(estimator)
run

In [None]:
from azureml.widgets import RunDetails

RunDetails(run).show()

In [None]:
run.wait_for_completion(show_output=False)

Note: if you need to cancel a run, you can follow [these instructions](https://aka.ms/aml-docs-cancel-run).

## (Option C) Configure and Run with Intelligent hyperparameter tuning (HyperDrive using Estimator)

IMPORTANT: You need to have created either an *Estimator* or an *ScriptRunConfig* in the previous steps. 

The adjustable parameters that govern the training process are referred to as the **hyperparameters** of the model. The goal of hyperparameter tuning is to search across various hyperparameter configurations and find the configuration that results in the best performance.

To demonstrate how Azure Machine Learning can help you automate the process of hyperarameter tuning, we will launch multiple runs with different values for numbers in the sequence. First let's define the parameter space using random sampling.

### Create a hyperparameter sweep
First, we will define the hyperparameter space to sweep over. 
In this example we will use random sampling to try different configuration sets of hyperparameters to maximize our primary metric, Accuracy.

In [None]:
from azureml.train.hyperdrive import RandomParameterSampling, BayesianParameterSampling 
from azureml.train.hyperdrive import BanditPolicy, HyperDriveConfig, PrimaryMetricGoal
from azureml.train.hyperdrive import choice, uniform
    
# Values for "solver": {'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'}, default='lbfgs'
# Values for "penalty": {'l1', 'l2', 'elasticnet', 'none'}, default='l2'
# Note that some penalty parameters are not supported by some algorithms. For example, 
param_sampling = RandomParameterSampling( {
    "--C": uniform(0.0, 1.0),
    "--solver": choice('newton-cg', 'lbfgs', 'sag', 'saga'),
    "--penalty": choice('none', 'l2')
    }
)

# Details on Scikit-Learn LogisticRegression hyper-parameters:
# https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html


Now we will define an early termination policy. The *BanditPolicy* basically states to check the job every 2 iterations. If the primary metric (defined later) falls outside of the top 10% range, Azure ML terminate the job. This saves us from continuing to explore hyperparameters that don't show promise of helping reach our target metric.

In [None]:
early_termination_policy = BanditPolicy(evaluation_interval=2, slack_factor=0.1)

# Note that early termination policy is currently NOT supported with Bayesian sampling
# Check here for recommendations on the multiple policies:
# https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters#picking-an-early-termination-policy

Now we are ready to configure a run configuration object, and specify the primary metric 'AUC_weighted' that's recorded in your training runs. 
If you go back to visit the training script, you will notice that this value is being logged. 
We also want to tell the service that we are looking to maximizing this value. 
We also set the number of samples to 20, and maximal concurrent job to 4.

In [None]:
# Note that in this case when using HyperDrive, we are using the script_runconfig configurations,
# and not the original Estimator's parameters. You can only use one of the two configurationse 
hyperdrive_config = HyperDriveConfig(
    run_config=script_runconfig, 
    #estimator=estimator,
    
    hyperparameter_sampling=param_sampling, 
    policy=early_termination_policy,
    
    # Here the primary metric is the label of one of logged metrics in the training run
    # So, in order to use HyperDrive you MUST log at least one metric and use it as parameter
    primary_metric_name='ROC-AUC',
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
    max_total_runs=20,
    max_concurrent_runs=4)

Finally, lauch the hyperparameter tuning job.

In [None]:
# start the HyperDrive run
hyperdrive_run = experiment.submit(hyperdrive_config)

# Check here how to submit the hyperdrive run as a step of an AML Pipeline:
# https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/machine-learning-pipelines/intro-to-pipelines/aml-pipelines-parameter-tuning-with-hyperdrive.ipynb

In [None]:
from azureml.widgets import RunDetails

RunDetails(hyperdrive_run).show()

In [None]:
hyperdrive_run.wait_for_completion(show_output=False)

#### Let's try now with the Bayesian Parameter Sampling

In [None]:
param_bayes_sampling = BayesianParameterSampling( {
    "--C": uniform(0.0, 1.0),
    "--solver": choice('newton-cg'),
    "--penalty": choice('none', 'l2')
    }
)

hyperdrive_bayes_config = HyperDriveConfig(
    run_config=script_runconfig, 
    #estimator=estimator,
    
    hyperparameter_sampling=param_bayes_sampling, 
    
    # No early termination is allowed when using the bayesian parameter sampling
    #policy=early_termination_policy,
    
    # Here the primary metric is the label of one of logged metrics in the training run
    # So, in order to use HyperDrive you MUST log at least one metric and use it as parameter
    primary_metric_name='ROC-AUC',
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
    max_total_runs=20,
    max_concurrent_runs=4)

hyperdrive_bayes_run = experiment.submit(hyperdrive_bayes_config)

In [None]:
RunDetails(hyperdrive_bayes_run).show()

In [None]:
hyperdrive_bayes_run.wait_for_completion(show_output=False)

#### (Optional) Let's try now with the Bayesian Parameter Sampling for just the ElasticNet solver

In [None]:
param_elasticnet_bayes_sampling = BayesianParameterSampling( {
    "--C": uniform(0.0, 1.0),
    "--solver": choice('saga'),
    "--penalty": choice('elasticnet'),
    "--l1_ratio": uniform(0.0, 1.0)
    }
)

hyperdrive_elasticnet_bayes_config = HyperDriveConfig(
    run_config=script_runconfig, 
    #estimator=estimator,
    
    hyperparameter_sampling=param_elasticnet_bayes_sampling, 
    
    # No early termination is allowed when using the bayesian parameter sampling
    #policy=early_termination_policy,
    
    # Here the primary metric is the label of one of logged metrics in the training run
    # So, in order to use HyperDrive you MUST log at least one metric and use it as parameter
    primary_metric_name='ROC-AUC',
    primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
    max_total_runs=24,
    max_concurrent_runs=4)

hyperdrive_elasticnet_bayes_run = experiment.submit(hyperdrive_elasticnet_bayes_config)

In [None]:
RunDetails(hyperdrive_elasticnet_bayes_run).show()

In [None]:
hyperdrive_elasticnet_bayes_run.wait_for_completion(show_output=False)

### Find and get the best model found by HyperDrive¶ 

When all jobs finish, we can find out the one that has the highest accuracy. Let's get the best model from the *HyperDrive-RandomGrid-BanditPolicy* run.

In [None]:
best_run = hyperdrive_run.get_best_run_by_primary_metric()
print(best_run.get_details()['runDefinition']['arguments'])

In [None]:
# Copy 'best_run' to 'run' to re-use the same code also used without HyperDrive
run = best_run

## Display run metrics results
You now have a model trained on a remote cluster. Retrieve the accuracy of the model:

In [None]:
print(run.get_metrics())

## See files associated with the run

In [None]:
print(run.get_file_names())

run.download_file('azureml-logs/70_driver_log.txt')

In [None]:
run.get_details()

## Register the model
Once you've trained the model, you can save and register it to your workspace. Model registration lets you store and version your models in your workspace to simplify model management and deployment.

A registered model is a logical container for one or more files that make up your model. For example, if you have a model that's stored in multiple files, you can register them as a single model in the workspace. After you register the files, you can then download or deploy the registered model.

With the Model class, you can package models for use with Docker and deploy them as a real-time endpoint that can be used for inference requests.

Running the following code will register the model to your workspace, and will make it available to reference by name in remote compute contexts or deployment scripts. 

In [None]:
# First of all, download the traind model from the best HyperDrive run
run.download_file('outputs/classif-empl-attrition.pkl')

In [None]:
from azureml.core.model import Model

model_name = 'aml-wrkshp-classif-empl-attrition'

model_reg = run.register_model(
    model_name=model_name,  # Name of the registered model in your workspace.
    description='Binary classification model for employees attrition',
    model_path='outputs/classif-empl-attrition.pkl', # Local file to upload and register as a model.
    model_framework=Model.Framework.SCIKITLEARN,     # Framework used to create the model. Supported frameworks: TensorFlow, ScikitLearn, Onnx, Custom
    model_framework_version='0.23.2',                # Version of scikit-learn used to create the model.
    tags={'ml-task': "binary-classification", 'business-area': "HR"},
    properties={'joblib-version': "0.14.1", 'pandas-version': "1.0.0"},
    sample_input_dataset=aml_dataset
)

model_reg

### How to download Scikit-Learn model pickle file from the model registry

In [None]:
print(Model.get_model_path(model_name, _workspace=ws))

model_from_registry = Model(ws, model_name)
model_from_registry.download(target_dir='.', exist_ok=True)

# Try model predictions in this notebook

### Load model into memory

In [None]:
# Load the model into memory
model = joblib.load('classif-empl-attrition.pkl')

model

In [None]:
# Load model and test datasets from .pkl files

# Download the test datasets to local
run.download_file('outputs/x_test.pkl')
run.download_file('outputs/y_test.pkl')

# Load the test datasets into memory
x_test = joblib.load('x_test.pkl')
y_test = joblib.load('y_test.pkl')

## Make Predictions and calculate Accuracy metric

In [None]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import average_precision_score

# Make Multiple Predictions
y_predictions = model.predict(x_test)

accuracy = accuracy_score(y_test, y_predictions)
rocauc = roc_auc_score(y_test, y_predictions)
average_precision = average_precision_score(y_test, y_predictions)

model_details_df = pd.DataFrame([accuracy, rocauc, average_precision],
                                columns = ['SVM'],
                                index=['Accuracy','ROC-AUC','Avg Precision'])

model_details_df

## Confusion Matrix

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import plot_confusion_matrix

class_names = y_test.unique()

# Plot non-normalized confusion matrix
titles_options = [("Confusion matrix, without normalization", None),
                  ("Normalized confusion matrix", 'true')]
for title, normalize in titles_options:
    disp = plot_confusion_matrix(model, x_test, y_test,
                                 display_labels=class_names,
                                 cmap=plt.cm.Blues,
                                 normalize=normalize)
    disp.ax_.set_title(title)

    print(title)
    print(disp.confusion_matrix)

plt.show()

In [None]:
# Index of the instance you want to use as input for a prediction
instance_num = 6 # The seventh instance from the beginning (0-based)

# Get the prediction for the upon defined index
prediction_values = model.predict(x_test)
prediction_probs = model.predict_proba(x_test)

print("Classes:")
print(model.classes_)

print("Prediction label for instance", instance_num, ":")
print(prediction_values[instance_num])

print("True label for instance", instance_num, ":")
print(y_test.values[instance_num])

print("Prediction probabilities for instance", instance_num, ":")
print(prediction_probs[instance_num])

x_test.iloc[instance_num]