# Exercise 3 - Compute Contexts

In [the previous exercise](./02%20-%20From%20Data%20to%20Model.ipynb), you used *datastores* and *datasets* to define shared sources of data that can be consumed in *experiments* and used to train machine learning models. In this exercise, you'll extend your experiments beyond the local compute context and take advantage of the cloud to run experiments in dynamically created compute contexts.

> **Important**: This exercise assumes you have completed the previous exercises in this series - specifically, you must have:
>
> - Created an Azure ML Workspace.
> - Uploaded the diabetes.csv data file to the workspace's default datastore.
> - Registered a **Diabetes Dataset** dataset in the workspace.
>
> If you haven't done that, go back and do it now - we'll wait for you!

## Task 1: Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK. Let's start by ensuring you still have the latest version installed (if you ended and restarted your Azure Notebooks session, the environment may have been reset)

In [None]:
!pip install --upgrade azureml-sdk[notebooks]
import azureml.core
print("Ready to use Azure ML", azureml.core.VERSION)

Now you're ready to connect to your workspace. When you created it in the previous exercise, you saved its configuration; so now you can simply load the workspace from its configuration file.

> **Note**: If the authenticated session with your Azure subscription has expired since you completed the previous exercise, you'll be prompted to reauthenticate.

In [None]:
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to work with', ws.name)

## Task 2: Run an Experiment in a Custom Python Environment

In the previous exercise, you ran various iterations of an experiment, but always using the current user-managed Python environment. Sometimes, it can be useful to create a dedicated, "clean" environment for your experiment, explicitly specifying the Python packages that should be installed to support your experiment script.

The following cell contains code that defines a custom Python environment, with specific package dependencies.

In [None]:
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies

# Create a Python environment
diabetes_env = Environment("diabetes-env")
diabetes_env.python.user_managed_dependencies = False # we're going to create a custom environment
diabetes_env.docker.enabled = False # Don't use a docker container (default is true)

# Create a set of package dependencies (conda or pip as required)
diabetes_conda = CondaDependencies.create(conda_packages=['pandas','scikit-learn','joblib','ipykernel','matplotlib'],
                                          pip_packages=['azureml-sdk','argparse','pyarrow'])

# Add the dependencies to the environment
diabetes_env.python.conda_dependencies = diabetes_conda

# View the environment definition (JSON)
diabetes_env

OK, so now you've defined the Python environment you need, you can use it to run an experiment.

The following code creates the experiment and a folder for its files (which may already exist from the previous exercise, but run it anyway!)

In [None]:
import os
from azureml.core import Experiment

# Create an experiment
experiment_name = 'diabetes_training'
experiment = Experiment(workspace = ws, name = experiment_name)

# Create a folder for the experiment files
experiment_folder = './' + experiment_name
os.makedirs(experiment_folder, exist_ok=True)

print("Experiment:", experiment.name)

Next, create the Python script file for the experiment. This will overwrite the script you used in the previous exercise.

In [None]:
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
import argparse
import joblib
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Set regularization parameter
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
args = parser.parse_args()
reg = args.reg_rate

# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
dataset_name = 'Diabetes Dataset'
print("Loading data from " + dataset_name)
diabetes = Dataset.get_by_name(workspace=run.experiment.workspace, name=dataset_name).to_pandas_dataframe()

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
fig = plt.figure(figsize=(6, 4))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
run.log_image(name = "ROC", plot = fig)
plt.show()

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Now you're ready to run the experiment using an estimator. You'll run it on the *local* compute (in this case the Azure Notebooks container), but you'll specify the custom Python environment you created previously. 

> **Note**: The experiment will take quite a lot longer because the Python environment must be created before the script can be run. For a simple experiment like the diabetes training script, this may seem inefficient; but imagine you needed to run a more complex experiment that takes several hours - you wouldn't want it to fail half-way through because of some issue with the version of a package installed in your environment, would you?

In [None]:
from azureml.train.estimator import Estimator

# Set the script parameters
script_params = {
    '--regularization': 0.1
}

# Create an estimator
estimator = Estimator(source_directory=experiment_folder,
              script_params=script_params,
              compute_target = 'local',
              environment_definition = diabetes_env,
              entry_script='diabetes_training.py')

# Run the experiment
run = experiment.submit(config=estimator)
run.wait_for_completion(show_output=True)

As before, you can view the results by visualizing the **run** variable:

In [None]:
run

Now you can register the model that was trained by the experiment.

In [None]:
from azureml.core import Model

# Register the model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model', tags={'Training context':'custom local environment'}, properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

## Task 3: Run an Experiment on Remote Compute

In the previous task, you created a custom Python environment for an experiment; but the experiment was still run on the local compute (in this case, the Azure Notebooks container hosting this notebook). In many cases, your local compute resources may not be sufficient to process a complex or long-running experiment that needs to process a large volume of data; and you may want to take advantage of the ability to dynamically create and use compute resources in the cloud.

Azure ML supports a range of compute targets, which you can define in your workpace and use to run experiments; paying for the resources only when using them.

> **Note**: In this exercise, you'll use an *Azure Machine Learning Compute* container cluster. For more details of the options for compute targets, see the [Azure ML documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/concept-compute-target).

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpu-cluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # Create an AzureMl Compute resource (a container cluster)
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', 
                                                           vm_priority='lowpriority', 
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

Look at the **Compute** tab in the workspace in the [Azure portal](https://portal.azure.com) to verify that the compute resource has been created.

You can also use the following code to enumerate the compute targets in your workspace.

In [None]:
for target_name in ws.compute_targets:
    target = ws.compute_targets[target_name]
    print(target.name, target.type)

Now you're ready to run the experiment on the remote compute.

This time, rather than the generic **Estimator** class, you'll use the **SKLearn** class, which is an estimator that is specifically designed for scikit-learn model training. You'll also specify the package dependencies in the constructor for the estimator, rather than creating a separate environment as before (note that you don't need to specify the scikit-learn library, as it's already included in this estimator). The only reason for this is to see that there are different ways to accomplish essentially the same task.

> **Note**: Once again, this will take a while to run as the nodes in the remote compute must be started and configured before the experiment script is run.

In [None]:
from azureml.train.sklearn import SKLearn

# Set the script parameters
script_params = {
    '--regularization': 0.1
}


# Create a new estimator that uses the remote compute
remote_estimator = SKLearn(source_directory=experiment_folder,
                           script_params=script_params,
                           compute_target = cpu_cluster,
                           conda_packages=['pandas','ipykernel','matplotlib'],
                           pip_packages=['azureml-sdk','argparse','pyarrow'],
                           entry_script='diabetes_training.py')

# Run the experiment
run = experiment.submit(config=remote_estimator)
run.wait_for_completion(show_output=True)


In [None]:
run

Now let's register the new version of the model.

In [None]:
# Register model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model', tags={'Training context':'remote compute'}, properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

So far, you've trained the model using a variety of compute options, but always using the same basic algorithm and parameters. As a result, the performance of the model has remained fairly consistent no matter how you've run the training script - and it's not really all that good!

Now that you've seen how to control compute options for a model training experiment, it's time to see how you can leverage the compute scalability of the cloud to experiment with different algorithms and parameters in order to find the best possible model for your data.