# Exercise 2 - From Data to Model

In [the previous exercise](./01%20-%20Getting%20Started%20with%20Azure%20ML.ipynb), you created an Azure ML workspace and ran a simple experiment based on data in a CSV file in the **data** folder where this notebook is stored. Although it's fairly common for data scientists to work with data on their local file system, in an enterprise environment it can be more effective to store the data in a central location where multiple data scientists can access it. In this exercise, you'll explore some Azure ML features that make it easier to work with data in a high-scale, collaborative environment.

> **Important**: This exercise assumes you have completed the previous exercise in this series - specifically, you must have:
>
> - Created an Azure ML Workspace, and saved its configuration in this Azure Notebooks project.
>
> If you haven't done that, do it now - it'll only take a few minutes!

## Task 1: Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK. Let's start by ensuring you still have the latest version installed (if you ended and restarted your Azure Notebooks session, the environment may have been reset)

In [None]:
!pip install --upgrade azureml-sdk[notebooks]
import azureml.core
print("Ready to use Azure ML", azureml.core.VERSION)

Now you're ready to connect to your workspace. When you created it in the previous exercise, you saved its configuration; so now you can simply load the workspace from its configuration file.

> **Note**: If the authenticated session with your Azure subscription has expired since you completed the previous exercise, you'll be prompted to reauthenticate.

In [None]:
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to work with', ws.name)

## Task 2: Upload Data to a Datastore

In Azure ML, *datastores* are references to storage locations, such as Azure Storage blob containers. Every workspace has a *default* datastore - usually the Azure storage blob container that was created with the workspace. If you need to work with data that is stored in different locations, you can add custom datastores to your workspace and set any of them to be the default.

Run the following code to determine the datastores in your workspace:

In [None]:
from azureml.core import Datastore

# Get the default datastore
default_ds = ws.get_default_datastore()

# Enumerate all datastores, indicating which is the default
for ds_name in ws.datastores:
    print(ds_name, "- Default =", ds_name == default_ds.name)

Now that you have determined the available datastores, you can upload files from your local file system to a datastore so that it will be accessible to experiments running in the workspace, regardless of where the experiment script is actually being run.

In [None]:
default_ds.upload_files(files=['./data/diabetes.csv'], # Upload the data/diabetes.csv file
                       target_path='diabetes-data/', # Put it in a folder path in the datastore
                       overwrite=True, # Replace existing files of the same name
                       show_progress=True)

Note that the upload to the datastore results in the creation of a *data reference*, which is an abstraction that represents the connection to the data in the datastore.

So now there's a copy of the diabetes data in the default datastore for the workspace, that you can use in future experiments. If you like, you can use the *Storage Explorer* interface for the Azure Storage account that was created with your Azure ML workspace in the [Azure portal](https://portal.azure.com) to verify that the *diabetes.csv* file has been uploaded to a *diabetes-data* folder in the *azureml-blobstore-nnnn...* blob container.

We'll return to datastores later, but for now, let's turn our attention to another data-related object in Azure ML - the *dataset*.

## Task 3: Create and Register a Dataset

A dataset is an object that encapsulates a specific data source. Let's create a dataset from the diabetes data you uploaded to the datastore, and view the first 20 records. In this case, the data is in a structured format in a CSV file, so we'll use a *Tabular* dataset.

> **More Information**: See the [documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-create-register-datasets) for more information about creating datasets.

In [None]:
from azureml.core import Dataset

#Create a tabular dataset from the path on the datastore (this may take a short while)
data_set = Dataset.Tabular.from_delimited_files(path=(default_ds, 'diabetes-data/diabetes.csv'))

# Display the first 20 rows as a Pandas dataframe
data_set.take(20).to_pandas_dataframe()

As you can see in the code above, it's easy to convert the dataset to a Pandas dataframe, enabling you to work with the data using common python techniques.

Now that we have a dataset that references the diabetes data, we can register it to make it easily accessible to any experiment being run in the workspace.

In [None]:
# Register the dataset
dataset_name = 'Diabetes Dataset'
data_set = data_set.register(workspace=ws, 
                           name=dataset_name,
                           description='diabetes data',
                           tags = {'year':'2019', 'category':'Diabetes'},
                           create_new_version=True)

# List the datasets registered in the workspace
for ds in ws.datasets:
    print(ds)

## Task 4: Train a Model from a Dataset

OK, now we're ready to start training models. We'll use the sampled diabetes data to train a classification model that predicts whether or not a patient is likely to be diabetic. This is a fairly simple example that can be accomplished easily using a statistical machine learning library like scikit-learn, but it will give us an opportunity to explore how Azure ML can be used to manage model training.

In the previous exercise, you learned how to use Azure ML to run Python code as an *experiment*, and log metrics for later analysis. Now you'll use this same technique to train a model, but with one slight difference - rather than run the experiment code directly in this notebook, you'll create a separate Python script file. This enables greater flexibility in terms of the Python environment, or even the compute platform, on which the experiment code is to be run; and makes it easier you to manage experiment scripts in a source-controlled environment.

First, let's create an experiment and a local folder into which we'll put the files needed to run it.

In [None]:
import os
from azureml.core import Experiment

# Create an experiment
experiment_name = 'diabetes_training'
experiment = Experiment(workspace = ws, name = experiment_name)

# Create a folder for the experiment files
experiment_folder = './' + experiment_name
os.makedirs(experiment_folder, exist_ok=True)

print("Experiment:", experiment.name)

Next, create a Python script file for the experiment (we're creating the script file dynamically here, but in production you'd likely create it separately and store it in a source control system).

If you're familiar with scikit-learn, this script shouldn't contain too many surprises for you. The key Azure ML-specific points to note are:

- The script uses the experiment run context to log metrics, just as your code in the previous exercise did.
- The script obtains the training data from the sample dataset you registered in the workspace.

In [None]:
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import joblib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Set regularization rate
reg = 0.01

# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
dataset_name = 'Diabetes Dataset'
print("Loading data from " + dataset_name)
diabetes = Dataset.get_by_name(workspace=run.experiment.workspace, name=dataset_name).to_pandas_dataframe()

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
fig = plt.figure(figsize=(6, 4))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
run.log_image(name = "ROC", plot = fig)
plt.show()

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Now you're almost ready to run the experiment. There are just a few configuration issues you need to deal with:

1. Create a *Run Configuration* that defines the Python code execution environment - in this case, you'll just use the current user environment.
2. Create a *Script Configuration* that identifies the Python script to be run in the experiment.

The following cell sets up these configuration objects, and then submits the experiment.

In [None]:
import os
from azureml.core import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core import Experiment, RunConfiguration
from azureml.core import ScriptRunConfig

# create a new RunConfig object
run_config = RunConfiguration()
run_config.environment.python.user_managed_dependencies = True # Use the current user environment

# Create a script config
src = ScriptRunConfig(source_directory=experiment_folder, 
                      script='diabetes_training.py',
                      run_config=run_config) 

# submit the experiment
run = experiment.submit(config=src)
run.wait_for_completion(show_output=True)

Run the following cell to see the status of the run, and the link to view the experiment details in the Azure portal. Then click the link and view the experiment details and output.

In [None]:
run

You can get the metrics for the experiment run like this:

In [None]:
run.get_metrics()

The experiment also produced a number of output files, including logs, images, and the model itself - let's take a look.

In [None]:
run.get_file_names()

The model in this case is a single *.pkl* file in the **outputs** folder. Since we've gone to the effort of creating it, let's register it in the workspace so we can easily rertrieve it again later.

> **Note**: In a later exercise, you'll explore how to deploy a registered model into production.

In [None]:
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model', tags={'Training context':'Experiment script'}, properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

Switch to the [Azure portal](https://portal.azure.com) and view the **Models** in your Azure ML workspace. The *diabetes_model* model should be listed, and clicking it reveals its details.

You can also examine the models in a workspace using code, like this:

In [None]:
from azureml.core import Model

for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

## Task 5: Use an Estimator

In the previous example, you ran an experiment using *RunConfig* and *ScriptRunConfig* classes to define the execution environment and the script to be run. This approach works well for many kinds of experiment that you might want to run in the Azure ML workspace, including extracting statistical insights from data or training models. However, Azure ML also supports the use of an *Estimator* object that is designed specifically for model training. There's a generic estimator, and some framework specific estimators for common machine learning frameworks like Scikit-Learn, PyTorch, and TensorFlow.

> **More Information**: For more details about estimators, see the [Azure ML documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-train-ml-models).

Let's rerun the experiment to train our diabetes model using an estimator. Additionally, instead of hard-coding the regularization rate, this time we'll pass it to the script as a parameter - so first we'll need to modify the training script.

In [None]:
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
import argparse
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import joblib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Set regularization parameter
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
args = parser.parse_args()
reg = args.reg_rate

# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
dataset_name = 'Diabetes Dataset'
print("Loading data from " + dataset_name)
diabetes = Dataset.get_by_name(workspace=run.experiment.workspace, name=dataset_name).to_pandas_dataframe()

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
fig = plt.figure(figsize=(6, 4))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
run.log_image(name = "ROC", plot = fig)
plt.show()

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Now you're ready to run the experiment using an estimator.

> **Note**: Don't worry too much about the environment and compute target for the moment - we'll explore those in the next exercise!

In [None]:
from azureml.train.estimator import Estimator
from azureml.core import Environment

# Create a Python environment (based on current user environment)
user_managed_env = Environment("user-managed-env")
user_managed_env.python.user_managed_dependencies = True

# Set the script parameters
script_params = {
    '--regularization': 0.1
}

# Create an estimator
estimator = Estimator(source_directory=experiment_folder,
              script_params=script_params,
              compute_target = 'local',
              environment_definition = user_managed_env,
              entry_script='diabetes_training.py')

# Run the experiment
run = experiment.submit(config=estimator)
run.wait_for_completion(show_output=True)

As before, you can view the results by visualizing the **run** variable:

In [None]:
run

Let's register the model that was produced.

In [None]:
# Register the model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model', tags={'Training context':'Estimator'}, properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

# List the registered models
print("Registered Models:")
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

Switch to the [Azure portal](https://portal.azure.com) and view the **Models** in your Azure ML workspace. There should be a new version of *diabetes_model*.

## Task 6: Train a Model from a Datastore

In the previous two tasks, you've trained a model by consuming data from the dataset that you registered previously. Datasets abstract the underlying datastore, and work really well for many machine learning scenarios. However, in some cases, your training script might need to access the datastore directly; and to accomplish this, you'll need to make use of data references.

A data reference provides a way to pass the path to a folder in a datastore to a script, regardless of where the script is being run, so that the script can access data in the datastore location.

The following code gets a reference to the datastore location where you uploaded the diabetes.csv file earlier, and adds the reference to a set of script parameters.

> **More Information**: For more details about using datastores, see the [Azure ML documentation](https://docs.microsoft.com/en-us/azure/machine-learning/service/how-to-access-data).

In [None]:
data_ref = default_ds.path('diabetes-data').as_download(path_on_compute='diabetes_data')
print(data_ref)

script_params = {
    '--regularization': 0.1,
    '--data-folder': data_ref
}

Note that the data reference is an encoded value that abstracts the actual data source.

Now you need to modify the training script to use the data reference.

In [None]:
%%writefile $experiment_folder/diabetes_training.py
# Import libraries
import os
import argparse
from azureml.core import Workspace, Dataset, Experiment, Run
import pandas as pd
import numpy as np
import joblib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Get parameters
parser = argparse.ArgumentParser()
parser.add_argument('--regularization', type=float, dest='reg_rate', default=0.01, help='regularization rate')
parser.add_argument('--data-folder', type=str, dest='data_folder', help='data folder reference')
args = parser.parse_args()
reg = args.reg_rate

# Get the experiment run context
run = Run.get_context()

# load the diabetes data from the data reference
data_folder = args.data_folder
print("Loading data from", data_folder)
diabetes = pd.read_csv(os.path.join(data_folder,"diabetes.csv"))

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
fig = plt.figure(figsize=(6, 4))
# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')
# Plot the FPR and TPR achieved by our model
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
run.log_image(name = "ROC", plot = fig)
plt.show()

os.makedirs('outputs', exist_ok=True)
# note file saved in the outputs folder is automatically uploaded into experiment record
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

The script will now load the training data from the data reference passed to it as a parameter.

Let's try it.

In [None]:
# Create an estimator
estimator = Estimator(source_directory=experiment_folder,
              script_params=script_params,
              compute_target = 'local',
              environment_definition = user_managed_env,
              entry_script='diabetes_training.py')

# Run the experiment
run = experiment.submit(config=estimator)
run.wait_for_completion(show_output=True)

As before, you can view the output and get a link to the details page in the Azure portal.

In [None]:
run

Once again, you can register the model that was trained by the experiment.

In [None]:
# Register the model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model', tags={'Training context':'Estimator (from Datasource)'}, properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

# List the registered models
print("Registered Models:")
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

In this exercise, you've explored some options for working with data in the form of *datastores* and *datasets*, and you've learned how to run an *experiment* that ingests data from these objects in order to train a machine learning model.

All of the experiments you've seen so far have been run in the context of the current Python environment on the local computer (in this case, the container where your Azure Notebooks library is hosted). In the next exercise, we'll look at ways you can run experiment code in different compute contexts.