# Training Models

The central goal of machine learning is to train predictive models that can be used by applications. In Azure Machine Learning,  you can use scripts to train models leveraging common machine learning frameworks like Scikit-Learn, Tensorflow, PyTorch, SparkML, and others. You can run these training scripts as experiments in order to track metrics and outputs - in particular, the trained models.



## Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

> **Note**: If you do not have a current authenticated session with your Azure subscription, you'll be prompted to authenticate. Follow the instructions to authenticate using the code provided.

In [1]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

Ready to use Azure ML 1.40.0 to work with azureml-workspace


## Create a Training Script

You're going to use a Python script to train a machine learning model based on the diabates data, so let's start by creating a folder for the script and data files.

In [2]:
import os, shutil

# Create a folder for the experiment files
training_folder = 'diabetes-training'
os.makedirs(training_folder, exist_ok=True)

# Copy the data file into the experiment folder
shutil.copy('data/diabetes.csv', os.path.join(training_folder, "diabetes.csv"))

'diabetes-training\\diabetes.csv'

Now you're ready to create the training script and save it in the folder.

In [3]:
%%writefile $training_folder/diabetes_training.py
# Import libraries
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
import os
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
print("Loading Data...")
diabetes = pd.read_csv('diabetes.csv')

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Set regularization hyperparameter
reg = 0.01

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# Save the trained model in the outputs folder
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Writing diabetes-training/diabetes_training.py


## Use an Estimator to Run the Script as an Experiment

You can run experiment scripts using a **RunConfiguration** and a **ScriptRunConfig**, or you can use an **Estimator**, which abstracts both of these configurations in a single object.

In this case, we'll use a generic **Estimator** object to run the training experiment. Note that the default environment for this estimator does not include the **scikit-learn** package, so you need to explicitly add that to the configuration. The conda environment is built on-demand the first time the estimator is used, and cached for future runs that use the same configuration; so the first run will take a little longer. On subsequent runs, the cached environment can be re-used so they'll complete more quickly.

In [9]:
from azureml.train.estimator import Estimator
from azureml.core import Experiment, Environment, ScriptRunConfig
from azureml.widgets import RunDetails


myenv = Environment.from_pip_requirements("ovo-azureml", "requirements.txt")
myenv.python.user_managed_dependencies = True
myenv.register(ws)

# Create an estimator
estimator = ScriptRunConfig(source_directory=training_folder,
                            script='diabetes_training.py',
                            compute_target='local',
                            environment=myenv)

# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment based on the estimator
run = experiment.submit(config=estimator)
RunDetails(run_instance=run).show()
run.wait_for_completion(show_output=True)

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

RunId: diabetes-training_1649521014_99425c33
Web View: https://ml.azure.com/runs/diabetes-training_1649521014_99425c33?wsid=/subscriptions/8ef9d38c-cd92-4cea-b46c-0527c93d4acb/resourcegroups/azure-ml-sdk-project/workspaces/azureml-workspace&tid=babcc559-b87a-427b-9175-f021a70345c0

Streaming azureml-logs/70_driver_log.txt

[2022-04-09T16:16:58.393308] Entering context manager injector.
[2022-04-09T16:16:59.434697] context_manager_injector.py Command line Options: Namespace(inject=['ProjectPythonPath:context_managers.ProjectPythonPath', 'RunHistory:context_managers.RunHistory', 'TrackUserError:context_managers.TrackUserError'], invocation=['diabetes_training.py'])
Script type = None
[2022-04-09T16:16:59.441704] Entering Run History Context Manager.
[2022-04-09T16:17:01.414806] Current directory: C:\Users\OVOOKP~1\AppData\Local\Temp\azureml_runs\diabetes-training_1649521014_99425c33
[2022-04-09T16:17:01.414806] Preparing to call script [diabetes_training.py] with arguments:[]
[2022-04-09

{'runId': 'diabetes-training_1649521014_99425c33',
 'target': 'local',
 'status': 'Completed',
 'startTimeUtc': '2022-04-09T16:16:57.449055Z',
 'endTimeUtc': '2022-04-09T16:17:14.260357Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'local',
  'ContentSnapshotId': 'fb6b5262-04cf-4199-8ef4-f2dbe6ca5100',
  'azureml.git.repository_uri': 'https://github.com/ovokpus/Azure-Machine-Learning-Project.git',
  'mlflow.source.git.repoURL': 'https://github.com/ovokpus/Azure-Machine-Learning-Project.git',
  'azureml.git.branch': 'main',
  'mlflow.source.git.branch': 'main',
  'azureml.git.commit': 'd1aabb6aca40b9f8c61269b61b807d44e2309e31',
  'mlflow.source.git.commit': 'd1aabb6aca40b9f8c61269b61b807d44e2309e31',
  'azureml.git.dirty': 'False'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'diabetes_training.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator

As with any experiment run, you can use the **RunDetails** widget to view information about the run and get a link to it in Azure Machine Learning studio.

You can also retrieve the metrics and outputs from the **Run** object.

In [10]:
# Get logged metrics
metrics = run.get_metrics()
for key in metrics.keys():
        print(key, metrics.get(key))
print('\n')
for file in run.get_file_names():
    print(file)

Regularization Rate 0.01
Accuracy 0.774
AUC 0.848370565699786


azureml-logs/60_control_log.txt
azureml-logs/70_driver_log.txt
logs/azureml/9620_azureml.log
outputs/diabetes_model.pkl


## Register the Trained Model

Note that the outputs of the experiment include the trained model file (**diabetes_model.pkl**). You can register this model in your Azure Machine Learning workspace, making it possible to track model versions and retrieve them later.

In [11]:
from azureml.core import Model

# Register the model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
                   tags={'Training context':'Estimator'},
                   properties={
                                'AUC': run.get_metrics()['AUC'], 
                               'Accuracy': run.get_metrics()['Accuracy']
                               })

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

diabetes_model version: 1
	 Training context : Estimator
	 AUC : 0.848370565699786
	 Accuracy : 0.774




## Create a Parameterized Training Script

You can increase the flexibility of your training experiment by adding parameters to your script, enabling you to repeat the same training experiment with different settings. In this case, you'll add a parameter for the regularization rate used by the Logistic Regression algorithm when training the model.

Again, lets start by creating a folder for the parameterized script and the training data.

In [12]:
import os, shutil

# Create a folder for the experiment files
training_folder = 'diabetes-training-params'
os.makedirs(training_folder, exist_ok=True)

# Copy the data file into the experiment folder
shutil.copy('data/diabetes.csv', os.path.join(training_folder, "diabetes.csv"))

'diabetes-training-params\\diabetes.csv'

Now let's create a script containing a parameter for the regularization rate hyperparameter.

In [14]:
%%writefile $training_folder/diabetes_training_params.py
# Import libraries
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
import os
import argparse
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Get the experiment run context
run = Run.get_context()

# Set regularization hyperparameter
parser = argparse.ArgumentParser()
parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01)
args = parser.parse_args()
reg = args.reg

# load the diabetes dataset
print("Loading Data...")
# load the diabetes dataset
diabetes = pd.read_csv('diabetes.csv')

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Writing diabetes-training-params/diabetes_training_params.py


## Use a Framework-Specific Estimator

You used a generic **Estimator** class to run the training script, but you can also take advantage of framework-specific estimators that include environment definitions for common machine learning frameworks. In this case, you're using Scikit-Learn, so you can use the **SKLearn** estimator. This means that you don't need to specify the **scikit-learn** package in the configuration.

> **Note**: Once again, the training experiment uses a new environment; which must be created the first time it is run.

In [16]:
from azureml.train.sklearn import SKLearn
from azureml.widgets import RunDetails
from azureml.core import ScriptRunConfig, Environment


myenv = Environment.from_pip_requirements("ovo-azureml", "requirements.txt")
myenv.python.user_managed_dependencies = True

# Create an estimator
estimator = ScriptRunConfig(source_directory=training_folder,
                            script='diabetes_training_params.py',
                            arguments=['--reg_rate', '0.001'],
                            compute_target='local',
                            environment=myenv)

# Create an experiment
experiment_name = 'diabetes-training'
experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment
run = experiment.submit(config=estimator)

# Show the run details while running
RunDetails(run).show()
run.wait_for_completion()

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

{'runId': 'diabetes-training_1649521670_a56fc174',
 'target': 'local',
 'status': 'Finalizing',
 'startTimeUtc': '2022-04-09T16:27:52.372911Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'local',
  'ContentSnapshotId': 'db107c4b-ec3e-4fb0-a56c-f36d712483d9',
  'azureml.git.repository_uri': 'https://github.com/ovokpus/Azure-Machine-Learning-Project.git',
  'mlflow.source.git.repoURL': 'https://github.com/ovokpus/Azure-Machine-Learning-Project.git',
  'azureml.git.branch': 'main',
  'mlflow.source.git.branch': 'main',
  'azureml.git.commit': 'd1aabb6aca40b9f8c61269b61b807d44e2309e31',
  'mlflow.source.git.commit': 'd1aabb6aca40b9f8c61269b61b807d44e2309e31',
  'azureml.git.dirty': 'True'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'diabetes_training_params.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--reg_rate', '0.001'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'targ

Once again, you can get the metrics and outputs from the run.

In [17]:
# Get logged metrics
metrics = run.get_metrics()
for key in metrics.keys():
        print(key, metrics.get(key))
print('\n')
for file in run.get_file_names():
    print(file)

Regularization Rate 0.001
Accuracy 0.774
AUC 0.8484919647743668


azureml-logs/60_control_log.txt
azureml-logs/70_driver_log.txt
logs/azureml/10344_azureml.log
outputs/diabetes_model.pkl


## Register A New Version of the Model

Now that you've trained a new model, you can register it as a new version in the workspace.

In [18]:
from azureml.core import Model

# Register the model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
                   tags={'Training context':'Parameterized SKLearn Estimator'},
                   properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

diabetes_model version: 2
	 Training context : Parameterized SKLearn Estimator
	 AUC : 0.8484919647743668
	 Accuracy : 0.774


diabetes_model version: 1
	 Training context : Estimator
	 AUC : 0.848370565699786
	 Accuracy : 0.774




## Clean Up

If you've finished exploring, you can close this notebook and shut down your Compute Instance.