# Training Models

The central goal of machine learning is to train predictive models that can be used by applications. In Azure Machine Learning,  you can use scripts to train models leveraging common machine learning frameworks like Scikit-Learn, Tensorflow, PyTorch, SparkML, and others. You can run these training scripts as experiments in order to track metrics and outputs - in particular, the trained models.

## Connect to Your Workspace

The first thing you need to do is to connect to your workspace using the Azure ML SDK.

> **Note**: If the authenticated session with your Azure subscription has expired since you completed the previous exercise, you'll be prompted to reauthenticate.

In [1]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()
print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

Ready to use Azure ML 1.58.0 to work with wsp-mlopsdemo


## Create a Training Script

You;re going to use a Python script to train a machine learning model based on the diabates data, so let's start by creating a folder for the script and data files.

In [2]:
import os, shutil

# Create a folder for the experiment files
training_folder = 'diabetes-training'
os.makedirs(training_folder, exist_ok=True)

# Copy the data file into the experiment folder
shutil.copy('data/diabetes.csv', os.path.join(training_folder, "diabetes.csv"))

'diabetes-training/diabetes.csv'

Now you're ready to create the training script and save it in the folder.

In [3]:
%%writefile $training_folder/diabetes_training.py
# Import libraries
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Get the experiment run context
run = Run.get_context()

# load the diabetes dataset
print("Loading Data...")
diabetes = pd.read_csv('diabetes.csv')

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Set regularization hyperparameter
reg = 0.01

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

# Save the trained model in the outputs folder
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Overwriting diabetes-training/diabetes_training.py


In [4]:
!pip show pandas

Name: pandas
Version: 1.3.5
Summary: Powerful data structures for data analysis, time series, and statistics
Home-page: https://pandas.pydata.org
Author: The Pandas Development Team
Author-email: pandas-dev@python.org
License: BSD-3-Clause
Location: /anaconda/envs/azureml_py38/lib/python3.10/site-packages
Requires: numpy, python-dateutil, pytz
Required-by: arch, azureml-automl-runtime, azureml-datadrift, azureml-opendatasets, azureml-train-automl-runtime, azureml-training-tabular, cmdstanpy, dask-expr, dask_sql, datasets, dice-ml, econml, erroranalysis, fairlearn, fastai, fastparquet, interpret-core, interpret_community, ml_wrappers, pandas-ml, pmdarima, prophet, raiutils, raiwidgets, responsibleai, scrapbook, seaborn, shap, sklearn-pandas, statsmodels, torch-tb-profiler


## Use an Estimator to Run the Script as an Experiment

Previously, you ran experiment scripts using a **RunConfiguration** and a **ScriptRunConfig**. You can use the same approach to run a training script, but using an **Estimator** is generally easier as it abstracts both of these configurations in a single object.

In this case, we'll use a generic **Estimator** object to run the training experiment. Note that the default environment for this estimator does not include the **scikit-learn** package, so you need to explicitly add that to the configuration. The conda environment is built on-demand the first time the estimator is used, and cached for future runs that use the same configuration; so the first run will take a little longer.

In [5]:
from azureml.train.estimator import Estimator
from azureml.core import Experiment

# Create an estimator
estimator = Estimator(source_directory=training_folder,
                      entry_script='diabetes_training.py',
                      compute_target='local',
                      conda_packages=['scikit-learn','pandas']
                      )

# Create an experiment
experiment_name = 'diabetes-training-19oct2024-1302'
experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment based on the estimator
run = experiment.submit(config=estimator)
run.wait_for_completion(show_output=True)

'Estimator' is deprecated. Please use 'ScriptRunConfig' from 'azureml.core.script_run_config' with your own defined environment or an Azure ML curated environment.


RunId: diabetes-training-19oct2024-1302_1729323306_692f3e2a
Web View: https://ml.azure.com/runs/diabetes-training-19oct2024-1302_1729323306_692f3e2a?wsid=/subscriptions/260ec2ba-a1f7-42aa-99f3-29ace2204416/resourcegroups/mlopsdemo/workspaces/wsp-mlopsdemo&tid=5e6bdc09-77ab-414c-9bfc-0ca06b020916

Streaming azureml-logs/60_control_log.txt

[2024-10-19T07:35:12.619297] Using urllib.request Python 3.0 or later
Streaming log file azureml-logs/60_control_log.txt
Starting the daemon thread to refresh tokens in background for process with pid = 36959
Running: ['/bin/bash', '/tmp/azureml_runs/diabetes-training-19oct2024-1302_1729323306_692f3e2a/azureml-environment-setup/docker_env_checker.sh']

Found materialized image on target: azureml/azureml_a67ef4dfa50622d413c5e07c5b80c41b


[2024-10-19T07:35:13.148400] Logging experiment running status in history service.

Streaming azureml-logs/70_driver_log.txt

[2024-10-19T07:35:16.449964] Entering context manager injector.
Cannot provide tracer witho

{'runId': 'diabetes-training-19oct2024-1302_1729323306_692f3e2a',
 'target': 'local',
 'status': 'Completed',
 'startTimeUtc': '2024-10-19T07:35:14.387899Z',
 'endTimeUtc': '2024-10-19T07:35:27.731014Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'local',
  '_azureml.ClusterName': 'local',
  'ContentSnapshotId': '199bcea9-22f4-45cf-9831-a5b8996ab6a9',
  'azureml.git.repository_uri': 'https://github.com/TheJamesHerring/DP100.git',
  'mlflow.source.git.repoURL': 'https://github.com/TheJamesHerring/DP100.git',
  'azureml.git.branch': 'master',
  'mlflow.source.git.branch': 'master',
  'azureml.git.commit': 'ce2cad8286b7818c96b0ca6979b84402376843ff',
  'mlflow.source.git.commit': 'ce2cad8286b7818c96b0ca6979b84402376843ff',
  'azureml.git.dirty': 'True'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'diabetes_training.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Pytho

As with any experiment run, you can use the **RunDetails** widget to view information about the run and get a link to it in Azure Machine Learning studio.

You can also retrieve the metrics and outputs from the **Run** object.

In [6]:
# Get logged metrics
metrics = run.get_metrics()
for key in metrics.keys():
        print(key, metrics.get(key))
print('\n')
for file in run.get_file_names():
    print(file)

Accuracy 0.774
Regularization Rate 0.01
AUC 0.8483208119806956


azureml-logs/60_control_log.txt
azureml-logs/70_driver_log.txt
logs/azureml/7_azureml.log
outputs/diabetes_model.pkl


## Register the Trained Model

Note that the outputs of the experiment include the trained model file (**diabetes_model.pkl**). You can register this model in your Azure Machine Learning workspace, making it possible to track model versions and retrieve them later.

In [7]:
from azureml.core import Model

# Register the model
run.register_model(model_path='outputs/diabetes_model.pkl', model_name='diabetes_model',
                   tags={'Training context':'Estimator'},
                   properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['Accuracy']})

# List registered models
for model in Model.list(ws):
    print(model.name, 'version:', model.version)
    for tag_name in model.tags:
        tag = model.tags[tag_name]
        print ('\t',tag_name, ':', tag)
    for prop_name in model.properties:
        prop = model.properties[prop_name]
        print ('\t',prop_name, ':', prop)
    print('\n')

diabetes_model version: 3
	 Training context : Estimator
	 AUC : 0.8483208119806956
	 Accuracy : 0.774


diabetes_model version: 2
	 Training context : Estimator
	 AUC : 0.8483208119806956
	 Accuracy : 0.774


diabetes_model version: 1
	 Training context : Inline Training
	 AUC : 0.88160336081812
	 Accuracy : 0.8968888888888888




## Create a Parameterized Training Script

You can increase the flexibility of your training experiment by adding parameters to your script, enabling you to repeat the same training experiment with different settings. In this case, you'll add a parameter for the regularization rate used by the Logistic Regression algorithm when training the model.

Again, lets start by creating a folder for the parameterized script and the training data.

In [8]:
import os, shutil

# Create a folder for the experiment files
training_folder = 'diabetes-training-params'
os.makedirs(training_folder, exist_ok=True)

# Copy the data file into the experiment folder
shutil.copy('data/diabetes.csv', os.path.join(training_folder, "diabetes.csv"))

'diabetes-training-params/diabetes.csv'

Now let's create a script containing a parameter for the regularization rate hyperparameter.

In [9]:
%%writefile $training_folder/diabetes_training.py
# Import libraries
from azureml.core import Run
import pandas as pd
import numpy as np
import joblib
import argparse
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Get the experiment run context
run = Run.get_context()

# Set regularization hyperparameter
parser = argparse.ArgumentParser()
parser.add_argument('--reg_rate', type=float, dest='reg', default=0.01)
args = parser.parse_args()
reg = args.reg

# load the diabetes dataset
print("Loading Data...")
# load the diabetes dataset
diabetes = pd.read_csv('diabetes.csv')

# Separate features and labels
X, y = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values, diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
run.log('Regularization Rate',  np.float(reg))
model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', acc)
run.log('Accuracy', np.float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', np.float(auc))

os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename='outputs/diabetes_model.pkl')

run.complete()

Overwriting diabetes-training-params/diabetes_training.py


## Use a Framework-Specific Estimator

You used a generic **Estimator** class to run the training script, but you can also take advantage of framework-specific estimators that include environment definitions for common machine learning frameworks. In this case, you're using Scikit-Learn, so you can use the **SKLearn** estimator. This means that you don't need to specify the **scikit-learn** package in the configuration.

In [10]:
from azureml.train.sklearn import SKLearn
from azureml.widgets import RunDetails

# Create an estimator
estimator = SKLearn(source_directory=training_folder,
                    entry_script='diabetes_training.py',
                    script_params = {'--reg_rate': 0.1},
                    compute_target='local'
                    )

# Create an experiment
experiment_name = 'diabetes-training-19oct2024-1302'
experiment = Experiment(workspace = ws, name = experiment_name)

# Run the experiment
run = experiment.submit(config=estimator)

# Show the run details while running
RunDetails(run).show()


2024-10-19 07:35:42.000619: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-10-19 07:35:42.027364: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-10-19 07:35:42.035357: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-19 07:35:42.055490: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

Once again, we can get the metrics and outputs from the run.

In [11]:
# Get logged metrics
print(run)
metrics = run.get_metrics()
print(metrics)
for key in metrics.keys():
        print(key, metrics.get(key))


Run(Experiment: diabetes-training-19oct2024-1302,
Id: diabetes-training-19oct2024-1302_1729323334_f0ea7652,
Type: azureml.scriptrun,
Status: Running)
{}


In [12]:
print('\n')
for file in run.get_file_names():
    print(file)



azureml-logs/60_control_log.txt
azureml-logs/70_driver_log.txt
logs/azureml/9_azureml.log
