# Run a training script with the Python SDK

You can use the Python SDK for Azure Machine Learning to submit scripts as jobs. By using jobs, you can easily keep track of the input parameters and outputs when training a machine learning model.

## Before you start

You'll need the latest version of the **azureml-core** package to run the code in this notebook. Run the cell below to verify that it is installed.

> **Note**:
> If the **azureml-core** package is not installed, run `pip install azureml-core` to install it.

In [2]:
%pip show azureml-core

Name: azureml-core
Version: 1.57.0.post1
Summary: Azure Machine Learning core packages, modules, and classes
Home-page: https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py
Author: Microsoft Corp
Author-email: 
License: https://aka.ms/azureml-sdk-license
Location: /opt/anaconda3/envs/automate/lib/python3.12/site-packages
Requires: adal, argcomplete, azure-common, azure-core, azure-graphrbac, azure-mgmt-authorization, azure-mgmt-containerregistry, azure-mgmt-keyvault, azure-mgmt-network, azure-mgmt-resource, azure-mgmt-storage, backports.tempfile, contextlib2, docker, humanfriendly, jmespath, jsonpickle, knack, msal, msal-extensions, msrest, msrestazure, ndg-httpsclient, packaging, paramiko, pathspec, pkginfo, PyJWT, pyopenssl, python-dateutil, pytz, requests, SecretStorage, urllib3
Required-by: 
Note: you may need to restart the kernel to use updated packages.


## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

To connect to a workspace, we need identifier parameters - a subscription ID, resource group name, and workspace name. A `config.json` file containing these parameters can be downloaded from the Azure Machine Learning workspace or Azure portal.

In [3]:
from azureml.core import Workspace

ws = Workspace.from_config(path="../config.json")

## Create the Experiment

In [4]:
from azureml.core import Experiment
import pandas as pd

# create the experiment
experiment = Experiment(workspace=ws, name="diabetes-train-predict")


## Register the dataset

Azure Machine Learning providers several datastores that encapsulates a Dataset. Be considerate about the kind of datastores, use cases, and associated costs to determine the best datasource. Here we use the default datasource which is `blob` data store.

Due to dependency resolution challlenges. I am skipping this step. The data store is manually created on in the Azure workspace. 

Do that with python 

1. Install the dependency:

    ```
    %pip install azureml-dataset-runtime
    ```

2. Run the code provided:

    ```python
    from azureml.core import Dataset

    blob_data_store = ws.get_default_datastore()
    diabetes_data_path = [
        (
            blob_data_store,
            "./diabetes.csv",
        )  # diabetes.csv should be in the same folder as this notebook
    ]
    diabetes_ds = Dataset.Tabular.from_delimited_files(path=diabetes_data_path)
    diabetes_ds = diabetes_ds.register(workspace=ws, name="diabetes_csv_data")
    ```

## Create the Experiment Python script to train and score a model

To train a model, you'll first create the **diabetes_training.py** script in the **src** folder. The script uses the **diabetes.csv** file in the same folder as the training data.

In [5]:
%%writefile diabetes-training.py
# import libraries
from azureml.core import Run, Model
import joblib
import argparse
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Get the experiment run context 
run = Run.get_context()

# Set regularization hyperparameter
parser = argparse.ArgumentParser()
parser.add_argument('--reg-rate', type=float, dest='reg_rate', default=0.01)
parser.add_argument('--test-size', type=float, dest='test_size', default=0.30)
args = parser.parse_args()

reg_rate = args.reg_rate
test_size = args.test_size

# load the diabetes dataset
print("Loading Data...")
diabetes = pd.read_csv('diabetes.csv')

# separate features and labels
X = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']].values
y = diabetes['Diabetic'].values

# split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=0)

# train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg_rate)
model = LogisticRegression(C=1/reg_rate, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', float(acc))
run.log('Accuracy', float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))
run.log('AUC', float(auc))

# Save the model
filename = 'outputs/model.pkl'
os.makedirs('outputs', exist_ok=True)
joblib.dump(value=model, filename= filename)

# complete the experiment
run.complete()

Overwriting diabetes-training.py


## Submit script to the run as an Experiment

Submit the script that trains a classification model to predict diabetes, to run as an experiment. This will create a job base don the ScriptConfiguration.

Here, the Experiment is created in Azure with `local compute target`. An Azure compute instance target can also be specified using the  `RunConfiguration` of the `ScriptRunConfig`. 

Test data size (`test_size`) and Regularization rate (`reg_rate`) for the `LogisticRegression` are passed as parameters. Other parameters such as a registered dataset in a data store could also be passed.

In [10]:
from azureml.core import ScriptRunConfig, Environment

env = Environment.from_existing_conda_environment(name="automate", conda_environment_name= "automate")
env.python.user_managed_dependencies = True
env.register(workspace=ws)

# Define arguments / parameters

test_size = 0.30
reg_rate = 0.01

script_config = ScriptRunConfig(
    source_directory=".",
    script="diabetes-training.py",
    arguments=["--reg-rate", reg_rate, "--test-size", test_size],
    environment=env,
)

run = experiment.submit(config=script_config)
run.wait_for_completion(show_output=False)

Exporting conda specifications for existing conda environment: automate


No Python version provided, defaulting to "3.9.12"


{'runId': 'diabetes-train-predict_1726140719_e21f6e47',
 'target': 'local',
 'status': 'Finalizing',
 'startTimeUtc': '2024-09-12T11:32:01.943364Z',
 'services': {},
 'properties': {'_azureml.ComputeTargetType': 'local',
  '_azureml.ClusterName': 'local',
  'ContentSnapshotId': 'c691e897-eaa0-49bc-abb4-67efd3a0283e',
  'azureml.git.repository_uri': 'https://github.com/kennedyopokuasare/Azure_datascience.git',
  'mlflow.source.git.repoURL': 'https://github.com/kennedyopokuasare/Azure_datascience.git',
  'azureml.git.branch': 'main',
  'mlflow.source.git.branch': 'main',
  'azureml.git.commit': '5a488dd27302ac742cd34e097bd2c8fbd42e879a',
  'mlflow.source.git.commit': '5a488dd27302ac742cd34e097bd2c8fbd42e879a',
  'azureml.git.dirty': 'True'},
 'inputDatasets': [],
 'outputDatasets': [],
 'runDefinition': {'script': 'diabetes-training.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--reg-rate', '0.01', '--test-size', '0.3'],
  'sourceDirectoryDataStore': None,
  'framewo

## Register the model output

In [7]:
import sklearn
from azureml.core import Model

filename = 'outputs/model.pkl'

run.register_model(
    model_name="diabetes-classification-model",
    model_path = filename,
    description = "A LogisticRegression classification model for Diabetes",
    tags = { 'data-format':"CSV", "regularization-rate":reg_rate, "test-size": test_size},
    model_framework = Model.Framework.SCIKITLEARN,
    model_framework_version = str(sklearn.__version__)
)

Model(workspace=Workspace.create(name='Diabetes-prediction', subscription_id='ed463f81-92a5-476c-b6e1-82f1a28d21e2', resource_group='ml-prod-scale'), name=diabetes-classification-model, id=diabetes-classification-model:2, version=2, tags={'data-format': 'CSV', 'regularization-rate': '0.01', 'test-size': '0.3'}, properties={})