# Run Experiments with Scripts with Azure Machine Learning Python SDK (v2)

You can use the Python SDK for Azure Machine Learning to submit scripts as jobs. By using jobs, you can easily keep track of the input parameters and outputs when training a machine learning model.

## Before you start

You'll need the latest version of the **azure.ai.ml** package to run the code in this notebook. Run the cell below to verify that it is installed.

> **Note**:
> If the **azure.ai.ml** package is not installed, run `pip install azure.ai.ml` to install it.

In [1]:
%pip show azure.ai.ml

Name: azure-ai-ml
Version: 1.20.0
Summary: Microsoft Azure Machine Learning Client Library for Python
Home-page: https://github.com/Azure/azure-sdk-for-python
Author: Microsoft Corporation
Author-email: azuresdkengsysadmins@microsoft.com
License: MIT License
Location: /opt/anaconda3/envs/automate/lib/python3.12/site-packages
Requires: azure-common, azure-core, azure-mgmt-core, azure-storage-blob, azure-storage-file-datalake, azure-storage-file-share, colorama, isodate, jsonschema, marshmallow, msrest, opencensus-ext-azure, opencensus-ext-logging, pydash, pyjwt, pyyaml, strictyaml, tqdm, typing-extensions
Required-by: 
Note: you may need to restart the kernel to use updated packages.


## Connect to your workspace

With the required SDK packages installed, now you're ready to connect to your workspace.

To connect to a workspace, we need identifier parameters - a subscription ID, resource group name, and workspace name. A `config.json` file containing these parameters can be downloaded from the Azure Machine Learning workspace or Azure portal.

In [2]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential= DefaultAzureCredential(), path="../../config.json")


Found the config file in: ../../config.json


## Register the dataset

Azure Machine Learning providers several datastores that encapsulates a Dataset. Be considerate about the kind of datastores, use cases, and associated costs to determine the best datasource. Here we use the default datasource which is `blob` data store.

**Authenticate with `Azure CLI` is required here**

In [3]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

file = "../../data/diabetes.csv"

diabetes_data = Data(
    name="diabetes_csv",
    path=file,
    type=AssetTypes.URI_FILE,
    description="Dataset for Diabetes model training",
    tags={"source_type": "file", "source": "Local file"},
    version="1.0.0",
)

try:
    diabetes_data = ml_client.data.create_or_update(diabetes_data)
except (Exception) as ex:  
    print("Exception while registering dataset ", ex)

Exception while registering dataset  (UserError) A data version with this name and version already exists. If you are trying to create a new data version, use a different name or version. If you are trying to update an existing data version, the existing asset's data uri cannot be changed. Only tags, description, and isArchived can be updated.
Code: UserError
Message: A data version with this name and version already exists. If you are trying to create a new data version, use a different name or version. If you are trying to update an existing data version, the existing asset's data uri cannot be changed. Only tags, description, and isArchived can be updated.
Additional Information:Type: ComponentName
Info: {
    "value": "managementfrontend"
}Type: Correlation
Info: {
    "value": {
        "operation": "ea58a3e20ff445203b9df2e7f582fb9e",
        "request": "75ae9d9acc26925e"
    }
}Type: Environment
Info: {
    "value": "northeurope"
}Type: Location
Info: {
    "value": "northeurope"

## Create the Python script to train and score a model

To train a model, you'll first create the **diabetes_training.py** script in the **src** folder. The script uses the **diabetes.csv** data

In [39]:
%%writefile src/diabetes-training.py
# import libraries
import os
import argparse
import pandas as pd
import numpy as np
import pickle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

# Parse job parameters
parser = argparse.ArgumentParser()
parser.add_argument('--reg-rate', type=float, dest='reg_rate', default=0.01)
parser.add_argument('--test-size', type=float, dest='test_size', default=0.30)
parser.add_argument('--data-set', type=str,dest="data")
args = parser.parse_args()

reg_rate = args.reg_rate
test_size = args.test_size
print("Test data size:", test_size)
print("Regularization rate:", reg_rate)

# load the diabetes dataset
print("Loading Data...")
diabetes = pd.read_csv(args.data, header=0)

print("num_samples:", diabetes.shape[0])
features = diabetes[['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness','SerumInsulin','BMI','DiabetesPedigree','Age']]
print("num_features:", features.shape[1])
print("features:", features.columns.values)

# separate features and labels
X = features.values
y = diabetes['Diabetic'].values

# split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=0)

# train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg_rate)
model = LogisticRegression(C=1/reg_rate, solver="liblinear").fit(X_train, y_train)

# calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print('Accuracy:', float(acc))

# calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print('AUC: ' + str(auc))

# Save the model to file
print("Saving model to file")
filename = 'outputs/model.pkl'
os.makedirs('outputs', exist_ok=True)
with open(filename, 'wb') as file:
    pickle.dump(model,file)

Overwriting src/diabetes-training.py


## Submit script to the run as Job

Submit the script that trains a classification model to predict diabetes, to run on Azure ML. This will create a job base on the specifications of the command.

The `enviroment` was created in Azure ML workspace, but can be created with a script

Test data size (`test_size`) and Regularization rate (`reg_rate`) for the `LogisticRegression` are passed as parameters. Other parameters such as a registered dataset in a data store could also be passed.

In [41]:
from azure.ai.ml import command, Input
from azure.ai.ml.constants import AssetTypes, InputOutputModes

data_asset = ml_client.data.get("diabetes_csv", version="1.0.0")

# Define arguments / parameters
diabetes_data = ml_client.data.get("diabetes_csv", version="1.0.0")
test_size = 0.30
reg_rate = 0.01

run_command = command(
    code="./src",
    command="python diabetes-training.py --data ${{inputs.data}} --test-size ${{inputs.test_size}} --reg-rate ${{inputs.reg_rate}} ",
    inputs=dict(
        data= Input(
            path=diabetes_data.id,
            type=AssetTypes.URI_FILE,
            mode=InputOutputModes.RO_MOUNT,
        ),
        reg_rate = reg_rate,
        test_size = test_size,
    ),
    environment="diabest-train:8",
    experiment_name = "diabetes-training"
)

returned_job = ml_client.jobs.create_or_update(run_command)