# Training with Azure ML

Azure Machine Learning (Azure ML) provides a robust platform for training machine learning models at scale. This notebook guides you through the process of configuring and submitting a training job to Azure ML's managed compute resources, allowing you to leverage powerful GPU instances without managing the underlying infrastructure.

This notebook demonstrates how to submit a training job to Azure ML using the Python SDK. The job will:
- Train a pneumonia detection model using PyTorch
- Run on GPU compute for accelerated training
- Leverage the registered RSNA pneumonia dataset
- Use custom training parameters (epochs, learning rate)
- Track metrics and artifacts automatically with MLflow

By defining your training as an Azure ML job, you gain several advantages over running locally. Your experiments are automatically tracked, resources are provisioned on-demand, and your training can scale to multiple GPUs if needed. The job definition includes everything needed to reproduce your training, from code and data to environment configuration and compute resources.

Once your job is submitted, you can monitor its progress through logs, metrics, and the Azure ML Studio interface. After training completes, the resulting model will be registered and can be deployed for inference or used in subsequent workflows.

## Setup Pre-requisites

Before starting, ensure you have the following ready:

* You need access to an Azure ML workspace with appropriate permissions.
* The RSNA pneumonia detection dataset should already be registered in your workspace.
* Your Azure ML workspace should have quota for GPU compute resources.

This notebook assumes you've already completed the data preparation steps and have registered your dataset with Azure ML. We'll be using that registered dataset as input to our training job, making it available to our compute cluster during training.

## What you will do:

* Connect to your Azure ML workspace using the Python SDK.
* Set up GPU compute resources for model training.
* Create and register a GPU-enabled environment with PyTorch.
* Retrieve the RSNA pneumonia detection dataset from your workspace.
* Configure and submit a distributed training job to Azure ML.
* Monitor the training job's progress through the Azure ML Studio.

The Azure ML Python SDK v2 simplifies these tasks through a streamlined interface, allowing you to focus on model development rather than infrastructure management. Throughout this notebook, you'll learn how to use key SDK components to orchestrate your training workflow.



In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
from workshop_helpers.utils import get_unique_name

credential = DefaultAzureCredential()
ml_client = MLClient.from_config(credential)

unique_name = get_unique_name(credential)


## GPU Compute Target Setup

This cell configures a dedicated GPU compute environment for our machine learning training jobs. GPU acceleration is essential for efficiently training deep learning models, especially for computer vision tasks that require significant computational power. By establishing this compute target, we ensure our training pipeline has access to the necessary GPU resources without manually managing infrastructure.

The code performs a critical infrastructure step that:

- **Automates resource provisioning**: Eliminates the need for manual cluster creation through the Azure portal
- **Enables scalability**: Configures auto-scaling to optimize resource usage during training
- **Standardizes the environment**: Ensures all training jobs run on consistent, reproducible hardware
- **Supports cost management**: Sets minimum instances to zero to avoid charges when the cluster is idle

Steps:
- Check if a GPU compute cluster named "gpucluster" already exists
- If not found, create a new compute target with:
  - GPU-enabled VM size (Standard_NC6s_v3)
  - Auto-scaling configuration (0 minimum to 4 maximum instances)
- Wait for cluster creation to complete
- Log status of the compute resource

In [None]:
from azure.ai.ml.entities import AmlCompute

gpu_cluster_name = f"gpucluster-{unique_name}" ## TODO Add unique name
try:
    compute_target = ml_client.compute.get(gpu_cluster_name)
    print(f"Using existing GPU compute cluster: {gpu_cluster_name}")
except Exception as e:
    print(f"Creating new GPU compute cluster: {gpu_cluster_name}")
    compute_target = AmlCompute(
        name=gpu_cluster_name,
        size="Standard_NC6s_v3",  # GPU-enabled VM size, adjust if needed
        min_instances=0,
        max_instances=5,
    )
    ml_client.compute.begin_create_or_update(compute_target).result()
    print(f"Created GPU compute cluster: {gpu_cluster_name}")


## Machine Learning Environment Configuration

This cell defines and registers a specialized GPU-enabled environment that will be used to execute our training code in the cloud. The environment configuration is a critical component of our machine learning workflow as it ensures all dependencies, frameworks, and GPU drivers are consistently available across training runs.

By creating a dedicated environment with PyTorch and CUDA support, we establish a reproducible foundation that:

- **Guarantees compatibility**: The curated Docker image provides pre-tested PyTorch with CUDA drivers that are optimized for deep learning
- **Maintains consistency**: Environment versioning ensures all experiments run with identical dependencies
- **Enables reproducibility**: The combination of the base image and conda file creates a fully documented, reusable environment
- **Optimizes performance**: The environment is specifically configured to leverage GPU acceleration for faster model training

Steps:
- Define a GPU-enabled environment using a PyTorch/CUDA Docker image
- Incorporate additional dependencies from a conda YAML file
- Add metadata tags for environment categorization
- Register the environment in the Azure ML workspace
- Capture the environment name and version for reference in training jobs

In [None]:
from azure.ai.ml.entities import Environment

# This assumes you have an 'environment.yaml' in your code folder that defines your Conda dependencies.
gpu_environment = Environment(
    name=f"uw-workshop-gpu-env-{unique_name}",
    description="GPU enabled environment",
    image="mcr.microsoft.com/azureml/curated/acpt-pytorch-1.13-cuda11.7:latest",
    conda_file="src/environment.yml",
    tags={"gpu": "true"}
)

# Register or update the environment in your workspace.
registered_env = ml_client.environments.create_or_update(gpu_environment)
print(f"Registered environment: {registered_env.name}:{registered_env.version}")


## Dataset Retrieval

This cell accesses our previously registered pneumonia dataset from the Azure ML workspace, ensuring our training job can reference the correct data assets. By retrieving the latest version of the dataset, we establish a connection to the centralized data resource that maintains consistency across all training runs and enables reproducibility of our experiments.

Steps:
- Specify the dataset name to retrieve
- Fetch the latest version of the registered dataset from the Azure ML workspace
- Confirm successful retrieval by displaying dataset name and version information

In [None]:
data_name = f"rsna_pneumonia_dataset-{unique_name}"

# Retrieve the registered dataset
registered_data = ml_client.data.get(name=data_name, label="latest")
print(f"Retrieved dataset: {registered_data.name}, version: {registered_data.version}")


## Configuring and Submitting the Training Job

This cell represents the culmination of our workflow setup, where we define and launch the actual machine learning training job in Azure ML. By configuring a fully parameterized job definition, we create a reproducible, scalable, and tracked training process that can run on cloud infrastructure without manual intervention.

The job configuration brings together all previously established components:
- **Input Data**: References our registered pneumonia dataset
- **Hyperparameters**: Defines training parameters that control model behavior
- **Compute Resources**: Utilizes our GPU cluster for accelerated training
- **Environment**: Uses our custom PyTorch environment with all dependencies
- **Code Assets**: Points to our training script within the source directory
- **Output Management**: Configures MLflow tracking for experiment monitoring

This approach offers several critical advantages:
- **Reproducibility**: Every aspect of the training process is defined and versioned
- **Scalability**: Jobs can be executed on powerful cloud infrastructure
- **Separation of Concerns**: Code, data, compute, and parameters are managed independently
- **Experiment Tracking**: Integration with MLflow captures metrics and artifacts automatically
- **Workflow Automation**: The job can be incorporated into larger ML pipelines

Steps:
- Define job inputs including dataset path and hyperparameters
- Configure a command job with:
  - Source code location
  - Execution command with parameterized inputs
  - Compute target configuration
  - Environment specification
  - Experiment organization details
- Submit the job to Azure ML for execution
- Return the job object for tracking and monitoring

In [None]:
from azure.ai.ml import command, Input
from azure.ai.ml.constants import AssetTypes

# Define job inputs including dataset and training hyperparameters
inputs = {
    "data_dir": Input(type=AssetTypes.URI_FOLDER, path=registered_data.path),  # Reference to pneumonia dataset
    "max_epochs": 1,
    "learning_rate": 0.001,
    "batch_size": 32,

    "mlflow_model_dir": "outputs/mlflow_model_dir"  # Location to store MLflow model
    }

# Configure the training job with code, compute, and environment
job = command(
    code="./src",  # Source code directory containing training script
    command="python train.py --data_dir ${{inputs.data_dir}} --max_epochs ${{inputs.max_epochs}} --learning_rate ${{inputs.learning_rate}} --batch_size ${{inputs.batch_size}} --mlflow_model_dir ${{inputs.mlflow_model_dir}}",
    compute=gpu_cluster_name,  # Use the GPU cluster defined earlier
    environment=registered_env,  # Use PyTorch GPU environment with dependencies
    inputs=inputs,
    experiment_name=f"workshop-{unique_name}",  # For organizing related jobs in AzureML
    display_name="test_job"  # Human-readable name in the AzureML UI
)

print("Job created with dataset input. Ready to submit.")
# Submit the job to AzureML and get a reference to the running job
test_job = ml_client.jobs.create_or_update(job)
test_job


## Job Output Streaming

This cell connects to the running training job and displays its logs in real-time, allowing us to monitor progress, track metrics, and detect any issues as they occur without leaving the notebook environment.

**_Note_: This cell will not complete until the job finishes execution.**

In [None]:
ml_client.jobs.stream(test_job.name)


## Hyperparameter Optimization with Sweep Job

This cell configures and launches an automated hyperparameter tuning experiment that systematically explores different learning rate values to optimize model performance. By leveraging Azure ML's sweep capabilities, we can efficiently discover optimal hyperparameters without manually testing each configuration.

### Parameter Sweeps
Parameter sweeps are essential in machine learning development for several key reasons:
- **Objective optimization**: Automatically identifies parameter combinations that maximize model performance
- **Efficient exploration**: LogUniform sampling efficiently explores values across orders of magnitude (10^-10 to 10^-2)
- **Reduced human bias**: Systematic exploration may discover unintuitive but effective parameters humans might not try
- **Time savings**: Parallel trials dramatically reduce the time needed to find optimal configurations
- **Reproducibility**: The structured approach ensures the parameter search process is documented and repeatable

In this specific sweep, we're exploring learning rate values, one of the most critical hyperparameters that influences convergence speed and final model quality.

### Compute Clusters with AzureML
Our sweep job leverages the GPU compute cluster we configured earlier, which provides several advantages:
- **Resource elasticity**: Scales between 0-4 nodes as needed, maximizing resource utilization
- **Parallel execution**: Runs multiple trials simultaneously, accelerating the optimization process
- **Cost management**: Auto-scaling ensures we only pay for compute when actively using it
- **Hardware optimization**: GPU acceleration dramatically speeds up each individual training run
- **Centralized management**: All compute resources are managed through the AzureML platform

Steps:
- Create a new job configuration with a learning rate sweep parameter (LogUniform distribution from 10^-10 to 10^-2)
- Configure sweep settings:
  - Random sampling strategy to explore parameter space
  - Maximum of 12 total trials with 3 running concurrently
  - Early termination policy (Bandit) to automatically stop underperforming trials
  - Target metric set to "val_best_metric_val" for maximization
- Submit the sweep job for execution on our GPU cluster

In [None]:
from azure.ai.ml import sweep

command_for_sweep_job = job(learning_rate=sweep.LogUniform(-10, -2), max_epochs=20, batch_size=sweep.Choice([16, 32, 48]))
command_for_sweep_job.display_name = None
command_for_sweep_job.experiment_name = job.experiment_name
sweep_job = command_for_sweep_job.sweep(
    sampling_algorithm="random",
    primary_metric="val_goal_metric_val",
    goal="Maximize",
    max_total_trials=24,
    max_concurrent_trials=3,
    early_termination_policy=sweep.BanditPolicy(
        slack_factor=0.15, evaluation_interval=1, delay_evaluation=3
    ))

ml_client.jobs.create_or_update(sweep_job)
