# Azure ML Training Tutorial: LSTM Time Series Forecasting

This notebook provides a step-by-step guide to prepare, submit, and monitor LSTM training jobs on Azure ML. You'll learn how to:

1. **Setup Azure ML Environment** - Connect to workspace and configure resources
2. **Create Training Scripts** - Develop Azure ML optimized training code
3. **Submit Remote Jobs** - Execute training on Azure ML compute clusters
4. **Monitor Progress** - Track job status and retrieve results

## Prerequisites

- Azure ML workspace (created in `01_setup_workspace.ipynb`)
- Environment variables configured for Azure authentication
- Azure CLI authenticated or service principal setup

## Learning Objectives

By the end of this tutorial, you will:
- ‚úÖ Understand Azure ML training job workflow
- ‚úÖ Create self-contained training scripts for remote execution
- ‚úÖ Configure environments and compute resources
- ‚úÖ Submit and monitor training jobs
- ‚úÖ Retrieve training outputs and artifacts

Let's get started! üöÄ

## Step 1: Install and Import Required Libraries

First, we'll install and import all necessary libraries for Azure ML operations.

In [None]:
# Import necessary libraries
import os
import sys
import time
from pathlib import Path

# Azure ML imports
from azure.ai.ml import MLClient, command
from azure.ai.ml.entities import Environment
from azure.identity import DefaultAzureCredential

# Utility imports
from dotenv import find_dotenv, load_dotenv

print("üìö Importing libraries...")

# Load environment variables
load_dotenv(find_dotenv(".env"))

print("‚úÖ All imports successful!")
print("üîß Environment variables loaded from .env file")

In [None]:
# Add parent directory to path for module imports
parent_dir = os.path.dirname(os.path.dirname(os.path.abspath('__file__')))
modules_dir = os.path.join(parent_dir, 'src')
if modules_dir not in sys.path:
    sys.path.append(modules_dir)
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

print(f"Parent directory: {parent_dir}")
print(f"Modules directory: {modules_dir}")

## Step 2: Configure Azure ML Workspace Connection

Next, we'll establish a connection to your Azure ML workspace using managed identity authentication.

In [None]:
# Load Azure ML workspace configuration from environment variables
subscription_id = os.getenv("AZURE_SUBSCRIPTION_ID")
resource_group = os.getenv("AZURE_RESOURCE_GROUP")
workspace_name = os.getenv("AZURE_ML_WORKSPACE")

print("üîß Azure ML Configuration:")
print(f"   Subscription ID: {subscription_id}")
print(f"   Resource Group: {resource_group}")
print(f"   Workspace Name: {workspace_name}")

# Validate required configuration
if not all([subscription_id, resource_group, workspace_name]):
    print("\n‚ùå Missing required environment variables!")
    print("Please set the following in your .env file:")
    print("   - AZURE_SUBSCRIPTION_ID")
    print("   - AZURE_RESOURCE_GROUP")
    print("   - AZURE_ML_WORKSPACE")
    raise ValueError("Missing Azure ML configuration")

print("\n‚úÖ Configuration validation passed!")

In [None]:
# Initialize Azure ML client with managed identity authentication
print("üîê Authenticating with Azure...")

try:
    # Use DefaultAzureCredential for secure authentication
    # This supports multiple auth methods: managed identity, Azure CLI, etc.
    credential = DefaultAzureCredential()

    # Create Azure ML client
    ml_client = MLClient(
        credential=credential,
        subscription_id=subscription_id,
        resource_group_name=resource_group,
        workspace_name=workspace_name
    )

    # Test connection by retrieving workspace details
    workspace = ml_client.workspaces.get(workspace_name)

    print("‚úÖ Successfully connected to Azure ML workspace!")
    print(f"   Workspace: {workspace.name}")
    print(f"   Location: {workspace.location}")
    print(f"   Resource Group: {workspace.resource_group}")

except Exception as e:
    print(f"‚ùå Error connecting to Azure ML workspace: {str(e)}")
    print("\nüîß Troubleshooting tips:")
    print("   1. Ensure you're authenticated with Azure CLI: 'az login'")
    print("   2. Verify workspace exists and you have access")
    print("   3. Check environment variables are correct")
    raise

## Step 3: Setup and Validate Compute Resources

We need compute resources to run our training jobs. Let's check existing compute or create new ones.

In [None]:
# List existing compute resources
print("üñ•Ô∏è Checking existing compute resources...")

try:
    compute_list = list(ml_client.compute.list())

    if compute_list:
        print(f"‚úÖ Found {len(compute_list)} compute resource(s):")
        for compute in compute_list:
            print(f"   - {compute.name} ({compute.type}) - {compute.provisioning_state}")
    else:
        print("‚ö†Ô∏è No compute resources found in workspace")

except Exception as e:
    print(f"‚ùå Error listing compute resources: {str(e)}")
    compute_list = []

In [None]:
# # Create compute cluster if needed
# from azure.ai.ml.entities import AmlCompute

# compute_name = "training-cluster"
# found_compute = None

# # Check if our target compute exists
# for compute in compute_list:
#     if compute.name == compute_name:
#         found_compute = compute
#         break

# if found_compute:
#     print(f"‚úÖ Using existing compute cluster: {compute_name}")
#     print(f"   State: {found_compute.provisioning_state}")
#     print(f"   VM Size: {found_compute.size}")
# else:
#     print(f"üî® Creating new compute cluster: {compute_name}")

#     # Define compute cluster configuration
#     compute_cluster = AmlCompute(
#         name=compute_name,
#         description="CPU cluster for LSTM training",
#         size="Standard_D2s_v3",  # 2 cores, 8GB RAM - good for small models
#         min_instances=0,         # Scale to zero when not in use
#         max_instances=4,         # Maximum nodes
#         idle_time_before_scale_down=1800  # 30 minutes
#     )

#     try:
#         # Create the compute cluster
#         created_compute = ml_client.compute.begin_create_or_update(compute_cluster)
#         print("‚è≥ Creating compute cluster (this may take a few minutes)...")

#         # Note: We don't wait for completion as it can take several minutes
#         print("‚úÖ Compute cluster creation initiated!")
#         print(f"   Monitor progress in Azure ML Studio")

#     except Exception as e:
#         print(f"‚ùå Error creating compute cluster: {str(e)}")
#         print("üí° You can use any existing compute cluster for training")

#         # Fallback to first available compute
#         if compute_list:
#             compute_name = compute_list[0].name
#             print(f"üîÑ Using fallback compute: {compute_name}")
#         else:
#             raise RuntimeError("No compute resources available")

# Setup compute cluster
from mlops.compute.setup_compute import ComputeManager

compute_manager = ComputeManager()

# Create CPU compute cluster
cpu_cluster = compute_manager.create_compute_cluster(
    cluster_name="cpu-cluster",
    vm_size="Standard_D32ds_v5",
    max_instances=4
)

print(f"\nüéØ Target compute for training: {cpu_cluster}")

## Step 4: Create and Register Azure ML Environment

We need to define the software environment (Python packages) for our training job.

In [None]:
# Create training environment directory
training_dir = Path("../src/azure_ml_training")
training_dir.mkdir(parents=True, exist_ok=True)

print(f"üìÅ Created training directory: {training_dir}")

# Create conda environment specification
environment_content = """
name: pytorch-lstm-env
channels:
  - pytorch
  - conda-forge
  - defaults
dependencies:
  - python=3.9
  - pytorch>=1.12.0
  - numpy
  - pandas
  - scikit-learn
  - pip
  - pip:
    - mlflow>=2.0.0
    - azure-ai-ml
    - joblib
"""

# Write environment file
env_file_path = training_dir / "environment.yml"
with open(env_file_path, 'w') as f:
    f.write(environment_content.strip())

print(f"‚úÖ Created environment file: {env_file_path}")

# Also create requirements.txt for reference
requirements_content = """
torch>=1.12.0
numpy
pandas
scikit-learn
mlflow>=2.0.0
azure-ai-ml
joblib
"""

requirements_path = training_dir / "requirements.txt"
with open(requirements_path, 'w') as f:
    f.write(requirements_content.strip())

print(f"‚úÖ Created requirements file: {requirements_path}")
print("üì¶ Environment includes: PyTorch, MLflow, scikit-learn, and Azure ML SDK")

In [None]:
# Register the environment with Azure ML
environment_name = "pytorch-lstm-env"

print(f"üîÑ Registering environment: {environment_name}")

try:
    # Create environment definition
    pytorch_env = Environment(
        name=environment_name,
        description="PyTorch environment for LSTM time series forecasting",
        conda_file=str(env_file_path),
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest"
    )

    # Register with Azure ML
    registered_env = ml_client.environments.create_or_update(pytorch_env)

    print("‚úÖ Environment registered successfully!")
    print(f"   Name: {registered_env.name}")
    print(f"   Version: {registered_env.version}")
    print("   Status: Ready for use")

    # Store for later use
    env_reference = f"{registered_env.name}:{registered_env.version}"

except Exception as e:
    print(f"‚ö†Ô∏è Error registering environment: {str(e)}")
    print("üîÑ Falling back to curated environment...")

    # Use a curated PyTorch environment as fallback
    env_reference = "AzureML-pytorch-1.13-ubuntu20.04-py38-cpu-inference:latest"
    print(f"‚úÖ Using curated environment: {env_reference}")

print(f"\nüéØ Environment for training: {env_reference}")

## Step 5: Create Training Script for Azure ML

Now we'll create a training script specifically optimized for Azure ML. This script will include data generation, model training, and MLflow integration for experiment tracking.

In [None]:
# Create Azure ML optimized training script
training_script_content = '''
import torch
import torch.nn as nn
import torch.optim as optim
import numpy as np
import mlflow
import mlflow.pytorch
import argparse
import os
from sklearn.metrics import mean_squared_error, mean_absolute_error
import json


class LSTMTimeSeriesModel(nn.Module):
    """LSTM model for time series forecasting"""

    def __init__(self, input_size=1, hidden_size=50, num_layers=2, output_size=1, dropout=0.2):
        super(LSTMTimeSeriesModel, self).__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers

        self.lstm = nn.LSTM(input_size, hidden_size, num_layers,
                           batch_first=True, dropout=dropout)
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        # Initialize hidden state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size)

        # Forward propagate LSTM
        out, _ = self.lstm(x, (h0, c0))

        # Apply dropout and linear layer to the last output
        out = self.dropout(out[:, -1, :])
        out = self.linear(out)
        return out


def generate_synthetic_data(n_samples=1000, sequence_length=50):
    """Generate synthetic time series data"""
    print(f"üîÑ Generating {n_samples} samples with sequence length {sequence_length}")

    # Generate time series with trend and seasonality
    t = np.linspace(0, 4*np.pi, n_samples + sequence_length)

    # Create complex time series
    trend = 0.01 * t
    seasonal = 2 * np.sin(t) + 0.5 * np.sin(3*t)
    noise = 0.1 * np.random.randn(len(t))

    data = trend + seasonal + noise

    # Create sequences
    X, y = [], []
    for i in range(len(data) - sequence_length):
        X.append(data[i:(i + sequence_length)])
        y.append(data[i + sequence_length])

    X = np.array(X).reshape(-1, sequence_length, 1)
    y = np.array(y).reshape(-1, 1)

    # Train/test split
    split_idx = int(0.8 * len(X))
    X_train, X_test = X[:split_idx], X[split_idx:]
    y_train, y_test = y[:split_idx], y[split_idx:]

    print(f"‚úÖ Data generated - Train: {X_train.shape}, Test: {X_test.shape}")

    return X_train, X_test, y_train, y_test


def train_model(X_train, y_train, X_test, y_test, config):
    """Train the LSTM model with MLflow tracking"""

    # Convert to PyTorch tensors
    X_train_tensor = torch.FloatTensor(X_train)
    y_train_tensor = torch.FloatTensor(y_train)
    X_test_tensor = torch.FloatTensor(X_test)
    y_test_tensor = torch.FloatTensor(y_test)

    # Initialize model
    model = LSTMTimeSeriesModel(
        input_size=config['input_size'],
        hidden_size=config['hidden_size'],
        num_layers=config['num_layers'],
        output_size=config['output_size'],
        dropout=config['dropout']
    )

    # Loss and optimizer
    criterion = nn.MSELoss()
    optimizer = optim.Adam(model.parameters(), lr=config['learning_rate'])

    print(f"üöÄ Starting training for {config['epochs']} epochs")

    # Training loop
    model.train()
    for epoch in range(config['epochs']):
        optimizer.zero_grad()

        # Forward pass
        outputs = model(X_train_tensor)
        loss = criterion(outputs, y_train_tensor)

        # Backward pass
        loss.backward()
        optimizer.step()

        # Log metrics every 10 epochs
        if (epoch + 1) % 10 == 0:
            model.eval()
            with torch.no_grad():
                test_outputs = model(X_test_tensor)
                test_loss = criterion(test_outputs, y_test_tensor)

                # Log to MLflow
                mlflow.log_metric("train_loss", loss.item(), step=epoch)
                mlflow.log_metric("test_loss", test_loss.item(), step=epoch)

                print(f"Epoch [{epoch+1}/{config['epochs']}] - "
                      f"Train Loss: {loss.item():.6f}, Test Loss: {test_loss.item():.6f}")

            model.train()

    # Final evaluation
    model.eval()
    with torch.no_grad():
        train_pred = model(X_train_tensor).numpy()
        test_pred = model(X_test_tensor).numpy()

        # Calculate final metrics
        train_mse = mean_squared_error(y_train, train_pred)
        test_mse = mean_squared_error(y_test, test_pred)
        train_mae = mean_absolute_error(y_train, train_pred)
        test_mae = mean_absolute_error(y_test, test_pred)

        # Log final metrics
        mlflow.log_metric("final_train_mse", train_mse)
        mlflow.log_metric("final_test_mse", test_mse)
        mlflow.log_metric("final_train_mae", train_mae)
        mlflow.log_metric("final_test_mae", test_mae)

        print(f"\\nüìä Final Results:")
        print(f"   Train MSE: {train_mse:.6f}")
        print(f"   Test MSE: {test_mse:.6f}")
        print(f"   Train MAE: {train_mae:.6f}")
        print(f"   Test MAE: {test_mae:.6f}")

    return model, test_mse


def main():
    """Main training function"""
    parser = argparse.ArgumentParser(description='LSTM Time Series Training')
    parser.add_argument('--epochs', type=int, default=100, help='Number of epochs')
    parser.add_argument('--learning_rate', type=float, default=0.001, help='Learning rate')
    parser.add_argument('--hidden_size', type=int, default=50, help='LSTM hidden size')
    parser.add_argument('--num_layers', type=int, default=2, help='Number of LSTM layers')
    parser.add_argument('--dropout', type=float, default=0.2, help='Dropout rate')
    parser.add_argument('--sequence_length', type=int, default=50, help='Input sequence length')
    parser.add_argument('--n_samples', type=int, default=1000, help='Number of data samples')

    args = parser.parse_args()

    # Configuration
    config = {
        'epochs': args.epochs,
        'learning_rate': args.learning_rate,
        'hidden_size': args.hidden_size,
        'num_layers': args.num_layers,
        'dropout': args.dropout,
        'sequence_length': args.sequence_length,
        'n_samples': args.n_samples,
        'input_size': 1,
        'output_size': 1
    }

    print("üéØ Starting Azure ML LSTM Training")
    print(f"üìã Configuration: {json.dumps(config, indent=2)}")

    # Set up MLflow
    mlflow.start_run()

    try:
        # Log parameters
        mlflow.log_params(config)

        # Generate data
        X_train, X_test, y_train, y_test = generate_synthetic_data(
            n_samples=config['n_samples'],
            sequence_length=config['sequence_length']
        )

        # Train model
        model, test_mse = train_model(X_train, y_train, X_test, y_test, config)

        # Save model
        model_path = "lstm_model"
        mlflow.pytorch.log_model(model, model_path)

        # Create model info file
        model_info = {
            "model_type": "LSTM Time Series",
            "framework": "PyTorch",
            "final_test_mse": float(test_mse),
            "parameters": config
        }

        # Save model info
        with open("model_info.json", "w") as f:
            json.dump(model_info, f, indent=2)

        mlflow.log_artifact("model_info.json")

        print(f"\\n‚úÖ Training completed successfully!")
        print(f"üìÅ Model saved to MLflow")
        print(f"üéØ Final Test MSE: {test_mse:.6f}")

    except Exception as e:
        print(f"‚ùå Training failed: {str(e)}")
        mlflow.log_param("status", "failed")
        mlflow.log_param("error", str(e))
        raise

    finally:
        mlflow.end_run()


if __name__ == "__main__":
    main()
'''

# Create the training script file
training_script_path = Path("../src/azure_ml_training/train_lstm_azureml.py")
training_script_path.parent.mkdir(parents=True, exist_ok=True)

with open(training_script_path, 'w') as f:
    f.write(training_script_content)

print(f"‚úÖ Training script created: {training_script_path}")
print("üìã Script features:")
print("   ‚Ä¢ LSTM model with configurable architecture")
print("   ‚Ä¢ Synthetic time series data generation")
print("   ‚Ä¢ MLflow experiment tracking")
print("   ‚Ä¢ Command-line argument parsing")
print("   ‚Ä¢ Error handling and logging")
print("   ‚Ä¢ Model saving and artifact logging")

## Step 6: Submit Training Job to Azure ML

Now we'll create and submit a training job to Azure ML. This will run our training script on the remote compute cluster.

In [None]:
# Import required classes for job submission

# Training job configuration
job_config = {
    "experiment_name": "lstm-time-series-tutorial",
    "display_name": "LSTM Time Series Training - Tutorial",
    "description": "Training LSTM model for time series forecasting on Azure ML",
    "code_path": "../src/azure_ml_training",  # Directory containing our training script
    "command": "python train_lstm_azureml.py --epochs 50 --learning_rate 0.001 --hidden_size 64 --num_layers 2",
    "environment": env_reference,  # Environment we created earlier
    "compute_target": compute_name,  # Compute cluster we verified earlier
    "instance_count": 1,
    "max_duration_in_seconds": 3600  # 1 hour timeout
}

print("üéØ Job Configuration:")
print(f"   Experiment: {job_config['experiment_name']}")
print(f"   Environment: {job_config['environment']}")
print(f"   Compute: {job_config['compute_target']}")
print(f"   Command: {job_config['command']}")
print(f"   Timeout: {job_config['max_duration_in_seconds']} seconds")

In [None]:
# Create and submit the training job
print("üöÄ Creating Azure ML training job...")

try:
    # Create the command job
    training_job = command(
        code=job_config["code_path"],
        command=job_config["command"],
        environment=job_config["environment"],
        compute=job_config["compute_target"],
        display_name=job_config["display_name"],
        description=job_config["description"],
        experiment_name=job_config["experiment_name"],
        # Resource configuration
        instance_count=job_config["instance_count"],
        # Timeout configuration
        timeout=job_config["max_duration_in_seconds"]
    )

    print("üì§ Submitting job to Azure ML...")

    # Submit the job
    submitted_job = ml_client.jobs.create_or_update(training_job)

    print("‚úÖ Job submitted successfully!")
    print("üìã Job Details:")
    print(f"   Job Name: {submitted_job.name}")
    print(f"   Job ID: {submitted_job.id}")
    print(f"   Status: {submitted_job.status}")
    print(f"   Experiment: {submitted_job.experiment_name}")

    # Store job info for monitoring
    job_name = submitted_job.name
    job_id = submitted_job.id

    print("\\nüîó Job URLs:")
    print(f"   Studio URL: https://ml.azure.com/runs/{job_name}")
    print(f"   Direct Link: {submitted_job.studio_url}")

except Exception as e:
    print(f"‚ùå Error submitting job: {str(e)}")
    print("üîç Check that:")
    print(f"   ‚Ä¢ Compute cluster '{compute_name}' is available")
    print(f"   ‚Ä¢ Environment '{env_reference}' is valid")
    print(f"   ‚Ä¢ Training script exists at '{job_config['code_path']}'")
    raise

## Step 7: Monitor Training Job

Let's monitor the training job progress and check its status. We can view logs and track the training metrics in real-time.

In [None]:
# Check job status
print(f"üîç Checking status of job: {job_name}")

try:
    # Get current job status
    current_job = ml_client.jobs.get(job_name)

    print(f"üìä Job Status: {current_job.status}")
    print(f"üïê Created: {current_job.creation_context.created_at}")
    print(f"üéØ Experiment: {current_job.experiment_name}")

    # Display different information based on job status
    if current_job.status == "Completed":
        print("‚úÖ Job completed successfully!")
        print(f"‚è±Ô∏è Duration: {current_job.creation_context.last_modified_at - current_job.creation_context.created_at}")

    elif current_job.status == "Running":
        print("üèÉ‚Äç‚ôÇÔ∏è Job is currently running...")
        print(f"‚è±Ô∏è Running for: {current_job.creation_context.last_modified_at - current_job.creation_context.created_at}")

    elif current_job.status == "Failed":
        print("‚ùå Job failed!")
        print("üîç Check the logs for error details")

    elif current_job.status in ["Queued", "Starting", "Preparing"]:
        print(f"‚è≥ Job is {current_job.status.lower()}...")
        print("üí° This may take a few minutes as Azure ML provisions resources")

    else:
        print(f"üìã Current status: {current_job.status}")

    # Show studio URL for detailed monitoring
    print("\\nüåê Monitor in Azure ML Studio:")
    print(f"   {current_job.studio_url}")

except Exception as e:
    print(f"‚ùå Error checking job status: {str(e)}")
    print("üí° Job might not exist or you may not have access")

In [None]:
# Wait for job completion (optional)

def wait_for_job_completion(job_name, max_wait_minutes=30, check_interval=30):
    """
    Wait for job completion with periodic status updates

    Args:
        job_name: Name of the Azure ML job
        max_wait_minutes: Maximum time to wait (default: 30 minutes)
        check_interval: Time between status checks in seconds (default: 30 seconds)
    """
    print(f"‚è≥ Waiting for job completion (max {max_wait_minutes} minutes)...")
    print(f"üîÑ Checking every {check_interval} seconds")

    start_time = time.time()
    max_wait_seconds = max_wait_minutes * 60

    while time.time() - start_time < max_wait_seconds:
        try:
            current_job = ml_client.jobs.get(job_name)
            elapsed_minutes = (time.time() - start_time) / 60

            print(f"[{elapsed_minutes:.1f}m] Status: {current_job.status}")

            if current_job.status == "Completed":
                print("‚úÖ Job completed successfully!")
                return True
            elif current_job.status == "Failed":
                print("‚ùå Job failed!")
                return False
            elif current_job.status == "Canceled":
                print("‚ö†Ô∏è Job was canceled!")
                return False

            time.sleep(check_interval)

        except Exception as e:
            print(f"‚ùå Error checking job status: {str(e)}")
            break

    print(f"‚è∞ Timeout reached after {max_wait_minutes} minutes")
    print("üí° Job may still be running - check Azure ML Studio for updates")
    return False

# Uncomment the following line to wait for job completion
# wait_for_job_completion(job_name, max_wait_minutes=15)

print("üí° To wait for job completion, uncomment and run the above function call")
print(f"üåê Or monitor progress in Azure ML Studio: {current_job.studio_url}")

## Step 8: Retrieve Results and Manage Models

Once the training job is complete, we can retrieve the trained model, view metrics, and register the model for future use.

In [None]:
# Download job outputs and artifacts
print(f"üì• Retrieving job outputs for: {job_name}")

try:
    # Get the completed job
    completed_job = ml_client.jobs.get(job_name)

    if completed_job.status == "Completed":
        print("‚úÖ Job completed successfully!")

        # Create outputs directory
        outputs_dir = Path("../outputs/job_artifacts")
        outputs_dir.mkdir(parents=True, exist_ok=True)

        print(f"üìÅ Downloading artifacts to: {outputs_dir}")

        # Download job outputs
        try:
            ml_client.jobs.download(
                name=job_name,
                download_path=outputs_dir,
                output_name="default"
            )
            print("‚úÖ Artifacts downloaded successfully!")

            # List downloaded files
            print("\\nüìã Downloaded files:")
            for file_path in outputs_dir.rglob("*"):
                if file_path.is_file():
                    relative_path = file_path.relative_to(outputs_dir)
                    print(f"   üìÑ {relative_path}")

        except Exception as download_error:
            print(f"‚ö†Ô∏è Could not download artifacts: {str(download_error)}")
            print("üí° This is normal if the job is still running or no artifacts were created")

    else:
        print(f"‚ö†Ô∏è Job status: {completed_job.status}")
        print("üí° Wait for job completion before downloading artifacts")

except Exception as e:
    print(f"‚ùå Error retrieving job: {str(e)}")

In [None]:
# View training metrics from MLflow
print("üìä Retrieving training metrics...")

try:
    # List experiments
    experiments = ml_client.experiments.list()

    print("\\nüìã Available experiments:")
    for exp in experiments:
        print(f"   üß™ {exp.name}")

    # Try to get our specific experiment
    experiment_name = job_config["experiment_name"]

    try:
        experiment = ml_client.experiments.get(experiment_name)
        print(f"\\n‚úÖ Found experiment: {experiment.name}")
        print(f"üìù Description: {experiment.description or 'No description'}")

        # List jobs in this experiment
        jobs_in_experiment = ml_client.jobs.list(parent_job_name=experiment_name)

        print(f"\\nüîç Jobs in experiment '{experiment_name}':")
        for job in jobs_in_experiment:
            print(f"   üéØ {job.name} - Status: {job.status}")

    except Exception as exp_error:
        print(f"‚ö†Ô∏è Could not retrieve experiment details: {str(exp_error)}")
        print("üí° The experiment may not exist yet or may have a different name")

except Exception as e:
    print(f"‚ùå Error retrieving experiments: {str(e)}")

print("\\nüí° For detailed metrics and visualizations:")
print("   üåê Visit Azure ML Studio: https://ml.azure.com")
print(f"   üìä Navigate to Experiments ‚Üí {experiment_name}")
print("   üìà View metrics, logs, and model artifacts")

In [None]:
# Register the trained model (optional)
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.entities import Model

print("üéØ Model Registration (Optional)")
print("This step registers your trained model in Azure ML Model Registry")
print("for easy deployment and versioning.\\n")

def register_model_from_job(job_name, model_name="lstm-time-series-model"):
    """Register a model from a completed training job"""

    try:
        # Get the completed job
        completed_job = ml_client.jobs.get(job_name)

        if completed_job.status != "Completed":
            print(f"‚ö†Ô∏è Job status: {completed_job.status}")
            print("üí° Model registration requires a completed job")
            return None

        print(f"üìù Registering model: {model_name}")

        # Create model entity
        model = Model(
            name=model_name,
            description="LSTM model for time series forecasting trained on Azure ML",
            type=AssetTypes.MLFLOW_MODEL,
            path=f"azureml://jobs/{job_name}/outputs/artifacts/lstm_model",
            tags={
                "model_type": "LSTM",
                "framework": "PyTorch",
                "task": "time_series_forecasting",
                "training_job": job_name,
                "experiment": completed_job.experiment_name
            }
        )

        # Register the model
        registered_model = ml_client.models.create_or_update(model)

        print("‚úÖ Model registered successfully!")
        print("üìã Model Details:")
        print(f"   Name: {registered_model.name}")
        print(f"   Version: {registered_model.version}")
        print(f"   ID: {registered_model.id}")
        print(f"   Type: {registered_model.type}")

        return registered_model

    except Exception as e:
        print(f"‚ùå Error registering model: {str(e)}")
        return None

# Uncomment to register the model
# registered_model = register_model_from_job(job_name)

print("üí° To register your trained model:")
print("   1. Uncomment the line above")
print("   2. Ensure your training job has completed successfully")
print("   3. Run this cell")
print("\\nüîó You can also register models manually in Azure ML Studio")

## Summary and Next Steps

üéâ **Congratulations!** You've successfully completed the Azure ML training tutorial. Here's what you've accomplished:

### ‚úÖ What You've Learned

1. **Azure ML Setup**: Configured Azure ML workspace and authentication
2. **Compute Resources**: Set up and validated compute clusters for training
3. **Environment Management**: Created custom conda environments for PyTorch
4. **Training Scripts**: Built Azure ML-optimized training code with MLflow tracking
5. **Job Submission**: Submitted remote training jobs to Azure ML compute
6. **Monitoring**: Tracked job progress and status in real-time
7. **Model Management**: Retrieved artifacts and learned about model registration

### üöÄ Next Steps

- **Experiment with hyperparameters**: Modify the training command to test different model architectures
- **Scale up training**: Use larger compute instances or distributed training
- **Deploy models**: Create real-time or batch inference endpoints
- **Set up pipelines**: Automate training workflows with Azure ML Pipelines
- **Add data sources**: Connect to Azure storage for larger datasets

### üìö Additional Resources

- [Azure ML Documentation](https://docs.microsoft.com/azure/machine-learning/)
- [MLflow Integration](https://docs.microsoft.com/azure/machine-learning/how-to-use-mlflow)
- [PyTorch on Azure ML](https://docs.microsoft.com/azure/machine-learning/how-to-train-pytorch)

### üí° Tips for Production

- Use Azure ML Pipelines for reproducible workflows
- Implement proper data versioning and lineage
- Set up automated model validation and testing
- Configure monitoring and alerts for model performance
- Use Azure ML's built-in security and compliance features