# Infrastructure Setup

This notebook provisions and validates Azure infrastructure for the Resume NER training pipeline.

## Overview

- **Step 1**: Load Configuration
- **Step 2**: Validate Environment Variables
- **Step 3**: Create/Verify Azure ML Workspace
- **Step 4**: Create/Verify Storage Account and Containers
- **Step 5**: Create/Verify Compute Clusters
- **Step 6**: (Optional) Validate Infrastructure

## Prerequisites

1. **Authenticate with Azure** (via `DefaultAzureCredential`):
   - Azure CLI: `az login`
   - VS Code Azure extension
   - Managed Identity
   - Service Principal environment variables

2. **Install dependencies**:
   ```bash
   pip install -r setup/requirements.txt
   ```

3. **Configure environment variables**:
   ```bash
   cp config.env.example config.env
   # Edit config.env with your values
   ```

## Configuration

Edit `config/infrastructure.yaml` to customize resource names, VM sizes, and auto-scale settings.

## Notes

- Operations are idempotent (safe to run multiple times)
- Compute clusters auto-scale to 0 when idle
- Infrastructure must exist before running orchestration notebook


## Step 1: Load Configuration

In [23]:
import os
from pathlib import Path
import yaml
from typing import Dict, Any, Tuple
from azure.ai.ml import MLClient
from azure.ai.ml.entities import Workspace, AmlCompute
from azure.mgmt.storage import StorageManagementClient
from azure.mgmt.storage.models import (
    StorageAccountCreateParameters,
    Sku,
    SkuName,
    Kind,
    AccessTier,
    PublicAccess,
)
from azure.mgmt.resource import ResourceManagementClient
from azure.mgmt.resource.resources.models import ResourceGroup
from azure.storage.blob import BlobServiceClient
from azure.core.exceptions import ResourceNotFoundError
from azure.identity import DefaultAzureCredential
from dotenv import load_dotenv

CONFIG_PATH = Path("../config/infrastructure.yaml")
ENV_PATH = Path("../config.env")
REQUIRED_ENV_VARS = ["AZURE_SUBSCRIPTION_ID", "AZURE_RESOURCE_GROUP"]
CONNECTION_STRING_TEMPLATE = "DefaultEndpointsProtocol=https;AccountName={account_name};AccountKey={account_key};EndpointSuffix=core.windows.net"

if ENV_PATH.exists():
    load_dotenv(ENV_PATH)


In [24]:
def load_config() -> Dict[str, Any]:
    """
    Load and resolve infrastructure configuration from config.yaml.
    
    Returns:
        dict: Configuration dictionary with resolved environment variables
        
    Raises:
        FileNotFoundError: If config file does not exist
    """
    if not CONFIG_PATH.exists():
        raise FileNotFoundError(f"Config file not found: {CONFIG_PATH}")
    
    with open(CONFIG_PATH, "r") as f:
        config = yaml.safe_load(f)
    
    config["azure"]["subscription_id"] = os.getenv("AZURE_SUBSCRIPTION_ID", config["azure"]["subscription_id"])
    config["azure"]["resource_group"] = os.getenv("AZURE_RESOURCE_GROUP", config["azure"]["resource_group"])
    config["azure"]["location"] = os.getenv("AZURE_LOCATION", config["azure"]["location"])
    
    return config


config = load_config()


## Step 2: Validate Environment Variables

Ensure required environment variables are set.


In [3]:
def validate_environment_variables() -> None:
    """
    Validate required environment variables are set.
    
    Raises:
        ValueError: If required variables are missing
    """
    missing = [var for var in REQUIRED_ENV_VARS if not os.getenv(var)]
    if missing:
        raise ValueError(f"Missing environment variables: {', '.join(missing)}")


validate_environment_variables()


## Step 3: Create/Verify Azure ML Workspace

Create or retrieve the Azure ML Workspace.


In [4]:
def create_or_get_resource_group(config: Dict[str, Any]) -> None:
    """
    Create resource group if it doesn't exist.
    
    Args:
        config: Infrastructure configuration dictionary
        
    Raises:
        Exception: If resource group creation fails
    """
    subscription_id = config["azure"]["subscription_id"]
    resource_group = config["azure"]["resource_group"]
    location = config["azure"]["location"]
    credential = DefaultAzureCredential()
    
    resource_client = ResourceManagementClient(credential, subscription_id)
    
    try:
        resource_client.resource_groups.get(resource_group)
    except ResourceNotFoundError:
        resource_group_params = ResourceGroup(location=location)
        resource_client.resource_groups.create_or_update(resource_group, resource_group_params)


def create_or_get_workspace(config: Dict[str, Any]) -> MLClient:
    """
    Create or retrieve Azure ML Workspace.
    
    Args:
        config: Infrastructure configuration dictionary
        
    Returns:
        MLClient: MLClient instance connected to the workspace
        
    Raises:
        Exception: If workspace creation or access fails
    """
    subscription_id = config["azure"]["subscription_id"]
    resource_group = config["azure"]["resource_group"]
    workspace_name = config["workspace"]["name"]
    credential = DefaultAzureCredential()
    
    create_or_get_resource_group(config)
    
    try:
        ml_client = MLClient(credential, subscription_id, resource_group, workspace_name)
        ml_client.workspaces.get(workspace_name)
        return ml_client
    except ResourceNotFoundError:
        workspace = Workspace(
            name=workspace_name,
            location=config["azure"]["location"],
            description=config["workspace"].get("description", ""),
            display_name=workspace_name,
        )
        ml_client = MLClient(credential, subscription_id, resource_group)
        ml_client.workspaces.begin_create(workspace).result()
        return MLClient(credential, subscription_id, resource_group, workspace_name)


ml_client = create_or_get_workspace(config)


Class DeploymentTemplateOperations: This is an experimental class, and may change at any time. Please see https://aka.ms/azuremlexperimental for more information.
Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
The deployment request resume-ner-ws-4334666 was accepted. ARM deployment URI for reference: 
https://portal.azure.com//#blade/HubsExtension/DeploymentDetailsBlade/overview/id/%2Fsubscriptions%2Fa23fa87c-802c-4fdf-9e59-e3d7969bcf31%2FresourceGroups%2Fresume_ner_2025-12-14-13-17-35%2Fproviders%2FMicrosoft.Resources%2Fdeployments%2Fresume-ner-ws-4334666
Creating Key Vault: (resumenekeyvaulta206d986  ) ..  Done (18s)
Creating Log Analytics Workspace: (resumenelogalyti0262baf5  )   Done (22s)
Creating AzureML Workspace: (res

## Step 4: Create/Verify Storage Account and Containers

Create or retrieve Azure Blob Storage account and required containers.


In [6]:
def build_connection_string(account_name: str, account_key: str) -> str:
    """
    Build storage account connection string.
    
    Args:
        account_name: Storage account name
        account_key: Storage account key
        
    Returns:
        str: Connection string for blob service client
    """
    return CONNECTION_STRING_TEMPLATE.format(account_name=account_name, account_key=account_key)


def create_or_get_storage(config: Dict[str, Any]) -> BlobServiceClient:
    """
    Create or retrieve Azure Blob Storage account and containers.
    
    Args:
        config: Infrastructure configuration dictionary
        
    Returns:
        BlobServiceClient: BlobServiceClient instance
        
    Raises:
        Exception: If storage creation or access fails
    """
    subscription_id = config["azure"]["subscription_id"]
    resource_group = config["azure"]["resource_group"]
    location = config["azure"]["location"]
    account_name = config["storage"]["account_name"]
    
    credential = DefaultAzureCredential()
    storage_management = StorageManagementClient(credential, subscription_id)
    
    try:
        storage_management.storage_accounts.get_properties(resource_group, account_name)
    except ResourceNotFoundError:
        params = StorageAccountCreateParameters(
            sku=Sku(name=SkuName.STANDARD_LRS),
            kind=Kind.STORAGE_V2,
            location=location,
            access_tier=AccessTier.HOT,
        )
        storage_management.storage_accounts.begin_create(resource_group, account_name, params).result()
    
    keys = storage_management.storage_accounts.list_keys(resource_group, account_name)
    connection_string = build_connection_string(account_name, keys.keys[0].value)
    blob_client = BlobServiceClient.from_connection_string(connection_string)
    
    for container_config in config["storage"]["containers"]:
        container_name = container_config["name"]
        public_access_str = container_config.get("public_access", "None")
        
        if public_access_str is None or public_access_str.lower() == "none":
            public_access = None
        else:
            public_access = getattr(PublicAccess, public_access_str.upper(), None)
            if public_access is None:
                raise ValueError(f"Invalid public_access value: {public_access_str}")
        
        container = blob_client.get_container_client(container_name)
        if not container.exists():
            container.create_container(public_access=public_access)
    
    return blob_client


blob_client = create_or_get_storage(config)


## Step 5: Create/Verify Compute Clusters

Create or retrieve GPU and CPU compute clusters.


In [26]:
def create_or_get_compute_cluster(
    ml_client: MLClient,
    cluster_name: str,
    vm_size: str,
    min_nodes: int,
    max_nodes: int,
    idle_time_before_scale_down: int,
) -> AmlCompute:
    """
    Create or retrieve a single compute cluster.
    
    Args:
        ml_client: MLClient instance
        cluster_name: Name of the compute cluster
        vm_size: VM size (e.g., "Standard_NC6s_v3")
        min_nodes: Minimum number of nodes (0 for cost savings)
        max_nodes: Maximum number of nodes
        idle_time_before_scale_down: Idle time in seconds before scaling down
        
    Returns:
        AmlCompute instance
        
    Raises:
        Exception: If cluster creation or update fails
    """
    try:
        compute = ml_client.compute.get(cluster_name)
        
        needs_update = (
            compute.size != vm_size
            or compute.min_instances != min_nodes
            or compute.max_instances != max_nodes
        )
        
        if needs_update:
            compute.size = vm_size
            compute.min_instances = min_nodes
            compute.max_instances = max_nodes
            compute.idle_time_before_scale_down = idle_time_before_scale_down
            
            if hasattr(compute, 'network_settings') and compute.network_settings is not None:
                if compute.network_settings.subnet is None:
                    compute.network_settings = None
            
            ml_client.compute.begin_create_or_update(compute).wait()
        
        return compute
        
    except ResourceNotFoundError:
        compute = AmlCompute(
            name=cluster_name,
            size=vm_size,
            min_instances=min_nodes,
            max_instances=max_nodes,
            idle_time_before_scale_down=idle_time_before_scale_down,
        )
        
        ml_client.compute.begin_create_or_update(compute).wait()
        return compute


def create_or_get_compute_clusters(ml_client: MLClient, config: Dict[str, Any]) -> None:
    """
    Create or retrieve compute clusters based on configuration.
    
    GPU cluster is optional and only created if present in config.
    CPU cluster is required.
    
    Args:
        ml_client: MLClient instance
        config: Infrastructure configuration dictionary
        
    Raises:
        KeyError: If compute config or CPU cluster config is missing
    """
    if "compute" not in config or not config["compute"]:
        raise KeyError("'compute' section not found in config. Please ensure config is loaded correctly.")
    
    compute_config = config["compute"]
    
    if "gpu_cluster" in compute_config and compute_config["gpu_cluster"] is not None:
        gpu_config = compute_config["gpu_cluster"]
        create_or_get_compute_cluster(
            ml_client=ml_client,
            cluster_name=gpu_config["name"],
            vm_size=gpu_config["vm_size"],
            min_nodes=gpu_config["min_nodes"],
            max_nodes=gpu_config["max_nodes"],
            idle_time_before_scale_down=gpu_config["idle_time_before_scale_down"],
        )
    
    if "cpu_cluster" not in compute_config:
        raise KeyError("'cpu_cluster' configuration is required but not found in config.")
    
    cpu_config = compute_config["cpu_cluster"]
    create_or_get_compute_cluster(
        ml_client=ml_client,
        cluster_name=cpu_config["name"],
        vm_size=cpu_config["vm_size"],
        min_nodes=cpu_config["min_nodes"],
        max_nodes=cpu_config["max_nodes"],
        idle_time_before_scale_down=cpu_config["idle_time_before_scale_down"],
    )


create_or_get_compute_clusters(ml_client, config)

## Step 6: (Optional) Validate Infrastructure

Validate that all infrastructure components exist and are accessible.


In [27]:
def validate_workspace(config: Dict[str, Any]) -> Tuple[bool, list]:
    """
    Validate Azure ML Workspace exists and is accessible.
    
    Args:
        config: Infrastructure configuration dictionary
        
    Returns:
        tuple: (success, list of errors)
    """
    errors = []
    subscription_id = config["azure"]["subscription_id"]
    resource_group = config["azure"]["resource_group"]
    workspace_name = config["workspace"]["name"]
    
    try:
        ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace_name)
        ml_client.workspaces.get(workspace_name)
        return True, errors
    except ResourceNotFoundError:
        errors.append(f"Workspace '{workspace_name}' not found")
        return False, errors
    except Exception as e:
        errors.append(f"Error accessing workspace: {e}")
        return False, errors


def validate_storage(config: Dict[str, Any]) -> Tuple[bool, list]:
    """
    Validate Storage Account and Containers exist.
    
    Args:
        config: Infrastructure configuration dictionary
        
    Returns:
        tuple: (success, list of errors)
    """
    errors = []
    subscription_id = config["azure"]["subscription_id"]
    resource_group = config["azure"]["resource_group"]
    account_name = config["storage"]["account_name"]
    
    try:
        storage_management = StorageManagementClient(DefaultAzureCredential(), subscription_id)
        storage_management.storage_accounts.get_properties(resource_group, account_name)
        
        keys = storage_management.storage_accounts.list_keys(resource_group, account_name)
        connection_string = build_connection_string(account_name, keys.keys[0].value)
        blob_client = BlobServiceClient.from_connection_string(connection_string)
        
        for container_config in config["storage"]["containers"]:
            container_name = container_config["name"]
            if not blob_client.get_container_client(container_name).exists():
                errors.append(f"Container '{container_name}' not found")
        
        return len(errors) == 0, errors
    except ResourceNotFoundError:
        errors.append(f"Storage account '{account_name}' not found")
        return False, errors
    except Exception as e:
        errors.append(f"Error accessing storage: {e}")
        return False, errors


def validate_compute(config: Dict[str, Any]) -> Tuple[bool, list]:
    """
    Validate Compute Clusters exist and are accessible.
    
    Only validates clusters that are present in the configuration.
    GPU cluster is optional, CPU cluster is required.
    
    Args:
        config: Infrastructure configuration dictionary
        
    Returns:
        tuple: (success, list of errors)
    """
    errors = []
    subscription_id = config["azure"]["subscription_id"]
    resource_group = config["azure"]["resource_group"]
    workspace_name = config["workspace"]["name"]
    
    try:
        ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace_name)
        compute_config = config.get("compute", {})
        
        if "gpu_cluster" in compute_config and compute_config["gpu_cluster"] is not None:
            gpu_cluster_name = compute_config["gpu_cluster"]["name"]
            try:
                ml_client.compute.get(gpu_cluster_name)
            except ResourceNotFoundError:
                errors.append(f"GPU cluster '{gpu_cluster_name}' not found")
            except Exception as e:
                errors.append(f"Error accessing GPU cluster '{gpu_cluster_name}': {e}")
        
        if "cpu_cluster" in compute_config:
            cpu_cluster_name = compute_config["cpu_cluster"]["name"]
            try:
                ml_client.compute.get(cpu_cluster_name)
            except ResourceNotFoundError:
                errors.append(f"CPU cluster '{cpu_cluster_name}' not found")
            except Exception as e:
                errors.append(f"Error accessing CPU cluster '{cpu_cluster_name}': {e}")
        else:
            errors.append("CPU cluster configuration is required but not found")
        
        return len(errors) == 0, errors
    except Exception as e:
        errors.append(f"Error accessing workspace: {e}")
        return False, errors


all_errors = []
_, errors = validate_workspace(config)
all_errors.extend(errors)
_, errors = validate_storage(config)
all_errors.extend(errors)
_, errors = validate_compute(config)
all_errors.extend(errors)

if all_errors:
    raise ValueError(f"Validation failed: {', '.join(all_errors)}")


Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Overriding of current TracerProvider is not allowed
Overriding of current LoggerProvider is not allowed
Overriding of current MeterProvider is not allowed
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
Attempting to instrument while already instrumented
