# AutoML: Train a classifier for your Claims dataset

**Requirements**
- A basic understanding of Machine Learning
- An Azure account with an active subscription. [Create one for free](https://azure.microsoft.com/free/)
- An Azure ML Workspace (see [this starter notebook](../../../resources/workspace/workspace.ipynb))
- Python environment with Azure ML SDK v2:  
  ```bash
  pip install azure-ai-ml azure-identity mltable mlflow azureml-mlflow
  ```

**Learning Objectives**
- Connect to your Azure ML workspace via the Python SDK
- Create and run an AutoML **classification** job targeting `long_term_3y`
- Use **serverless compute** (preview) for your job
- Retrieve the best model via MLFlow and score new data


# 1. Connect to Azure Machine Learning Workspace

We first authenticate and instantiate an `MLClient` to your workspace.

In [12]:
# Import required libraries
import os
from azure.identity import DefaultAzureCredential
from azure.ai.ml import MLClient, automl, Input
from azure.ai.ml.constants import AssetTypes
from azure.ai.ml.entities import ResourceConfiguration
from azure.ai.ml.automl import ClassificationModels

# For MLTable creation
import mltable
from mltable import DataType

# For retrieving model via MLFlow
import mlflow
from mlflow.tracking.client import MlflowClient

In [13]:
# Authenticate and connect to your workspace
credential = DefaultAzureCredential()
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print("Could not load from config:\n", ex)
    # Fallback: manually specify
    SUBSCRIPTION_ID = "<your-subscription>"
    RESOURCE_GROUP  = "<your-resource-group>"
    WORKSPACE_NAME  = "<your-workspace-name>"
    ml_client = MLClient(
        credential,
        subscription_id=SUBSCRIPTION_ID,
        resource_group_name=RESOURCE_GROUP,
        workspace_name=WORKSPACE_NAME,
    )

# Display workspace info
ws = ml_client.workspaces.get(ml_client.workspace_name)
print(f"Connected to {ws.name} in {ws.location} (RG: {ws.resource_group})")

Found the config file in: /config.json


Connected to uhg-rx-aml in westus (RG: uhg-rx-owca)


# 2. Prepare your Claims data as an MLTable

We’ll turn `claims.csv` into a local MLTable so AutoML can consume it.


In [14]:
# # 2.1 ── Create the MLTable from your CSV
# os.makedirs("data/claims-mltable", exist_ok=True)

# # Write CSV into the MLTable folder
# import shutil
# shutil.copy("claims.csv", "data/claims-mltable/claims.csv")

# # Use mltable API to scaffold the MLTable file
# paths = [{"file": "./data/claims-mltable/claims.csv"}]
# tbl = mltable.from_delimited_files(paths=paths)
# # Ensure target is treated as integer/category
# tbl = tbl.convert_column_types({
#     "long_term_3y": DataType.to_int64()
# })
# # Save MLTable.yaml
# tbl.save("data/claims-mltable")
# print("MLTable created at ./data/claims-mltable")

In [15]:
# 2.2 ── Register the MLTable as an Input
claims_data = Input(
    type=AssetTypes.MLTABLE,
    path="./claims"
)

# 3. Configure and run the AutoML Classification job

We now set up `automl.classification()`, target `long_term_3y`, use **serverless compute**, and submit.

In [18]:
# 3.1 ── Job parameters
experiment_name = "claims-long-term-3y-classifier"
max_trials       = 20

# Create the AutoML classification job
classification_job = automl.classification(
    experiment_name=experiment_name,
    training_data=claims_data,
    target_column_name="long_term_3y",
    primary_metric="accuracy",
    n_cross_validations=5,
    enable_model_explainability=True,
    tags={"dataset": "claims", "task": "long_term_3y"},
)

# Set limits for the AutoML run
classification_job.set_limits(
    timeout_minutes=60,
    trial_timeout_minutes=10,
    max_trials=max_trials,
    enable_early_termination=True,
)

# Serverless compute resources used to run the job
classification_job.resources = ResourceConfiguration(instance_type="Standard_E4s_v3", instance_count=6)

In [19]:
# 3.2 ── Submit the job
returned_job = ml_client.jobs.create_or_update(classification_job)
print(f"Submitted AutoML job: {returned_job.name}")

# Stream logs
ml_client.jobs.stream(returned_job.name)

Submitted AutoML job: serene_honey_4dw3g1xqhn
RunId: serene_honey_4dw3g1xqhn
Web View: https://ml.azure.com/runs/serene_honey_4dw3g1xqhn?wsid=/subscriptions/28d2df62-e322-4b25-b581-c43b94bd2607/resourcegroups/uhg-rx-owca/workspaces/uhg-rx-aml


# 4. Retrieve the Best Model via MLFlow

AutoML logs each trial in MLFlow—grab the parent run’s URI, then fetch the best child.


In [None]:
# 4.1 ── Configure MLFlow
tracking_uri = ml_client.workspaces.get(ml_client.workspace_name).mlflow_tracking_uri
mlflow.set_tracking_uri(tracking_uri)
print("MLFlow tracking @", mlflow.get_tracking_uri())

mlflow_client = MlflowClient()
parent_run = mlflow_client.get_run(returned_job.name)
print("Parent run status:", parent_run.info.status)

In [None]:
# 4.2 ── List child runs and pick best by primary metric
children = mlflow_client.search_runs([
    mlflow.registered_model.SearchFilter(
        attribute="tags.mlflow.parentRunId",
        operator="=",
        value=returned_job.name
    )
])
best = sorted(children, key=lambda r: r.data.metrics.get('accuracy'), reverse=True)[0]
print(f"Best child run ID: {best.info.run_id}, accuracy: {best.data.metrics['accuracy']:.4f}")

# 5. Score New Data

Use the best model to score a hold-out or new CSV of claims. Here’s a simple example on the training set.

In [None]:
from mlflow.pyfunc import load_model
import pandas as pd

# Load best model as an MLFlow pyfunc
model_uri = f"runs:/{best.info.run_id}/model"
clf = load_model(model_uri)

# Read raw CSV
df = pd.read_csv("claims.csv")
preds = clf.predict(df.drop(columns=['long_term_3y']))
df['predicted_long_term_3y'] = preds
df.head()

You’ve now trained, retrieved, and scored your claims classifier—all using serverless AutoML! From here you can register the model, deploy it to an endpoint, or integrate into your pipelines.