# Model-as-a-Judge Evaluation Walkthrough
This notebook demonstrates how Riley Inc. can launch an Amazon Bedrock Eval job that uses the Nova Pro model as an evaluator (model-as-a-judge) while benchmarking multiple candidate models.

## Prerequisites
- AWS credentials configured for the account where evaluations will run.
- The IAM role `Amazon-Bedrock-IAM-Role-20250928T210855` with permissions to start Bedrock eval jobs and access the target S3 buckets.
- Python environment with the latest `boto3` and `botocore` libraries installed.
- The prompt dataset stored in S3 at `s3://riley-inc-rag-knowledge-base-20250929-210517/prompt_datasets/general_trivia_no_groundtruth.jsonl`.

In [18]:
import json
from datetime import datetime
import boto3
from botocore.exceptions import BotoCoreError, ClientError
from copy import deepcopy
import re

## Configuration
Define region, IAM role, dataset location, evaluator model, and the shared output prefix for this evaluation run.

In [13]:
aws_region = "us-west-2"
role_arn = "arn:aws:iam::187899929471:role/service-role/Amazon-Bedrock-IAM-Role-20250928T210855"
eval_results_bucket = "riley-inc-rag-eval-results-20250929-210517"  # Replace with the eval bucket created in 00_infra_setup.ipynb
results_prefix = "baseline-evals/"
prompt_dataset_s3_uri = "s3://riley-inc-rag-knowledge-base-20250929-210517/prompt_datasets/general_trivia_no_groundtruth.jsonl"
dataset_name = "GeneralTriviaNoGroundTruth"
datasets = [
    {
        'name': dataset_name,
        'datasetLocation': {'s3Uri': prompt_dataset_s3_uri}
    }
]
core_judge_metrics = [
    'Builtin.Correctness',
    'Builtin.Completeness',
    'Builtin.Faithfulness',
    'Builtin.Helpfulness',
    'Builtin.Coherence',
    'Builtin.Relevance',
    'Builtin.FollowingInstructions',
    'Builtin.ProfessionalStyleAndTone',
]
safety_judge_metrics = [
    'Builtin.Harmfulness',
    'Builtin.Stereotyping',
    'Builtin.Refusal',
]
llm_judge_metrics = core_judge_metrics + safety_judge_metrics
judge_task_type = "General"
primary_model_id = 'us.amazon.nova-lite-v1:0'
candidate_models_default = [
    'us.amazon.nova-lite-v1:0',
    'us.amazon.nova-micro-v1:0',
    'us.amazon.nova-pro-v1:0'
]
evaluator_model_id = 'us.amazon.nova-pro-v1:0'
evaluator_model_config = {
    'bedrockEvaluatorModels': [
        {
            'modelIdentifier': evaluator_model_id
        }
    ]
}
application_type = 'ModelEvaluation'
comparison_run_id = datetime.now().strftime('%Y%m%d-%H%M%S')
shared_output_uri = f"s3://{eval_results_bucket}/{results_prefix}model-judge-run-{comparison_run_id}/"

if eval_results_bucket.endswith('REPLACE-ME'):
    raise ValueError('Update eval_results_bucket with the actual bucket name before running the notebook.')
if not datasets or not llm_judge_metrics:
    raise ValueError('Datasets and metrics must be non-empty lists.')

print(f'Evaluation artifacts will be stored under: {shared_output_uri}')
print(f'Using evaluator model: {evaluator_model_id}')
print(f'Datasets: {datasets}')
print(f'Metrics ({len(llm_judge_metrics)}): {llm_judge_metrics}')


Evaluation artifacts will be stored under: s3://riley-inc-rag-eval-results-20250929-210517/baseline-evals/model-judge-run-20250929-232500/
Using evaluator model: us.amazon.nova-pro-v1:0
Datasets: [{'name': 'GeneralTriviaNoGroundTruth', 'datasetLocation': {'s3Uri': 's3://riley-inc-rag-knowledge-base-20250929-210517/prompt_datasets/general_trivia_no_groundtruth.jsonl'}}]
Metrics (11): ['Builtin.Correctness', 'Builtin.Completeness', 'Builtin.Faithfulness', 'Builtin.Helpfulness', 'Builtin.Coherence', 'Builtin.Relevance', 'Builtin.FollowingInstructions', 'Builtin.ProfessionalStyleAndTone', 'Builtin.Harmfulness', 'Builtin.Stereotyping', 'Builtin.Refusal']


## Initialize Bedrock Client
Establish a client connection to Amazon Bedrock in the selected region.

In [14]:
bedrock = boto3.client('bedrock', region_name=aws_region)
print('Bedrock client ready.')


Bedrock client ready.


## Submit Baseline Model-as-a-Judge Eval Job
Create an evaluation job that uses the Nova Pro evaluator to score a single candidate model against the prompt dataset.

In [16]:
# Build a timestamped job name
job_name = f"baseline-eval-job-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
print(f'Creating evaluation job: {job_name} → {shared_output_uri}')

dataset_metric_configs = [
    {
        'taskType': judge_task_type,
        'dataset': deepcopy(dataset),
        'metricNames': llm_judge_metrics
    }
    for dataset in datasets
]

evaluation_request = {
    'jobName': job_name,
    'roleArn': role_arn,
    'applicationType': application_type,
    'evaluationConfig': {
        'automated': {
            'datasetMetricConfigs': dataset_metric_configs,
            'evaluatorModelConfig': evaluator_model_config
        }
    },
    'inferenceConfig': {
        'models': [
            {
                'bedrockModel': {
                    'modelIdentifier': primary_model_id
                }
            }
        ]
    },
    'outputDataConfig': {'s3Uri': shared_output_uri}
}

try:
    create_response = bedrock.create_evaluation_job(**evaluation_request)
    print(json.dumps(create_response, indent=2))
except (ClientError, BotoCoreError) as err:
    raise RuntimeError('Failed to create evaluation job.') from err


Creating evaluation job: baseline-eval-job-20250929-232655 → s3://riley-inc-rag-eval-results-20250929-210517/baseline-evals/model-judge-run-20250929-232500/
{
  "ResponseMetadata": {
    "RequestId": "9244a0e1-c964-449f-8395-6c060e0a12f1",
    "HTTPStatusCode": 202,
    "HTTPHeaders": {
      "date": "Mon, 29 Sep 2025 15:26:56 GMT",
      "content-type": "application/json",
      "content-length": "79",
      "connection": "keep-alive",
      "x-amzn-requestid": "9244a0e1-c964-449f-8395-6c060e0a12f1"
    },
    "RetryAttempts": 0
  },
  "jobArn": "arn:aws:bedrock:us-west-2:187899929471:evaluation-job/ks5dd1pskn07"
}


## Check Job Status
Retrieve the evaluation job status to confirm that it is running.

In [17]:
job_arn = create_response.get('jobArn') or create_response.get('evaluationJobArn')
if not job_arn:
    raise ValueError('Could not find the job ARN in the create response.')

try:
    job_details = bedrock.get_evaluation_job(jobIdentifier=job_arn)
    print(json.dumps(job_details, indent=2, default=str))
except (ClientError, BotoCoreError) as err:
    raise RuntimeError('Unable to fetch evaluation job status.') from err


{
  "ResponseMetadata": {
    "RequestId": "b49565fd-12c3-41b4-9db4-4fa28db2c3ee",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 29 Sep 2025 15:28:03 GMT",
      "content-type": "application/json",
      "content-length": "1443",
      "connection": "keep-alive",
      "x-amzn-requestid": "b49565fd-12c3-41b4-9db4-4fa28db2c3ee"
    },
    "RetryAttempts": 0
  },
  "jobName": "baseline-eval-job-20250929-232655",
  "status": "InProgress",
  "jobArn": "arn:aws:bedrock:us-west-2:187899929471:evaluation-job/ks5dd1pskn07",
  "roleArn": "arn:aws:iam::187899929471:role/service-role/Amazon-Bedrock-IAM-Role-20250928T210855",
  "jobType": "Automated",
  "applicationType": "ModelEvaluation",
  "evaluationConfig": {
    "automated": {
      "datasetMetricConfigs": [
        {
          "taskType": "General",
          "dataset": {
            "name": "GeneralTriviaNoGroundTruth",
            "datasetLocation": {
              "s3Uri": "s3://riley-inc-rag-knowledge-base-20250929

In [22]:
def launch_eval_job(model_id: str, task_type: str, datasets: list, metrics: list, evaluator_config: dict):
    if not isinstance(datasets, list) or not datasets:
        raise ValueError('datasets must be a non-empty list of dataset definitions.')
    if not isinstance(metrics, list) or not metrics:
        raise ValueError('metrics must be a non-empty list of metric names.')
    if not isinstance(evaluator_config, dict) or 'bedrockEvaluatorModels' not in evaluator_config:
        raise ValueError('evaluator_config must include bedrockEvaluatorModels.')

    dataset_metric_configs = []
    for dataset in datasets:
        if isinstance(dataset, str):
            dataset_metric_configs.append({
                'taskType': task_type,
                'dataset': {
                    'name': 'CustomDataset',
                    'datasetLocation': {'s3Uri': dataset}
                },
                'metricNames': metrics
            })
        elif isinstance(dataset, dict) and 'datasetLocation' in dataset:
            dataset_metric_configs.append({
                'taskType': task_type,
                'dataset': deepcopy(dataset),
                'metricNames': metrics
            })
        else:
            raise ValueError('Each dataset entry must be a dataset string or dict with datasetLocation.')

    sanitized_model = re.sub(r'[^a-zA-Z0-9-]+', '-', model_id.split('/')[-1])
    job_suffix = datetime.now().strftime('%Y%m%d-%H%M%S')
    job_name = f"baseline-eval-job-{sanitized_model}-{job_suffix}"

    request_payload = {
        'jobName': job_name,
        'roleArn': role_arn,
        'applicationType': application_type,
        'evaluationConfig': {
            'automated': {
                'datasetMetricConfigs': dataset_metric_configs,
                'evaluatorModelConfig': evaluator_config
            }
        },
        'inferenceConfig': {
            'models': [
                {
                    'bedrockModel': {
                        'modelIdentifier': model_id
                    }
                }
            ]
        },
        'outputDataConfig': {
            's3Uri': shared_output_uri
        }
    }

    try:
        response = bedrock.create_evaluation_job(**request_payload)
    except (ClientError, BotoCoreError) as err:
        raise RuntimeError(f'Failed to create evaluation job for {model_id}.') from err

    job_arn = response.get('jobArn') or response.get('evaluationJobArn')
    print(f"Launched {job_name} (model={model_id}) -> {job_arn}")
    return {
        'job_name': job_name,
        'job_arn': job_arn,
        'response': response
    }


## Launch Batch of Model-as-a-Judge Eval Jobs
Trigger three evaluations for different Nova-family models, sending all outputs to the shared comparison folder and explicitly referencing the evaluator configuration for clarity.

In [23]:
candidate_models = [
    'us.amazon.nova-lite-v1:0',
    'us.amazon.nova-micro-v1:0',
    'us.amazon.nova-pro-v1:0'
]

metrics = llm_judge_metrics

batch_jobs = []
for candidate_model in candidate_models:
    job_info = launch_eval_job(
        model_id=candidate_model,
        task_type=judge_task_type,
        datasets=datasets,
        metrics=metrics,
        evaluator_config=evaluator_model_config
    )
    batch_jobs.append(job_info)

print(json.dumps(batch_jobs, indent=2, default=str))
print(f'Evaluator model config used: {json.dumps(evaluator_model_config, indent=2)}')


Launched baseline-eval-job-us-amazon-nova-lite-v1-0-20250929-233009 (model=us.amazon.nova-lite-v1:0) -> arn:aws:bedrock:us-west-2:187899929471:evaluation-job/cd3wock08fmg
Launched baseline-eval-job-us-amazon-nova-micro-v1-0-20250929-233011 (model=us.amazon.nova-micro-v1:0) -> arn:aws:bedrock:us-west-2:187899929471:evaluation-job/lg3t3axuzgwo
Launched baseline-eval-job-us-amazon-nova-pro-v1-0-20250929-233012 (model=us.amazon.nova-pro-v1:0) -> arn:aws:bedrock:us-west-2:187899929471:evaluation-job/21qda9111ffq
[
  {
    "job_name": "baseline-eval-job-us-amazon-nova-lite-v1-0-20250929-233009",
    "job_arn": "arn:aws:bedrock:us-west-2:187899929471:evaluation-job/cd3wock08fmg",
    "response": {
      "ResponseMetadata": {
        "RequestId": "69de510b-d45c-472f-a4f9-251c22f55d2b",
        "HTTPStatusCode": 202,
        "HTTPHeaders": {
          "date": "Mon, 29 Sep 2025 15:30:11 GMT",
          "content-type": "application/json",
          "content-length": "79",
          "connection": 

## Next Steps
- Monitor each job until it reaches a terminal state (`Completed`, `Failed`, or `Stopped`).
- Review judge scores and rationales stored within the shared output prefix.
- Compare candidate model performance side-by-side in the Bedrock console or by aggregating the S3 outputs.