# Baseline Evaluation Walkthrough
This notebook demonstrates how Riley Inc. can launch a baseline Amazon Bedrock Eval job using the Nova Lite foundation model and the general text generation task type.

## Prerequisites
- AWS credentials configured for the account where evaluations will run.
- The IAM role `Amazon-Bedrock-IAM-Role-20250928T210855` with permissions to start Bedrock eval jobs and access the target S3 bucket.
- Python environment with the latest `boto3` and `botocore` libraries installed.


In [43]:
import json
from datetime import datetime
import boto3
from botocore.exceptions import BotoCoreError, ClientError
import re


## Configuration
Set your AWS Region, evaluation role, and S3 destination. Update `eval_results_bucket` to match the evaluation bucket provisioned in `00_infra_setup.ipynb`.

In [None]:
aws_region = "us-west-2" # Update with your selected region
role_arn = "arn:aws:iam::187899929471:role/service-role/Amazon-Bedrock-IAM-Role-20250928T210855" # Update with your own role arn
model_id = "us.amazon.nova-lite-v1:0" # Update with your selected model id
eval_results_bucket = "riley-inc-rag-eval-results-20250929-210517"  # Paste the exact bucket name created in 00_infra_setup.ipynb
results_prefix = "baseline-evals/"
datasets = ["Builtin.T-REx"]
metrics = ["Builtin.Accuracy"]
task_type = "Generation"
comparison_run_id = datetime.now().strftime('%Y%m%d-%H%M%S')
shared_output_uri = f"s3://{eval_results_bucket}/{results_prefix}comparison-run-{comparison_run_id}/"

if eval_results_bucket.endswith("REPLACE-ME"):
    raise ValueError("Update eval_results_bucket with the actual bucket name before running the notebook.")
if not datasets or not metrics:
    raise ValueError("Datasets and metrics must be non-empty lists.")

print(f"Evaluation artifacts will be stored under: {shared_output_uri}")
print(f"Datasets: {datasets}; metrics: {metrics}")


Evaluation artifacts will be stored under: s3://riley-inc-rag-eval-results-20250929-210517/baseline-evals/comparison-run-20250929-224333/
Datasets: ['Builtin.T-REx']; metrics: ['Builtin.Accuracy']


## Initialize Bedrock Client
Establish a client connection to Amazon Bedrock in the selected region.

In [45]:
bedrock = boto3.client("bedrock", region_name=aws_region)
print("Bedrock client ready.")

Bedrock client ready.


## Submit Baseline Eval Job
Create a general text generation evaluation job for the Nova Lite model using accuracy as the metric.

In [46]:
# Build a timestamped job name
job_name = f"baseline-eval-job-{datetime.now().strftime('%Y%m%d-%H%M%S')}"
output_uri = shared_output_uri
print(f"Creating evaluation job: {job_name} → {output_uri}")

dataset_metric_configs = [
    {
        'taskType': task_type,
        'dataset': {'name': dataset_name},
        'metricNames': metrics
    }
    for dataset_name in datasets
]

evaluation_request = {
    'jobName': job_name,
    'roleArn': role_arn,
    'evaluationConfig': {
        'automated': {
            'datasetMetricConfigs': dataset_metric_configs
        }
    },
    'inferenceConfig': {
        'models': [
            {
                'bedrockModel': {
                    'modelIdentifier': model_id
                }
            }
        ]
    },
    'outputDataConfig': {'s3Uri': output_uri}
}

try:
    create_response = bedrock.create_evaluation_job(**evaluation_request)
    print(json.dumps(create_response, indent=2))
except (ClientError, BotoCoreError) as err:
    raise RuntimeError('Failed to create evaluation job.') from err


Creating evaluation job: baseline-eval-job-20250929-224333 → s3://riley-inc-rag-eval-results-20250929-210517/baseline-evals/comparison-run-20250929-224333/
{
  "ResponseMetadata": {
    "RequestId": "5953de9e-2407-4a34-a5b0-556bc5032b9e",
    "HTTPStatusCode": 202,
    "HTTPHeaders": {
      "date": "Mon, 29 Sep 2025 14:43:35 GMT",
      "content-type": "application/json",
      "content-length": "79",
      "connection": "keep-alive",
      "x-amzn-requestid": "5953de9e-2407-4a34-a5b0-556bc5032b9e"
    },
    "RetryAttempts": 0
  },
  "jobArn": "arn:aws:bedrock:us-west-2:187899929471:evaluation-job/r2bv25n2n9zy"
}


## Check Job Status
Retrieve the evaluation job status to confirm that it is running.

In [47]:
job_arn = create_response.get("jobArn") or create_response.get("evaluationJobArn")
if not job_arn:
    raise ValueError("Could not find the job ARN in the create response.")

try:
    job_details = bedrock.get_evaluation_job(jobIdentifier=job_arn)
    print(json.dumps(job_details, indent=2, default=str))
except (ClientError, BotoCoreError) as err:
    raise RuntimeError("Unable to fetch evaluation job status.") from err

{
  "ResponseMetadata": {
    "RequestId": "434ede68-9f7e-4f69-982e-113d7f28ed25",
    "HTTPStatusCode": 200,
    "HTTPHeaders": {
      "date": "Mon, 29 Sep 2025 14:43:35 GMT",
      "content-type": "application/json",
      "content-length": "1012",
      "connection": "keep-alive",
      "x-amzn-requestid": "434ede68-9f7e-4f69-982e-113d7f28ed25"
    },
    "RetryAttempts": 0
  },
  "jobName": "baseline-eval-job-20250929-224333",
  "status": "InProgress",
  "jobArn": "arn:aws:bedrock:us-west-2:187899929471:evaluation-job/r2bv25n2n9zy",
  "roleArn": "arn:aws:iam::187899929471:role/service-role/Amazon-Bedrock-IAM-Role-20250928T210855",
  "jobType": "Automated",
  "applicationType": "ModelEvaluation",
  "evaluationConfig": {
    "automated": {
      "datasetMetricConfigs": [
        {
          "taskType": "Generation",
          "dataset": {
            "name": "Builtin.T-REx"
          },
          "metricNames": [
            "Builtin.Accuracy"
          ]
        }
      ]
    }
  }

## Automate Multi-Model Eval Launch
Define a helper that accepts model, task type, datasets, and metrics so you can spin up multiple evaluations with consistent settings.

In [50]:
def launch_eval_job(model_id: str, task_type: str, datasets: list, metrics: list):
    if not isinstance(datasets, list) or not datasets:
        raise ValueError('datasets must be a non-empty list of dataset names.')
    if not isinstance(metrics, list) or not metrics:
        raise ValueError('metrics must be a non-empty list of metric names.')

    dataset_metric_configs = [
        {
            'taskType': task_type,
            'dataset': {'name': dataset_name},
            'metricNames': metrics
        }
        for dataset_name in datasets
    ]

    sanitized_model = re.sub(r'[^a-zA-Z0-9-]+', '-', model_id.split('/')[-1])
    job_suffix = datetime.now().strftime('%Y%m%d-%H%M%S')
    job_name = f"baseline-eval-job-{sanitized_model}-{job_suffix}"
    output_uri = shared_output_uri

    request_payload = {
        'jobName': job_name,
        'roleArn': role_arn,
        'evaluationConfig': {
            'automated': {
                'datasetMetricConfigs': dataset_metric_configs
            }
        },
        'inferenceConfig': {
            'models': [
                {
                    'bedrockModel': {
                        'modelIdentifier': model_id
                    }
                }
            ]
        },
        'outputDataConfig': {
            's3Uri': output_uri
        }
    }

    try:
        response = bedrock.create_evaluation_job(**request_payload)
    except (ClientError, BotoCoreError) as err:
        raise RuntimeError(f'Failed to create evaluation job for {model_id}.') from err

    job_arn = response.get('jobArn') or response.get('evaluationJobArn')
    print(f"Launched {job_name} (model={model_id}) -> {job_arn}")
    return {
        'job_name': job_name,
        'job_arn': job_arn,
        'response': response
    }


## Launch Batch of Eval Jobs
Trigger three evaluations for different Nova-family models while reusing the same datasets and metrics.

In [51]:
candidate_models = [
    'us.amazon.nova-lite-v1:0',
    'us.amazon.nova-micro-v1:0',
    'us.amazon.nova-pro-v1:0'
]

task_type = 'Generation'
datasets = ["Builtin.T-REx"]
metrics = ["Builtin.Accuracy", "Builtin.Robustness"]

batch_jobs = []
for candidate_model in candidate_models:
    job_info = launch_eval_job(
        model_id=candidate_model,
        task_type=task_type,
        datasets=datasets,
        metrics=metrics
    )
    batch_jobs.append(job_info)

print(json.dumps(batch_jobs, indent=2, default=str))


Launched baseline-eval-job-us-amazon-nova-lite-v1-0-20250929-224401 (model=us.amazon.nova-lite-v1:0) -> arn:aws:bedrock:us-west-2:187899929471:evaluation-job/p9mnmfi9e41v
Launched baseline-eval-job-us-amazon-nova-micro-v1-0-20250929-224402 (model=us.amazon.nova-micro-v1:0) -> arn:aws:bedrock:us-west-2:187899929471:evaluation-job/6ib4dw60o4ph
Launched baseline-eval-job-us-amazon-nova-pro-v1-0-20250929-224404 (model=us.amazon.nova-pro-v1:0) -> arn:aws:bedrock:us-west-2:187899929471:evaluation-job/12qagahudtht
[
  {
    "job_name": "baseline-eval-job-us-amazon-nova-lite-v1-0-20250929-224401",
    "job_arn": "arn:aws:bedrock:us-west-2:187899929471:evaluation-job/p9mnmfi9e41v",
    "response": {
      "ResponseMetadata": {
        "RequestId": "9ef80206-fc80-436a-a2aa-25840ae1f980",
        "HTTPStatusCode": 202,
        "HTTPHeaders": {
          "date": "Mon, 29 Sep 2025 14:44:02 GMT",
          "content-type": "application/json",
          "content-length": "79",
          "connection": 

## Next Steps
- Monitor the job until it reaches a terminal state (`Completed`, `Failed`, or `Stopped`).
- Review the metrics and detailed outputs stored in the S3 results prefix.
- Iterate on prompts, models, or datasets as needed to improve baseline quality.