# Explore BrainTrust Eval Dataset

Goal: Load the `SQL-agent-annotated` dataset from BrainTrust, inspect its structure, and figure out how to extract:

1.  **Input:** Original user query.
2.  **Expected:** Human review scores (Pass/Fail) and reasons.
3.  **Metadata:** Actual agent outputs (final answer, tool calls) needed for scoring.

In [1]:
import braintrust
import os
import json
from dotenv import load_dotenv

# Load environment variables (e.g., BRAINTRUST_API_KEY)
# Make sure you have a .env file in the root or the key set in your environment
load_dotenv() 

print("Braintrust SDK imported.")
# Ensure API key is loaded (optional check)
if not os.getenv("BRAINTRUST_API_KEY"):
    print("Warning: BRAINTRUST_API_KEY not found in environment.")
else:
    print("BRAINTRUST_API_KEY found.")

Braintrust SDK imported.
BRAINTRUST_API_KEY found.


## Configuration

In [2]:
PROJECT_NAME = "Transcript Agent" # Make sure this matches your Braintrust project name
DATASET_NAME = "SQL-agent-annotated" # The dataset created via the UI

print(f"Project: {PROJECT_NAME}")
print(f"Dataset: {DATASET_NAME}")

Project: Transcript Agent
Dataset: SQL-agent-annotated


## Initialize and Fetch Dataset Records

In [3]:
# Initialize connection to the dataset
# This uses the API key loaded from the environment
try:
    # Use 'with' context manager to ensure resources are cleaned up
    with braintrust.init_dataset(
        project=PROJECT_NAME,
        name=DATASET_NAME
    ) as dataset:
        print(f"Successfully initialized dataset: {DATASET_NAME}")
        
        # Fetch records (use an iterator to avoid loading everything at once if large)
        fetched_records = []
        # Use the dataset object as an iterator
        for i, record in enumerate(dataset):
            # record is likely a Braintrust DatasetRecord object, but could be a dict
            # We'll store the raw record for inspection
            fetched_records.append(record) 
            
            # Fetch only a few records for initial inspection
            # Adjust the number as needed (fetching 3 here)
            if i >= 2: 
                 break 
                 
        print(f"Fetched {len(fetched_records)} records for inspection.")

except Exception as e:
    print(f"Error initializing or fetching from dataset: {e}")
    fetched_records = [] # Ensure it's empty on error

# Quick check if records were fetched
if 'fetched_records' in locals() and fetched_records:
    print(f"Type of first record: {type(fetched_records[0])}")
else:
    print("No records seem to have been fetched.")


Successfully initialized dataset: SQL-agent-annotated
Fetched 3 records for inspection.
Type of first record: <class 'dict'>


## Inspect Record Structure

Let's look at the structure of the first fetched dictionary to understand where the input, expected labels, and metadata (agent outputs) are located.

In [4]:
# Ensure we have fetched records before trying to access them
if 'fetched_records' in locals() and fetched_records:
    first_record_dict = fetched_records[0] # Get the first record dictionary
    
    # Pretty print the dictionary structure
    print(json.dumps(first_record_dict, indent=2))
         
else:
    print("Variable 'fetched_records' not defined or empty. Please run the fetching cell first.")


{
  "_pagination_key": "p07498881093001281551",
  "_xact_id": "1000195016097763644",
  "created": "2025-04-29T23:28:29.962Z",
  "dataset_id": "0809fe94-66a2-4093-856a-928e51f30412",
  "expected": {
    "error": null,
    "final_answer": "The speaker with the most entries in the transcript is Hugo Bowne-Anderson, with 430 entries.",
    "messages": [
      {
        "content": "Which speaker has the most entries in the transcript?",
        "role": "user"
      },
      {
        "annotations": [],
        "audio": null,
        "content": null,
        "function_call": null,
        "refusal": null,
        "role": "assistant",
        "tool_calls": [
          {
            "function": {
              "arguments": "{\"sql_query\":\"SELECT speaker, COUNT(*) as entry_count FROM transcript_segments GROUP BY speaker ORDER BY entry_count DESC LIMIT 1;\"}",
              "name": "query_database"
            },
            "id": "call_Nav6V13hc6X5v4QCiKW7z93v",
            "type": "function"

## Inspect Metadata Field

Let's look specifically inside the `metadata` field of the first record to confirm the keys for the human review Pass/Fail scores and see if the agent outputs are duplicated there.

In [5]:
# Ensure we have fetched records before trying to access them
if 'fetched_records' in locals() and fetched_records:
    first_record_dict = fetched_records[0] # Get the first record dictionary
    
    # Extract the metadata dictionary
    metadata_dict = first_record_dict.get('metadata', {})
    
    # Pretty print the metadata dictionary structure
    print(json.dumps(metadata_dict, indent=2))
         
else:
    print("Variable 'fetched_records' not defined or empty. Please run the fetching cell first.")

{
  "Final Answer Quality Reason": "The final answer correctly states the speaker and count based on the query results.",
  "SQL Correctness Reason": "The SQL query accurately calculates and retrieves the speaker with the highest entry count.",
  "Tool Choice Reason": "Correctly chose the database tool to answer a question requiring data aggregation."
}


## Re-Inspect Full Record and Object Attributes

The `metadata` field only contained the reasons. Let's print the full record dictionary again and also check the attributes of the original record object fetched by the SDK to find where the Pass/Fail scores are stored.

In [6]:
# Ensure we have fetched records 
if 'fetched_records' in locals() and fetched_records:
    first_record_object = fetched_records[0] 
    first_record_dict = {}
    if hasattr(first_record_object, 'as_dict'): 
        first_record_dict = first_record_object.as_dict()
    elif isinstance(first_record_object, dict):
         first_record_dict = first_record_object
    elif hasattr(first_record_object, '__dict__'): 
        first_record_dict = first_record_object.__dict__

    print("--- Full First Record Dictionary ---")
    if first_record_dict:
         print(json.dumps(first_record_dict, indent=2))
    else:
         print("(Could not convert to dictionary)")

    print("\n--- Attributes of Original Record Object ---")
    # Use dir() to list attributes and methods
    print(dir(first_record_object))
    
    # Check if there's a specific 'scores' attribute
    if hasattr(first_record_object, 'scores'):
        print("\nFound a '.scores' attribute. Value:")
        print(getattr(first_record_object, 'scores'))
    else:
        print("\nNo '.scores' attribute found directly on the object.")
         
else:
    print("Variable 'fetched_records' not defined or empty. Please run the fetching cell first.")


--- Full First Record Dictionary ---
{
  "_pagination_key": "p07498881093001281551",
  "_xact_id": "1000195016097763644",
  "created": "2025-04-29T23:28:29.962Z",
  "dataset_id": "0809fe94-66a2-4093-856a-928e51f30412",
  "expected": {
    "error": null,
    "final_answer": "The speaker with the most entries in the transcript is Hugo Bowne-Anderson, with 430 entries.",
    "messages": [
      {
        "content": "Which speaker has the most entries in the transcript?",
        "role": "user"
      },
      {
        "annotations": [],
        "audio": null,
        "content": null,
        "function_call": null,
        "refusal": null,
        "role": "assistant",
        "tool_calls": [
          {
            "function": {
              "arguments": "{\"sql_query\":\"SELECT speaker, COUNT(*) as entry_count FROM transcript_segments GROUP BY speaker ORDER BY entry_count DESC LIMIT 1;\"}",
              "name": "query_database"
            },
            "id": "call_Nav6V13hc6X5v4QCiKW7

## Re-Inspect Full Record and Object Attributes

The `metadata` field only contained the reasons. Let's print the full record dictionary again and also check the attributes of the original record object fetched by the SDK to find where the Pass/Fail scores are stored.

In [7]:
# Ensure we have fetched records 
if 'fetched_records' in locals() and fetched_records:
    first_record_object = fetched_records[0] 
    first_record_dict = {}
    if hasattr(first_record_object, 'as_dict'): 
        first_record_dict = first_record_object.as_dict()
    elif isinstance(first_record_object, dict):
         first_record_dict = first_record_object
    elif hasattr(first_record_object, '__dict__'): 
        first_record_dict = first_record_object.__dict__

    print("--- Full First Record Dictionary ---")
    if first_record_dict:
         print(json.dumps(first_record_dict, indent=2))
    else:
         print("(Could not convert to dictionary)")

    print("\n--- Attributes of Original Record Object ---")
    # Use dir() to list attributes and methods
    print(dir(first_record_object))
    
    # Check if there's a specific 'scores' attribute
    if hasattr(first_record_object, 'scores'):
        print("\nFound a '.scores' attribute. Value:")
        print(getattr(first_record_object, 'scores'))
    else:
        print("\nNo '.scores' attribute found directly on the object.")
         
else:
    print("Variable 'fetched_records' not defined or empty. Please run the fetching cell first.")


--- Full First Record Dictionary ---
{
  "_pagination_key": "p07498881093001281551",
  "_xact_id": "1000195016097763644",
  "created": "2025-04-29T23:28:29.962Z",
  "dataset_id": "0809fe94-66a2-4093-856a-928e51f30412",
  "expected": {
    "error": null,
    "final_answer": "The speaker with the most entries in the transcript is Hugo Bowne-Anderson, with 430 entries.",
    "messages": [
      {
        "content": "Which speaker has the most entries in the transcript?",
        "role": "user"
      },
      {
        "annotations": [],
        "audio": null,
        "content": null,
        "function_call": null,
        "refusal": null,
        "role": "assistant",
        "tool_calls": [
          {
            "function": {
              "arguments": "{\"sql_query\":\"SELECT speaker, COUNT(*) as entry_count FROM transcript_segments GROUP BY speaker ORDER BY entry_count DESC LIMIT 1;\"}",
              "name": "query_database"
            },
            "id": "call_Nav6V13hc6X5v4QCiKW7