# MMLU-ProX-Lite-Open Criterion and Prompt Template Setup

This notebook creates a criterion set and prompt template for evaluating responses to MMLU-ProX-Lite-Open questions using the Elluminate platform.

## Setup and Imports

In [3]:
import os
from dotenv import load_dotenv
from elluminate import Client
import nest_asyncio

nest_asyncio.apply()

# Load environment variables
load_dotenv()

print("Environment setup complete")

Environment setup complete


## Initialize Elluminate Client

In [4]:
# Initialize Elluminate client
# Make sure ELLUMINATE_API_KEY and ELLUMINATE_BASE_URL are set in your .env file
client = Client()

print("Elluminate client initialized successfully")

Elluminate client initialized successfully


## Create MMLU-ProX-Lite-Open Prompt Template

The prompt template consists of:
- **System message**: Instructions for exact answering, chain-of-thought reasoning, and language matching
- **User message**: Contains only the question from the dataset

In [5]:
# Define the system message for the prompt template
system_message = """You are an expert academic assistant. Your task is to answer questions as accurately and precisely as possible.

Instructions:
1. Provide exact and comprehensive answers
2. Use chain-of-thought reasoning before giving your final answer
3. Always answer and think in the same language as the user's question
4. Be concise but complete in your response

Structure your response as:
- First, briefly analyze the question
- Then, provide your reasoning step by step
- Finally, give your definitive answer"""

# Define the user message template (uses variables from the dataset)
user_message_template = "{{question}}"

# Create the prompt template
prompt_template, created = client.prompt_templates.get_or_create(
    name="MMLU-ProX-Lite-Open",
    user_prompt_template=[
        {"role": "system", "content": system_message},
        {"role": "user", "content": user_message_template},
    ],
)

status = "created" if created else "found existing"
print(f"✓ Prompt template '{prompt_template.name}' {status}")
print(f"Template ID: {prompt_template.id}")

✓ Prompt template 'MMLU-ProX-Lite-Open' created
Template ID: 401


## Create Ground Truth Matching Criterion

This criterion evaluates whether a response matches the ground truth answer based on the judge prompt logic:
- Response must contain at least as much information as ground truth
- Can be more specific or include additional correct information
- Paraphrasing is acceptable
- For numeric answers, relative error must be < 1%

In [6]:
# Define the criterion for ground truth matching
ground_truth_criterion = """Does the response match the ground truth answer?

<ground_truth_answer>
{{answer}}
</ground_truth_answer>

For a response to match, it must contain at least as much information as the ground truth. The response can:
- Be more specific (e.g., "Labrador" is more specific than "dog")
- Include additional possible correct answers
- Use different words or paraphrasing that conveys the same meaning
- For numeric answers: have a relative error less than 1% (defined as |response - ground_truth| / mean(response, ground_truth))

The response must cover everything mentioned in the ground truth, even if expressed differently. Please answer with Yes if the response matches the ground truth according to the above specifications, no otherwise!"""

# Create criterion set for MMLU-ProX evaluation
criterion_set, created = client.criterion_sets.get_or_create(
    name="MMLU-ProX-Ground-Truth-Matching",
    prompt_template=prompt_template,
)

status = "created" if created else "found existing"
print(f"✓ Criterion set '{criterion_set.name}' {status}")
print(f"Criterion set ID: {criterion_set.id}")

✓ Criterion set 'MMLU-ProX-Ground-Truth-Matching' created
Criterion set ID: 145


## Add Criterion to the Set

In [7]:
# Add the ground truth matching criterion to the set
language_criterion = "Is the whole response in the same language as the user question?"
try:
    client.criteria.add_many(
        [ground_truth_criterion, language_criterion], criterion_set=criterion_set
    )
    print("✓Criteria added to set")
except Exception as e:
    print(f"Note: Criterion may already exist - {str(e)}")

# List all criteria in the set to verify
criteria_list = client.criteria.list(prompt_template)
print(f"\nCriteria in set '{criterion_set.name}':")
for i, criterion in enumerate(criteria_list, 1):
    print(f"{i}. {criterion}")

✓Criteria added to set

Criteria in set 'MMLU-ProX-Ground-Truth-Matching':
1. id=2787 criterion_str='Is the whole response in the same language as the user question?' label='Criterion 2' prompt_template=None template_variables=None created_at=datetime.datetime(2025, 7, 24, 7, 48, 17, 669000, tzinfo=TzInfo(UTC))
2. id=2786 criterion_str='Does the response match the ground truth answer?\n\n<ground_truth_answer>\n{{answer}}\n</ground_truth_answer>\n\nFor a response to match, it must contain at least as much information as the ground truth. The response can:\n- Be more specific (e.g., "Labrador" is more specific than "dog")\n- Include additional possible correct answers\n- Use different words or paraphrasing that conveys the same meaning\n- For numeric answers: have a relative error less than 1% (defined as |response - ground_truth| / mean(response, ground_truth))\n\nThe response must cover everything mentioned in the ground truth, even if expressed differently. Please answer with Yes if t