# Search API Relevance Judge

This notebook evaluates and compares two search APIs (EXA and PWS) using AI-powered judging.

## Overview

1. **Parallel Search Execution**: Run both APIs simultaneously for speed
2. **AI-Powered Evaluation**: Use OpenAI agents to judge each result against custom rubrics
3. **Criteria-Based Scoring**: Evaluate multiple criteria (relevance, price fit, product match, etc.)
4. **Excel Reports**: Generate detailed evaluation reports with scores and reasoning

## Setup

First, let's import all required dependencies.


In [None]:

from pydantic import BaseModel
from typing import List, Optional
from agents import Agent, AgentOutputSchema 
import time
import asyncio
from agents.run import Runner
import pandas as pd
import re

## Define AI Judge Agent

Configure the LLM agent that will evaluate search results based on custom rubrics.

**Key Components:**
- `Criteria`: Schema for individual criterion scores (name, score, notes)
- `JudgeReport`: Overall evaluation schema (query, criteria list)
- `judge_agent`: The AI agent with instructions for objective evaluation

The agent uses GPT-5 to score each result on a 1-5 scale across multiple criteria defined in the rubric.


In [None]:
import json

def _msg(payload: dict) -> list:
    return [{
        "role": "user",
        "content": [{"type": "input_text", "text": json.dumps(payload)}]
    }]

class Criteria(BaseModel):
    criteria_name: str 
    score: float
    notes: str 

class JudgeReport(BaseModel):
    query: str
    criteria: List[Criteria]

judge_agent = Agent(
    name="ProductSearchQualityJudge",
    model="gpt-5", 
    instructions=(
        "You are an objective evaluator for shopping/search APIs.\n"
        "Do NOT browse the web. Use only the inputs provided.\n\n"

        "Inputs:\n"
        "- query: the user's search query.\n"
        "- apis: array of { name, latency_sec (float), results: [ {title, url, price?} ] } per API.\n"
        "- rubric_json: a JSON rubric defining detailed evaluation criteria for multiple queries.\n\n"

        "Your job:\n"
        "1. Identify the rubric in rubric_json whose 'query' best matches the given query (case-insensitive exact match preferred).\n"
        "2. For that rubric, evaluate each API’s results across all five criteria listed under 'criteria'.\n"
        "3. For each criterion, assign a numeric score in [1,5] using the definitions in 'score_definitions'. Interpret descriptions precisely and grade deterministically. 1 is the lowest and 5 is the highest\n"
        "4. Add a latency adjustment: latency_score = 1 / (1 + latency_sec/2), clipped to [0,1]"
        "5. Round all numeric scores to 2 decimals"

        "Scoring Details:\n"
        "- For each result:\n"
        "  • Analyze its results (title, url, price if given) to infer how well each criterion is met.\n"
        "  • If results are missing or malformed, score conservatively (1 or 2).\n"
        "  • Be consistent and deterministic.\n\n"

        "Rules & Edge Cases:\n"
        "1) Use ONLY the rubric_json and the API results. Do NOT infer external data.\n"
        "2) If query not found in rubric_json, return an empty per_api list and winner='None'.\n"
        "3) 'notes' must summarize key strengths or weaknesses from the rubric perspective.\n"
    ),
    tools=[],  # pure rubric-based judging
    output_type=AgentOutputSchema(JudgeReport, strict_json_schema=True)
)


## Load Evaluation Rubrics

Load the evaluation criteria from `evals/product_search_rubric.json`.

Each rubric defines:
- **Query**: The search query to evaluate
- **Criteria**: Multiple scoring dimensions (color match, price fit, relevance, etc.)
- **Score Definitions**: What each score (1-5) means for each criterion

The rubrics guide the AI judge on how to evaluate search results objectively.


In [None]:
with open("evals/product_search_rubric.json", "r", encoding="utf-8") as f:
        rubric_data = json.load(f)

evals = []
for item in rubric_data.get("rubrics", []):
    evals.append({
        "query": item.get("query"),
        "criteria": item.get("criteria")
    })


## Execute Evaluations

This section contains the main evaluation logic:

### Helper Functions

**`run_search(script_name, query)`**
- Executes a search script (EXA or PWS) asynchronously  
- Returns: `(results, latency)` tuple
- Enables parallel search execution with `asyncio.gather()`

**`evaluate_result(api_name, query, idx, item, latency, rubric_data)`**
- Sends a single search result to the AI judge for evaluation
- Returns: Tuple of `(api_name, query, idx, item, judge_output)`
- Enables parallel evaluation of all results

### Main Execution Loop

For each query in the rubric:
1. **Parallel Search** - Run both EXA and PWS searches simultaneously
2. **Prepare Tasks** - Create evaluation tasks for top 10 results from each API
3. **Parallel Judging** - Send all results to AI judge concurrently (up to 20 evaluations)
4. **Generate Excel** - Create separate reports for EXA and PWS with scores and reasoning

**Performance:** All operations are parallelized for maximum speed!


In [None]:
async def run_search(script_name, query):
    """Run a search script and return results with timing"""
    start_time = time.time()
    script_path = f"/Users/karthikapurushothaman/projects/search/{script_name}"
    process = await asyncio.create_subprocess_exec(
        "/Users/karthikapurushothaman/projects/search/.venv/bin/python",
        script_path, "-q", query, "-f", "json",
        stdout=asyncio.subprocess.PIPE,
        stderr=asyncio.subprocess.PIPE
    )
    stdout, stderr = await process.communicate()
    latency = time.time() - start_time
    results = json.loads(stdout.decode())
    return results, latency

async def evaluate_result(api_name, query, idx, item, latency, rubric_data):
    """Evaluate a single result using the judge agent"""
    formatted = [{
        'title': item.get('title', ''),
        'url': item.get('url', ''),
        'price': item.get('price')
    }]
    
    payload = {
        'query': query,
        'apis': [{
            'name': api_name,
            'latency_sec': latency,
            'results': formatted
        }],
        'rubric_json': rubric_data
    }
    
    result = await Runner.run(judge_agent, _msg(payload))
    judge_output = result.final_output_as(JudgeReport)
    return (api_name, query, idx, item, judge_output)

for eval_type in evals:
    query = eval_type["query"]
    evaluation_criteria = eval_type["criteria"]

    print(f"Searching for: '{query}'")
    
    exa_task = run_search("exa_search.py", query)
    pws_task = run_search("pws_search.py", query)
    
    (exa_results, exa_latency), (pws_results, pws_search) = await asyncio.gather(exa_task, pws_task)
    
    print(f"✓ EXA completed in {exa_latency:.2f}s")
    print(f"✓ PWS completed in {pws_search:.2f}s")

    print("=" * 80)
    print("EVALUATING RESULTS IN PARALLEL")
    print("=" * 80)
    
    eval_tasks = []
    
    for idx, item in enumerate(exa_results.get('all_results', [])[:10], 1):
        task = evaluate_result('EXA', query, idx, item, exa_latency, rubric_data)
        eval_tasks.append(task)
    
    for idx, item in enumerate(pws_results.get('output', {}).get('matched_products', [])[:10], 1):
        task = evaluate_result('PWS', query, idx, item, pws_search, rubric_data)
        eval_tasks.append(task)
    
    print(f"Running {len(eval_tasks)} evaluations in parallel...")
    all_evaluations = await asyncio.gather(*eval_tasks)
    time.sleep(10)
    print(f"✓ Completed {len(all_evaluations)} evaluations")
    
    exa_data = []
    for api, query, idx, item, judge_output in all_evaluations:
        if api == 'EXA':
            row = {
                'Result #': idx,
                'Title': item.get('title', 'N/A'),
                'URL': item.get('url', 'N/A'),
                'Price': item.get('price', 'N/A'),
            }
            for criterion in judge_output.criteria:
                clean_name = re.sub(r'[\x00-\x1F\x7F-\x9F≤≥:()]', '', criterion.criteria_name).strip()
                row[f"{clean_name} - Score"] = criterion.score
                row[f"{clean_name} - Reasoning"] = criterion.notes
            
            exa_data.append(row)

    pws_data = []
    for api, query, idx, item, judge_output in all_evaluations:
        if api == 'PWS':
            row = {
                'Result #': idx,
                'Title': item.get('title', 'N/A'),
                'URL': item.get('url', 'N/A'),
                'Price': item.get('price', 'N/A'),
            }
            for criterion in judge_output.criteria:
                clean_name = re.sub(r'[\x00-\x1F\x7F-\x9F≤≥:()]', '', criterion.criteria_name).strip()
                row[f"{clean_name} - Score"] = criterion.score
                row[f"{clean_name} - Reasoning"] = criterion.notes
            
            pws_data.append(row)

    exa_df = pd.DataFrame(exa_data)
    pws_df = pd.DataFrame(pws_data)

    exa_filename = f'exa_evaluation_report_{query.replace(" ", "_")}.xlsx'
    pws_filename = f'pws_evaluation_report_{query.replace(" ", "_")}.xlsx'

    exa_df.to_excel(exa_filename, index=False, engine='openpyxl')
    pws_df.to_excel(pws_filename, index=False, engine='openpyxl')

    print(f"Created EXA report: {exa_filename} ({len(exa_df)} results)")
    print(f"Created PWS report: {pws_filename} ({len(pws_df)} results)")

            


## Results

After running the evaluations, you'll find:

### Excel Reports
- `exa_evaluation_report_{query}.xlsx` - EXA API results with detailed scoring
- `pws_evaluation_report_{query}.xlsx` - PWS API results with detailed scoring

### Report Contents
Each Excel file includes:
- **Result metadata**: Title, URL, Price
- **Criterion scores**: Individual scores (1-5) for each evaluation criterion
- **Judge reasoning**: Detailed notes explaining each score
- **Comparison**: Easy side-by-side comparison of both APIs

### Next Steps
1. Open the Excel files to review detailed evaluations
2. Compare scores across different criteria
3. Identify which API performs better for specific query types
4. Adjust rubrics in `evals/product_search_rubric.json` for different evaluation needs

---

💡 **Tip**: Modify the rubric JSON to add new queries or change scoring criteria!
