## Using llm as a judge to assess for accuracy

Will follow methadology as per anthropic paper [here](https://www.anthropic.com/engineering/multi-agent-research-system):
"""
 We used an LLM judge that evaluated each output against criteria in a rubric: factual accuracy (do claims match sources?), citation accuracy (do the cited sources match the claims?), completeness (are all requested aspects covered?), source quality (did it use primary sources over lower-quality secondary sources?), and tool efficiency (did it use the right tools a reasonable number of times?). We experimented with multiple judges to evaluate each component, but found that a single LLM call with a single prompt outputting scores from 0.0-1.0 and a pass-fail grade was the most consistent and aligned with human judgements. This method was especially effective when the eval test cases did have a clear answer, and we could use the LLM judge to simply check if the answer was correct (i.e. did it accurately list the pharma companies with the top 3 largest R&D budgets?).
"""

In [8]:
# Simulated some data
data = [
    {"question": "Who was the first president of the United States?", "context": "The first president of the United States was George Washington, inaugurated in 1789.", "response": "George Washington", "label": "grounded"},
    {"question": "Who was the first president of the United States?", "context": "The first president of the United States was George Washington, inaugurated in 1789.", "response": "Abraham Lincoln, who was inaugurated in 1861.", "label": "ungrounded"},
    {"question": "At what temperature does water boil at sea level?", "context": "Water boils at 100°C at sea level under standard atmospheric pressure.", "response": "100°C", "label": "grounded"},
    {"question": "At what temperature does water boil at sea level?", "context": "Water boils at 100°C at sea level under standard atmospheric pressure.", "response": "Water boils at 150°C at sea level.", "label": "ungrounded"},
    {"question": "What is 2 + 2?", "context": "The sum of 2 and 2 is 4.", "response": "2 plus 2 equals 4.", "label": "grounded"},
    {"question": "What is 2 + 2?", "context": "The sum of 2 and 2 is 4.", "response": "2 plus 2 equals 5.", "label": "ungrounded"},
    {"question": "Who are the main characters in Romeo and Juliet?", "context": "In Shakespeare’s play Romeo and Juliet, the two main characters are Romeo Montague and Juliet Capulet.", "response": "Romeo Montague and Juliet Capulet.", "label": "grounded"},
    {"question": "Who are the main characters in Romeo and Juliet?", "context": "In Shakespeare’s play Romeo and Juliet, the two main characters are Romeo Montague and Juliet Capulet.", "response": "The play’s main characters are Hamlet and Ophelia.", "label": "ungrounded"}
]

for x in data:
    print(x.get("question"))
    print(x.get("context"))
    print(x.get("response"))
    print(x.get("label"))

Who was the first president of the United States?
The first president of the United States was George Washington, inaugurated in 1789.
George Washington
grounded
Who was the first president of the United States?
The first president of the United States was George Washington, inaugurated in 1789.
Abraham Lincoln, who was inaugurated in 1861.
ungrounded
At what temperature does water boil at sea level?
Water boils at 100°C at sea level under standard atmospheric pressure.
100°C
grounded
At what temperature does water boil at sea level?
Water boils at 100°C at sea level under standard atmospheric pressure.
Water boils at 150°C at sea level.
ungrounded
What is 2 + 2?
The sum of 2 and 2 is 4.
2 plus 2 equals 4.
grounded
What is 2 + 2?
The sum of 2 and 2 is 4.
2 plus 2 equals 5.
ungrounded
Who are the main characters in Romeo and Juliet?
In Shakespeare’s play Romeo and Juliet, the two main characters are Romeo Montague and Juliet Capulet.
Romeo Montague and Juliet Capulet.
grounded
Who are t

In [10]:
from openai import OpenAI
import json
from dotenv import load_dotenv
load_dotenv()
client = OpenAI()

system_prompt = """You are an impartial evaluator. 
Judge ONLY factual accuracy of the ANSWER against the provided CONTEXT. 
Use no outside knowledge. Return STRICT JSON per schema. 
No extra text. No chain-of-thought."""

user_prompt_template = """QUESTION:
{question}

ANSWER:
{answer}

CONTEXT:
{context}

SCORING & DECISION:
- factual_accuracy 0.0–1.0 (one decimal place).
- pass = factual_accuracy >= {pass_threshold}

OUTPUT JSON SCHEMA:
{{
  "score": 0.0,
  "pass": false,
}}

RETURN ONLY JSON.
"""

def judge_factual(question, answer, sources, pass_threshold=0.8):
    user_prompt = user_prompt_template.format(
        question=question,
        answer=answer,
        context=sources,
        pass_threshold=pass_threshold,
    )
    resp = client.chat.completions.create(
        model="gpt-3.5-turbo",
        temperature=0,
        messages=[{"role":"system","content":system_prompt},
                  {"role":"user","content":user_prompt}]
    )
    out = resp.choices[0].message.content.strip()
    return json.loads(out)

for x in data:
    print(judge_factual(question = x.get("question"),
                  answer = x.get("response"),
                  sources = x.get("context")))
    

{'score': 1.0, 'pass': True}
{'score': 0.0, 'pass': False}
{'score': 1.0, 'pass': True}
{'score': 0.0, 'pass': False}
{'score': 1.0, 'pass': True}
{'score': 0.0, 'pass': False}
{'score': 1.0, 'pass': True}
{'score': 0.0, 'pass': False}
