In [None]:
# install verdict
!uv pip install verdict --system

# This notebook has been run ahead of time, so you can inspect outputs without making
# any API calls. You can set your API key if you want to run the examples yourself.
# %env OPENAI_API_KEY=*************************

# Judge

Refer to [Eugene Yan's excellent blog post](https://eugeneyan.com/writing/llm-evaluators/#key-considerations-before-adopting-an-llm-evaluator) for a background on LLMs-as-a-Judge. We implement a configurable versions of
* Direct Sore Judge (outputs a score for a single sample)
  * can also easily be used for reference-based evaluation, as the prompt is completely customizable
* Pairwise Judge (choses the better of two samples)
  * we extend this to the Best-of-k case with our `BestOfKJudgeUnit`

## `JudgeUnit` (Direct Score Judge) Usage

In [2]:
from verdict import Pipeline
from verdict.schema import Schema
from verdict.common.judge import JudgeUnit

# default scale is DiscreteScale((1, 5))
pipeline = Pipeline() \
    >> JudgeUnit().prompt("""
        Score this on how funny it is.

        {source.joke}
    """)

response, leaf_node_prefixes = pipeline.run(Schema.of(joke="Why did the chicken cross the road? To get to the other side."))
response

{'Pipeline_root.block.unit[DirectScoreJudge]_score': 3}

### Configuring

#### Scale
You can pass an arbitrary `Scale` object using the `scale` argument. For example `BooleanScale` or `ContinuousScale(0, 1)`.


#### Explanation
Set `explanation=True` to prepend a required `explanation: str` field **before** the `score` field.

In [3]:
from verdict.scale import BooleanScale

response, _ = (Pipeline() \
    >> JudgeUnit(BooleanScale(), explanation=True).prompt("""
        Is this joke appropriate for young children?

        {source.joke}
    """)
).run(Schema.of(
    joke="Why did the chicken cross the road? To get to the other side."
))

response

{'Pipeline_root.block.unit[DirectScoreJudge]_explanation': 'The joke is a classic and simple one that plays on the expectation of a punchline. It is light-hearted and does not contain any inappropriate content or themes. Therefore, it is appropriate for young children.',
 'Pipeline_root.block.unit[DirectScoreJudge]_score': True}

## `PairwiseJudgeUnit` (Pairwise Judge) Usage

We return the scale index of the chosen  (i.e., with the default `DiscreteScale(['A', 'B'])`, this is either `'A'` or `'B'`). Be aware of [positional bias](verdict.haizelabs.com/docs/cookbook/positional-bias/).

In [5]:
from verdict.common.judge import PairwiseJudgeUnit

response, _ = (
    Pipeline() \
    # default scale is DiscreteScale(['A', 'B'])
    # we can pass a custom scale using the `response_options` parameter
    >> PairwiseJudgeUnit(explanation=True).prompt("""
        Chose the funnier joke

        A: {source.joke_A}
        B: {source.joke_B}
    """)
).run(Schema.of(
    joke_A="Why did the chicken cross the road? To get to the other side.",
    joke_B="Why did the chicken cross the road? Because the other side had better documentation."
))

response

{'Pipeline_root.block.unit[PairwiseJudge]_explanation': 'Joke B is funnier because it adds a humorous twist related to documentation, which is a more modern and relatable topic for many people, especially in tech or office environments.',
 'Pipeline_root.block.unit[PairwiseJudge]_choice': 'B'}

If we pass the input choices within the `options` list parameter in the Input Schema, we can also pass `original=True` to get the original input associated with the selected index.

In [6]:
from verdict.common.judge import PairwiseJudgeUnit

response, _ = (
    Pipeline() \
    >> PairwiseJudgeUnit(explanation=True, original=True).prompt("""
        Chose the funnier joke

        A: {input.options[0]}
        B: {input.options[1]}
    """)
).run(Schema.of(
    options=[
        "Why did the chicken cross the road? To get to the other side.",
        "Why did the chicken cross the road? Because the other side had better documentation."
    ]
))

response

{'Pipeline_root.block.unit[PairwiseJudge]_explanation': 'Joke B is funnier because it incorporates a humorous twist related to documentation, which can be relatable and amusing, especially in a tech or work context.',
 'Pipeline_root.block.unit[PairwiseJudge]_chosen': 'Why did the chicken cross the road? Because the other side had better documentation.'}

## `BestOfKJudgeUnit` (Multi-Option Judge) Usage

Pass an arbitrary `k` options. `PairwiseJudge`is a special case of this implementation.

In [12]:
from verdict.common.judge import BestOfKJudgeUnit

response, _ = (
    Pipeline() \
    # default scale is [A, ..., kth letter]
    >> BestOfKJudgeUnit(k=3, explanation=True, original=True).prompt("""
        Choose the funniest joke. Respond with the letter index below of the joke that is funniest.

        A: {input.options[0]}
        B: {input.options[1]}
        C: {input.options[2]}
    """)
).run(Schema.of(
    options=[
        "Why did the chicken cross the road? To get to the other side.",
        "Why did the chicken cross the road? Because the other side had better documentation.",
        "Why did the chicken cross the road? I don't know, ask the chicken."
    ]
))

response

{'Pipeline_root.block.unit[BestOfKJudge]_explanation': 'Joke B is the funniest because it adds a twist related to documentation, which is a humorous take on the classic joke.',
 'Pipeline_root.block.unit[BestOfKJudge]_chosen': 'Why did the chicken cross the road? Because the other side had better documentation.'}

Dropping the `explanation` allows us to use a Token Probability Extractor, and sample the choice with highest probability.

In [13]:
from verdict.extractor import ArgmaxScoreExtractor

response, _ = (
    Pipeline() \
    >> BestOfKJudgeUnit(k=3, original=True).prompt("""
        Choose the funniest joke. Respond with the index below of the joke that is funniest.

        1: {input.options[0]}
        2: {input.options[1]}
        3: {input.options[2]}
    """).extract(ArgmaxScoreExtractor())
).run(Schema.of(
    options=[
        "Why did the chicken cross the road? To get to the other side.",
        "Why did the chicken cross the road? Because the other side had better documentation.",
        "Why did the chicken cross the road? I don't know, ask the chicken."
    ]
))

response

{'Pipeline_root.block.unit[BestOfKJudge]_chosen': 'Why did the chicken cross the road? Because the other side had better documentation.'}

# Uncertainty Quantification with Extractors

Refer to the [Extractor documentation]() for more context.

In [16]:
from verdict.extractor import TokenProbabilityExtractor

response, _ = (
    Pipeline() \
    >> BestOfKJudgeUnit(k=3).prompt("""
        Choose the funniest joke. Respond with the index below of the joke that is funniest.

        1: {input.options[0]}
        2: {input.options[1]}
        3: {input.options[2]}
    """).extract(TokenProbabilityExtractor())
).run(Schema.of(
    options=[
        "Why did the chicken cross the road? To get to the other side.",
        "Why did the chicken cross the road? Because the other side had better documentation.",
        "Why did the chicken cross the road? I don't know, ask the chicken.",
    ]
))

response['Pipeline_root.block.unit[BestOfKJudge]_distribution']

{'A': 4.296800258297125e-07, 'B': 0.8519524434689586, 'C': 0.14804712685101548}

So `gpt-4o-mini` finds the second joke to be ~4x funnier than the third joke.