# Align a Voice LLM Judge

In this notebook we'll learn how to align a multi-modal LLM judge that can take in audio and guide it towards our preference for what a character from the GTA-VI should sound like.

### Who Validates the Validators
This human-LLM collaboration is inspired by the [Who Validates the Validators paper](https://arxiv.org/pdf/2404.12272), which focussed on the minimum about of labelling from a human to get the maximum out of aligning a LLM.

The repo for this notebook can be found here: https://github.com/morganmcg1/voice-judge

# Setup

In [None]:
# comment out if you already have the repo
!git clone https://github.com/morganmcg1/voice-judge.git && cp -r voice-judge/* .
!uv pip install -r pyproject.toml -q

## Imports

In [1]:
from pprint import pprint
from pydantic import BaseModel, Field
import weave

import ipywidgets as widgets
from IPython.display import display

from audio_utils import AudioRanker, wave_read_to_wav_bytes
from judge import (
    JudgeRanking,
    run_speech_llm,
    update_pairwise_comparison_history,
    # update_pairwise_comparison_history_from_feedback
)
from preference_learner import PreferenceLearner, USER_CONTEXT_PROMPT

# Download data

## Weave login
You'll need a free Weights & Biases account to log in to Weave and run this notebook, sign up here: www.wandb.ai/site

In [2]:
weave_client = weave.init("wandb-voice-ai/voice-judge")

[36m[1mweave[0m: Logged in as Weights & Biases user: morgan.
[36m[1mweave[0m: View Weave data at https://wandb.ai/wandb-voice-ai/voice-judge/weave


In [3]:
# SPEECH_AUDIO_DATASET_URI = "weave:///wandb-voice-ai/voice-judge/object/generated_speech_audio:v1"
SPEECH_AUDIO_DATASET_URI = "weave:///wandb-voice-ai/voice-judge/object/generated_speech_audio_test:v1"
N_SAMPLES_TO_RANK = 16
N_SAMPLES_TO_EVAL = 3
PREFERENCE_LEARNER_MODEL =  "gemini-2.0-flash" #"gemini-2.5-pro-preview-05-06",  # "gemini-2.0-flash"
JUDGE_MODEL =  "gemini-2.0-flash" # "gemini-2.5-pro-preview-05-06"
ANALYZER_MODEL = "gemini-2.0-flash" #"gemini-2.5-pro-preview-05-06"

print("Downloading speech samples from Weave...")
ds_ref = weave.ref(SPEECH_AUDIO_DATASET_URI).get()
speech_samples = list(ds_ref.rows)
print(f"{len(speech_samples)} speech samples downloaded from Weave")

samples_to_rank = {}
for i in range(N_SAMPLES_TO_RANK):
    samples_to_rank[speech_samples[i]["voice_instructions_id"]] = {
        "audio": speech_samples[i]["audio"],
        "audio_bytes": wave_read_to_wav_bytes(speech_samples[i]["audio"]),
        "instructions": speech_samples[i]["voice_instructions"],
        "short_hash": None,
        "pairwise_comparison_history": {},
    }

ids_to_rank = list(samples_to_rank.keys())

samples_to_eval = {}
for sample in speech_samples[-N_SAMPLES_TO_EVAL:]:
    samples_to_eval[sample["voice_instructions_id"]] = {
        "audio": sample["audio"],
        "audio_bytes": wave_read_to_wav_bytes(sample["audio"]),
        "instructions": sample["voice_instructions"],
    }
ids_to_eval = list(samples_to_eval.keys())

Downloading speech samples from Weave...
20 speech samples downloaded from Weave


# Round 1 - Rank and Update Judge
## Pick first samples to rank

In [4]:
# Register our samples for ranking
@weave.op
def rank_audio(sample_1: dict, sample_2: dict) -> None:
    pass

sample_1_id = ids_to_rank[0]
sample_2_id = ids_to_rank[1]

sample_1 = {"id": sample_1_id, **samples_to_rank[sample_1_id]}
sample_2 = {"id": sample_2_id, **samples_to_rank[sample_2_id]}

_, target_call = rank_audio.call(sample_1, sample_2)

In [5]:
# Init Ranker
ranker = AudioRanker(
    [
        {
            "id": sample_1_id,
            "audio": samples_to_rank[sample_1_id]["audio"],
            "original_input_order": 1,
        },
        {
            "id": sample_2_id,
            # "audio": samples_to_rank[sample_2_id]["audio_bytes"],
            "audio": samples_to_rank[sample_2_id]["audio"],
            "original_input_order": 2,
        },
    ],
    weave_client=weave_client,
    target_call=target_call,
    image_path="assets/gta-vi_guy_in_front_of_truck_cropped.jpg",
    mode="gradio"
)

[36m[1mweave[0m: 📦 Published to https://wandb.ai/wandb-voice-ai/voice-judge/weave/objects/AudioRanker/versions/ym5a2rKViLvZtIjKh5VMKYzRxpZOmGhGDTLgSi8hDN8


## Run the ranker
First we'll run our ranker UI and rank the 2 voice samples selected

In [7]:
interface = ranker.create_gradio_interface()
interface.launch()

* Running on local URL:  http://127.0.0.1:7862
* To create a public link, set `share=True` in `launch()`.




✅ BinaryVoiceRank applied successfully to weave call. Result: ApplyScorerSuccess(result={'preferred_sample_id': 'weary_georgia_mutter_20250527_1823', 'sample_one_preferred': False, 'sample_two_preferred': True, 'ranking_timestamp': '2025-05-27_19-35'}, score_call=Call(_op_name=<Future at 0x128454fd0 state=pending>, trace_id='01971307-3659-74f2-938b-58bd012fdc37', project_id='wandb-voice-ai/voice-judge', parent_id=None, inputs={'self': ObjectRef(entity='wandb-voice-ai', project='voice-judge', name='BinaryVoiceRank', _digest=<Future at 0x12806dd10 state=pending>, _extra=()), 'output': None}, id='01971307-3659-74f2-938b-58ccbb75cade', output={'preferred_sample_id': 'weary_georgia_mutter_20250527_1823', 'sample_one_preferred': False, 'sample_two_preferred': True, 'ranking_timestamp': '2025-05-27_19-35'}, exception=None, summary={'status_counts': {<TraceStatus.SUCCESS: 'success'>: 1, <TraceStatus.ERROR: 'error'>: 0}}, _display_name=None, attributes=AttributesDict({'weave': {'python': {'type

### Download feedback from Weave

In [8]:
%%capture
updated_target_call = weave_client.get_call(target_call.id)
updated_target_call.feedback

In [9]:
assert updated_target_call.feedback is not None, "No feedback found for target call"
for f in updated_target_call.feedback.feedbacks.items:
    if "output" in f.payload:
        ranking = f.payload["output"]
        print("Rankings retrieved from Weave")
        break

rank1 = samples_to_rank[ranking["preferred_sample_id"]]

if ranking["preferred_sample_id"] == sample_1_id:
    rank2 = samples_to_rank[sample_2_id]
    rank2_id = sample_2_id
else:
    rank2 = samples_to_rank[sample_1_id]
    rank2_id = sample_1_id

res = [
    {
        "rank" : 1,
        "id" : ranking["preferred_sample_id"],
        "original_input_order" : 1 if ranking["preferred_sample_id"] == sample_1_id else 2
    },
    {
        "rank" : 2,
        "id" : rank2_id,
        "original_input_order" : 1 if rank2_id== sample_1_id else 2
    }
]

final_rankings = {"rankings": res, 
                  "completed_at": ranking["ranking_timestamp"],
                  "preferred_id": ranking["preferred_sample_id"],
                  "rejected_id": rank2_id
                  }
final_rankings

samples_to_rank = update_pairwise_comparison_history(samples_to_rank, final_rankings)

Rankings retrieved from Weave


In [10]:
sample_1_id, samples_to_rank[sample_1_id]["pairwise_comparison_history"]

('surprised_french_veteran_20250527_1823',
 {'2025-05-27_19-35': {'competitor_id': 'weary_georgia_mutter_20250527_1823',
   'sample_rank_in_this_pair': 2}})

## Anlysze Preferences

Based on the ranking selected, a LLM will now try and identify the differences in the preferred vs the rejected voice sample.

In [12]:
learner = PreferenceLearner(model_name=PREFERENCE_LEARNER_MODEL)
learner.patterns

{'strong': [], 'emerging': []}

In [13]:
await learner.update(final_rankings, samples_to_rank)
ranking_patterns = learner.patterns

Updating comparisons...
Running pattern update...


Pattern update result:
('reasoning: The preferred voice in the recent comparison exhibits a slight '
 'variation in speech rate and intonation, making it sound more casual and '
 'engaging. The rejected voice, while clear, sounds somewhat monotonous and '
 'less expressive. This suggests a preference for voices with more dynamic '
 'speech patterns.')
'strong: []'
("emerging: ['Preference for more dynamic and engaging speech patterns.', "
 "'Dispreference for monotonous and less expressive voices.', 'Slightly slower "
 "speaking rate is preferred.', 'Casual tone is preferred']")


## The Judge and The Analyzer
Below is a seed judge prompt as well as some other prompts for the analyzer

#### Basic judge prompts

In [16]:
BASIC_JUDGE_SYSTEM_INSTRUCTION = f"""Assess the generated voices provided. The task is to assess the appropriateness \
of the audio voice sample for the given task. If multiple voice samples are provided, rank them in order of preference.

Here is some context about the task:

{USER_CONTEXT_PROMPT}

"""

seed_judge_prompt = """Based on the following criteria, the task is to assess the appropriateness \
of the audio voice sample for a video game character in his late 50's. He should sound like a man in his late 50's \
and be a little on the wild side.

## Voice characteristics to consider
Consider the following aspects of the voice:
- style
- tone
- accent
- speed
- volume
- pitch
- intonation
and more
"""

judge_prompt_postfix = """## Assessment
- If a single voice sample is provided, assess it according to the above criteria and return a bool of whether \
it is appropriate.
- If multiple voice samples are provided, rank them in order of preference according to the above criteria.

## Voice samples
Below are the voice samples to assess:

"""

#### Analyzer prompts

In [17]:
ANALYZER_SYSTEM_INSTRUCTION = """The task is to help align a LLLM Judge towards user preferences. This will \
be acheived by analysing the current prompt as well as new preference data derived from the user, followed by \
making recommendations for how to update the prompt to align more with the user's preferences."""

judge_prompt_analyser_prompt = """The task is to optimze a given LLM judge prompt to align more with a human \
rater's preferences.

## Current prompt
Below is the current judge prompt: 

<current_judge_prompt>
{current_judge_prompt}
</current_judge_prompt>

## Rater preferences
Below are patterns that have been observed from a human rater's preferences when doing pairwise comparisons of \
voice samples.

<rater_preferences>
{ranking_patterns}
</rater_preferences>

## Analysis
Critially analyse the <current_judge_prompt> and <rater_preferences> and make recommendations for whether or not \
the <current_judge_prompt> needs to be updated. You are not required to make edits to the <current_judge_prompt> \
if you fell it is already aligned with the <rater_preferences>.

## Output
Based on your analysis, please output a full updated judge prompt below. Do not include any placeholder text \
for the input variables. Just focus on describing what makes a good and bad voice sample purely \
based on the <rater_preferences>. Do not include mentions of the outputs needed such as scores or rankings. \
Just describe what makes a good and bad voice sample purely based on the <rater_preferences>.

Output your new judge prompt using markdown formatting.

"""

class JudgePromptAnalysis(BaseModel):
    reasoning: str = Field(description="A detailed explanation of your analysis of the <current_judge_prompt>, \
<rater_preferences> and what edits, if any, are needed to align the <current_judge_prompt> with the rater's preferences.")
    updated_judge_prompt: str = Field(description="The updated judge prompt based on the analysis, focussed just on the \
characteristics of a good and bad voice sample.")

### Run the Analyzer to update the Judge

With these preferences, we'll update a basic "seed" LLM Judge prompt to try and make it less generic and more aligned with our personal preferences.

#### Pass the seed judge prompt plus preference analysis

In [18]:
@weave.op
async def run_judge_prompt_analyser(judge_prompt_analyser_prompt: str, analyzer_model: str) -> JudgePromptAnalysis:
    # Example with file path
    result = await run_speech_llm(
        system_instruction=ANALYZER_SYSTEM_INSTRUCTION,
        prompt=judge_prompt_analyser_prompt,
        model_name=analyzer_model,
        temperature=1.0,
        response_model=JudgePromptAnalysis,
    )
    return result

In [19]:
def format_patterns(patterns):
    strong_str = ""
    for pat in patterns.get('strong', []):
        strong_str += f"  - {pat}\n"
    emerging_str = ""
    for pat in patterns.get('emerging', []):
        emerging_str += f"  - {pat}\n"
    return f"Strong patterns: {strong_str}\n\nEmerging patterns:\n{emerging_str}\n"

In [20]:
format_patterns(ranking_patterns)

'Strong patterns: \n\nEmerging patterns:\n  - Preference for more dynamic and engaging speech patterns.\n  - Dispreference for monotonous and less expressive voices.\n  - Slightly slower speaking rate is preferred.\n  - Casual tone is preferred\n\n'

In [21]:
patterns_str =format_patterns(ranking_patterns)
judge_prompt_analyser_prompt = judge_prompt_analyser_prompt.format(
    current_judge_prompt=seed_judge_prompt, 
    ranking_patterns=patterns_str
)

analyzer_result = await run_judge_prompt_analyser(
    judge_prompt_analyser_prompt=judge_prompt_analyser_prompt,
    analyzer_model=ANALYZER_MODEL
)

print("**Reasoning:**")
pprint(analyzer_result.reasoning)
print()
pprint("**Updated Judge Prompt:**")
pprint(analyzer_result.updated_judge_prompt)

round_1_judge_prompt = analyzer_result.updated_judge_prompt

**Reasoning:**
('The current prompt provides a solid foundation by outlining the task and '
 'listing voice characteristics. However, it lacks specific guidance '
 "reflecting the rater's preferences for dynamic, engaging, and slightly "
 'slower speech with a casual tone. The prompt also does not define what is '
 'meant by a voice sounding "a little on the wild side". To better align with '
 "the rater's preferences, the prompt should be updated to emphasize the "
 'importance of these factors when assessing voice samples. It should '
 'encourage the judge to favor voices with varied intonation, expressive '
 'delivery, and a relaxed, unhurried pace. The term "wild side" should be '
 'better defined.')

'**Updated Judge Prompt:**'
('Based on the following criteria, the task is to assess the appropriateness '
 "of the audio voice sample for a video game character in his late 50's. The "
 "ideal voice should sound like a man in his late 50's, with a casual and "
 'expressive tone. The 

## Run the updated Judge with the Round 1 judge prompt

In [22]:
@weave.op
async def run_speech_judge(new_judge_prompt, speech_samples_to_eval, judge_model):
    # Example with file path
    result = await run_speech_llm(
        system_instruction=BASIC_JUDGE_SYSTEM_INSTRUCTION,
        prompt=new_judge_prompt,
        model_name=judge_model,
        temperature=0.1,
        response_model=JudgeRanking,
        audio_data=[
            speech_samples_to_eval[0]["audio_bytes"],
            speech_samples_to_eval[1]["audio_bytes"],
            speech_samples_to_eval[2]["audio_bytes"],
        ],
        initial_audio_parts_prompt="\n\nvoice_1:\n",
        audio_parts_prompt_divider="\n\nvoice_{input_order}:\n",
    )
    return result

In [23]:
new_judge_prompt = round_1_judge_prompt + judge_prompt_postfix
speech_samples_to_eval = []
for eval_id in ids_to_eval:
    speech_samples_to_eval.append(samples_to_eval[eval_id])

# Run new judge
result = await run_speech_judge(
    new_judge_prompt=new_judge_prompt,
    speech_samples_to_eval=speech_samples_to_eval,
    judge_model=JUDGE_MODEL
)
pprint(result.thinking)
pprint(result.ranking)
print()

('Voice 3 is the most appropriate because it has the most dynamic and engaging '
 'style. The delivery is lively and captivating, demonstrating a range of '
 'emotions. Voice 1 is second most appropriate because it has a casual '
 'delivery and is slightly slower. Voice 2 is the least appropriate because it '
 'is the most monotone.')
['voice_3', 'voice_1', 'voice_2']



# Round 2 - Rank and Update Judge

## Get next samples to rank

In [24]:
# Register our samples for ranking
sample_1_id = ids_to_rank[2]
sample_2_id = ids_to_rank[3]

sample_1 = {"id": sample_1_id, **samples_to_rank[sample_1_id]}
sample_2 = {"id": sample_2_id, **samples_to_rank[sample_2_id]}

_, target_call = rank_audio.call(sample_1, sample_2)

# Init Ranker
ranker = AudioRanker(
    [
        {
            "id": sample_1_id,
            "audio": samples_to_rank[sample_1_id]["audio"],
            "original_input_order": 1,
        },
        {
            "id": sample_2_id,
            # "audio": samples_to_rank[sample_2_id]["audio_bytes"],
            "audio": samples_to_rank[sample_2_id]["audio"],
            "original_input_order": 2,
        },
    ],
    weave_client=weave_client,
    target_call=target_call,
    image_path="assets/gta-vi_guy_in_front_of_truck_cropped.jpg",
    mode="gradio"
)

[36m[1mweave[0m: 📦 Published to https://wandb.ai/wandb-voice-ai/voice-judge/weave/objects/AudioRanker/versions/ym5a2rKViLvZtIjKh5VMKYzRxpZOmGhGDTLgSi8hDN8



## Run the ranker
First we'll run our ranker UI and rank the 2 voice samples selected

In [25]:
interface = ranker.create_gradio_interface()
interface.launch()

* Running on local URL:  http://127.0.0.1:7863
* To create a public link, set `share=True` in `launch()`.




✅ BinaryVoiceRank applied successfully to weave call. Result: ApplyScorerSuccess(result={'preferred_sample_id': 'posh_sarcastic_fast_whisper_20250527_1823', 'sample_one_preferred': False, 'sample_two_preferred': True, 'ranking_timestamp': '2025-05-27_19-37'}, score_call=Call(_op_name=<Future at 0x1285cb210 state=finished returned str>, trace_id='01971309-7522-7ef3-80ba-3424ecae03be', project_id='wandb-voice-ai/voice-judge', parent_id=None, inputs={'self': ObjectRef(entity='wandb-voice-ai', project='voice-judge', name='BinaryVoiceRank', _digest=<Future at 0x1288b9ed0 state=pending>, _extra=()), 'output': None}, id='01971309-7522-7ef3-80ba-343a9d5d55c8', output={'preferred_sample_id': 'posh_sarcastic_fast_whisper_20250527_1823', 'sample_one_preferred': False, 'sample_two_preferred': True, 'ranking_timestamp': '2025-05-27_19-37'}, exception=None, summary={'status_counts': {<TraceStatus.SUCCESS: 'success'>: 1, <TraceStatus.ERROR: 'error'>: 0}}, _display_name=None, attributes=AttributesDict

In [26]:
%%capture
updated_target_call = weave_client.get_call(target_call.id)
updated_target_call.feedback

In [27]:
for f in updated_target_call.feedback.feedbacks.items:
    if "output" in f.payload:
        ranking = f.payload["output"]
        print("Rankings retrieved from Weave")
        break

rank1 = samples_to_rank[ranking["preferred_sample_id"]]

if ranking["preferred_sample_id"] == sample_1_id:
    rank2 = samples_to_rank[sample_2_id]
    rank2_id = sample_2_id
else:
    rank2 = samples_to_rank[sample_1_id]
    rank2_id = sample_1_id

res = [
    {
        "rank" : 1,
        "id" : ranking["preferred_sample_id"],
        "original_input_order" : 1 if ranking["preferred_sample_id"] == sample_1_id else 2
    },
    {
        "rank" : 2,
        "id" : rank2_id,
        "original_input_order" : 1 if rank2_id== sample_1_id else 2
    }
]

final_rankings = {"rankings": res, 
                  "completed_at": ranking["ranking_timestamp"],
                  "preferred_id": ranking["preferred_sample_id"],
                  "rejected_id": rank2_id
                  }
final_rankings

samples_to_rank = update_pairwise_comparison_history(samples_to_rank, final_rankings)

Rankings retrieved from Weave


In [29]:
sample_1_id, samples_to_rank[sample_1_id]["pairwise_comparison_history"]

('hushed_georgia_rush_20250527_1823',
 {'2025-05-27_19-37': {'competitor_id': 'posh_sarcastic_fast_whisper_20250527_1823',
   'sample_rank_in_this_pair': 2}})

## Anlysze Preferences

Based on the ranking selected, a LLM will now try and identify the differences in the preferred vs the rejected voice sample.

In [30]:
learner.patterns

{'strong': [],
 'emerging': ['Preference for more dynamic and engaging speech patterns.',
  'Dispreference for monotonous and less expressive voices.',
  'Slightly slower speaking rate is preferred.',
  'Casual tone is preferred']}

In [31]:
learner.comparisons

[['<ranking_2025-05-27_19-35>\n',
  'Pairwise comparison timestamp: 2025-05-27_19-35\n',
  'Preferred Voice:',
  Part(video_metadata=None, thought=None, inline_data=Blob(display_name=None, data=b'RIFF\x04\xc0\x05\x00WAVEfmt \x10\x00\x00\x00\x01\x00\x01\x00\xc0]\x00\x00\x80\xbb\x00\x00\x02\x00\x10\x00data\xe0\xbf\x05\x00\x07\x00\x0c\x00\x05\x00\x04\x00\x02\x00\t\x00\x08\x00\x06\x00\x04\x00\x00\x00\x01\x00\x01\x00\x01\x00\x05\x00\x07\x00\x08\x00\x04\x00\x07\x00\x00\x00\x08\x00\x07\x00\x07\x00\x08\x00\x07\x00\x08\x00\x03\x00\x07\x00\x03\x00\x04\x00\x03\x00\x02\x00\x02\x00\x01\x00\xfe\xff\xfd\xff\xfc\xff\xfa\xff\xff\xff\xfc\xff\xf8\xff\xf1\xff\xf9\xff\xf2\xff\xfb\xff\xf2\xff\xf2\xff\xec\xff\xf2\xff\xed\xff\xe6\xff\xf0\xff\xe9\xff\xee\xff\xe4\xff\xe8\xff\xe7\xff\xe7\xff\xed\xff\xe7\xff\xe6\xff\xdf\xff\xe1\xff\xde\xff\xde\xff\xe1\xff\xde\xff\xdc\xff\xd7\xff\xd2\xff\xd7\xff\xd6\xff\xd4\xff\xd6\xff\xda\xff\xd7\xff\xd2\xff\xd1\xff\xd4\xff\xd2\xff\xd1\xff\xd3\xff\xd1\xff\xd6\xff\xd3\xff\xd2\xff\

In [32]:
await learner.update(final_rankings, samples_to_rank)
ranking_patterns = learner.patterns

Updating comparisons...
Running pattern update...


Pattern update result:
('reasoning: The recent comparisons reinforce the preference for a more '
 'casual, dynamic, and slightly slower speaking rate. The preferred voices in '
 'both comparisons have a more relaxed and natural tone compared to the '
 'rejected voices, which sound somewhat monotonous. No contradictions were '
 'observed. The slight preference for a more relaxed tone is solidified.')
'strong: []'
("emerging: ['Preference for more dynamic and engaging speech patterns.', "
 "'Dispreference for monotonous and less expressive voices.', 'Slightly slower "
 "speaking rate is preferred.', 'Casual tone is preferred']")


In [34]:
ranking_patterns

{'strong': [],
 'emerging': ['Preference for more dynamic and engaging speech patterns.',
  'Dispreference for monotonous and less expressive voices.',
  'Slightly slower speaking rate is preferred.',
  'Casual tone is preferred']}

## Update the Judge

With these preferences, we'll update a basic "seed" LLM Judge prompt to try and make it less generic and more aligned with our personal preferences.

### Pass Round 1 judge prompt plus Round 2 preference analysis

In [36]:
pprint(round_1_judge_prompt)

('Based on the following criteria, the task is to assess the appropriateness '
 "of the audio voice sample for a video game character in his late 50's. The "
 "ideal voice should sound like a man in his late 50's, with a casual and "
 'expressive tone. The character should sound slightly unhinged and '
 'unpredictable, but not overtly crazy. Think of a character who enjoys '
 'telling tall tales in a saloon. \n'
 '\n'
 '## Voice characteristics to consider\n'
 'Consider the following aspects of the voice:\n'
 '- Dynamic and engaging style: The voice should have varied intonation and '
 'avoid sounding monotonous.\n'
 '- Expressive tone: The delivery should be lively and captivating, '
 'demonstrating a range of emotions.\n'
 '- Accent\n'
 '- Slightly slower speed: A deliberate and unhurried pace is preferred.\n'
 '- Volume\n'
 '- Pitch\n'
 '- Intonation\n'
 '- Casual delivery\n'
 'and more')


In [35]:
format_patterns(ranking_patterns)

'Strong patterns: \n\nEmerging patterns:\n  - Preference for more dynamic and engaging speech patterns.\n  - Dispreference for monotonous and less expressive voices.\n  - Slightly slower speaking rate is preferred.\n  - Casual tone is preferred\n\n'

In [37]:
judge_prompt_analyser_prompt = judge_prompt_analyser_prompt.format(
    current_judge_prompt=round_1_judge_prompt,  # <----- Round 1 judge prompt used this time
    ranking_patterns=format_patterns(ranking_patterns)
)

In [40]:
analyzer_result = await run_judge_prompt_analyser(
    judge_prompt_analyser_prompt=judge_prompt_analyser_prompt,
    analyzer_model=ANALYZER_MODEL
)

print("**Reasoning:**")
pprint(analyzer_result.reasoning)
print()
pprint("**Updated Judge Prompt:**")
pprint(analyzer_result.updated_judge_prompt)

round_2_judge_prompt = analyzer_result.updated_judge_prompt

**Reasoning:**
('The current prompt provides a general overview of the desired voice '
 'characteristics for a video game character but lacks specific details that '
 "align with the rater's preferences for more dynamic and engaging speech "
 'patterns, a slightly slower speaking rate, and a casual tone. The prompt '
 "mentions 'style,' 'tone,' and 'speed,' but doesn't explicitly emphasize the "
 'importance of dynamism, engagement, and casualness. Therefore, the prompt '
 'could be updated to better reflect these preferences and provide more '
 'specific guidance for evaluating voice samples.')

'**Updated Judge Prompt:**'
('The task is to assess the appropriateness of the audio voice sample for a '
 "video game character in his late 50's. He should sound like a man in his "
 "late 50's and be a little on the wild side. The ideal voice should exhibit "
 'dynamic and engaging speech patterns, avoiding monotony and showcasing '
 'expressiveness. A slightly slower speaking rate is prefer

## Run the updated Judge

In [41]:
new_judge_prompt = round_2_judge_prompt + judge_prompt_postfix
speech_samples_to_eval = []
for eval_id in ids_to_eval:
    speech_samples_to_eval.append(samples_to_eval[eval_id])

# Run new judge
result = await run_speech_judge(
    new_judge_prompt=new_judge_prompt,
    speech_samples_to_eval=speech_samples_to_eval,
    judge_model=JUDGE_MODEL
)
pprint(result.thinking)
pprint(result.ranking)
print()

('All three voices sound like they could be appropriate for the character '
 "described. Voice 1 sounds the most 'wild' and has a slightly gruff quality "
 'that fits the description well. Voice 2 is similar but slightly less '
 'dynamic. Voice 3 is also good, but perhaps a little less distinctive than '
 'the other two.')
['voice_1', 'voice_2', 'voice_3']

