Pick 30 questions to run a debate on.
30 debates will be run with gemini-3-pro
30 debates will be run with grok-4-fast.

10 of these debates will be used for validation. Will use them to:
- Tune debater prompts
- Tune judge prompts
- Set up the human eval script (will collect certainty after every turn)

Will choose the questions that were hardest across all models BUT were correct for grok-4-fast.

Notes:
- Expect positive regression from QA->Debate simply because choosing poor QA outcomes. Strong baseline will be the judge model's performance with itself as debater.
- Outcomes: 
    - If the judges are able to do it, we have a strong result (doubt it)
    - If judges are unable but I am able, we have a clear target for improvement (training judge with process supervision)
    - If I am unable to do it, we've got nothing

In [23]:
### Find the hardest questions

from analysis_utils import prepare_df
from datasets import load_dataset


In [24]:
qa_df = prepare_df(['qa'])

In [25]:
dataset = load_dataset('Idavidrein/gpqa', 'gpqa_diamond')['train']
dataset_df = dataset.to_pandas()
dataset_df = dataset_df.rename({'Question': 'question', 'High-level domain': 'high_level_domain'}, axis=1)

In [28]:
qa_df = qa_df.merge(dataset_df[['question', 'high_level_domain']], left_on=['question_qa'], right_on=['question'], how='left', suffixes=('', '_dataset'))

In [29]:
qa_df.columns

Index(['run_id_qa', 'datetime_qa', 'question_idx_qa', 'question_qa',
       'options_qa', 'correct_idx_qa', 'raw_model_response_qa',
       'parsed_model_response_qa', 'prompt_qa', 'token_usage_qa',
       'record_id_qa', 'prompt_template_qa', 'internal_model_reasoning_qa',
       'internal_model_reasoning_details_qa', 'success_qa', 'error_message_qa',
       'config_dataset_name_qa', 'config_dataset_subset_qa',
       'config_dataset_split_qa', 'config_model_name_qa',
       'config_temperature_qa', 'config_num_questions_qa',
       'config_random_seed_qa', 'config_num_choices_qa',
       'config_max_threads_qa', 'config_specific_question_idxs_qa',
       'config_rerun_qa', 'config_lenient_parsing_qa', 'config_max_tokens_qa',
       'config_reasoning_effort_qa', 'config_reasoning_max_tokens_qa',
       'options_str_qa', 'parsed_answer_qa', 'is_correct_qa', 'question',
       'high_level_domain'],
      dtype='object')

In [31]:
qa_filters = {
    'config_dataset_name_qa': 'Idavidrein/gpqa',
    'config_num_choices_qa': 2,
    'config_random_seed_qa': 42
}

temp_df = qa_df.copy()
for filter in qa_filters:
    temp_df = temp_df[temp_df[filter] == qa_filters[filter]]
temp_df.shape

# drop some judge models with low sample counts
temp_df = temp_df[~temp_df['config_model_name_qa'].isin(['qwen/qwen3-235b-a22b', 'qwen/qwen3-8b', 'google/gemini-3-pro-preview'])]

temp_df.shape

(2090, 36)

In [32]:
temp_df['config_model_name_qa'].value_counts()

config_model_name_qa
x-ai/grok-4-fast                     198
openai/gpt-4o-mini                   198
openai/gpt-3.5-turbo                 197
google/gemma-3-27b-it                196
qwen/qwen-2.5-72b-instruct           195
qwen/qwen3-14b                       192
qwen/qwen-2.5-7b-instruct            191
meta-llama/llama-3-8b-instruct       189
google/gemma-3-12b-it                182
qwen/qwen3-32b                       177
meta-llama/llama-3.1-70b-instruct    175
Name: count, dtype: int64

In [33]:
# drop any where grok-4-fast got it wrong (less chance of debate working there)
valid_question_idxs = temp_df[(temp_df['config_model_name_qa'] == 'x-ai/grok-4-fast') & (temp_df['is_correct_qa'] == True)]['question_idx_qa']
len(valid_question_idxs)

temp_df = temp_df[temp_df['question_idx_qa'].isin(valid_question_idxs)]
temp_df['config_model_name_qa'].value_counts()

config_model_name_qa
x-ai/grok-4-fast                     186
openai/gpt-4o-mini                   186
openai/gpt-3.5-turbo                 185
qwen/qwen-2.5-72b-instruct           184
google/gemma-3-27b-it                184
qwen/qwen3-14b                       180
qwen/qwen-2.5-7b-instruct            179
meta-llama/llama-3-8b-instruct       178
google/gemma-3-12b-it                171
qwen/qwen3-32b                       165
meta-llama/llama-3.1-70b-instruct    164
Name: count, dtype: int64

In [35]:
temp_df = temp_df[temp_df['high_level_domain'] == 'Physics']

hardest_qs = temp_df.groupby(by=['question_idx_qa']).apply(lambda x: x['is_correct_qa'].mean()).sort_values().head(30)
hardest_qs

  hardest_qs = temp_df.groupby(by=['question_idx_qa']).apply(lambda x: x['is_correct_qa'].mean()).sort_values().head(30)


question_idx_qa
176.0    0.090909
179.0    0.222222
192.0    0.272727
136.0    0.272727
173.0    0.363636
123.0    0.363636
129.0    0.363636
56.0     0.444444
9.0      0.444444
178.0    0.454545
3.0      0.454545
169.0    0.454545
57.0     0.454545
55.0     0.454545
180.0    0.500000
84.0     0.500000
165.0    0.545455
70.0     0.545455
103.0    0.545455
146.0    0.545455
119.0    0.545455
68.0     0.555556
154.0    0.600000
49.0     0.600000
34.0     0.600000
95.0     0.636364
150.0    0.636364
108.0    0.636364
143.0    0.636364
194.0    0.636364
dtype: float64

In [19]:
import random

random.seed(42)
shuffled = random.sample(hardest_qs.keys().astype(int).tolist(), 30, )

print(shuffled)

[173, 45, 176, 118, 179, 110, 56, 92, 164, 129, 149, 32, 75, 125, 142, 63, 33, 196, 9, 192, 182, 76, 139, 97, 105, 189, 57, 136, 22, 123]


In [20]:
validation = shuffled[:10]
test = shuffled[10:]

print(validation)

[173, 45, 176, 118, 179, 110, 56, 92, 164, 129]


In [21]:
print(test)

[149, 32, 75, 125, 142, 63, 33, 196, 9, 192, 182, 76, 139, 97, 105, 189, 57, 136, 22, 123]
