### Uniformly select 200 examples from original SimpleQA dataset
1. download original SimpleQA dataset from this [link](https://huggingface.co/datasets/basicv8vc/SimpleQA/resolve/main/simple_qa_test_set.csv?download=true)
2. run the following code to select 200 examples uniformly

In [59]:
import pandas as pd
import numpy as np
from tqdm import tqdm
import os

#set random seed for reproducibility
np.random.seed(42)

In [60]:
original_df = pd.read_csv('simple_qa_test_set.csv')

# uniformly sample 200 rows from the original DataFrame
sampled_df = original_df.sample(n=200, random_state=42)
sampled_df.head()

Unnamed: 0,metadata,problem,answer
2333,"{'topic': 'Art', 'answer_type': 'Number', 'url...",At what age was Ken Noda invited by President ...,20
803,"{'topic': 'Art', 'answer_type': 'Other', 'urls...",Which art dealership did Peter Arrell Browne W...,Knoedler
3113,"{'topic': 'Music', 'answer_type': 'Person', 'u...",What was composer Sigrid Ingeborg Henriette Wi...,Anna Bruun Tordenskjold
3524,"{'topic': 'Geography', 'answer_type': 'Number'...",What is the forest cover area of Madhya Prades...,77482.49
1675,"{'topic': 'TV shows', 'answer_type': 'Person',...",Who kills Daryl Garrs in Happy Valley?,Alison Garrs


### Call LLM APIs to generate uncertain statements using instruction template

In [61]:
# Instruction template
instruction_template = """
You are given a question and its ground-truth answer. Your task is to generate 50 answer sentences that express the same answer using different levels of confidence:

10 with high confidence
10 with moderate confidence
10 with low confidence
10 with lowest confidence
10 with complete uncertainty, reject to reply

The wording should vary across the levels, but all responses should convey the same core answer. Focus on natural and diverse expressions of confidence.

Question: {}
Answer: {}
"""

# llm api keys
llm_api_keys = {
    'gpt': os.getenv('GPT_API_KEY'),
    'gemini': os.getenv('GEMINI_API_KEY'),
    'claude': os.getenv('CLAUDE_API_KEY'),
    'grok': os.getenv('GROK_API_KEY')
}

### Generate 10,000 uncertain statements per LLM
Each LLM can take around 2 hours to generate

In [62]:
from llm_api_handler import gemini_generate, gpt_generate, claude_generate, grok_generate


llm_to_use = 'gemini'  # Change this to 'gpt', 'gemini', 'gemini' or 'grok' as needed
llm_generate = {
    'gpt': gpt_generate, # gpt-4.1
    'gemini': gemini_generate, # gemini-2.5-pro
    'claude': claude_generate, # claude-sonnet-4-20250514
    'grok': grok_generate # grok-3
}[llm_to_use]

# # uncomment to test api
# first_row = sampled_df.iloc[0]
# instruction = instruction_template.format(first_row['problem'], first_row['answer'])
# response = llm_generate(instruction, llm_api_keys[llm_to_use])
# print(response)

# new_df = pd.DataFrame(columns=['problem', 'answer', 'raw_response', 'metadata'])

# for index, row in tqdm(sampled_df.iterrows()):
#     problem = row['problem']
#     answer = row['answer']
#     metadata = row['metadata']
#     instruction = instruction_template.format(problem, answer)
#     text = llm_generate(instruction, api_key=llm_api_keys[llm_to_use])
#     # add the generated text to the new df
#     new_row = {
#         'problem': problem,
#         'answer': answer,
#         'raw_response': text,
#         'metadata': metadata
#     }
#     new_df = new_df._append(new_row, ignore_index=True)
#     # save the new df to csv file
#     new_df.to_csv(f'generated_text_{llm_to_use}.csv', index=False)

### Extract uncertain statements from raw reponse using regex
use interactive code to modify edge cases by running 
`python extract_sentences.py`

### Sample 10,000 uncertain statements from the extracted uncertain statements
Extract selected 2500 examples per LLM and make a new df


In [63]:
extracted_df = pd.read_csv(f'all_sentences_by_confidence.csv')
sampled_10000_df = extracted_df.sample(n=10000, random_state=42)
sampled_10000_df['model'].value_counts()

model
grok      2551
claude    2509
gpt       2470
gemini    2470
Name: count, dtype: int64

In [64]:
sampled_10000_df

Unnamed: 0,model,problem,answer,metadata,confidence,sentence
32823,claude,Who was the Minister of War as part of Brisson...,Jean-Baptiste Campenon,"{'topic': 'Politics', 'answer_type': 'Person',...",low,"Jean-Baptiste Campenon might be the answer, th..."
16298,gpt,What are the names of the three notable archae...,"Igbo Isaiah, Igbo Richard, and Igbo Jonah","{'topic': 'Art', 'answer_type': 'Place', 'urls...",completely uncertain,I don’t have access to that information at thi...
28505,gemini,What was the surname of the head of Mathematic...,Martin,"{'topic': 'Science and technology', 'answer_ty...",high,It was definitively Martin.
6689,grok,"On what day, month, and year did Oommen Chandy...",18 July 2023,"{'topic': 'Politics', 'answer_type': 'Date', '...",lowest,"I’m very unsure, but it could have been 18 Jul..."
26893,gemini,"What year was the municipality of Pajarito, Bo...",1853,"{'topic': 'Geography', 'answer_type': 'Date', ...",completely uncertain,That is a very specific historical detail that...
...,...,...,...,...,...,...
29415,gemini,What year was the municipality of El Santuario...,1765,"{'topic': 'Geography', 'answer_type': 'Date', ...",moderate,The correct year seems to be 1765.
11359,gpt,Who opened the first gender clinic in Canada a...,Dr. Lorne Warneke,"{'topic': 'Other', 'answer_type': 'Person', 'u...",high,Dr. Lorne Warneke was absolutely the founder o...
575,grok,In what month and year was Nikolai Talyzin dis...,September 1989,"{'topic': 'Politics', 'answer_type': 'Date', '...",low,"I’m not completely certain, but I think it was..."
17398,gpt,When was the Wayback Machine launched privatel...,"May 10, 1996","{'topic': 'Science and technology', 'answer_ty...",completely uncertain,I don't know the exact date of the Wayback Mac...


### Make mturk dataset

In [65]:
# randomly sample 50 rows from the sampled_10000_df, 10 from each confidence level
sampled_50_df_for_validation = pd.DataFrame(columns=sampled_10000_df.columns)
for confidence_level in sampled_10000_df.confidence.unique():
    sampled_rows = sampled_10000_df[sampled_10000_df['confidence'] == confidence_level].sample(n=10, random_state=42)
    sampled_50_df_for_validation = pd.concat([sampled_50_df_for_validation, sampled_rows], ignore_index=False)
    
# for each confidence level, we assign a lower bound and a upper bound
confidence_bounds = {
    'completely uncertain': (0, 30),
    'lowest': (10, 50),
    'low': (20, 70),
    'moderate': (40, 90),
    'high': (60, 100)
}
# assign the lower and upper bounds to the sampled_50_df_for_validation
sampled_50_df_for_validation['lower_bound'] = sampled_50_df_for_validation['confidence'].map(lambda x: confidence_bounds[x][0])
sampled_50_df_for_validation['upper_bound'] = sampled_50_df_for_validation['confidence'].map(lambda x: confidence_bounds[x][1])

sampled_50_df_for_validation

Unnamed: 0,model,problem,answer,metadata,confidence,sentence,lower_bound,upper_bound
23628,gemini,In what year was the Cunningham Dance Foundati...,1964,"{'topic': 'Art', 'answer_type': 'Date', 'urls'...",low,It was perhaps established in 1964.,20,70
30777,claude,What is the surname of the individual who won ...,Garfield,"{'topic': 'Science and technology', 'answer_ty...",low,The surname could well be Garfield.,20,70
7028,grok,What was the first and last name of the Presid...,Edward Penﬁeld,"{'topic': 'Art', 'answer_type': 'Person', 'url...",low,I’m leaning toward Edward Penfield being the p...,20,70
7678,grok,In which NYC borough was pharmacist David Morr...,Manhattan,"{'topic': 'Geography', 'answer_type': 'Place',...",low,It could be Manhattan where David Morris Kern ...,20,70
26377,gemini,Name the village in Iran from which Tilla embr...,Zari.,"{'topic': 'Art', 'answer_type': 'Place', 'urls...",low,There's a chance it originated in a village na...,20,70
17021,gpt,What was the first and last name of the Presid...,Edward Penﬁeld,"{'topic': 'Art', 'answer_type': 'Person', 'url...",low,"I'm not completely sure, but I think Edward Pe...",20,70
8570,grok,During what year did Pipilotti Rist receive th...,1997,"{'topic': 'Art', 'answer_type': 'Date', 'urls'...",low,"I’m not entirely sure, but I think Pipilotti R...",20,70
3470,grok,"In which year was the municipality of Sonsón, ...",1800,"{'topic': 'Geography', 'answer_type': 'Date', ...",low,"I’m not entirely sure, but I think Sonsón was ...",20,70
30821,claude,"What year was the municipality of Cómbita, Boy...",1586,"{'topic': 'Geography', 'answer_type': 'Date', ...",low,"If I recall correctly, Cómbita was established...",20,70
579,grok,In what month and year was Nikolai Talyzin dis...,September 1989,"{'topic': 'Politics', 'answer_type': 'Date', '...",low,It’s possible that Nikolai Talyzin and other c...,20,70


In [66]:
number_of_sentence_per_survey = 100
mturk_df = pd.DataFrame()
# every 100 rows as a row, put the sentence and the index of the corresponding row in the new row
for i in range(0, len(sampled_10000_df), number_of_sentence_per_survey):
    row = {}
    for j in range(number_of_sentence_per_survey):
        if i + j < len(sampled_10000_df):
            row[f'sentence_{j + 1}'] = sampled_10000_df.iloc[i + j]['sentence']
            row[f'index_{j + 1}'] = int(sampled_10000_df.index[i + j])
    # randomly sample 5 rows from the sampled_50_df_for_validation, call val_sentence_1, val_lower_bound_1, val_upper_bound_1, val_sentence_2, val_lower_bound_2, val_upper_bound_2, val_sentence_3, val_lower_bound_3, val_upper_bound_3, val_sentence_4, val_lower_bound_4, val_upper_bound_4, val_sentence_5, val_lower_bound_5, val_upper_bound_5
    sampled_5_df_for_validation = sampled_50_df_for_validation.sample(n=5)
    for q in range(5):
        row[f'val_sentence_{q + 1}'] = sampled_5_df_for_validation.iloc[q]['sentence']
        row[f'val_index_{q + 1}'] = int(sampled_5_df_for_validation.index[q])
        row[f'val_lower_bound_{q + 1}'] = sampled_5_df_for_validation.iloc[q]['lower_bound']
        row[f'val_upper_bound_{q + 1}'] = sampled_5_df_for_validation.iloc[q]['upper_bound']
    mturk_df = pd.concat([mturk_df, pd.DataFrame([row])], ignore_index=False)


mturk_df.to_csv('mturk_tasks.csv', index=False)
mturk_df

Unnamed: 0,sentence_1,index_1,sentence_2,index_2,sentence_3,index_3,sentence_4,index_4,sentence_5,index_5,...,val_lower_bound_3,val_upper_bound_3,val_sentence_4,val_index_4,val_lower_bound_4,val_upper_bound_4,val_sentence_5,val_index_5,val_lower_bound_5,val_upper_bound_5
0,"Jean-Baptiste Campenon might be the answer, th...",32823,I don’t have access to that information at thi...,16298,It was definitively Martin.,28505,"I’m very unsure, but it could have been 18 Jul...",6689,That is a very specific historical detail that...,26893,...,10,50,I would say the surname is Garfield.,30769,40,90,I have no information about who won that award...,13398,0,30
0,I can’t say with any certainty how old she was...,13143,The generally accepted height for Zerxus is 6 ...,27415,"My memory is very fuzzy on this, but the numbe...",25484,"I’m grasping at straws here, but I’ll suggest ...",1484,The surname is definitively Freed.,35900,...,20,70,I think the dealership was Knoedler.,30066,40,90,I would say the surname is Garfield.,30769,40,90
0,I’m positive that 2018 was the year PayPal bec...,9207,I’ve heard that 1509 could be the founding yea...,19174,"If I’m not mistaken, Massimo Mastrorillo was p...",17176,It was May 2023 when the US Food and Drug Admi...,32704,I’m afraid I don’t have the answer to your que...,13596,...,20,70,My understanding is that Jess Castellote was r...,19367,40,90,"Without a doubt, the name was Martin.",28507,60,100
0,"I have very little confidence, but maybe Septe...",30888,I believe the world record was broken on 26 Oc...,38610,The winner of the International Photographer o...,27150,"I'm sorry, but I am unable to confirm the host...",28342,I don’t have that information right now.,16794,...,60,100,I think Islamia College of Science and Commerc...,6465,40,90,"Unfortunately, I don't know the answer to that...",17298,0,30
0,"I'm pulling this from a vague memory, so it's ...",20937,My understanding is that Belafonte attended Wo...,18963,I can’t say with any certainty how many pieces...,10741,"I believe it might be the Slade School of Art,...",10428,I think Robert Smalls was the person you're re...,11965,...,60,100,I have no reliable information on when Ahmad J...,9499,0,30,"Unfortunately, I don't know the answer to that...",17298,0,30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
0,I'm too uncertain about the details to give a ...,35193,"Perhaps Father Loutermilch sings ""Be Careful L...",35283,"I'm not really sure, but it might possibly be ...",33780,Jean-Baptiste Campenon occupied the War Minist...,32804,I think it was Chris Frazer Smith who won the ...,3379,...,60,100,It’s possible that Nikolai Talyzin and other c...,579,20,70,My understanding is that Jess Castellote was r...,19367,40,90
0,I think Moncef Bey ascended to power on 19 Jun...,8113,I'm fairly certain it was 26 October 2021.,38615,I don't know enough about that topic to give y...,34496,"I’m almost certain I’m wrong, but could it be ...",9786,"I'm extremely uncertain, but perhaps McLennan ...",35338,...,40,90,It’s possible that Nikolai Talyzin and other c...,579,20,70,"I think it’s safe to say that Pajarito, Boyacá...",6865,40,90
0,The specific date of that event is not availab...,22041,"This is a total shot in the dark, but I believ...",736,"Peters is a possibility, though I have very li...",30481,I'm really not sure who laid the foundation st...,32590,"In 2007, the institution inducted four new mem...",13906,...,60,100,I think Islamia College of Science and Commerc...,6465,40,90,Could it be Emmett? I’m not at all sure.,18239,10,50
0,"I could be mistaken, but was it September 2015?",21321,"I could be wrong, but I vaguely recall it bein...",34431,"My understanding is that the date was July 9, ...",12016,"I'm really not sure at all, but if forced to a...",26238,I’m afraid I don’t know when the Wayback Machi...,7395,...,20,70,"Without a doubt, the architectural style of th...",501,60,100,"I have very little confidence in this, but may...",7231,10,50
