# LitQA2 Benchmark for PaperQA2

## 1: Create pipeline functions and Exploration

Familiarise PaperQA2 functionality by testing on a single question.

In [1]:
# Import necessary libraries
import random

import pandas as pd
import numpy as np
import nest_asyncio

from paperqa import ask, Settings
# Import specific settings objects
from paperqa.settings import AgentSettings, AnswerSettings

### Import data for exploratory analysis

The data used is the test set for PaperQA2

In [2]:
# Import the LitQA2 test data 
litqa2_data = pd.read_parquet("/root/paperQA2_analysis/data/LitQA_data/test-00000-of-00001.parquet")
litqa2_data.head()

Unnamed: 0,id,question,ideal,distractors,canary,tag,version,sources,is_opensource,subtask,key-passage
0,e6ece709-c919-4388-9f64-ab0e0822b03a,Approximately what percentage of topologically...,31%,"[21%, 11%, 41%, 51%]",BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...,litqa,1.1-dev,[https://doi.org/10.1038/s41467-024-44782-6],True,litqa-v2-test,Good control in FPR does not necessarily repre...
1,813a9053-3f67-4d58-80af-02153de90ae4,At least how long do SynNotch-MCF10DCIS cells ...,72 h,"[24, 48 h, 0 h, 12 h, 6 h, 96 h]",BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...,litqa,1.1-dev,[https://doi.org/10.1073/pnas.2322688121],True,litqa-v2-test,Spatial heterogeneity within tumors due to var...
2,831621de-5e32-4006-af84-a40dba100866,DK015 and DK038 strains of Verticillium dahlia...,95%,"[94%, 96%, 97%, 98%]",BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...,litqa,1.1-dev,[https://doi.org/10.1186/s12915-024-01900-6],True,litqa-v2-test,"The strains DK015 and DK038, with opposite MAT..."
3,3e6d7a54-5b8a-4aa0-ac6e-1fce986d1636,Expression of which of the following genes was...,Aldh1l1,"[MAPK, Actin, none of the above]",BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...,litqa,1.1-dev,[https://doi.org/10.1073/pnas.2321711121],True,litqa-v2-test,The mitogen-activated protein kinase (MAPK) pa...
4,e4579ca5-c7d4-47a0-88f5-8adc460fc936,For which of the following Trub1 substrates di...,SCP2,"[FBXO5, HECTD1, NKAIN1, CCDC22, IDI1]",BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...,litqa,1.1-dev,[https://doi.org/10.1101/2024.03.26.586895],True,litqa-v2-test,"Among the Trub1 substrates, FBXO5 (chr6:152975..."


### Create the prompt generator

Function that creates the prompt to feed to paperqa2: Takes the ideal answer and combines them with the distractors.
Randomises the selection of possible answers and formats them for multiple choice response from LLM. 

In [3]:
# Set random seed for determinability
random.seed(81001)

# Define a function to randomize the answers to a letter:
def randomize_question_letter(answers: list):
    # Create an index of letters to use
    letters = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"]
    
    # Use a set to get unique answers
    unique_ans = list(set(answers))
    
    # Shuffle the list
    random.shuffle(unique_ans)
    
    answers = {}
    
    # Randomly assign letter to question:
    for i in range(len(unique_ans)):
        answers[letters[i]] = unique_ans[i]
    return answers


# Test if the prompt generator works
correct_answer = str(litqa2_data["ideal"][0])
possible_answers = [str(i) for i in litqa2_data["distractors"][0]]
possible_answers.append(correct_answer)
answer_options = randomize_question_letter(possible_answers)
print(answer_options)

{'A': '51%', 'B': '11%', 'C': '21%', 'D': '41%', 'E': '31%'}


Create a prompt from the selections

In [16]:
prompt = f"""
Please answer the following multiple choice question. 
Return a single letter answer denoting your choice, or return 0 if you are unsure about the answer or unable to answer.

Question: {litqa2_data["question"][0]}. 

Available Options:
"""

# Add the answer options
for key, val in answer_options.items():
    prompt += f"\n {key}: {val}"
    
# Add the prompt for unsure:#
prompt += f"\n 0: unsure"

prompt += """
\n
Return your answer in the following format:

"letter".

where the letter denotes your chosen answer from the available options. Only include the letter and nothing else. 
"""
    
print(prompt)


Please answer the following multiple choice question. 
Return a single letter answer denoting your choice, or return 0 if you are unsure about the answer or unable to answer.

Question: Approximately what percentage of topologically associated domains in the GM12878 blood cell line does DiffDomain classify as reorganized in the K562 cell line?. 

Available Options:

 A: 51%
 B: 11%
 C: 21%
 D: 41%
 E: 31%
 0: unsure


Return your answer in the following format:

"letter".

where the letter denotes your chosen answer from the available options. Only include the letter and nothing else. 



In [17]:
# Set up LLM config (main LLM for reasoning, extract metadata, ...)
llm_config_dict = {
    "model_list": [
        {
            "model_name": "gpt-4o-mini",
            "litellm_params": {
                "model": "gpt-4o-mini",
                "temperature": 0,
                "max_tokens": 4096
            }
        }
    ],
    "rate_limit": {"gpt-4o-mini": "30000 per 1 minute"}
}

# Set up agent (answer search and selecting tools):
agent_settings = AgentSettings(
    agent_llm="gpt-4o-mini",
    agent_llm_config={
        "rate_limit": "30000 per 1 minute"
    }
)

# Set up summary LLM config
summary_config_dict = {
    "rate_limit": {"gpt-4o-mini": "30000 per 1 minute"}
}

# Set up answer format
answer_settings = AnswerSettings(
    evidence_k=30,
    evidence_detailed_citations=False,
    evidence_retrieval=False,
    evidence_summary_length="around 100 words",
    evidence_skip_summary=False,
    answer_max_sources=5,
    max_answer_attempts=5,
    answer_length="1 letter"
)

# Set up the final settings object
paperqa_settings = Settings(
    llm="gpt-4o-mini",
    llm_config=llm_config_dict,
    summary_llm="gpt-4o-mini",
    summary_llm_config=summary_config_dict,
    agent=agent_settings,
    temperature=0,
    batch_size=1,
    verbosity=1,
    paper_directory="/root/paperQA2_analysis/data/LitQA_data/LitQA2_test_pdfs"
)

In [18]:
# Run the nest_asyncio for notebook use
nest_asyncio.apply()

# Ask Question:
test_response = ask(query=prompt, settings=paperqa_settings)

PaperQA version: 5.11.1


Despite all of the temperature settings set to 0, multiple runs of the same prompt give different answers. 
