# Exploratory Data Analysis and Experimentation for LitQA2 Benchmark

## Tasks: 

- Recreate the figure found within the paper (specifically the figure about the LitQA2 Question Answering Benchmark)


## Specific LLM and PaperQA settings

- ```agent_llm``` fixed to ```gpt-4-turbo-2024-04-09```
- ```consider_sources``` (the top-k settings) set to 30
- ```max_sources``` set to 5

In [1]:
# Import libraries
from os import path
import random

import pandas as pd
import numpy as np
from paperqa import ask, Settings, agent_query
from paperqa.settings import AgentSettings, AnswerSettings, IndexSettings

In [2]:
# Import the data from LitQA2 data 

# Path for the data from the authors found in the parquet file
data_path = path.join("data", "LitQA_data", "LitQA_full_pdfs", "train-00000-of-00001.parquet")

litqa2_data = pd.read_parquet("/root/paperQA2_analysis/data/LitQA_data/test-00000-of-00001.parquet")

litqa2_data.head()

Unnamed: 0,id,question,ideal,distractors,canary,tag,version,sources,is_opensource,subtask,key-passage
0,e6ece709-c919-4388-9f64-ab0e0822b03a,Approximately what percentage of topologically...,31%,"[21%, 11%, 41%, 51%]",BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...,litqa,1.1-dev,[https://doi.org/10.1038/s41467-024-44782-6],True,litqa-v2-test,Good control in FPR does not necessarily repre...
1,813a9053-3f67-4d58-80af-02153de90ae4,At least how long do SynNotch-MCF10DCIS cells ...,72 h,"[24, 48 h, 0 h, 12 h, 6 h, 96 h]",BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...,litqa,1.1-dev,[https://doi.org/10.1073/pnas.2322688121],True,litqa-v2-test,Spatial heterogeneity within tumors due to var...
2,831621de-5e32-4006-af84-a40dba100866,DK015 and DK038 strains of Verticillium dahlia...,95%,"[94%, 96%, 97%, 98%]",BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...,litqa,1.1-dev,[https://doi.org/10.1186/s12915-024-01900-6],True,litqa-v2-test,"The strains DK015 and DK038, with opposite MAT..."
3,3e6d7a54-5b8a-4aa0-ac6e-1fce986d1636,Expression of which of the following genes was...,Aldh1l1,"[MAPK, Actin, none of the above]",BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...,litqa,1.1-dev,[https://doi.org/10.1073/pnas.2321711121],True,litqa-v2-test,The mitogen-activated protein kinase (MAPK) pa...
4,e4579ca5-c7d4-47a0-88f5-8adc460fc936,For which of the following Trub1 substrates di...,SCP2,"[FBXO5, HECTD1, NKAIN1, CCDC22, IDI1]",BENCHMARK DATA SHOULD NEVER APPEAR IN TRAINING...,litqa,1.1-dev,[https://doi.org/10.1101/2024.03.26.586895],True,litqa-v2-test,"Among the Trub1 substrates, FBXO5 (chr6:152975..."


Set Up Question Prompt

In [1]:
# Set up the question prompt:

# Set random seed
np.random.seed(81001)

# Define a function to randomize the answers to a letter:
def randomize_question_letter(answers: list):
    # Create an index of letters to use
    letters = ["A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K"]
    
    # Use a set to get unique answers
    unique_ans = list(set(answers))
    
    # Shuffle the list
    random.shuffle(unique_ans)
    
    answers = {}
    
    # Randomly assign letter to question:
    for i in range(len(unique_ans)):
        answers[letters[i]] = unique_ans[i]
    return answers
        
# Check if it works
correct_answer = str(litqa2_data["ideal"][0])
possible_answers = [str(i) for i in litqa2_data["distractors"][0]]
possible_answers.append(correct_answer)
answer_options = randomize_question_letter(possible_answers)
print(answer_options)

NameError: name 'np' is not defined

In [5]:
# Create prompt
prompt = f"""
Please answer the following multiple choice question. 
Return a single letter answer denoting your choice, or return 0 if you are unsure about the answer or unable to answer.

Question: {litqa2_data["question"][0]}. 

Answer Options:
"""

# Add the answer options
for key, val in answer_options.items():
    prompt += f"\n {key}: {val}"
    
# Add the prompt for unsure:#
prompt += f"\n 0: unsure"
    
print(prompt)


Please answer the following multiple choice question. 
Return a single letter answer denoting your choice, or return 0 if you are unsure about the answer or unable to answer.

Question: Approximately what percentage of topologically associated domains in the GM12878 blood cell line does DiffDomain classify as reorganized in the K562 cell line?. 

Answer Options:

 A: 21%
 B: 31%
 C: 51%
 D: 41%
 E: 11%
 0: unsure


Use the Agent Query to call the model and answer the questions using the agent_query call

In [None]:
# To run the ask function, you need to apply:
import nest_asyncio
nest_asyncio.apply()

test_response = ask(
    query=prompt,
    settings=Settings(
        llm="gpt-4o-mini",
        llm_config={
            "model_list": [
                {
                    "model_name": "gpt-4o-mini",
                    "litellm_params": {
                        "model": "gpt-4o-mini",
                        "temperature": 0,
                        "max_tokens": 4096
                    }
                }
            ],
            "rate_limits": {"gpt-4o-mini": "30000 per 1 minute"},
        },
        agent=AgentSettings(
            agent_llm="gpt-4o-mini",
            agent_llm_config={
                "rate_limit": {"gpt-4o-mini": "30000 per 1 minute"}
            }
        ),
        embedding="text-embedding-3-small",
        temperature=0,
        paper_directory="/root/paperQA2_analysis/data/LitQA_data/LitQA2_test_pdfs"
    )
)

Encountered exception during tool call for tool gather_evidence: litellm.NotFoundError: OpenAIException - Error code: 403 - {'error': {'message': 'Project `proj_8WO6cHG6Ics15ycd35qOE2RY` does not have access to model `gpt-4o-2024-11-20`', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
Received Model Group=gpt-4o-2024-11-20
Available Model Group Fallbacks=None


Encountered exception during tool call for tool gather_evidence: litellm.NotFoundError: OpenAIException - Error code: 403 - {'error': {'message': 'Project `proj_8WO6cHG6Ics15ycd35qOE2RY` does not have access to model `gpt-4o-2024-11-20`', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
Received Model Group=gpt-4o-2024-11-20
Available Model Group Fallbacks=None


Encountered exception during tool call for tool gather_evidence: litellm.NotFoundError: OpenAIException - Error code: 403 - {'error': {'message': 'Project `proj_8WO6cHG6Ics15ycd35qOE2RY` does not have access to model `gpt-4o-2024-11-20`', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}
Received Model Group=gpt-4o-2024-11-20
Available Model Group Fallbacks=None


KeyboardInterrupt: 