# Gather data for evaluation

Queries and corresponding documents are gathered from the following datasets:
- NQ with natural questions focused on open-domain QA.
- LitQA2 from LAB-bench with biology questions and corresponding research papers.
- DRUID with fact-checked claims and potential evidence retrieved from the web.

## Environment setup

Change the current working directory to repo root

In [None]:
import os
os.chdir("../.")
os.getcwd() # check that we are in the right directory

In [None]:
from datasets import load_dataset
import pandas as pd
import json
import requests
import asyncio
import aiohttp
import xml.etree.ElementTree as ET
from thefuzz import fuzz
from tqdm import tqdm

tqdm.pandas() # set up tqdm progress bar for pandas

## NQ

Google's Natural Questions focuses on open-domain question answering. 

A sample in NQ contains a question, a corresponding full wikipedia document html and the correct answer passages.

**Helpful** context can be detected using the key passages. **Similar** context can be detected using e.g. Jaccard similarity and BERTScore.

### How to gather the data

1. Get the NQ data.
2. Take the full document and chunk it based on html entries - retain token indeces covered. Ensure that the chunks adequately cover the annotated long answers.

### 1. Get the data

Download all entries in the the NQ dev set from [here](https://console.cloud.google.com/storage/browser/natural_questions/v1.0/dev) to `data/NQ`. This should correspond to ~8,000 samples.

Unzip them and then load the data as follows.


In [None]:
save_path = "data/NQ/v1.0_dev_nq-dev-<num>.jsonl"
nq_data = []

for n in ['00','01','02','03','04']:
    nq_data.append(pd.read_json(save_path.replace('<num>', n), lines=True))
    
nq_data = pd.concat(nq_data)
nq_data.head()

Unnamed: 0,annotations,document_html,document_title,document_tokens,document_url,example_id,long_answer_candidates,question_text,question_tokens
0,"[{'annotation_id': 4323936797498927989, 'long_...","<!DOCTYPE html>\n<HTML class=""client-js ve-not...",Trade winds,"[{'end_byte': 97, 'html_token': False, 'start_...",https://en.wikipedia.org//w/index.php?title=Tr...,5225754983651766092,"[{'end_byte': 44666, 'end_token': 161, 'start_...",what purpose did seasonal monsoon winds have o...,"[what, purpose, did, seasonal, monsoon, winds,..."
1,"[{'annotation_id': 4831085488325731996, 'long_...","<!DOCTYPE html>\n<HTML class=""client-js ve-not...",High School Musical 2,"[{'end_byte': 96, 'html_token': False, 'start_...",https://en.wikipedia.org//w/index.php?title=Hi...,6986236841860957647,"[{'end_byte': 58952, 'end_token': 260, 'start_...",where did they film high school musical two,"[where, did, they, film, high, school, musical..."
2,"[{'annotation_id': 10138100176517733689, 'long...","<!DOCTYPE html>\n<HTML class=""client-js ve-not...",List of Nobel laureates in Physics,"[{'end_byte': 96, 'html_token': False, 'start_...",https://en.wikipedia.org//w/index.php?title=Li...,-3290814144789249484,"[{'end_byte': 36110, 'end_token': 179, 'start_...",who got the first nobel prize in physics,"[who, got, the, first, nobel, prize, in, physics]"
3,"[{'annotation_id': 2559678672569860137, 'long_...","<!DOCTYPE html>\n<HTML class=""client-js ve-not...",Alice in Wonderland (1951 film),"[{'end_byte': 97, 'html_token': False, 'start_...",https://en.wikipedia.org//w/index.php?title=Al...,5745452844331879752,"[{'end_byte': 46964, 'end_token': 178, 'start_...",who has the rights to alice in wonderland,"[who, has, the, rights, to, alice, in, wonderl..."
4,"[{'annotation_id': 927142593907478770, 'long_a...","<!DOCTYPE html>\n<HTML class=""client-js ve-not...",Deadpool 2,"[{'end_byte': 100, 'html_token': False, 'start...",https://en.wikipedia.org//w/index.php?title=De...,8851020722386421469,"[{'end_byte': 60014, 'end_token': 272, 'start_...",when is the next deadpool movie being released,"[when, is, the, next, deadpool, movie, being, ..."


In [14]:
len(nq_data)

7830

### 2. Process the NQ data into chunks for the reranker, retain the gold chunks

Set up the code for processing

In [None]:
from nltk.tokenize.treebank import TreebankWordDetokenizer
d = TreebankWordDetokenizer()

def get_hs_data_entry(row):
    long_answer = row.annotations[0]["long_answer"]
    # only retain answers with helpful passages available
    if long_answer["start_token"] != -1:
        hs_data_entry = {"nq_example_id": row.example_id,
                         "question": row.question_text+"?",
                         "answer_start_token_ix": long_answer["start_token"],
                         "answer_end_token_ix": long_answer["end_token"]}
        
        # get the document chunks
        def start_html_chunk(token):
            is_start_html_tag = token.startswith("<") and token.endswith(">") and not token.startswith("</")
            assert is_start_html_tag, f"Asked to start html chunk but received start token '{token}' (index {ix})."
            chunk = [token]
            wanted_end_tag = token.replace("<", "</")
            html_depth = 0
            return chunk, wanted_end_tag, html_depth
        # check if chunk matches answer
        def is_answer_chunk(chunks_entry):
            # can accept a shift of 1
            if chunks_entry["start_token_ix"] in {long_answer["start_token"]-1, long_answer["start_token"], long_answer["start_token"]+1}:
                # can accept a shift of 1
                if chunks_entry["end_token_ix"] in {long_answer["end_token"]-1, long_answer["end_token"], long_answer["end_token"]+1}:
                    return True, True
                elif chunks_entry["end_token_ix"] > long_answer["end_token"]:
                    return True, False
                else:
                    print(f"Error: Broken chunk. Matched start token in answers but not end token (found {chunks_entry['end_token_ix']}, wanted {long_answer['end_token']})")
                    return False, False
            # also check if gold chunk found within current chunk
            elif chunks_entry["start_token_ix"] < long_answer["start_token"] and chunks_entry["end_token_ix"] > long_answer["end_token"]:
                return True, False
            else:
                return False, False
            
        chunks = []
        wanted_end_tag = None
        start_token_ix = None
        answer_chunk_found = False
        precise_answer_chunk_found = False
        for ix, val in enumerate(row.document_tokens):
            token = val["token"]
            # potentially end chunk processing if currently in processing mode
            if start_token_ix:
                # if processing a chunk in html mode, check if the token is the end tag
                if wanted_end_tag:
                    chunk.append(token)
                    if token == wanted_end_tag:
                        # account for nested html tags
                        html_depth += -1
                        if html_depth == -1:
                            chunks_entry = {"chunk": d.detokenize(chunk),
                                            "start_token_ix": start_token_ix,
                                            "end_token_ix": ix}
                            is_gold_chunk, is_precise = is_answer_chunk(chunks_entry)
                            chunks_entry["is_gold_chunk"] = is_gold_chunk
                            if is_gold_chunk:
                                answer_chunk_found = True
                                hs_data_entry["gold_chunk_ix"] = len(chunks)
                            if is_precise:
                                precise_answer_chunk_found = True
                            chunks.append(chunks_entry)
                            start_token_ix = None
                            chunk = []
                    # if we encounter the start token again, increase depth
                    elif token == wanted_end_tag.replace("/", ""):
                        html_depth += 1
                # if we are not in html mode but detect a html token, consider this chunk done
                elif val["html_token"]:
                    chunks_entry = {"chunk": d.detokenize(chunk),
                                    "start_token_ix": start_token_ix,
                                    "end_token_ix": ix-1}
                    is_gold_chunk, is_precise = is_answer_chunk(chunks_entry)
                    chunks_entry["is_gold_chunk"] = is_gold_chunk
                    if is_gold_chunk:
                        answer_chunk_found = True
                        hs_data_entry["gold_chunk_ix"] = len(chunks)
                    if is_precise:
                        precise_answer_chunk_found = True
                    chunks.append(chunks_entry)
                    chunk, wanted_end_tag, html_depth = start_html_chunk(token)
                    start_token_ix = ix
                # if we are not in html mode and do not detect a html token, continue processing
                else:
                    chunk.append(token)
            # otherwise, start chunk processing        
            else:
                if val['html_token']:
                    chunk, wanted_end_tag, html_depth = start_html_chunk(token)
                    start_token_ix = ix
                else:
                    chunk = [token]
                    wanted_end_tag = None
                    start_token_ix = ix
        assert start_token_ix is None, "Should not be on the processing of a chunk at end of tokens"
        if not answer_chunk_found:
            print("Failed to recover the answer chunk among the processed chunks.")
            return None
        hs_data_entry["chunks"] = chunks
        hs_data_entry["gold_chunk_is_precise"] = precise_answer_chunk_found
        
        return hs_data_entry

Perform the processing

In [22]:
hs_data_entries = []
for ix, row in tqdm(nq_data.iterrows(), total=len(nq_data)):
    res = get_hs_data_entry(row)
    if res:
        hs_data_entries.append(res)
hs_data = pd.DataFrame(hs_data_entries).set_index("nq_example_id")

 10%|█         | 815/7830 [00:25<03:03, 38.17it/s]

Failed to recover the answer chunk among the processed chunks.


 15%|█▍        | 1148/7830 [00:33<02:19, 47.88it/s]

Failed to recover the answer chunk among the processed chunks.


 22%|██▏       | 1725/7830 [00:50<01:42, 59.75it/s]

Failed to recover the answer chunk among the processed chunks.


 30%|███       | 2380/7830 [01:09<02:35, 34.94it/s]

Failed to recover the answer chunk among the processed chunks.


 52%|█████▏    | 4080/7830 [01:51<01:36, 38.83it/s]

Failed to recover the answer chunk among the processed chunks.


 58%|█████▊    | 4535/7830 [02:04<02:04, 26.52it/s]

Failed to recover the answer chunk among the processed chunks.


 70%|███████   | 5490/7830 [02:32<01:21, 28.58it/s]

Failed to recover the answer chunk among the processed chunks.


 75%|███████▌  | 5897/7830 [02:42<00:38, 50.48it/s]

Failed to recover the answer chunk among the processed chunks.


 86%|████████▌ | 6722/7830 [03:04<00:19, 57.86it/s]

Failed to recover the answer chunk among the processed chunks.


 86%|████████▋ | 6761/7830 [03:05<00:18, 59.35it/s]

Failed to recover the answer chunk among the processed chunks.


 92%|█████████▏| 7176/7830 [03:15<00:17, 37.22it/s]

Failed to recover the answer chunk among the processed chunks.


 97%|█████████▋| 7585/7830 [03:27<00:05, 42.66it/s]

Error: Broken chunk. Matched start token in answers but not end token (found 309, wanted 432)


100%|█████████▉| 7803/7830 [03:34<00:00, 33.94it/s]

Failed to recover the answer chunk among the processed chunks.


100%|██████████| 7830/7830 [03:34<00:00, 36.46it/s]


In [23]:
hs_data.gold_chunk_is_precise.value_counts()

gold_chunk_is_precise
True     3595
False     164
Name: count, dtype: int64

In the end, only 3759 samples remain. Most likely due to missing or imperfect chunk annotations.

### Edit the passages to only include up to and including the gold passage

We can only guarantee the dataset quality by including the passages up to and including the gold passage for each entry, since the NQ annotators only were asked to identify the _first_ gold passage in the corresponding Wikipedia page.

In [None]:
def trim_chunks(row):
    return row.chunks[:row.gold_chunk_ix+1]

hs_data['chunks'] = hs_data.apply(trim_chunks, axis=1)

In [33]:
hs_data.chunks.apply(len).min()

np.int64(4)

In [31]:
hs_data.chunks.apply(len).value_counts(sort=True)

chunks
5      734
4      513
6      459
7      203
15     114
      ... 
83       1
147      1
126      1
65       1
117      1
Name: count, Length: 122, dtype: int64

Save the data

In [None]:
hs_data.reset_index().to_json("data/NQ/chunks.jsonl", orient='records', lines=True)

### Get a version on NQ with prepended page titles

Used for the study of the "Prepend titles" alleviation methods described in section 3.5 of the paper.

- To each chunk, prepend the corresponding page title.
- The title is always given by the second chunk (between H1 markers).

In [None]:
data = pd.read_json("data/NQ/chunks.jsonl", lines=True)

In [55]:
data.iloc[0].chunks

[{'chunk': 'of Nobel laureates in Physics - wikipedia',
  'start_token_ix': 1,
  'end_token_ix': 7,
  'is_gold_chunk': False},
 {'chunk': '<H1> List of Nobel laureates in Physics </H1>',
  'start_token_ix': 8,
  'end_token_ix': 15,
  'is_gold_chunk': False},
 {'chunk': 'Jump to: navigation, search Front side (obverse) of the Nobel Prize Medal for Physics presented to Edward Victor Appleton in 1947',
  'start_token_ix': 16,
  'end_token_ix': 40,
  'is_gold_chunk': False},
 {'chunk': "<P> The Nobel Prize in Physics (Swedish: Nobelpriset i fysik) is awarded annually by the Royal Swedish Academy of Sciences to scientists in the various fields of physics . It is one of the five Nobel Prizes established by the 1895 will of Alfred Nobel (who died in 1896), awarded for outstanding contributions in physics . As dictated by Nobel's will, the award is administered by the Nobel Foundation and awarded by a committee that consists of five members elected by the Royal Swedish Academy of Sciences . Th

In [56]:
def get_chunks_with_title(row):
    title = row.chunks[1]["chunk"]
    assert "<H1>" in title, f"Title does not contain <H1> tag: {title}"
    title = title.replace("<H1> ", "").replace(" </H1>", "")+" - Wikipedia: "
    new_chunks = []
    for chunk in row.chunks:
        chunk_entry = chunk.copy()
        chunk_entry["chunk"] = title+chunk["chunk"]
        new_chunks.append(chunk_entry)
    return new_chunks

data['chunks'] = data.apply(get_chunks_with_title, axis=1)

In [58]:
data.chunks.iloc[21]

[{'chunk': 'Lithium - Wikipedia: - wikipedia',
  'start_token_ix': 1,
  'end_token_ix': 2,
  'is_gold_chunk': False},
 {'chunk': 'Lithium - Wikipedia: <H1> Lithium </H1>',
  'start_token_ix': 3,
  'end_token_ix': 5,
  'is_gold_chunk': False},
 {'chunk': 'Lithium - Wikipedia: Jump to: navigation, search This article is about the chemical element . For the use of lithium as a medication, see Lithium (medication). For other uses, see Lithium (disambiguation).',
  'start_token_ix': 6,
  'end_token_ix': 44,
  'is_gold_chunk': False},
 {'chunk': 'Lithium - Wikipedia: <Table> <Tr> <Td> </Td> <Td> It has been suggested that Lithium as an investment be merged into this article . (Discuss) Proposed since June 2017 . </Td> </Tr> </Table>',
  'start_token_ix': 45,
  'end_token_ix': 75,
  'is_gold_chunk': False},
 {'chunk': 'Lithium - Wikipedia: <Table> Lithium, Li <Tr> <Td colspan="2"> Lithium floating in oil </Td> </Tr> <Tr> <Th colspan="2"> General properties </Th> </Tr> <Tr> <Th> Pronunciation 

In [None]:
data.to_json("data/NQ/chunks_w_titles.jsonl", orient='records', lines=True)

### Get a version with prepended chunk contexts (alleviation method)

Used for the study of the "Incorporate context" alleviation methods described in section 3.5 of the paper.

- To each chunk, prepend a chunk contextualization generated by GPT-4o mini. 
- The chunk contextualization is performed in a separate step, for which the results have been saved to "data/NQ/nq_chunks_w_context_part_<num>.jsonl".

In [None]:
# the chunk data has been stored in 5 files
data_file_template = "data/NQ/nq_chunks_w_context_part_<num>.jsonl"
d_list = []
for n in range(1, 6):
    d_list.append(pd.read_json(data_file_template.replace("<num>", str(n)), lines=True))
data = pd.concat(d_list)

In [114]:
data.columns

Index(['nq_example_id', 'question', 'answer_start_token_ix',
       'answer_end_token_ix', 'gold_chunk_ix', 'chunks',
       'gold_chunk_is_precise', 'contextualized_chunk'],
      dtype='object')

Check for None values in contextualized info
- Most values (3,753) correspond to complete contexts.
- Only 6 correspond to missing contexts, of which one corresponds to MANY.

In [154]:
num_none = data.contextualized_chunk.apply(lambda c_list: sum([1 for c in c_list if c is None]))
num_none.value_counts()

contextualized_chunk
0     3753
1        2
8        2
6        1
32       1
Name: count, dtype: int64

In [155]:
print(data[num_none>0].index)

Index([666, 451, 447, 721, 39, 148], dtype='int64')


In [125]:
def get_contextualized_chunk(row):
    new_chunks = []
    for ix, chunk in enumerate(row.chunks):
        chunk_entry = chunk.copy()
        if row.contextualized_chunk[ix]:
            # only update the chunk with context if it exists
            chunk_entry["chunk"] = row.contextualized_chunk[ix] + " " + chunk_entry["chunk"]
        new_chunks.append(chunk_entry)
    return new_chunks

data['chunks'] = data.apply(get_contextualized_chunk, axis=1)

In [132]:
print(data.chunks.iloc[2][11]["chunk"])

The document discusses the geography of Nigeria, highlighting its climate types. This chunk specifically addresses the tropical monsoon climate ('Am') found in southern Nigeria, its characteristics, and the influence of monsoons on rainfall patterns in the region. <P> The tropical monsoon climate, designated by the Köppen climate classification as "Am", is found in the southern part of the country . This climate is influenced by the monsoons originating from the South Atlantic ocean, which is brought into the country by the (maritime tropical) MT airmass, a warm moist sea to land seasonal wind . Its warmth and high humidity gives it a strong tendency to ascend and produce copious rainfall, which is a result of the condensation of water vapour in the rapidly rising air . </P>


In [None]:
data.drop(columns='contextualized_chunk').to_json("data/NQ/chunks_w_contexts.jsonl", orient='records', lines=True)

## LitQA2

The Language Agent Biology Benchmark, or LAB-Bench, is an evaluation [dataset](https://huggingface.co/datasets/futurehouse/lab-bench) for AI systems intended to benchmark capabilities foundational to scientific research in biology. We focus on the LitQA2 task of extracting information from the scientific literature. It is based on 200 samples for which it is not sufficient to simply fetch information from the abstract of a paper.

A sample in LitQA2 contains a question, a key passage containing the correct answer, the link to the source of the key passage and potential distractors.

**Helpful** context can be detected using the key passages. **Similar** context can be detected using e.g. Jaccard similarity and BERTScore.

### Gather the data

Thanks to an improved paper retrieval, we manage to retrieve 125 out of 200 samples (we previously only managed to collect 88). We here struggled with many of the research papers of the LitQA2 samples being paywalled.

Start by downloading the LitQA2 samples from [here](https://github.com/Future-House/LAB-Bench/blob/main/LitQA2/litqa-v2-public.jsonl) to `data/LitQA2/litqa-v2-public.jsonl`.

Load and clean the samples

In [None]:
def clean_jsonl(input_path: str) -> list:
    cleaned_data = []
    with open(input_path, 'r', encoding='utf-8') as f:
        for line in f:
            data = json.loads(line)
            if 'canary' in data:
                del data['canary']
            ideal = data.get('ideal')
            distractors = data.get('distractors', [])
            option_answers = {chr(65 + i): (ideal if i == 0 else distractors[i - 1]) for i in range(len(distractors) + 1)}
            option_answers[chr(65 + len(option_answers.keys()))] = "I don't have enough evidence"
            data['option_answers'] = option_answers
            cleaned_data.append(data)
    return cleaned_data

# Example usage:
jsonl_path = "data/LitQA2/litqa-v2-public.jsonl"
cleaned_data = clean_jsonl(jsonl_path)
cleaned_data

[{'id': 'e3b5a4af-41d9-48db-becf-29a08d0ad28e',
  'tag': 'litqa',
  'version': '1.1-dev',
  'question': 'Acinetobacter lwoffii has been evolved in the lab to be resistant to which of these antibiotics?',
  'ideal': 'ciproflaxin',
  'distractors': ['meropenem', 'gentamicin', 'ampicillin'],
  'sources': ['https://doi.org/10.1128/msphere.00109-24'],
  'key-passage': 'A. baumannii readily evolved resistance to meropenem, ciprofloxacin, and gentamicin, but A. lwoffii only evolved resistance to ciprofloxacin',
  'is_opensource': True,
  'option_answers': {'A': 'ciproflaxin',
   'B': 'meropenem',
   'C': 'gentamicin',
   'D': 'ampicillin',
   'E': "I don't have enough evidence"}},
 {'id': '76184ccf-4bf0-469e-a442-11d04b4ff8b0',
  'tag': 'litqa',
  'version': '1.1-dev',
  'question': 'Active olfactory receptor genes increase their contacts with greek island regions by what factor in mouse olfactory neurons?',
  'ideal': '2.7 fold',
  'distractors': ['1.7 fold', '27 fold', '3.0 fold', '2.0 fold

In [None]:
dois_to_convert = [file["sources"] for file in cleaned_data]
len(dois_to_convert)

199

#### Transform doi to pmid

Easier for fetching from pubmed.

In [None]:
def doi_to_pmid(doi_list):
    """
    Convert a list of DOIs to PMIDs using the NCBI E-utilities API.
    """
    pmid_list = []
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    for doi in doi_list:
        params = {
            'db': 'pubmed',
            'term': doi,
            'retmode': 'json'
        }
        response = requests.get(base_url, params=params)
        if response.status_code == 200:
            data = response.json()
            pmid = data.get('esearchresult', {}).get('idlist', [])
            if pmid:
                pmid_list.append("https://pubmed.ncbi.nlm.nih.gov/" + pmid[0])
            else:
                pmid_list.append(None)
        else:
            pmid_list.append(None)
    return pmid_list


In [None]:
# It takes around 90 secs..
converted_doi = doi_to_pmid([doi[0].replace('https://doi.org/', '') for doi in dois_to_convert])

#### Collect the paper content from BioC

In [None]:
async def fetch_paper(session, paper_id: str) -> str:
    """
        paper_id: Can be either a pmid id (such as 17299597) or PMC id (such as PMC7885820)
    """
    if paper_id is None:
        return None

    url = f"https://www.ncbi.nlm.nih.gov/research/bionlp/RESTful/pmcoa.cgi/BioC_xml/{paper_id}/unicode"
    
    try:
        async with session.get(url) as response:
            response.raise_for_status()
            return await response.text()
    except aiohttp.ClientError as http_err:
        print(f"Failed to fetch the content from fetch_paper with {paper_id}. Error: {http_err}")
        return None

async def gather_papers_from_bioc(paper_ids: list) -> list:
    """
    Fetch multiple papers using BioC asynchronously.

    args:
        paper_ids: A list of pmid or PMC ids.
    """
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_paper(session, paper_id) for paper_id in paper_ids]
        return await asyncio.gather(*tasks)



In [None]:
result = await gather_papers_from_bioc([paper.replace('https://pubmed.ncbi.nlm.nih.gov/', '') if paper else None for paper in converted_doi ])

In [None]:
result[:10]

['<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE collection SYSTEM "BioC.dtd"><collection><source>BioC-API</source><date>20250115</date><key>collection.key</key><document><id>PMC11237425</id><infon key="license">CC BY</infon><passage><infon key="article-id_doi">10.1128/msphere.00109-24</infon><infon key="article-id_pmc">PMC11237425</infon><infon key="article-id_pmid">38578105</infon><infon key="article-id_publisher-id">00109-24 msphere.00109-24</infon><infon key="elocation-id">e00109-24</infon><infon key="issue">5</infon><infon key="kwd">antibiotic resistance virulence efflux pumps Gram-negative bacteria drug-resistance evolution</infon><infon key="license">This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International license.</infon><infon key="name_0">surname:Darby;given-names:Elizabeth M.</infon><infon key="name_1">surname:Moran;given-names:Robert A.</infon><infon key="name_10">surname:Blair;given-names:Jessica M. A.</infon><infon 

In [None]:
print(result[198])

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE collection SYSTEM "BioC.dtd"><collection><source>BioC-API</source><date>20250115</date><key>collection.key</key><document><id>8944923</id><infon key="license">author_manuscript</infon><passage><infon key="article-id_doi">10.1126/science.abk2432</infon><infon key="article-id_manuscript">NIHMS1788050</infon><infon key="article-id_pmc">8944923</infon><infon key="article-id_pmid">35239393</infon><infon key="fpage">eabk2432</infon><infon key="issue">6584</infon><infon key="license">
          This file is available for text mining. It may also be used consistent with the principles of fair use under the copyright law.
        </infon><infon key="lpage">eabk2432</infon><infon key="name_0">surname:Li;given-names:Hongjie</infon><infon key="name_1">surname:Janssens;given-names:Jasper</infon><infon key="name_10">surname:Leskovec;given-names:Jure</infon><infon key="name_11">surname:McLaughlin;given-names:Colleen N.</infon><infon key="name_12">surnam

#### Process the retrieved info

In [None]:
def extract_article_text(xml_strings):
    extracted_texts = []
    for xml_string in xml_strings:
       if xml_string is not None and "No record can be found" not in xml_string:
          try:
              root = ET.fromstring(xml_string)
              article_texts = []
              for passage in root.findall('.//passage'):
                  text_element = passage.find('text')
                  if text_element is not None and passage.find('.//infon[@key="type"]') is None:
                      article_texts.append(text_element.text)
                  elif text_element is not None:
                     type_infon = passage.find('.//infon[@key="type"]')
                     if type_infon is not None and type_infon.text not in ['abstract', 'abstract_title_1','front','table','table_caption','table_footnote','fig_caption','fig_title_caption']:
                        article_texts.append(text_element.text)
              extracted_texts.append("\n".join(article_texts))
          except ET.ParseError:
              extracted_texts.append(None)
       else:
           extracted_texts.append(None)
    return extracted_texts

full_texts = extract_article_text(result)

In [None]:
def extract_article_title(xml_strings):
    extracted_titles = []
    for xml_string in xml_strings:
       if xml_string is not None and "No record can be found" not in xml_string:
          try:
              root = ET.fromstring(xml_string)
              # get first passage
              title_passage = root.find('.//passage')
              # check if it is a front section
              type_infon = passage.find('.//infon[@key="type"]')
              assert type_infon is not None and type_infon.text == 'front', "No front section found."
              title = title_passage.find('text')
              if title is not None:
                 extracted_titles.append(title.text)
              else:
                 extracted_titles.append(None)
          except ET.ParseError:
              extracted_titles.append(None)
       else:
           extracted_titles.append(None)
    return extracted_titles

article_titles = extract_article_title(result)

In [None]:
df = pd.DataFrame(cleaned_data)
df = pd.concat([
    df,
    pd.DataFrame(converted_doi, columns=["pubmed"]),
    pd.DataFrame(full_texts, columns=["full texts"]),
    pd.DataFrame(article_titles, columns=["title"])],
    axis = 1)

Lacking full texts for 63 samples out of 199

In [None]:
df[df[['full texts', 'title']].isna().any(axis=1)]

Unnamed: 0,id,tag,version,question,ideal,distractors,sources,key-passage,is_opensource,option_answers,pubmed,full texts,title
3,27234279-f50c-4cfc-86e2-af68364a8f94,litqa,1.1-dev,Approximately how many unique transposable ele...,12300,"[4, 900, 39, 400, 2, 750, 2, 850]",[https://doi.org/10.1101/2022.03.19.484946],"As a result, the SWO, mandarin, and pummelo ac...",False,"{'A': '12,300', 'B': '4', 'C': '900', 'D': '39...",https://pubmed.ncbi.nlm.nih.gov/39911448,,Intranasal Insulin Diminishes Postoperative De...
8,c9bdb9b5-28c1-44da-93b6-b1fc9d8bf369,litqa,1.1-dev,Based on experiments done in cohorts of mice i...,evening,"[morning, dawn, middle of the night]",[https://doi.org/10.1016/j.cell.2024.04.015],We explored whether the number of leukocytes i...,True,"{'A': 'evening', 'B': 'morning', 'C': 'dawn', ...",https://pubmed.ncbi.nlm.nih.gov/38723627,,
22,1e5f5199-84f4-4133-ab87-2372fa6ca722,litqa,1.1-dev,Does the protein encoded by TDRD1 have symmetr...,No dimethylarginine,"[Symmetrical dimethylarginine modifications, A...",[https://doi.org/10.1261/rna.034181.112],"Thus, we conclude that the TD1 domain of TDRD1...",True,"{'A': 'No dimethylarginine', 'B': 'Symmetrical...",https://pubmed.ncbi.nlm.nih.gov/22996915,,
23,d2860d38-dfb9-4b80-905f-812c32573915,litqa,1.1-dev,Engineering of which human skin bacteria can s...,S. epidermis,"[cultibarium acnes, staphylococcus aureus]",[https://doi.org/10.1126/science.abp9563],"Thus, under physiologic conditions of coloniza...",False,"{'A': 'S. epidermis', 'B': 'cultibarium acnes'...",https://pubmed.ncbi.nlm.nih.gov/37053311,,
26,d65103ae-c881-4116-a0a7-1b233eb6275a,litqa,1.1-dev,For which of the following genes does regulati...,gp02,"[rpoS, cyaA, petA]",[https://doi.org/10.1016/j.chom.2024.03.010],"The rpoS, cyaA, and petA genes are located on ...",True,"{'A': 'gp02', 'B': 'rpoS', 'C': 'cyaA', 'D': '...",https://pubmed.ncbi.nlm.nih.gov/38579715,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
185,bca1be77-208b-4d57-ac29-05aa6d58bdbf,litqa,1.1-dev,Which of the following transcriptions factors ...,p50,"[SMAD5, TBX21, SOX15, MYCN, JUN, HIF1A]",[https://doi.org/10.1016/j.ccell.2024.03.013],"Disruption of p50 (encoded by NFKB1), but not ...",True,"{'A': 'p50', 'B': 'SMAD5', 'C': 'TBX21', 'D': ...",https://pubmed.ncbi.nlm.nih.gov/38640932,,
186,2dc20a2f-de54-4bfe-a34f-1ba395f342cf,litqa,1.1-dev,Which of the following viscoelastic properties...,The nucleus is more stiff than the cytoplasm.,[The nucleus is less stiff than the cytoplasm....,[https://doi.org/10.1016/j.bpj.2024.03.034],Our results show that distinct nuclear and cyt...,True,{'A': 'The nucleus is more stiff than the cyto...,https://pubmed.ncbi.nlm.nih.gov/38544410,,
192,4949fc05-5f78-4dd0-8d9c-3dbb6004740a,litqa,1.1-dev,Which pathway(s) is primarily responsible for ...,eIF4E-eIF4G,"[IRES-Dependent, Cap-Independent Translational...",[https://doi.org/10.1038/s41587-024-02174-7],4EGI-1 significantly decreased the translation...,False,"{'A': 'eIF4E-eIF4G', 'B': 'IRES-Dependent', 'C...",https://pubmed.ncbi.nlm.nih.gov/38519719,,
193,c7b36a1c-ea80-4ae5-b2f1-3932cb16c3cf,litqa,1.1-dev,Which poly(A) oligo(s) demonstrates highest re...,modified branch poly(A),"[unmodified linear poly(A), modified linear po...",[https://doi.org/10.1038/s41587-024-02174-7],"Finally, adding chemical stabilization to all ...",False,"{'A': 'modified branch poly(A)', 'B': 'unmodif...",https://pubmed.ncbi.nlm.nih.gov/38519719,,


#### Locate the samples for which we can find the gold passage in the corresponding research paper

Seemingly, some of the research papers have been updated after the LitQA2 data collection, meaning that the previously annotated gold passages (key passages) may not be possible to recover for all samples. Samples for which we cannot recover the gold passage in the document have to be dropped.

We perform a fuzzy search to identify key passage-full text matches.

In [None]:
# A threshold of 80 seems applicable. Values close to 50 are very dissimilar and the key passage cannot be identified in the full text for these.
FUZZY_SIM_THRESHOLD = 80

In [None]:
df['key_passage_match_ratio'] = df.apply(lambda row: fuzz.partial_ratio(row['key-passage'], row['full texts']), axis=1)
df['fuzzy_passage_in'] = df.key_passage_match_ratio.apply(lambda val: val>=FUZZY_SIM_THRESHOLD)
sum(df['fuzzy_passage_in'])

125

125 samples survived

### Format the samples to get chunks that can be re-ranked

In [None]:
# format the data and store it for script processing
df = df.rename(columns={'full texts': 'content'})
df[df['fuzzy_passage_in']].to_json('data/LitQA2/tmp_full_text.jsonl', orient='records', lines=True)

We use a script to format crawled web content into retrievable passages. For this, run the following code:

```bash
SAVE_FOLDER=data/LitQA2
python -m src.get_paragraphs_from_crawl \
        --crawl_results_path "${SAVE_FOLDER}/tmp_full_text.jsonl" \
        --save_folder "${SAVE_FOLDER}" \
        --min_word_limit 4 \
        --max_word_limit 600
```

Load the results from the script

In [None]:
df = pd.read_json('data/LitQA2/paragraph_results.jsonl', lines=True)

Match chunk/passage with key-passage

In [None]:
def get_paragraphs_key_passage_ix(row):
    key_passage = row['key-passage']
    # the key passage length should be at least 80% of the paragraph length
    min_words = int(len(key_passage.split())*0.8)
    for ix, par in enumerate(row.paragraphs):
        # the key passage should be a substring-ish of the paragraph (or the same-ish)
        # and be a fuzzy match
        if (min_words <= len(par.split())) and (fuzz.partial_ratio(key_passage, par) >= FUZZY_SIM_THRESHOLD):
            return ix
    return None

df["gold_chunk_ix"] = df.apply(get_paragraphs_key_passage_ix, axis=1)

We only failed to recover the gold passage for one sample

In [None]:
df[df["gold_chunk_ix"].isna()]

Unnamed: 0,id,tag,version,question,ideal,distractors,sources,key-passage,is_opensource,option_answers,pubmed,content,title,passage_in,key_passage_match_ratio,fuzzy_passage_in,paragraphs,gold_chunk_ix
113,a73b2c2d-a3ac-49e6-8e2e-44b4c1e6de94,litqa,1.1-dev,Which of the following proteins has the greate...,SmD1,"[Cul1, SkpA, HspA]",[https://doi.org/10.1101/2024.05.15.594402],"Altogether, we performed three different mass ...",True,"{'A': 'SmD1', 'B': 'Cul1', 'C': 'SkpA', 'D': '...",https://pubmed.ncbi.nlm.nih.gov/38903116,"Introduction\nCellular stressors, both extrins...",Chaperoning the chaperones: Proteomic analysis...,False,99,True,"[Cellular stressors, both extrinsic and intrin...",


Store a cleaned version of the data for further processing

In [None]:
df = df[~df["gold_chunk_ix"].isna()]
df["gold_chunk_ix"] = df["gold_chunk_ix"].astype(int)
df = df.drop(columns=['pubmed', 'content', 'passage_in', 'key_passage_match_ratio', 'fuzzy_passage_in'])

In [None]:
# format for further processing
df = df.rename(columns={"paragraphs": "chunks"})
df.to_json('data_on_git/LitQA2/chunks.jsonl', orient='records', lines=True)

### Add the options to the queries
We edit the samples to include the answer options in the query. For the analysis of re-ranker performance on LitQA2, we only consider the LitQA2-options samples.

In [None]:
data = pd.read_json("data/LitQA2/chunks.jsonl", lines=True)
data.head()

Unnamed: 0,id,tag,version,question,ideal,distractors,sources,key-passage,is_opensource,option_answers,title,chunks,gold_chunk_ix
0,e3b5a4af-41d9-48db-becf-29a08d0ad28e,litqa,1.1-dev,Acinetobacter lwoffii has been evolved in the ...,ciproflaxin,"[meropenem, gentamicin, ampicillin]",[https://doi.org/10.1128/msphere.00109-24],A. baumannii readily evolved resistance to mer...,True,"{'A': 'ciproflaxin', 'B': 'meropenem', 'C': 'g...",Differential development of antibiotic resista...,"[Acinetobacter spp. are Gram-negative, soil-dw...",39
1,76184ccf-4bf0-469e-a442-11d04b4ff8b0,litqa,1.1-dev,Active olfactory receptor genes increase their...,2.7 fold,"[1.7 fold, 27 fold, 3.0 fold, 2.0 fold]",[https://doi.org/10.1038/s41586-023-06845-4],"Furthermore, total contacts made to GIs showed...",True,"{'A': '2.7 fold', 'B': '1.7 fold', 'C': '27 fo...",RNA-mediated symmetry breaking enables singula...,"[To interact with their environment, cells exp...",11
2,39129e1c-096f-4414-bf4f-37fadbbe364c,litqa,1.1-dev,Among Cas9-disrupted loci in human neural stem...,61% of gene loci,"[61% of enhancer loci, 31% of gene loci, 31% o...",[https://doi.org/10.1016/j.celrep.2024.113693],"Globally, disruptions within enhancers had com...",True,"{'A': '61% of gene loci', 'B': '61% of enhance...",Massively parallel disruption of enhancers act...,[The development of the human cerebral cortex ...,14
3,517e7cf8-c5d2-4391-9e2a-235b79d93050,litqa,1.1-dev,Approximately what percentage of Drosophila wi...,80%,"[19%, 50%, 37%, 6%, 94%]",[https://doi.org/10.1101/2024.05.03.592390],"In contrast with the results for H3.2, we foun...",True,"{'A': '80%', 'B': '19%', 'C': '50%', 'D': '37%...",Set2 and H3K36 regulate the Drosophila male X ...,[The evolution of heterogametic sexes necessit...,13
4,cb710074-73a8-4407-b0c7-7dc868f1bc76,litqa,1.1-dev,"Approximately what percentage of adr-1(-), adr...",10%,"[20%, 30%, 40%, 50%, 60%, 70%, 80%]",[https://doi.org/10.1186/s12915-024-01840-1],It is important to note that previous studies ...,True,"{'A': '10%', 'B': '20%', 'C': '30%', 'D': '40%...",ADARs regulate cuticle collagen expression and...,[Pathogen infection is a major environmental t...,6


In [61]:
question_options_prompt = "<question> <options>?"
def get_options_string(option_answers):
    num_options = len(option_answers)
    string = ""
    processed_options = 0
    for key, val in option_answers.items():
        if processed_options == num_options - 1:
            string += f"or ({key}) {val}"
        else:
            string += f"({key}) {val}, "
        processed_options += 1
    return string.strip(", ").replace("I don't have enough evidence", "not enough info")
data["question"] = data.apply(lambda row: question_options_prompt.replace("<question>", row.question).replace("<options>", get_options_string(row.option_answers)), axis=1)

Manually check some samples

In [62]:
data.question.iloc[30]

'How long do mouse neurons survive following CRISPR inactivation of HSPA5? (A) 14 days, (B) 3 days, (C) 5 days, (D) 10 days, (E) 28 days, or (F) not enough info?'

In [63]:
data[data.id=='a6622141-68d6-418f-8e30-7a5eff3d4fa8'].question.iloc[0]

'Which of the following drugs can NOT be used to inhibit binding of the viral protease dNS3S139A to the antiviral peptide CP5-46A-4D5E? (A) Telaprevir, (B) Grazoprevir, (C) Glecaprevir, (D) Danoprevir, or (E) not enough info?'

Save the data

In [None]:
data.reset_index().to_json("data/LitQA2-options/chunks.jsonl", orient='records', lines=True)

### Get a version of LitQA2 with prepended titles

In [None]:
data = pd.read_json("data/LitQA2-options/chunks.jsonl", lines=True)

In [67]:
def get_chunks_with_title(row):
    title = row.title + " - PubMed: "
    new_chunks = [title+chunk for chunk in row.chunks]
    return new_chunks

data['chunks'] = data.apply(get_chunks_with_title, axis=1)

In [None]:
data.to_json("data/LitQA2-options/chunks_w_titles.jsonl", orient='records', lines=True)

### Get a version of LitQA2 with prepended chunk contexts
- To each chunk, prepend a chunk contextualization generated by GPT-4o mini.

In [None]:
data = pd.read_json("data/LitQA2-options/litqa_chunks_w_context.jsonl", lines=True)

Check for None values in contextualized info
- All values have contexts!

In [136]:
num_none = data.contextualized_chunk.apply(lambda c_list: sum([1 for c in c_list if c is None]))
num_none.value_counts()

contextualized_chunk
0    124
Name: count, dtype: int64

In [137]:
def get_contextualized_chunk(row):
    new_chunks = []
    for ix, chunk in enumerate(row.chunks):
        new_chunk = chunk
        if row.contextualized_chunk[ix]:
            # only update the chunk with context if it exists
            new_chunk = row.contextualized_chunk[ix] + " " + chunk
        new_chunks.append(new_chunk)
    return new_chunks

data['chunks'] = data.apply(get_contextualized_chunk, axis=1)

In [None]:
data.drop(columns='contextualized_chunk').to_json("data/LitQA2-options/chunks_w_contexts.jsonl", orient='records', lines=True)

## DRUID

DRUID contains fact-checked claims with corresponding context (evidence) retrieved from the web.

In [None]:
data = pd.DataFrame(load_dataset("copenlu/druid", "DRUID")["train"])
data

Generating train split: 100%|██████████| 5490/5490 [00:00<00:00, 24209.87 examples/s]


Unnamed: 0,id,claim_id,claim_source,claim,claimant,claim_date,evidence_source,evidence,evidence_date,factcheck_verdict,is_gold,relevant,evidence_stance
0,borderlines-0_ret_b14_gn,borderlines-0,-,Abyei is a territory of South Sudan,Abyei,2024-10-09,https://reliefweb.int/report/south-sudan/dtm-s...,ABYEI OVERVIEW AND DISPLACEMENT DYNAMICS The A...,2017-03-31,South Sudan,False,True,insufficient-contradictory
1,borderlines-0_ret_b5_gn,borderlines-0,-,Abyei is a territory of South Sudan,Abyei,2024-10-09,https://peacerep.org/2024/03/05/crisis-in-abyei/,"On the border between South Sudan and Sudan, A...",2024-03-05,South Sudan,False,True,insufficient-contradictory
2,borderlines-0_ret_bn_g3,borderlines-0,-,Abyei is a territory of South Sudan,Abyei,2024-10-09,https://theconversation.com/crisis-in-abyei-so...,Abyei – a territory roughly the size of Jamaic...,2024-03-05,South Sudan,False,True,insufficient-supports
3,borderlines-1_ret_b10_gn,borderlines-1,-,Abyei is a territory of Sudan,Abyei,2024-10-09,https://peacerep.org/2024/03/05/crisis-in-abyei/,"On the border between South Sudan and Sudan, A...",2024-03-05,Sudan,False,True,insufficient-contradictory
4,borderlines-1_ret_b15_gn,borderlines-1,-,Abyei is a territory of Sudan,Abyei,2024-10-09,https://link.springer.com/chapter/10.1007/978-...,Abyei is a territory that straddles Sudan and ...,2024-03-26,Sudan,False,True,insufficient-contradictory
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5485,srilankafactcrescendo_98_ret_bn_g3,srilankafactcrescendo_98,https://srilanka.factcrescendo.com/english/pur...,Fox News Tweets that 68% of individuals who us...,Social Media Users,2024-02-12,https://www.snopes.com/fact-check/ivermectins-...,"In mid-May 2022, Twitter users shared a suppos...",2022-05-17,False,False,True,refutes
5486,srilankafactcrescendo_99,srilankafactcrescendo_99,https://srilanka.factcrescendo.com/english/sni...,CCTV Footage Of A Burglary In Sri Lanka,Social Media Users,2021-10-17,https://srilanka.factcrescendo.com/english/sni...,We verified that the video viral on social med...,2021-10-17,False,True,True,refutes
5487,srilankafactcrescendo_99_ret_b1_gn,srilankafactcrescendo_99,https://srilanka.factcrescendo.com/english/sni...,CCTV Footage Of A Burglary In Sri Lanka,Social Media Users,2021-10-17,https://www.news.com.au/world/asia/chilling-cc...,Chilling CCTV footage shows moment Sri Lankan ...,2019-05-03,False,False,True,insufficient-contradictory
5488,srilankafactcrescendo_99_ret_b4_gn,srilankafactcrescendo_99,https://srilanka.factcrescendo.com/english/sni...,CCTV Footage Of A Burglary In Sri Lanka,Social Media Users,2021-10-17,https://www.news.com.au/world/asia/cctv-captur...,Chilling CCTV inside Colombo’s Cinnamon Grand ...,2019-04-25,False,False,False,not_applicable


We map the DRUID samples to suit our dataset format.

In [74]:
wanted_claim_ids = data[data.duplicated(subset=["claim_id"], keep=False)][["claim_id"]].drop_duplicates()

sufficient_stances = ["supports", "refutes"]
insufficient_stances = ["insufficient-supports", "insufficient-neutral", "insufficient-refutes"]
eval_data = []
for claim_id in wanted_claim_ids.claim_id:
    sub_data = data[(data.claim_id==claim_id) & (data.evidence_stance.isin(sufficient_stances+insufficient_stances))]
    if any(sub_data.evidence_stance.isin(sufficient_stances)) and any(sub_data.evidence_stance.isin(insufficient_stances)):
        data_entry = {"ids": sub_data.id.tolist(),
                      "claim_id": sub_data.claim_id.iloc[0],
                      "question": sub_data.claim.iloc[0],
                      "chunk_sources": sub_data.evidence_source.tolist(),
                      "is_gold_chunk": sub_data.is_gold.tolist(),
                      "chunk_dates": sub_data.evidence_date.tolist(),
                      "chunk_stances": sub_data.evidence_stance.tolist(),
                      "chunks": sub_data.evidence.tolist(),
                      "is_helpful_chunk": [val in sufficient_stances for val in sub_data.evidence_stance]
                      }
        eval_data.append(data_entry)
eval_data = pd.DataFrame(eval_data)
eval_data

Unnamed: 0,ids,claim_id,question,chunk_sources,is_gold_chunk,chunk_dates,chunk_stances,chunks,is_helpful_chunk
0,"[borderlines-103_ret_b0_gn, borderlines-103_re...",borderlines-103,Kfar Qouq is a territory of Syria,"[https://en.wikipedia.org/wiki/Kafr_Qouq,_Syri...","[False, False]","[2022-05-19, 2011-03-17]","[supports, insufficient-refutes]",[Kafr Qouq or Kafr Qawq (Arabic: كفرقوق) is a ...,"[True, False]"
1,"[borderlines-107_ret_b0_gn, borderlines-107_re...",borderlines-107,Qaa is a territory of Lebanon,"[https://en.wikipedia.org/wiki/Qaa, https://el...","[False, False, False]","[2024-07-17, 2000-01-01, 2021-10-21]","[supports, insufficient-supports, supports]","[Qaa (Arabic: القاع), El Qaa, Al Qaa, Qaa Baal...","[True, False, True]"
2,"[borderlines-108_ret_b10_gn, borderlines-108_r...",borderlines-108,Qaa is a territory of Syria,[https://www.haaretz.com/2015-07-06/ty-article...,"[False, False]","[2015-07-06, 2000-01-01]","[supports, insufficient-neutral]",[Beyond the barren mountains separating the Le...,"[True, False]"
3,"[borderlines-113_ret_b18_gn, borderlines-113_r...",borderlines-113,Shatt al-Arab is a territory of Iraq,[https://www.worldatlas.com/articles/where-doe...,"[False, False]","[2017-04-25, 2024-10-01]","[insufficient-supports, supports]",[The Shatt al-Arab appears to have formed quit...,"[False, True]"
4,"[borderlines-118_ret_b0_g2, borderlines-118_re...",borderlines-118,Greater and Lesser Tunbs is a territory of Iran,[https://en.wikipedia.org/wiki/Greater_and_Les...,"[False, False]","[2024-06-14, 2024-05-18]","[refutes, insufficient-refutes]",[Greater Tunb and Lesser Tunb (Persian: تنب بز...,"[True, False]"
...,...,...,...,...,...,...,...,...,...
870,"[srilankafactcrescendo_91, srilankafactcrescen...",srilankafactcrescendo_91,The CID Arrested The President Of The Kelaniya...,[https://srilanka.factcrescendo.com/english/co...,"[True, False, False, False]","[2024-02-28, 2024-02-28, 2023-01-02, 2024-02-26]","[refutes, refutes, refutes, insufficient-suppo...",[The individual arrested in connection with th...,"[True, True, True, False]"
871,"[srilankafactcrescendo_92, srilankafactcrescen...",srilankafactcrescendo_92,US Ambassador Julie Chung Has Been Recalled To...,[https://srilanka.factcrescendo.com/english/us...,"[True, False, False, False, False]","[2023-07-29, 2022-02-25, 2021-08-10, 2024-04-0...","[refutes, insufficient-supports, insufficient-...",[Our investigation found that the rumors stati...,"[True, False, False, False, True]"
872,"[srilankafactcrescendo_94, srilankafactcrescen...",srilankafactcrescendo_94,New Phone Scam in Which Criminals Can Access Y...,[https://srilanka.factcrescendo.com/english/on...,"[True, False, False, False]","[2023-01-23, 2016-02-07, 2015-08-24, 2023-01-01]","[refutes, insufficient-supports, insufficient-...",[The part that these scams can copy your SIM d...,"[True, False, False, False]"
873,"[srilankafactcrescendo_95, srilankafactcrescen...",srilankafactcrescendo_95,"The Oppenheimer Actor, Cillian Murphy, Correct...",[https://srilanka.factcrescendo.com/english/ci...,"[True, False]","[2023-08-02, 2023-08-03]","[refutes, insufficient-neutral]",[A viral video of Cillian Murphy speaking to P...,"[True, False]"


In [75]:
eval_data.loc[874].question

'Bluetooth headphones emit the EMF wave that increases cancer risk !!'

In [86]:
eval_data.loc[874].chunks[1]

'Some scientists are arguing that these headphones could potentially lead increase cancer risk due to the fact they emit radio frequency (EMF) radiation when they are used. In 2015, a group of scientists around the world signed a petition directed to the United Nations and World Health Organization that expresses their concern about exposure to non-ionizing electromagnetic fields (EMF). The petition, which started circulating around the internet again in early 2019, also says that EMF has the potential to increase the risk of neurological disorders, DNA damage, cellular distress, and changes to the reproductive system. [...] Ultimately, when it comes to wireless headphones, its dangers aren’t all that clear. Despite extremely low frequency EMF being possibly carcinogenic to humans, researchers have still not observed a direct connection. Bluetooth headphones specifically have their power density exposures that is 10 to 400 times lower than those of cell phones, according to Dana-Farber

Save the data

In [None]:
eval_data.to_json("data/DRUID/chunks.jsonl", orient='records', lines=True)

### Get page titles

For the "Prepend titles" alleviation method. DRUID does not contain page titles a priori, so we have to sample them from the original webpages.

In [None]:
data = pd.read_json("data/DRUID/chunks.jsonl", lines=True)

Set up and test the page title retrieval

In [89]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.currentaffairs.org/news/2023/07/how-aoc-went-from-influencer-to-influenced'
# making requests instance
reqs = requests.get(url)

soup = BeautifulSoup(reqs.text, 'html.parser')
# displaying the title
print("Title of the website is : ")
for title in soup.find_all('title'):
    print(title.get_text())

Title of the website is : 
How AOC Went From Influencer to Influenced


Collect the page titles

In [None]:
def get_chunks_with_title(row):
    new_chunks = []
    for ix, chunk in enumerate(row.chunks):
        try:
            soup = BeautifulSoup(requests.get(row.chunk_sources[ix], timeout=30).text, 'html.parser')
            title = soup.find('title')
            if title:
                title = title.get_text()
                new_chunks.append(title + ": " + chunk)
            else:
                print(f"No title found for chunk with url {row.chunk_sources[ix]}.")
                new_chunks.append(chunk)
        except Exception as e:
            print(f"Failed to get title for chunk with url {row.chunk_sources[ix]}.")
            print(f"Error: {e}")
            new_chunks.append(chunk)
    return new_chunks

data['chunks'] = data.progress_apply(get_chunks_with_title, axis=1)

In [96]:
num_chunks = data.chunks.apply(len)
sum(num_chunks)

3454

Lacking titles for ~226 of 3454 chunks.
- srilanka factcrescendo seems to be a main problem
- looking at some of the titles, there also seem to be quality issues at points because of failed scraping etc.

In [None]:
data.to_json("data/DRUID/chunks_w_titles.jsonl", orient='records', lines=True)

### Better prompt investigation

For DRUID, we also investigate the effect of adapting the prompt to the setting (fact-checking in this case). We create a DRUID version (DRUID-q) with a more advanced question formulation that potentially will work better for the rerankers.

In [41]:
from datasets import load_dataset

data = pd.DataFrame(load_dataset("copenlu/druid", "DRUID")["train"])

Generating train split: 100%|██████████| 5490/5490 [00:00<00:00, 14744.43 examples/s]


In [42]:
def get_question(row):
    QUESTION_TEMPLATE = "Is the following claim accurate?\nClaimant: <claimant>\nClaim: <claim>"
    return QUESTION_TEMPLATE.replace('<claimant>', row.claimant).replace('<claim>', row.claim)

wanted_claim_ids = data[data.duplicated(subset=["claim_id"], keep=False)][["claim_id"]].drop_duplicates()

sufficient_stances = ["supports", "refutes"]
insufficient_stances = ["insufficient-supports", "insufficient-neutral", "insufficient-refutes"]
eval_data = []
for claim_id in wanted_claim_ids.claim_id:
    sub_data = data[(data.claim_id==claim_id) & (data.evidence_stance.isin(sufficient_stances+insufficient_stances))]
    if any(sub_data.evidence_stance.isin(sufficient_stances)) and any(sub_data.evidence_stance.isin(insufficient_stances)):
        data_entry = {"ids": sub_data.id.tolist(),
                      "claim_id": sub_data.claim_id.iloc[0],
                      "question": get_question(sub_data.iloc[0]),
                      "chunk_sources": sub_data.evidence_source.tolist(),
                      "is_gold_chunk": sub_data.is_gold.tolist(),
                      "chunk_dates": sub_data.evidence_date.tolist(),
                      "chunk_stances": sub_data.evidence_stance.tolist(),
                      "chunks": sub_data.evidence.tolist(),
                      "is_helpful_chunk": [val in sufficient_stances for val in sub_data.evidence_stance]
                      }
        eval_data.append(data_entry)
eval_data = pd.DataFrame(eval_data)
eval_data

Unnamed: 0,ids,claim_id,question,chunk_sources,is_gold_chunk,chunk_dates,chunk_stances,chunks,is_helpful_chunk
0,"[borderlines-103_ret_b0_gn, borderlines-103_re...",borderlines-103,Is the following claim accurate?\nClaimant: Kf...,"[https://en.wikipedia.org/wiki/Kafr_Qouq,_Syri...","[False, False]","[2022-05-19, 2011-03-17]","[supports, insufficient-refutes]",[Kafr Qouq or Kafr Qawq (Arabic: كفرقوق) is a ...,"[True, False]"
1,"[borderlines-107_ret_b0_gn, borderlines-107_re...",borderlines-107,Is the following claim accurate?\nClaimant: Qa...,"[https://en.wikipedia.org/wiki/Qaa, https://el...","[False, False, False]","[2024-07-17, 2000-01-01, 2021-10-21]","[supports, insufficient-supports, supports]","[Qaa (Arabic: القاع), El Qaa, Al Qaa, Qaa Baal...","[True, False, True]"
2,"[borderlines-108_ret_b10_gn, borderlines-108_r...",borderlines-108,Is the following claim accurate?\nClaimant: Qa...,[https://www.haaretz.com/2015-07-06/ty-article...,"[False, False]","[2015-07-06, 2000-01-01]","[supports, insufficient-neutral]",[Beyond the barren mountains separating the Le...,"[True, False]"
3,"[borderlines-113_ret_b18_gn, borderlines-113_r...",borderlines-113,Is the following claim accurate?\nClaimant: Sh...,[https://www.worldatlas.com/articles/where-doe...,"[False, False]","[2017-04-25, 2024-10-01]","[insufficient-supports, supports]",[The Shatt al-Arab appears to have formed quit...,"[False, True]"
4,"[borderlines-118_ret_b0_g2, borderlines-118_re...",borderlines-118,Is the following claim accurate?\nClaimant: Gr...,[https://en.wikipedia.org/wiki/Greater_and_Les...,"[False, False]","[2024-06-14, 2024-05-18]","[refutes, insufficient-refutes]",[Greater Tunb and Lesser Tunb (Persian: تنب بز...,"[True, False]"
...,...,...,...,...,...,...,...,...,...
870,"[srilankafactcrescendo_91, srilankafactcrescen...",srilankafactcrescendo_91,Is the following claim accurate?\nClaimant: Da...,[https://srilanka.factcrescendo.com/english/co...,"[True, False, False, False]","[2024-02-28, 2024-02-28, 2023-01-02, 2024-02-26]","[refutes, refutes, refutes, insufficient-suppo...",[The individual arrested in connection with th...,"[True, True, True, False]"
871,"[srilankafactcrescendo_92, srilankafactcrescen...",srilankafactcrescendo_92,Is the following claim accurate?\nClaimant: So...,[https://srilanka.factcrescendo.com/english/us...,"[True, False, False, False, False]","[2023-07-29, 2022-02-25, 2021-08-10, 2024-04-0...","[refutes, insufficient-supports, insufficient-...",[Our investigation found that the rumors stati...,"[True, False, False, False, True]"
872,"[srilankafactcrescendo_94, srilankafactcrescen...",srilankafactcrescendo_94,Is the following claim accurate?\nClaimant: So...,[https://srilanka.factcrescendo.com/english/on...,"[True, False, False, False]","[2023-01-23, 2016-02-07, 2015-08-24, 2023-01-01]","[refutes, insufficient-supports, insufficient-...",[The part that these scams can copy your SIM d...,"[True, False, False, False]"
873,"[srilankafactcrescendo_95, srilankafactcrescen...",srilankafactcrescendo_95,Is the following claim accurate?\nClaimant: So...,[https://srilanka.factcrescendo.com/english/ci...,"[True, False]","[2023-08-02, 2023-08-03]","[refutes, insufficient-neutral]",[A viral video of Cillian Murphy speaking to P...,"[True, False]"


In [44]:
print(eval_data.loc[874].question)

Is the following claim accurate?
Claimant: Social Media Users
Claim: Bluetooth headphones emit the EMF wave that increases cancer risk !!


In [45]:
eval_data.loc[874].chunks[1]

'Some scientists are arguing that these headphones could potentially lead increase cancer risk due to the fact they emit radio frequency (EMF) radiation when they are used. In 2015, a group of scientists around the world signed a petition directed to the United Nations and World Health Organization that expresses their concern about exposure to non-ionizing electromagnetic fields (EMF). The petition, which started circulating around the internet again in early 2019, also says that EMF has the potential to increase the risk of neurological disorders, DNA damage, cellular distress, and changes to the reproductive system. [...] Ultimately, when it comes to wireless headphones, its dangers aren’t all that clear. Despite extremely low frequency EMF being possibly carcinogenic to humans, researchers have still not observed a direct connection. Bluetooth headphones specifically have their power density exposures that is 10 to 400 times lower than those of cell phones, according to Dana-Farber

Save the data

In [None]:
eval_data.to_json("data/DRUID-q/chunks.jsonl", orient='records', lines=True)