# **Data processing**

#### **Last update: 11th June, 2025**
<br>

This notebook retrieves the data from a number of systematic reviews on intervention studies by conducting the following steps:

**Step 1: Download the files with PubMed IDs below and place them in the "data/raw/" folder.** <br>
In this notebook, some of the intervention review datasets from the CLEF challenge are processed for usage in semi-automated screening simulations. The datasets (which can also be manually downloaded here: https://github.com/CLEF-TAR/tar/tree/master/2019-TAR/Task2 found under Training -> Intervention -> topics or Testing -> Intervention -> topics) are called:

- CD011768 (corresponds to intervention review 1)
- CD008170 (corresponds to intervention review 2)
- CD010558 (corresponds to intervention review 3)
- CD006468 (corresponds to intervention review 4)
- CD010038 (corresponds to intervention review 5)
- CD005139 (corresponds to intervention review 6)
- CD008201 (corresponds to intervention review A1)

**Step 2: Download the files with title-abstract inclusion labels below and place them in the "data/meta_data/clef_qrels/" folder.** <br>
The corresponding labels for title-abstract as well as full-text level inclusions are stored in separate files (which can also be manually downloaded here: https://github.com/CLEF-TAR/tar/tree/master/2019-TAR/Task2 found under Training -> Intervention -> qrels or Testing -> Intervention -> qrels) are called:

- full.train.int.abs.2019.qrels
- full.test.intervention.abs.2019.qrels

**Step 3: Process the files for compatibility with semi-automated screening simulations.** <br>
Using the downloaded raw files, for each review: <br>

1. the PubMed IDs are retrieved, <br>
2. from the PubMed IDs the titles and abstracts are retrieved, and <br>
3. the title-abstract labels are imported and added to each dataset which is saved as .csv in the "data/processed/" folder.

**Note**: The resulting datasets may slightly vary depending on the date the Pubmed IDs are being retrieved with this notebook, as PubMed studies can be deleted (or added) over time. 

------------------------------------------------------------------------------------------------------------------------------------------


In [1]:
import os
import csv
import time
import requests
import pandas as pd
import xml.etree.ElementTree as ET
from tqdm import tqdm
import glob
import warnings



In [2]:
# Check if the current pathway is in the main folder
os.chdir("..") 
current_path = os.getcwd()
print(current_path)

/Users/ispiero2/TAR-abstracts_testing


#### **Step 1: Download the files below and place them in the "data/raw/" folder.**

In [3]:
# Indicate the GitHub urls where the systematic review data can be downloaded
urls_reviews_dic = {
    'CD011768' : "https://raw.githubusercontent.com/CLEF-TAR/tar/master/2019-TAR/Task2/Testing/Intervention/topics/CD011768",
    'CD008170' : "https://raw.githubusercontent.com/CLEF-TAR/tar/master/2019-TAR/Task2/Training/Intervention/topics/CD008170",
    'CD010558' : "https://raw.githubusercontent.com/CLEF-TAR/tar/master/2019-TAR/Task2/Testing/Intervention/topics/CD010558",
    'CD006468' : "https://raw.githubusercontent.com/CLEF-TAR/tar/master/2019-TAR/Task2/Testing/Intervention/topics/CD006468",
    'CD010038' : "https://raw.githubusercontent.com/CLEF-TAR/tar/master/2019-TAR/Task2/Testing/Intervention/topics/CD010038",
    'CD005139' : "https://raw.githubusercontent.com/CLEF-TAR/tar/master/2019-TAR/Task2/Training/Intervention/topics/CD005139",
    'CD008201' : "https://raw.githubusercontent.com/CLEF-TAR/tar/master/2019-TAR/Task2/Training/Intervention/topics/CD008201"
}

In [4]:
# Path to store the raw review data
path_data_raw = "data/raw/"  

# Download the data for each review using the respective url
for key, value in urls_reviews_dic.items():
    
    response = requests.get(value)

    path = path_data_raw + key + ".txt"
    if response.status_code == 200:
        with open(path, "wb") as f:
            f.write(response.content)
        print(f"Downloaded successfully to {path}")
    else:
        print(f"Failed to download. Status code: {response.status_code}")

Downloaded successfully to data/raw/CD011768.txt
Downloaded successfully to data/raw/CD008170.txt
Downloaded successfully to data/raw/CD010558.txt
Downloaded successfully to data/raw/CD006468.txt
Downloaded successfully to data/raw/CD010038.txt
Downloaded successfully to data/raw/CD005139.txt
Downloaded successfully to data/raw/CD008201.txt


#### **Step 2: Download the files below and place them in the "data/meta_data/clef_qrels/" folder.**

In [5]:
# Indicate the GitHub urls where the qrels (title-abstract labels) can be downloaded
urls_qrels_dic = {
    'train' : "https://raw.githubusercontent.com/CLEF-TAR/tar/master/2019-TAR/Task2/Training/Intervention/qrels/full.train.int.abs.2019.qrels",
    'test' : "https://raw.githubusercontent.com/CLEF-TAR/tar/master/2019-TAR/Task2/Testing/Intervention/qrels/full.test.intervention.abs.2019.qrels"
}

In [6]:
# Path to store the raw qrels data
path_meta_data = "data/meta_data/clef_qrels/"

# Download the data for each qrels file using the respective url
for key, value in urls_qrels_dic.items():
    
    response = requests.get(value)

    path = path_meta_data + key + ".qrels"
    if response.status_code == 200:
        with open(path, "wb") as f:
            f.write(response.content)
        print(f"Downloaded successfully to {path}")
    else:
        print(f"Failed to download. Status code: {response.status_code}")

Downloaded successfully to data/meta_data/clef_qrels/train.qrels
Downloaded successfully to data/meta_data/clef_qrels/test.qrels


#### **Step 3: Process the files**

##### **1. From .txt extract PMIDs and save as .csv file**

In [7]:
# Path where the raw review data are stored
path_data_raw = "data/raw/" 
# Path to store the converted review data
path_data_converted = "data/converted"

format = "TREC"  # Change to "TOP" if needed

# Retrieve the lines in the .txt that are PMIDs and store as .csv
def process_file(filename, format, path_data_converted):
    """
    Processes a single .txt file, extracts PMIDs, and saves results as a CSV file.

    :param filename: Path to the .txt file
    :param format: Output format ('TREC' or 'TOP')
    :param output_folder: Folder where processed CSV files will be saved
    """
    record = False
    topicid = ""
    output_data = []

    with open(filename, "r", encoding="utf-8") as f:
        while f:
            line = f.readline()
            if not line:
                break

            if record:
                pmid = line.strip()
                output_data.append([topicid, pmid]) 

            if line.startswith("Topic:"):
                topicid = line.split()[1].strip()

            if line.startswith("Pids:"):
                record = True

    # Save results to a CSV file
    if output_data:
        os.makedirs(path_data_converted, exist_ok=True)  
        output_filename = os.path.join(path_data_converted, os.path.basename(filename).replace(".txt", ".csv"))

        with open(output_filename, "w", newline="", encoding="utf-8") as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow(["TopicID", "PMID"])  
            writer.writerows(output_data)  
        
        print(f"Saved: {output_filename}")

def process_folder(path_data_raw, path_data_converted, format):
    """
    Processes all .txt files in the specified folder and saves results as CSV.

    :param input_folder: Folder containing .txt files
    :param output_folder: Folder to save processed CSV files
    :param format: Output format ('TREC' or 'TOP')
    """
    if not os.path.exists(path_data_raw):
        print(f"Error: Folder '{path_data_raw}' does not exist.")
        return

    files = [f for f in os.listdir(path_data_raw) if f.endswith(".txt")]

    if not files:
        print("No .txt files found in the folder.")
        return

    for file in files:
        file_path = os.path.join(path_data_raw, file)
        print(f"\nProcessing: {file_path}")
        process_file(file_path, format, path_data_converted)

process_folder(path_data_raw, path_data_converted, format)


Processing: data/raw/CD005139.txt
Saved: data/converted/CD005139.csv

Processing: data/raw/CD006468.txt
Saved: data/converted/CD006468.csv

Processing: data/raw/CD010038.txt
Saved: data/converted/CD010038.csv

Processing: data/raw/CD010558.txt
Saved: data/converted/CD010558.csv

Processing: data/raw/CD008170.txt
Saved: data/converted/CD008170.csv

Processing: data/raw/CD011768.txt
Saved: data/converted/CD011768.csv

Processing: data/raw/CD008201.txt
Saved: data/converted/CD008201.csv


##### **2. From the PMIDs in the saved .csv files extract the corresponding titles and abstracts from PubMed**

In [8]:
# Path where the PMIDs of the review data are stored
path_data_converted = "data/converted/"  
# Path to store the extracted titles and abstracts
path_data_extracted = "data/extracted/"  
combined_output_file = os.path.join(path_data_extracted, "clef_all.csv")  

files_to_skip = []

def fetch_pubmed_data(list_of_pids):
    """
    Fetches PubMed article details for a given list of PMIDs.
    Skips unavailable PMIDs and returns data for available ones.
    
    :param list_of_pids: List of PubMed IDs (PMIDs)
    :return: DataFrame containing PMID, Title, and Abstract for available PMIDs
    """
    if not list_of_pids:
        return pd.DataFrame(columns=["PMID", "Title", "Abstract"])

    # Convert list of PMIDs to comma-separated string
    list_of_pids_str = ",".join(list_of_pids)

    # Request parameters
    payload = {
        'db': 'pubmed',
        'id': list_of_pids_str,
        'rettype': 'xml',
        'retmode': 'xml'
    }

    try:
        # Fetch data from PubMed
        response = requests.get("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?", params=payload)

        # Check if the response contains valid content
        if not response.content.strip():
            print(f"No data returned for PMIDs: {list_of_pids}")
            return pd.DataFrame(columns=["PMID", "Title", "Abstract"])

        # Parse XML response
        root = ET.fromstring(response.content)

        # Extract article information
        articles = []
        for article in root.findall(".//PubmedArticle"):
            pmid = article.find(".//PMID").text
            title_element = article.find(".//ArticleTitle")
            title = title_element.text if title_element is not None else "No title available"
            
            abstract_element = article.find(".//AbstractText")
            abstract = abstract_element.text if abstract_element is not None else "No abstract available"
            
            articles.append({"PMID": pmid, "Title": title, "Abstract": abstract})

        # Convert to DataFrame
        return pd.DataFrame(articles)

    except requests.exceptions.RequestException as e:
        print(f"Error fetching data for PMIDs {list_of_pids}: {e}")
        return pd.DataFrame(columns=["PMID", "Title", "Abstract"])

def process_csv_files(path_data_converted, path_data_extracted, combined_output_file):
    """
    Reads all CSV files from the input folder, extracts PMIDs, fetches PubMed data, 
    saves new CSV files, and creates a combined output file.

    :param input_folder: Path to the folder containing processed CSV files (PMID lists).
    :param output_folder: Path to the folder where new CSV files will be saved.
    :param combined_output_file: Path to the final combined output CSV file.
    """
    if not os.path.exists(path_data_converted):
        print(f"Error: Folder '{path_data_converted}' does not exist.")
        return

    files = [f for f in os.listdir(path_data_converted) if f.endswith(".csv")]
    
    if not files:
        print("No .csv files found in the folder.")
        return

    os.makedirs(path_data_extracted, exist_ok=True)  
    all_data = []  

    for file in files:
        # Skip the file if it's in the 'files_to_skip' list
        if file in files_to_skip:
            print(f"Skipping file: {file}")
            continue

        file_path = os.path.join(path_data_converted, file)
        print(f"\nProcessing: {file_path}")

        # Read CSV and extract PMIDs
        df = pd.read_csv(file_path)
        if "PMID" not in df.columns:
            print(f"Skipping {file}: No 'PMID' column found.")
            continue

        pmid_list = df["PMID"].dropna().astype(str).tolist() 

        # Fetch PubMed data for batches of PMIDs
        batch_size = 100  
        batched_data = []  
        for i in tqdm(range(0, len(pmid_list), batch_size), desc=f"Processing {file}"):
            batch = pmid_list[i:i+batch_size]
            pubmed_data = fetch_pubmed_data(batch)
            batched_data.append(pubmed_data)

        # Concatenate the batched data for this file
        pubmed_data_combined = pd.concat(batched_data, ignore_index=True)

        # Save the new CSV file in the output folder with the same filename
        output_filename = os.path.join(path_data_extracted, file)
        pubmed_data_combined.to_csv(output_filename, index=False)
        print(f"Saved: {output_filename}")

        # Append to the combined data list
        pubmed_data_combined["Source_File"] = file  
        all_data.append(pubmed_data_combined)

process_csv_files(path_data_converted, path_data_extracted, combined_output_file)


Processing: data/converted/CD010038.csv


Processing CD010038.csv: 100%|██████████| 89/89 [01:48<00:00,  1.21s/it]


Saved: data/extracted/CD010038.csv

Processing: data/converted/CD010558.csv


Processing CD010558.csv: 100%|██████████| 29/29 [00:39<00:00,  1.37s/it]


Saved: data/extracted/CD010558.csv

Processing: data/converted/CD011768.csv


Processing CD011768.csv: 100%|██████████| 92/92 [01:47<00:00,  1.17s/it]


Saved: data/extracted/CD011768.csv

Processing: data/converted/CD008170.csv


Processing CD008170.csv: 100%|██████████| 124/124 [02:14<00:00,  1.09s/it]


Saved: data/extracted/CD008170.csv

Processing: data/converted/CD008201.csv


Processing CD008201.csv: 100%|██████████| 36/36 [00:44<00:00,  1.24s/it]


Saved: data/extracted/CD008201.csv

Processing: data/converted/CD005139.csv


Processing CD005139.csv:  67%|██████▋   | 36/54 [00:42<00:23,  1.33s/it]

Error fetching data for PMIDs ['22232487', '27759581', '23480269', '24610632', '22801846', '22173075', '26411831', '22044337', '20937739', '21268815', '22304024', '22594924', '23430681', '21988318', '21187730', '22595908', '25547525', '20845250', '19410953', '18780651', '17317399', '25905984', '28712657', '22269605', '27465105', '23582991', '19900204', '23242589', '25193672', '19289984', '28366074', '27482641', '26957834', '25006692', '17567661', '20508510', '27766582', '23362796', '17524771', '17097190', '17060799', '25756339', '27409464', '20464787', '24696051', '16603955', '26501397', '25651450', '26148802', '27417506', '19321472', '27000269', '23642783', '18050131', '19584653', '15161830', '21668787', '23746130', '27027523', '21470386', '19556214', '25011025', '20379211', '19668384', '22491395', '20574019', '18617755', '18667951', '23464520', '26090898', '21720153', '27613201', '22495357', '19898177', '19023226', '26529038', '21170652', '23455233', '22408217', '27860530', '20558421

Processing CD005139.csv: 100%|██████████| 54/54 [01:03<00:00,  1.18s/it]


Saved: data/extracted/CD005139.csv

Processing: data/converted/CD006468.csv


Processing CD006468.csv: 100%|██████████| 39/39 [00:43<00:00,  1.12s/it]

Saved: data/extracted/CD006468.csv





##### **3. Retrieve the labels for title-abstract screening from the qurels and add to each dataset**

In [9]:
# Path where the qrels are stored
path_meta_data = "data/meta_data/clef_qrels/"

warnings.filterwarnings('ignore')

# Convert the .qrels files to .txt files
def convert_qrels_to_txt(path_meta_data):
    # Find all .qrels files in the folder
    qrels_files = glob.glob(os.path.join(path_meta_data, "*.qrels"))

    for qrels_file in qrels_files:
        # Extract base filename without extension
        base_name = os.path.splitext(os.path.basename(qrels_file))[0]
        
        # Load the .qrels file (assuming whitespace-separated format)
        qrels_df = pd.read_csv(qrels_file, sep=r"\s+", header=None, names=["query_id", "unused", "doc_id", "relevance"])
        
        # Define output .txt path
        txt_file = os.path.join(path_meta_data, f"{base_name}.txt")
        
        # Save as .txt
        qrels_df.to_csv(txt_file, sep=" ", index=False, header=False)
        print(f"Converted: {qrels_file} -> {txt_file}")

convert_qrels_to_txt(path_meta_data)

# Specify the path to your .txt files (adjust the path accordingly)
txt_files = glob.glob(path_meta_data + "*.txt")

# Initialize an empty list to hold dataframes
dfs = []

# Loop through each file
for file in txt_files:
    # Read the content of the file into a dataframe, assuming whitespace as the delimiter
    df = pd.read_csv(file, delim_whitespace=True)  
    
    # Rename the columns
    df.columns = ['topic', 'unknown', 'PMID', 'label']
    
    # Append the dataframe to the list
    dfs.append(df)

# Concatenate all the dataframes into one
concatenated_df = pd.concat(dfs, ignore_index=True)

Converted: data/meta_data/clef_qrels/test.qrels -> data/meta_data/clef_qrels/test.txt
Converted: data/meta_data/clef_qrels/train.qrels -> data/meta_data/clef_qrels/train.txt


In [10]:
qrels_df = concatenated_df.drop_duplicates()
qrels_df

Unnamed: 0,topic,unknown,PMID,label
0,CD005139,0,22972355,0
1,CD005139,0,17644433,0
2,CD005139,0,26866528,0
3,CD005139,0,17380066,0
4,CD005139,0,26107864,0
...,...,...,...,...
73634,CD012551,0,11272678,0
73635,CD012551,0,12504236,0
73636,CD012551,0,17546832,0
73637,CD012551,0,1872650,0


In [15]:
# Path where the extracted data is stored
path_data_extracted = 'data/extracted/'
# Path to store the cleaned data
path_data_processed = 'data/processed/'

# Get the list of .csv files in the input folder
csv_files = glob.glob(os.path.join(path_data_extracted, '*.csv'))

# Create the output folder if it doesn't exist
os.makedirs(path_data_processed, exist_ok=True)

# Process files 
for file in tqdm(csv_files, desc="Processing files", unit="file"):
    # Extract the topic from the filename (remove the extension .csv)
    topic = os.path.basename(file).replace('.csv', '')
    
    # Filter qrels_df based on the topic (from the filename)
    filtered_qrels_df = qrels_df[qrels_df['topic'] == topic]
    
    # Read the current .csv file into a DataFrame
    df_csv = pd.read_csv(file)
    
    # Merge the label from qrels_df to the .csv file based on 'PMID'
    df_merged = pd.merge(df_csv, filtered_qrels_df[['PMID', 'label']], on='PMID', how='left')

    # Rename the needed columns for consistency
    df_merged = df_merged.rename(columns={'PMID': 'pubmed_id', 
                            'Title': 'title',
                            'Abstract': 'abstract',
                            'label': 'label_included'})

    # Add the needed columns
    df_merged['id'] = range(1, len(df_merged) + 1)
    df_merged['openalex_id'] = None
    df_merged['doi'] = None
    df_merged['keywords'] = None
    df_merged['year'] = None

    # Select the needed columns
    df_merged = df_merged[['id', 'title', 'abstract', 'pubmed_id', 'openalex_id', 'doi',
             'keywords', 'year', 'label_included']]

    # Create a new output file path
    output_file = os.path.join(path_data_processed, os.path.basename(file))

    # Ensure all abstracts have a label that is not NA and as integer
    df_merged = df_merged.dropna(subset=['label_included'])
    df_merged['label_included'] = df_merged['label_included'].astype(int)
    
    # Save the updated .csv file to the output folder
    df_merged.to_csv(output_file, index=False)

Processing files: 100%|██████████| 7/7 [00:00<00:00,  7.69file/s]


##### **End of notebook**