# ArXiv Literature Search Strategy & CSV Generation
This notebook allows you to test different keyword strategies and download records using the ArXiv API ()

Check the API documentation on: https://info.arxiv.org/help/api/user-manual.html

Edit the `groups` and `logic` in the next code cell, then run the subsequent cells to see the results


In [1]:
# === SECTION: INSTALL AND IMPORT DEPENDENCIES ===
# Commented out: pip install feedparser if not present
try:
    import feedparser
    print("'feedparser' is already installed.")
except ModuleNotFoundError:
    import subprocess, sys
    print("'feedparser' not found. Installing now...")
    subprocess.check_call([sys.executable, "-m", "pip", "install", "feedparser"])
    import feedparser
    print("'feedparser' has been installed successfully.")

'feedparser' is already installed.


# 1. Setup & Define Your Folders

In the below section uncomment (ctrl+/ on PC or command+/ on Mac) the relevant lines to define the csv and summary folder.

In [2]:
import os
from dotenv import load_dotenv, find_dotenv

# Load .env variables
load_dotenv(find_dotenv())

# Get folder paths and API key from environment variables
csv_folder = os.getenv("CSV_FOLDER")
summary_folder = os.getenv("SUMMARY_FOLDER")

# This will create the folders (and parents) if they do NOT exist—does nothing if they do
os.makedirs(csv_folder, exist_ok=True)
os.makedirs(summary_folder, exist_ok=True)

# Check that folders and API key are set
missing = []
if not os.path.isdir(csv_folder):
    missing.append("CSV folder")
if not os.path.isdir(summary_folder):
    missing.append("Summary folder")

if missing:
    print(f"⚠️ WARNING: Please check the following: {', '.join(missing)}")
else:
    print("✅ Output folders are set up and ready.")

✅ Output folders are set up and ready.


# 2. Test and adjust your keyword strategy

The below 4 sections will help test different keyword groups and their combinations.
- 2.1 Run to define groups of keywords using AND/OR rules, then define a combination logic
- 2.2. Run to see the number of results returned for each keyword group and the combined query
- 2.3. Run to see the first 10 titles for each keyword group
- 2.4. Run to see the first 10 titles for the combined keyword group

In [None]:
# === SECTION: USER INPUT FOR DATE RANGE ===
start_year = 2016  # Example value
end_year = 3000    # Example value

print(f"Date range defined: {start_year} to {end_year}")

# === SECTION: DEFINE KEYWORD GROUPS, LOGIC, AND DATE FILTER ===
# Exclusion criteria keywords will only applied to combined query at download stage
groups = {
    'group1': 'keyword OR keyword',
    'group2': 'keyword OR keyword',
    'group3': 'keyword OR keyword',
    'group4': 'keyword OR keyword',
    'excluded': 'exclusion OR criteria' 
}

# Date filter for time set onwards
date_filter = f'submittedDate:[{start_year}01010000 TO {end_year}01010000]'
print(f"Date filter string defined: {date_filter}")

# Combine logic and add date filter
logic = "({group1}) AND ({group2}) AND ({group3}) AND ({group4}) AND {date_filter}"
combined_query = logic.format(**groups, date_filter=date_filter)
print("Keyword groups, logic, and date filter defined.")
print("Combined arXiv query:", combined_query)

Date range defined: 2016 to 3000
Date filter string defined: submittedDate:[201601010000 TO 300001010000]
Keyword groups, logic, and date filter defined.
Combined arXiv query: (keyword OR keyword) AND (keyword OR keyword) AND (keyword OR keyword) AND (keyword OR keyword) AND submittedDate:[201601010000 TO 300001010000]


In [10]:
# === SECTION: RUN ARXIV QUERY AND RETURN TOTAL RESULTS FOR EACH GROUP AND COMBINED QUERY ===
import urllib.parse

def arxiv_query(query, max_results=1, start=0):
    """Query arXiv API and return the parsed feed."""
    import feedparser
    base_url = 'http://export.arxiv.org/api/query?'
    search_query = urllib.parse.quote(query)
    url = f"{base_url}search_query=all:{search_query}&start={start}&max_results={max_results}"
    print(f"Querying arXiv: {url}")
    feed = feedparser.parse(url)
    return feed

# Print total results for each group
print("\n" + "="*50)
print("INDIVIDUAL GROUP RESULTS:")
print("="*50)
for name, query in groups.items():
    print(f"\nGROUP: {name.upper()}")
    feed = arxiv_query(query, max_results=1)  # Only need 1 result to get total count
    total_results = feed.feed.get('opensearch_totalresults', 'unknown')
    print(f"Total results for {name}: {total_results}")

# Print total results for combined query
print("\n" + "="*50)
print("COMBINED LOGIC RESULTS:")
print("="*50)
feed = arxiv_query(combined_query, max_results=1)
total_results = feed.feed.get('opensearch_totalresults', 'unknown')
print(f"Total results for combined query: {total_results}")


INDIVIDUAL GROUP RESULTS:

GROUP: GROUP1
Querying arXiv: http://export.arxiv.org/api/query?search_query=all:neuroimaging%20pipeline%20OR%20MRI%20processing%20OR%20neuroinformatics%20pipeline&start=0&max_results=1
Total results for group1: 458989

GROUP: GROUP2
Querying arXiv: http://export.arxiv.org/api/query?search_query=all:structural%20MRI%20OR%20T1-weighted%20OR%20T2-weighted%20OR%20low-field%20MRI%20OR%20portable%20MRI&start=0&max_results=1
Total results for group2: 453094

GROUP: GROUP3
Querying arXiv: http://export.arxiv.org/api/query?search_query=all:continuous%20integration%20OR%20continuous%20deployment%20OR%20CI/CD%20OR%20containerization%20OR%20containerisation%20OR%20version%20control%20OR%20cloud-based%20OR%20serverless%20OR%20distributed%20storage%20OR%20BIDS%20OR%20flywheel.io%20OR%20github%20OR%20gitlab%20OR%20reproducibility&start=0&max_results=1
Total results for group3: 1102006

GROUP: GROUP4
Querying arXiv: http://export.arxiv.org/api/query?search_query=all:brain%

The next block will show the first 10 titles for each keyword group (except the excluded keyword group).

Based on this, you can go back and adust your keyword groups.

In [11]:
# === SECTION: PRINT FIRST 10 TITLES FOR EACH GROUP ===
for name, query in groups.items():
    print("\n" + "="*50)
    print(f"FIRST 10 TITLES FOR GROUP: {name.upper()}")
    print("="*50)
    feed = arxiv_query(query, max_results=10)
    for i, entry in enumerate(feed.entries, 1):
        # Clean the title before using it in the f-string
        clean_title = entry.title.strip().replace('\n', ' ')
        print(f"{i}. {clean_title}")


FIRST 10 TITLES FOR GROUP: GROUP1
Querying arXiv: http://export.arxiv.org/api/query?search_query=all:neuroimaging%20pipeline%20OR%20MRI%20processing%20OR%20neuroinformatics%20pipeline&start=0&max_results=10
1. Machine Learning pipeline for discovering neuroimaging-based biomarkers   in neurology and psychiatry
2. An Analysis of Performance Bottlenecks in MRI Pre-Processing
3. CompressedMediQ: Hybrid Quantum Machine Learning Pipeline for   High-Dimensional Neuroimaging Data
4. Pipeline-Invariant Representation Learning for Neuroimaging
5. The Developing Human Connectome Project: A Fast Deep Learning-based   Pipeline for Neonatal Cortical Surface Reconstruction
6. JUMP: A joint multimodal registration pipeline for neuroimaging with   minimal preprocessing
7. Clinica: an open source software platform for reproducible clinical   neuroscience studies
8. FastSurfer -- A fast and accurate deep learning based neuroimaging   pipeline
9. Mitigating analytical variability in fMRI results with st

The next block will show the first 10 titles for the combined query.

Based on the results you can go back and adjust your groups and logic.

In [12]:
# === SECTION: PRINT FIRST 10 TITLES FOR COMBINED QUERY ===
print("\n" + "="*50)
print("FIRST 10 TITLES FOR COMBINED QUERY:")
print("="*50)
feed = arxiv_query(combined_query, max_results=10)
for i, entry in enumerate(feed.entries, 1):
    clean_title = entry.title.strip().replace('\n', ' ')
    print(f"{i}. {clean_title}")


FIRST 10 TITLES FOR COMBINED QUERY:
Querying arXiv: http://export.arxiv.org/api/query?search_query=all:%28neuroimaging%20pipeline%20OR%20MRI%20processing%20OR%20neuroinformatics%20pipeline%29%20AND%20%28structural%20MRI%20OR%20T1-weighted%20OR%20T2-weighted%20OR%20low-field%20MRI%20OR%20portable%20MRI%29%20AND%20%28continuous%20integration%20OR%20continuous%20deployment%20OR%20CI/CD%20OR%20containerization%20OR%20containerisation%20OR%20version%20control%20OR%20cloud-based%20OR%20serverless%20OR%20distributed%20storage%20OR%20BIDS%20OR%20flywheel.io%20OR%20github%20OR%20gitlab%20OR%20reproducibility%29%20AND%20%28brain%20OR%20neuroimaging%29%20AMD%20submittedDate%3A%5B201601010000%20TO%20300001010000%5D&start=0&max_results=10
1. Clinica: an open source software platform for reproducible clinical   neuroscience studies
2. FastSurferVINN: Building Resolution-Independence into Deep Learning   Segmentation Methods -- A Solution for HighRes Brain MRI
3. The Developing Human Connectome Proj

# 2. Export ArXiv Results to CSV

The below script will use your combined query to download titles and abstracts and save them to a CSV file, including author name, title, abstract, year and doi. It will also update the summary table to include the total of found and downloaded records, the source the final query and a timestamp for record keeping purposes.

In [13]:
# === SECTION: EXPORT ARXIV RESULTS TO CSV (ALL RESULTS, WITH EXCLUSION) ===

import csv
import os
import time
from datetime import datetime

def get_next_csv_name(folder, base_name):
    i = 1
    while True:
        csv_name = f"{base_name}_v{i}.csv"
        csv_path = os.path.join(folder, csv_name)
        if not os.path.exists(csv_path):
            return f"{base_name}_v{i}", csv_path
        i += 1

def is_excluded(entry, excluded_terms):
    """Return True if any excluded term is found in the title or summary."""
    text = (entry.title + " " + entry.summary).lower()
    return any(term.lower() in text for term in excluded_terms)

def download_arxiv_to_csv_all(query, csv_folder, excluded_terms, max_total=5000, batch_size=100):
    import feedparser
    os.makedirs(csv_folder, exist_ok=True)
    base_name, csv_path = get_next_csv_name(csv_folder, "arxiv")
    print(f"Writing results to CSV: {csv_path}")

    total_found = None
    total_downloaded = 0
    start = 0

    with open(csv_path, 'w', newline='', encoding='utf-8') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(['arxiv_id', 'title', 'authors', 'published', 'summary', 'categories'])
        while total_downloaded < max_total:
            feed = arxiv_query(query, max_results=batch_size, start=start)
            if total_found is None:
                # Get total found from the feed metadata
                total_found = int(feed.feed.get('opensearch_totalresults', 0))
                print(f"Total results found: {total_found}")
            entries = feed.entries
            if not entries:
                break
            for entry in entries:
                if is_excluded(entry, excluded_terms):
                    continue
                arxiv_id = entry.id.split('/abs/')[-1]
                title = entry.title.replace('\n', ' ').strip()
                authors = '; '.join(author.name for author in entry.authors)
                published = entry.published
                summary = entry.summary.replace('\n', ' ').strip()
                categories = ', '.join(tag['term'] for tag in entry.tags) if hasattr(entry, 'tags') else ''
                writer.writerow([arxiv_id, title, authors, published, summary, categories])
                total_downloaded += 1
                if total_downloaded >= max_total:
                    break
            start += batch_size
            print(f"Progress: Downloaded {total_downloaded} / {min(total_found, max_total)}")
            time.sleep(3)  # Respect arXiv API rate limit[2][3]
            if len(entries) < batch_size:
                break  # No more results
    print(f"Downloaded {total_downloaded} records to {csv_path}")
    timestamp = datetime.now().strftime("%Y-%m-%dT%H:%M")
    return total_found, total_downloaded, base_name, timestamp

def append_summary_row(summary_folder, base_name, found, downloaded, query, timestamp):
    summary_csv_path = os.path.join(summary_folder, "summary_csv.csv")
    os.makedirs(summary_folder, exist_ok=True)
    header = "source,found,downloaded,query,timestamp\n"
    row = f"{base_name},{found},{downloaded},\"{query}\",{timestamp}\n"
    file_exists = os.path.exists(summary_csv_path)
    is_empty = not file_exists or os.path.getsize(summary_csv_path) == 0
    with open(summary_csv_path, 'a', encoding='utf-8', newline='') as f:
        if is_empty:
            f.write(header)
        f.write(row)
    print(f"Summary row added for {base_name}")

# Prepare exclusion terms from your groups dict
excluded_terms = [term.strip() for term in groups.get('excluded', '').split('OR') if term.strip()]

# Run download and summary
found, downloaded, base_name, timestamp = download_arxiv_to_csv_all(
    combined_query, csv_folder, excluded_terms, max_total=5000, batch_size=100
)
append_summary_row(summary_folder, base_name, found, downloaded, combined_query, timestamp)


Writing results to CSV: C:/Users/petra/Documents/UniKCL/Workshop/csvs/arxiv_v1.csv
Querying arXiv: http://export.arxiv.org/api/query?search_query=all:%28neuroimaging%20pipeline%20OR%20MRI%20processing%20OR%20neuroinformatics%20pipeline%29%20AND%20%28structural%20MRI%20OR%20T1-weighted%20OR%20T2-weighted%20OR%20low-field%20MRI%20OR%20portable%20MRI%29%20AND%20%28continuous%20integration%20OR%20continuous%20deployment%20OR%20CI/CD%20OR%20containerization%20OR%20containerisation%20OR%20version%20control%20OR%20cloud-based%20OR%20serverless%20OR%20distributed%20storage%20OR%20BIDS%20OR%20flywheel.io%20OR%20github%20OR%20gitlab%20OR%20reproducibility%29%20AND%20%28brain%20OR%20neuroimaging%29%20AMD%20submittedDate%3A%5B201601010000%20TO%20300001010000%5D&start=0&max_results=100
Total results found: 1830
Progress: Downloaded 87 / 1830
Querying arXiv: http://export.arxiv.org/api/query?search_query=all:%28neuroimaging%20pipeline%20OR%20MRI%20processing%20OR%20neuroinformatics%20pipeline%29%20A

# ArXiv Literature Search Strategy Completed

If all scripts have been run successfully (either once or multiple times), you should've received confirmation messages for each block and have at least one csv named arxiv_csv_v(n).csv in your folder defined at the start. Note, that with every single download the code generates an additional version following the naming convention of v1, v2, v3 etc. You should also have a summary table updated with a record of each download you made.

Note that your found and download numbers should be different as the exclusion criteria is applied at the download stage for arXiv API.