# Search PubMed and store abstracts

This notebook searches PubMed for all the defined queries in the EFO ontology (i.e., properties of type `efo:pubmedQuery`), then cleanups and aggregates the XML results, and stores all the search hits in a single CSV.

**NOTE**: Notebooks should be executed from the project root folder. All paths are relative to the project root.


## Input

- `data/ontologies/efo.owl`: EFO ontology.

## Output

- `data/pubmed/abstracts_2023.csv.gz`: Aggregated abstracts are stored as a single compressed file, containing the following columns:
    - `category` (`str`): either `CognitiveTask` or `CognitiveConstruct`
    - `subcategory` (`str`): task or construct name
    - `pmid` (`long`): PubMed identifier
    - `doi` (`str`): DOI
    - `year` (`int`): publication year in `yyyy` format
    - `title` (`str`): publication title
    - `abstract` (`str`): publication abstract
    - `journal_title` (`str`): full journal title
    - `journal_iso_abbreviation` (`str`): Abbreviated journal title
    - `mesh` (`str`, deprecated): A list of Medical Subject Headings which indicates the field of research and other topics. We only keep major topics.

## Requirements

First set the `NCBI_API_KEY` environment variable ([How do I obtain an API Key through an NCBI account
](https://support.nlm.nih.gov/knowledgebase/article/KA-05317/en-us)). Then activate the `cogtext` conda environment:

```bash
mamba activate cogtext
```


In [1]:
%reload_ext autoreload
%autoreload 3

from owlready2 import get_ontology
from pathlib import Path
from dotenv import load_dotenv; load_dotenv()  # to load NCBI_API_KEY env variable
import pandas as pd
import re
from tqdm import tqdm

from src.cogtext.datasets.pubmed import search_and_cache_xml, convert_xml_to_csv


In [2]:
DEBUG = True

# collect data for the following categories
CATEGORIES = ['CognitiveTask', 'CognitiveConstruct']

OUTPUT_PATH = 'data/pubmed/abstracts_2023.csv.gz'

OWL_FILE = 'data/ontologies/efo.owl'
ONTOLOGY = get_ontology(OWL_FILE).load()

In [3]:
for category in CATEGORIES:
  pubmed_queries = {e.name:e.pubmedQuery[0] for e in ONTOLOGY[category].descendants() if len(e.pubmedQuery) > 0}
  print(f'EF ontology contains {len(pubmed_queries)} PubMed queries for {category}s.')

EF ontology contains 126 PubMed queries for CognitiveTasks.
EF ontology contains 72 PubMed queries for CognitiveConstructs.


In [4]:

for category in CATEGORIES:

    # init folder
    Path(f'data/pubmed/{category}').mkdir(parents=True, exist_ok=True)

    # fetch queries from the ontology
    pubmed_queries = {e.name: e.pubmedQuery[0] for e in ONTOLOGY[category].descendants() if len(e.pubmedQuery) > 0}

    search_and_cache_xml(pubmed_queries)
    convert_xml_to_csv(pubmed_queries, category)

print('Done!')

[PubMed] query: ("More-less/Odd-Even task"[TIAB])
[PubMed] no article found.
[PubMed] query: ("Backward Span task"[TIAB])
[PubMed] stored 3 hits on NCBI history server.
[PubMed] stored hits in data/pubmed/.cache/BackwardSpanTask.xml.
[PubMed] query: ("Digit-Shifting task"[TIAB])
[PubMed] no article found.
[PubMed] query: ("Contingency Naming test"[TIAB])
[PubMed] stored 11 hits on NCBI history server.
[PubMed] stored hits in data/pubmed/.cache/ContingencyNamingTask.xml.
[PubMed] query: ("Grass/Snow"[TIAB])
[PubMed] stored 3 hits on NCBI history server.
[PubMed] stored hits in data/pubmed/.cache/GrassSnowTask.xml.
[PubMed] query: ("Psychomotor Vigilance"[TIAB])
[PubMed] stored 1076 hits on NCBI history server.
[PubMed] stored hits in data/pubmed/.cache/PVT_-_Psychomotor_Vigilance_task.xml.
[PubMed] query: ("Dimensional Change Card" Sort*[TIAB])
[PubMed] stored 185 hits on NCBI history server.
[PubMed] stored hits in data/pubmed/.cache/DimensionalChangeCardSortTask.xml.
[PubMed] query: (

Now that we have a separate file for each corpus, here we combine them and store the whole PubMed abstracts corpus as a single compressed CSV file:

In [None]:
# aggregation

corpus_files = Path('data/pubmed/').glob('**/*.csv')

dfs = []

for fname in tqdm(corpus_files):

  # find categories from the file name
  cats = re.findall('.*/pubmed/(.*)/(.*)\\.csv', str(fname))

  # ignore other csv files
  if len(cats) == 0:
    continue

  category = cats[0][0]
  subcategory = cats[0][1]

  df = pd.read_csv(fname)
  df['category'] = category
  df['subcategory'] = subcategory
  dfs.append(df)

# now aggregate all the data and store the compressed csv output (takes ~ 2min).
df.concat(dfs).to_csv(OUTPUT_PATH, single_file=True, index=False, compression='gzip')