# Preprocessing

Preprocess PubMed abstract texts by applying the following steps:

- tokenize,
- remove punctuation marks, numbers, and symbols,
- develop dictionary,
- remove stop words,
- stemming,
- lemmatize,
- n-gram phrase detection to identify common phrases,
- concat tokens back to a single string.

All the above operations are performed separately for each cognitive task/construct corpus. However the output will be an aggregated CSV file with the same columns as the previous one; only the abstract column is now cleaned up. The CSV will be stored in the `data/pubmed/abstracts_preprocessed.csv.gz` path.

**Note:** Preprocessing a huge corpus of many articles is a time-consuming task. It takes about 12 hours. Be mindful when running, and have fun!

In [1]:
%reload_ext autoreload
%autoreload 2

from pathlib import Path
import pandas as pd
from tqdm import tqdm
import spacy  # or en_core_web_trf, en_core_web_sm

from python.cogtext.abstract_utils import preprocess_abstracts
from python.cogtext.utils import select_relevant_journals

tqdm.pandas()

In [16]:
# parameters
DEV_MODE = False
"""Enabling development mode reducing dataset size and increases speed."""

INPUT_FILE = Path('data/pubmed/abstracts.csv.gz')
"""Input csv file."""

OUTPUT_FILE = Path('data/pubmed/abstracts_preprocessed.csv.gz')
"""Path to store the preprocessed abstracts (compressed CSV)."""

CUSTOM_STOP_WORDS = ['study']  #, 'task', 'test']
"""List of custom domain-specific stop words, e.g., study, performance."""

nlp = spacy.load('en_core_web_sm')
"""Language model."""

PUBMED = pd.read_csv(INPUT_FILE, compression='gzip').pipe(select_relevant_journals).dropna(subset=['abstract'])
"""The raw PubMed abstracts dataset."""

# additional stop words
for stop_word in CUSTOM_STOP_WORDS:
  lexeme = nlp.vocab[stop_word]
  lexeme.is_stop = True


## Fix "Cognitive Flexibility" Corpus

Analysis of PubMed corpora for "Cognitive Flexibility" and "Cognitive Flexibility Test" produces highly correlated task-construct co-occurrences. Here, we are investigating whether it's artificial and due to PubMed querying issues.


The following code removes those papers that seem irrelevant to the task/construct. 

In [17]:
#FIXME this should happen after tokenizing and preprocessing abstracts to cover move generic cases.

C = PUBMED.query('category.str.contains("Construct") and subcategory.str.contains("Flexibility", case=False)')
M = PUBMED.query('category.str.contains("Task") and subcategory.str.contains("Flexibility", case=False)')

invalid_C = C.query('abstract.str.contains("flexibility (task|test)", case=False, na=False)')
invalid_M = M.query('not abstract.str.contains("flexibility (task|test)", case=False, na=False)')

PUBMED.drop(invalid_C.index, inplace=True)
PUBMED.drop(invalid_M.index, inplace=True)

# DEBUG
print(f'Found {len(C)} construct articles and {len(M)} task articles in the "Cognitive Flexibility" corpus.')
print(f'Removed {len(invalid_C)} construct articles and {len(invalid_M)} task articles from the "Cognitive Flexibility" corpus.')

Found 2622 construct articles and 100 task articles in the "Cognitive Flexibility" corpus.
Removed 0 construct articles and 0 task articles from the "Cognitive Flexibility" corpus.


  return self.const_type(res(*new_args, **kwargs), self.env)


## Preprocessing the abstracts

In [19]:
if DEV_MODE:
  subcats_cnt = PUBMED['subcategory'].value_counts()
  small_subcats = subcats_cnt[subcats_cnt < 20].index.to_list()
  PUBMED = PUBMED.query('subcategory in @small_subcats',).copy()

# lower case all abstracts (to avoid inconsistent lemmas by SpaCy)
PUBMED['abstract'] = PUBMED['abstract'].str.lower()

# preprocess (~2:30h)
PUBMED['abstract'] = PUBMED['abstract'].pipe(preprocess_abstracts, nlp_model=nlp)

# OPTIONAL
# PUBMED['abstract'] = PUBMED.groupby(['category','subcategory'])['abstract'].progress_transform(
#   lambda abstracts: concat_common_phrases(abstracts)
# )

# store output
PUBMED.to_csv(OUTPUT_FILE, index=False, compression='gzip')

100%|██████████| 323150/323150 [2:29:51<00:00, 35.94it/s]
