# Preprocessing

Preprocess PubMed abstract texts by applying the following steps:

- tokenize,
- remove punctuation marks, numbers, and symbols,
- develop dictionary,
- remove stop words,
- stemming,
- lemmatize,
- n-gram phrase detection to identify common phrases,
- concat tokens back to a single string.

All the above operations are performed separately for each cognitive task/construct corpus. However the output will be an aggregated CSV file with the same columns as the previous one; only the abstract column is now cleaned up. The CSV will be stored in the `data/pubmed_abstracts_preprocessed.csv.gz` path.

**Note:** Preprocessing a huge corpus of many articles is a time-consuming task. It takes about 12 hours. Be mindful when running, and have fun!

In [None]:
%reload_ext autoreload
%autoreload 2

from pathlib import Path
import pandas as pd
from tqdm import tqdm
import spacy
from python.cogtext.preprocess_abstracts import preprocess_abstracts

tqdm.pandas()

In [None]:
# parameters
DEBUG = True

INPUT_FILE = Path('data/pubmed_abstracts.csv.gz')
OUTPUT_FILE = Path('data/pubmed_abstracts_preprocessed.csv.gz')

CUSTOM_STOP_WORDS = ['study', 'task', 'test']

nlp = spacy.load('en_core_web_sm')

# additional stop words
for stop_word in CUSTOM_STOP_WORDS:
  lexeme = nlp.vocab[stop_word]
  lexeme.is_stop = True


In [None]:
# load raw dataset (~30s)
df = pd.read_csv(INPUT_FILE, compression='gzip')
df['abstract'].fillna('', inplace=True)

# DEBUG: uncomment reduce dataset size and speed up development
# if DEBUG:
#   subcats_cnt = df['subcategory'].value_counts()
#   small_subcats = subcats_cnt[subcats_cnt < 20].index.to_list()
#   df = df.query('subcategory in @small_subcats',)

# preprocess (~12h)
df['abstract'] = df.groupby(['category','subcategory'])['abstract'].progress_transform(
  lambda abstracts: preprocess_abstracts(abstracts.to_list(), nlp_model=nlp, extract_phrases=True)
)

# store output
df.to_csv(OUTPUT_FILE, index=False, compression='gzip')