# Preprocessing

Preprocess PubMed abstract texts by applying the following steps:

- tokenize,
- remove punctuation marks, numbers, and symbols,
- develop dictionary,
- remove stop words,
- stemming,
- lemmatize,
- n-gram phrase detection to identify common phrases,
- concat tokens back to a single string.

All the above operations are performed separately for each cognitive task/construct corpus. However the output will be an aggregated CSV file with the same columns as the previous one; only the abstract column is now cleaned up. The CSV will be stored in the `data/pubmed_abstracts_preprocessed.csv.gz` path.

**Note:** Preprocessing a huge corpus of many articles is a time-consuming task. It takes about 12 hours. Be mindful when running, and have fun!

In [1]:
%reload_ext autoreload
%autoreload 2

from pathlib import Path
import pandas as pd
from tqdm import tqdm
import spacy
from python.cogtext.preprocess_abstracts import preprocess_abstracts

tqdm.pandas()

In [2]:
# parameters
DEBUG = True

INPUT_FILE = Path('data/pubmed_abstracts.csv.gz')
OUTPUT_FILE = Path('data/pubmed_abstracts_preprocessed.csv.gz')

CUSTOM_STOP_WORDS = ['study', 'task', 'test']

# nlp = spacy.load('en_core_web_sm')  # efficient
nlp = spacy.load('en_core_web_trf')  # accurate

# additional stop words
for stop_word in CUSTOM_STOP_WORDS:
  lexeme = nlp.vocab[stop_word]
  lexeme.is_stop = True

df = pd.read_csv(INPUT_FILE, compression='gzip')

## Fix "Cognitive Flexibility" Corpus

Analysis of PubMed corpora for "Cognitive Flexibility" and "Cognitive Flexibility Test" produces highly correlated task-construct co-occurrences. Here, we are investigating whether it's artificial and due to PubMed querying issues.


The following code removes those papers that seem irrelevant to the task/construct. 

In [3]:
#FIXME this should happen after tokenizing and preprocessing abstracts to cover move generic cases.


C = df.query('category.str.contains("Construct") and subcategory.str.contains("Flexibility", case=False)')
M = df.query('category.str.contains("Task") and subcategory.str.contains("Flexibility", case=False)')

invalid_C = C.query('abstract.str.contains("flexibility (task|test)", case=False, na=False)')
invalid_M = M.query('not abstract.str.contains("flexibility (task|test)", case=False, na=False)')

df.drop(invalid_C.index, inplace=True)
df.drop(invalid_M.index, inplace=True)

# DEBUG
print(f'Found {len(C)} construct articles and {len(M)} task articles in the "Cognitive Flexibility" corpus.')
print(f'Removed {len(invalid_C)} construct articles and {len(invalid_M)} task articles from the "Cognitive Flexibility" corpus.')

  return func(self, *args, **kwargs)


Found 3566 construct articles and 240 task articles in the "Cognitive Flexibility" corpus.
Removed 0 construct articles and 0 task articles from the "Cognitive Flexibility" corpus.


## Preprocessing the abstracts

In [23]:
df['abstract'].fillna('', inplace=True)

# DEBUG: uncomment reduce dataset size and speed up development
# if DEBUG:
#   subcats_cnt = df['subcategory'].value_counts()
#   small_subcats = subcats_cnt[subcats_cnt < 20].index.to_list()
#   df = df.query('subcategory in @small_subcats',).copy()

# preprocess (~12h)
df['abstract'] = df.groupby(['category','subcategory'])['abstract'].progress_transform(
  lambda abstracts: preprocess_abstracts(abstracts.str.lower().to_list(), nlp_model=nlp, extract_phrases=True)
)

# store output
df.to_csv(OUTPUT_FILE, index=False, compression='gzip')

100%|██████████| 32/32 [00:55<00:00,  1.74s/it]


 # Task-Construct co-appearance matrix

 This notebook generates a matrix that contains task/construct co-appearance frequencies. Values indicate number of articles that both the task and the cognitive construct were mentioned in.


## Output

Co-appearance matrix is stored in sparse format in the `data/pubmed/pubmed_coappearances.csv.gz` path with the following columns in it:

 - `task`: Name of the cognitive task (a.k.a. subcategory in the pubmed_abstracts dataset).
 - `construct`: Name of the cognitive construct.
 - `task_corpus_size`: Number of articles in the cognitive task corpus.
 - `construct_corpus_size`: Number of articles in the cognitive construct corpus.
 - `union_corpus_size`: Total number of unique articles in either of the two corpora.
 - `intersection_corpus_size`: Number of articles that are shared in the two corpora.

**Note**: Values in the matrix are neither normalized nor scaled.

In [4]:
import pandas as pd
from python.cogtext.coappearance_matrix import generate_coappearance_matrix
from pathlib import Path

# parameters
COAPPEARANCE_OUTPUT_FILE = Path('data/pubmed_coappearances.csv.gz')

# DEBUG: reload the raw data (which would be an additional 15 sec)
# INPUT_FILE = Path('data/pubmed_abstracts.csv.gz')
# df = pd.read_csv(INPUT_FILE)

coappearances = generate_coappearance_matrix(df)
coappearances.to_csv(COAPPEARANCE_OUTPUT_FILE, index=False, compression='gzip')

100%|██████████| 100/100 [18:53<00:00, 11.33s/it]
