Many of analysis steps in the EFO/cogtext project require computational power and enough time to be completed. As a workaround to unblock the reminder steps and proceed with analysis, here we create a dataset of size 20% of the original PubMed abstracts dataset. The original dataset contains more than 500_000 abstracts, along with 300_000 articles in the preprocessed dataset. Here, we limit that to 20% of those sizes. We try to keep at least 1 abstract per label.


In [1]:
import pandas as pd
from tqdm import tqdm

In [2]:
# parameters
DATASET_NAMES = ['pubmed_abstracts', 'pubmed_abstracts_preprocessed']
fraction = .2

In [3]:
# load and prep

for dataset_name in tqdm(DATASET_NAMES):
  PUBMED = pd.read_csv(f'data/{dataset_name}.csv.gz')
  PUBMED['original_index'] = PUBMED.index
  PUBMED = PUBMED.dropna(subset=['abstract']).reset_index()
  PUBMED = PUBMED.rename(columns={'subcategory': 'label'}, errors='ignore')

  # sample
  if fraction < 1.0:
    PUBMED = PUBMED.groupby('label', group_keys=False).apply(
        lambda grp: grp.sample(n=max(int(len(grp) * fraction), 1))
    )

  # save
  new_dataset_name = dataset_name.replace('pubmed_', f'pubmed{int(100*fraction)}pct_')
  PUBMED.to_csv(f'data/{new_dataset_name}.csv.gz')

print('Done!')

100%|██████████| 2/2 [00:39<00:00, 19.91s/it]

Done!



