# Annotating cognitive tasks corpus

This notebooks implements a graphical interface to annotate PubMed cognitive tests coprpus and mark each article as relevant or irrelated to the provided context.

First, run all the cells up to the last one, which saves the annotated outputs into the `data/pubmed/tests_annotated.csv` file. You don't need to annotate all the articles at once. Whenever you run the notebook, it only shows those articles that are not already annotated.

## Setup

In [1]:
# from pigeon import annotate
import pigeonXT as pixt
import pandas as pd
from pathlib import Path
from tqdm import tqdm
from IPython.display import display

## Load corpus

The following loads all the CSV files from the `data/pubmed/tests/` directory and prepares them to be annotated.

In [2]:

if not Path('data/pubmed/annotated_cognitive_tests_corpus.csv').exists():
    csv_files = Path('data/pubmed').glob('tests/*.csv')

    corpora = []
    for csv_file in tqdm(csv_files, desc='Reading CSV files'):
        df = pd.read_csv(csv_file)
        df['corpus_name'] = csv_file.stem
        corpora.append(df)

    df = pd.concat(corpora, axis=0)
    df['abstract'].fillna(df['title'], inplace=True)
    df['is_annotated'] = False
    df['annotation'] = None
    df.to_csv('data/pubmed/annotated_cognitive_tests_corpus.csv', index=False, na_rep='n/a')
else:
    # load the aggregated corpus if the aggregated file already exists
    df = pd.read_csv('data/pubmed/annotated_cognitive_tests_corpus.csv')

# workaround to discard unusual terminators in the text
df['abstract'] = df['abstract'].apply(lambda x: x.replace('\u2029', ' ') if isinstance(x, str) else x)
df['title'] = df['title'].apply(lambda x: x.replace('\u2029', ' ') if isinstance(x, str) else x)

Reading CSV files: 91it [00:05, 15.57it/s]


##  Annonate PubMed corpus

For each article, you will be asked to annotate them as `relevant` or `irrelevant`. You need to make sure of the following constraints for each question:

- `journal_title` refers to a cogntive science journal.
- `title` and `abstract` are in theory and practice related to the context that is provided in the `corpus_name`.
- and all the other features make sense.

You will have three choices: relevant, irrelevant, and skip.

When you are done with annotating, just run the next cell to store your work. Don't worry; next time you run this annotating process, you will only see those articles that are not previously annotated.

In [3]:
pd.set_option('display.max_colwidth', None)

def store_annotations(annotations, path = 'data/pubmed/annotated_cognitive_tests_corpus.csv'):
  df.loc[list(annotations['row_index']), 'is_annotated'] = annotations['changed']
  df.loc[list(annotations['row_index']), 'annotation'] = annotations['annotation']
  df.to_csv(path, index=False, na_rep='n/a')


annotations = pixt.annotate(
  df.query('is_annotated.isna() or (is_annotated == False)').index.to_list(),
  options=['relevant','irrelevant'],
  # task_type='multilabel-classification',
  # shuffle=True,
  buttons_in_a_row=5,
  example_column='row_index',
  value_column='annotation',
  display_fn=lambda x: display(x,df[x:x+1].T),
  final_process_fn=lambda annotations: store_annotations(annotations)
)

HTML(value='0 of 91229 Examples annotated, Current Position: 0 ')

HBox(children=(Button(description='relevant', style=ButtonStyle()), Button(description='irrelevant', style=But…

Output()

## Store the newly annonated articles

**do NOT run this cell except when you're done annotating and want to save the results.**

In [8]:
store_annotations(annotations)