# Annotating cognitive tasks corpus

This notebooks provides a graphical interface to annotate PubMed articles and mark each article as relevant or irrelated to the provided context.

Feel free to run all the cells up to the last one which saves the annotated outputs into the `data/pubmed/tests_annotated.csv` file.

## Setup

In [None]:
from pigeon import annotate
import pandas as pd
from pathlib import Path
from tqdm import tqdm
from IPython.display import display

## Load corpus

The following loads all the CSV files from the `data/pubmed/tests/` directory and prepares them to be annotated.

In [None]:

if not Path('data/pubmed/tests_annotated.csv').exists():
    csv_files = Path('data/pubmed').glob('tests/*.csv')

    corpora = []
    for csv_file in tqdm(csv_files, desc='Reading CSV files'):
        df = pd.read_csv(csv_file)
        df['corpus_name'] = csv_file.stem
        corpora.append(df)

    df = pd.concat(corpora, axis=0)
    df['abstract'].fillna(df['title'], inplace=True)
    df['is_relevant'] = None
    df.to_csv('data/pubmed/tests_annotated.csv', index=False)

##  Annonate PubMed corpus

For each article, you will be asked to annotate them as `relevant` or `irrelevant`. You need to make sure of the following constraints for each question:

- `journal_title` refers to a cogntive science journal.
- `title` and `abstract` are in theory and practice related to the context that is provided in the `corpus_name`.
- and all the other features make sense.

You will have three choices: relevant, irrelevant, and skip.

When you are done with annotating, just run the next cell to store your work. Don't worry; next time you run this annotating process, you will only see those articles that are not previously annotated.

In [None]:
pd.set_option('display.max_colwidth', None)

df = pd.read_csv('data/pubmed/tests_annotated.csv')
annotations = annotate(
  df.query('is_relevant.isna()').index,
  options=['relevant','irrelevant'],
  shuffle=True,
  display_fn=lambda x: display(df.iloc[x:x+1].T)
)

## Store the newly annonated articles

**do NOT run this cell except when you're done annotating and want to save the results.**

In [None]:
annotations_dict = {a[0]: (a[1]=='relevant') for a in annotations}
df.loc[list(annotations_dict.keys()), 'is_relevant'] = list(annotations_dict.values())
df.to_csv('data/pubmed/tests_annotated.csv', index=False)