# Parser Evaluation

There is an [earlier version](https://colab.research.google.com/github/petervdonovan/CitationParser/blob/master/ParserEvaluation.ipynb) of this notebook, and much of the work contained in this notebook is different from but parallel to this earlier version. There are a few reasons why it was necessary to rethink this prior work:

1. Tags will not be a component of the testing or training dataset.

    * This does not mean that tags will not continue to be important. However, I have decided that the our objective -- use of citations to create edges in a social network -- does not inherently involve tagging.

1. In lieu of tags, metadata is being included in a format that is as independent from its origin as possible.

    * Names in particular do require this. In bibliographic entries, the formatting of names varies widely by style guide. If names are to be considered as semantic information instead of as raw text, they should be presented in a data structure that makes the different parts of a name explicit.

1. Metrics used to evaluate models will be carefully chosen to make them as interpretable as possible for individuals who do not know any low-level details of how our models operate.

    * Changes in this direction reduce the amount of text that is required alongside any reported metrics.

There may be several parts of this notebook where you may cry out, "I have seen all of this before, in another notebook! This is not DRY!" I insist on justifying myself by noting that in writing this notebook I have a slightly different purpose in mind, and that I do not wish to be constrained by any links or dependencies on an implementation that I may not wish to keep.

In [1]:
!pip install pickle5
import pandas as pd
import numpy as np
import pickle5 as pickle
import random
import time
import itertools
import os
import matplotlib.pyplot as plt
from gensim.utils import simple_preprocess

from People.NameList import NameList
#from google.colab import drive
#drive.mount('/content/drive')
#%cd /content/drive/MyDrive/AWCA/Colab_notebooks/CitationTagging/Sp21/CitationParser/



## Datasets

### Note on dataset selection

For the moment, I will be loading the OpenCitations dataset for initial tests. There are a couple of serious limitations to this dataset which we should be aware of:
* It has a replicability issue: Because the OpenCitations dataset is too large to support a SPARQL query that asks for the DOI of _every_ cited work for which we have a raw-text citation, I had to use a LIMIT clause. There are no promises about whether the sample I got was representative of the corpus, so this is a little unfortunate.
* It is probably dissimilar to our dataset in important ways. Specifically, many of the bibliographic entities documented in the particular portion of the dataset that I have downloaded are journal articles on biology and medicine -- a fact that may be reflected in the style guides that are used.

For these reasons, OpenCitations is an additional tool, not a replacement for our other datasets. I am using it for the moment so that I can focus on developing a parser using a clean dataset, without having to worry about whether my results are valid or whether the dataset generation process has bugs.

Another challenge is that the Zotero dataset does not have raw text citations that are reliably matched with the corresponding metadata. A _mostly_ successful attempt to tackle this issue is included in [this notebook](https://colab.research.google.com/drive/1OEFWVWgEzCiPA35Ma20svEY5OhPCuPwI), but it is a beast, and the output has imperfections of a severity that is difficult to quantify definitively.

In any case, the OpenCitations dataset will be used for development. Long-term goals might be:
* To include a more representative (or complete) sample of the OpenCitations corpus, for replicability, and
* To include the Zotero dataset.

### Fields in the Dataset

The dataset has the following fields for the models to use for prediction:
* **raw**: The raw text citation, represented as a string.

The dataset has the following fields for the models to predict:
* **author**: The name(s) of the authors of the work, represented as `NameList` objects.
* **year**: The publication year of the work, represented as an integer.
* **title**: The title of the work, represented as a string.

These three fields are the ones that are likely candidates to be used to construct edges; the others are much more questionable. Each of them is commonly found -- for example, 98.8% of records in the Zotero database have a publication year, 99.6% have an author, and 99.7% have a title.

Every row used in this dataset will have all three fields, i.e., the datasets will look like someone called `dropna` on them to eliminate null values in those columns. I acknowledge that this means that what is left may not be representative of the raw-text citations that may be out there in the wild; however, as mentioned in the above paragraph, those three fields are fairly commonplace.

It is worth noting, however, that other fields are available in case we decide to incorporate them into our analysis later. They are less common and less reliable, so I would only include them as a very late micro-optimization to our model, if at all.
* pages: The pages in which the work appeared in the container (book, anthology, journal issue, etc.) in which it was published, represented as a string containing digits, a hyphen, and then more digits (or just digits, if the work appears on only one page)
* volume: The volume in which the work appears, represented as a string.
* source_title: The name of the journal or other entity responsible for the publication of the work.
* issue: The issue in which the work appeared.

Matching surnames, matching titles, and matching years should be sufficient grounds for declaring a match. `simple_preprocess` from Gensim.utils with deacc=True should be enough to get reliable matches, if the fields can just be extracted correctly.



In [2]:
with open('datasets/occ_45K.pickle', 'rb') as dbfile:
  occ_45K = pickle.load(dbfile)
occ = pd.DataFrame()
occ['raw']    = occ_45K.raw
occ['author'] = occ_45K.author
occ['year']   = occ_45K.year
occ['title']  = occ_45K.title
print('I wished to get metadata for {} DOIs downloaded from the OpenCitations\n'
      'SPARQL endpoint, '.format(
    len(occ.index)
), end='')
occ = occ.dropna()
print('but only {} complete rows of metadata were received.'.format(
    len(occ.index)
))
occ.sample(2)

I wished to get metadata for 45756 DOIs downloaded from the OpenCitations
SPARQL endpoint, but only 45280 complete rows of metadata were received.


Unnamed: 0,raw,author,year,title
6629,"Blennow, G, McNeil, TF. Neurological deviation...","Blennow, G.; Mcneil, T. F.",1991,Neurological Deviations In Newborns At Psychia...
33955,"Dandawate, P, Padhye, S, Ahmad, A, Sarkar, FH....","Dandawate, Prasad; Padhye, Subhash; Ahmad, Aam...",2012,Novel Strategies Targeting Cancer Stem Cells T...


In [3]:
bad_author_idx = [
    idx for idx in occ.index
    if (not occ.author[idx]) or any(
        len(name.split(',')) != 2
        for name in occ.author[idx].split(';')
    )
]
bad_title_idx = [idx for idx in occ.index if not occ.title[idx]]
bad_year_idx = [idx for idx in occ.index if not occ.year[idx]]

occ_drop_idx = list(
    set(bad_author_idx) | set(bad_title_idx) | set(bad_year_idx))
occ = occ.drop(occ_drop_idx, axis=0)

print('...but even the rows of metadata that did not explicitly have\n'
      'null values such as None or np.NaN sometimes had unusable data.\n'
      '{} rows had no author names or author names that were formatted\n'
      'inconsistently or in a way that was difficult to interpret, {}\n'
      'rows had no years provided, and {} rows had no titles provided.\n'
      'In total, {} rows had to be excluded for these reasons, \n'
      'leaving {} rows of truly complete metadata.'.format(
          len(bad_author_idx),
          len(bad_title_idx),
          len(bad_year_idx),
          len(occ_drop_idx),
          len(occ.index)
      ))

...but even the rows of metadata that did not explicitly have
null values such as None or np.NaN sometimes had unusable data.
1192 rows had no author names or author names that were formatted
inconsistently or in a way that was difficult to interpret, 167
rows had no years provided, and 137 rows had no titles provided.
In total, 1448 rows had to be excluded for these reasons, 
leaving 43832 rows of truly complete metadata.


In short, 5000 DOIs were requested, 45756 DOIs were received. Then 45756 rows of complete metadata were requested, and of them, 43832 complete rows of metadata were received. This means that after the crucial bottleneck where I was only able to request an arbitrarily (and not necessarily randomly) selected sample of 50000 DOIs, I was able to get most of the data I wanted.

Let us get a sense for how clean and complete the dataset is. Do the raws really contain sufficient information to get the data we want? How can this inform the expectations and concerns we might have when developing a parser?

The following cells may be run several times to see many different rows in the dataset.

Already, it is starting to seem like most author surnames seem to be available. However, even a human processing the dataset by hand might find it difficult or impossible to determine given names. Sometimes, it is even difficult to determine whose given names or initials are whose.

In [4]:
occ.sample(5).drop(['year', 'title'], axis=1)

Unnamed: 0,raw,author
41538,"Kounoue, E, Izumi, K, Ogawa, S, Kondo, S, Kats...","Kounoue, Etsushi; Izumi, Ken-Ichi; Ogawa, Shui..."
37136,"an, N.H, Arunmozhiarasi, A, Ponnudurai, G. A c...","Tan, Nget-Hong; Arunmozhiarasi, Armugam; Ponnu..."
6100,Manikandan S. Are we moving towards a new defi...,"Manikandan, S"
23099,"van den Brand, JM, Haagmans, BL, van Riel, D, ...","Van Den Brand, J.M.A.; Haagmans, B.L.; Van Rie..."
9335,"Ertas, G, Gulcur, HO, Osman, O, Ucan, ON, Tuna...","Ertaş, Gökhan; Gülçür, H.Özcan; Osman, Onur; U..."


Sometimes, titles do not end with a period or closing quotation mark; instead, they terminate with a comma, potentially making it difficult even for a human to know when the title ends and when other article data begins.

At least a few different style guides do seem to be represented, each with different schemes for where to place the year, how to format the title, and so forth.

In [5]:
list(occ.sample(5).raw)

['Zbinden C. Leader neurons in leaky integrate and fire neural network simulations. J Comput Neurosci 31, 285–304 (2011). PMID: 21234795',
 'Davis GE (1939) Ornithodoros parkeri: Distribution and host data; spontaneous infection with relapsing fever spirochetes. Public Health Rep 54: 1345–1349.',
 'Talbot, H.M, Summons, R.E, Jahnke, L.L, Cockell, C.S, Rohmer, M, and Farrimond, P. (2008) Cyanobacterial bacteriohopanepolyol signatures from cultures and natural environmental settings. Org Geochem 39: 232–263. doi: 10.1016/j.orggeochem.2007.08.006.',
 'Kamal M. A. et al. Kinetics of human serum butyrylcholinesterase inhibition by a novel experimental Alzheimer therapeutic, dihydrobenzodioxepine cymserine. Neurochemical research 33, 745–753, DOI: 10.1007/s11064-007-9490-y (2008). PMID: 17985237',
 'Schuler JR, Bockisch CJ, Straumann D, Tarnutzer AA. Precision and accuracy of the subjective haptic vertical in the roll plane. BMC Neurosci. BioMed Central Ltd; 2010; 11: 83 doi: 10.1186/1471-22

## Name Raw Text -> NameList Objects

The purpose of this step is simply to better organize the data according to its semantics. No interesting analysis is going on here.

Although the DataFrame will look the same after this step is applied because of how the __str__ method is implemented for NameLists, the objects that are actually being stored in the DataFrame will have some methods available to make explicit the distinctions between surnames and given names, and between different authors' names.

In [6]:
occ.author = occ.apply(lambda row: NameList.delimited(row.author), axis=1)
occ.head(2)

Unnamed: 0,raw,author,year,title
1,"Knechtle, B, Knechtle, P, Schulze, I, Kohler, ...","Knechtle, B.; Schulze, I.",2008,Ernährungsverhalten Bei Ultraläufern - Deutsch...
2,"Sousa, M, Fernandes, MJ, Moreira, P, Teixeira,...","Sousa, Mónica; Fernandes, Maria João; Moreira,...",2013,Nutritional Supplements Usage By Portuguese At...


## Train-Test Split

I have decided (somewhat arbitrarily) on the following train-test split:

* 10,000 items will be set aside for testing

* 10,000 items will be set aside for validation

* The remaining (~24,000) items will be set aside for training

In [7]:
# The random state is the hour and minute when I wrote this code.
occ_test = occ.sample(n=10000, replace=False, random_state=11)
occ_remaining = occ.drop(occ_test.index, axis=0)
occ_validation = occ_remaining.sample(n=10000, replace=False, random_state=41)
occ_train = occ_remaining.drop(occ_validation.index, axis=0)
# Let's be extra sure this operation was done correctly!
for a, b in itertools.combinations(
        [occ_test.index, occ_validation.index, occ_train.index],
        2):
    assert not any(element in b for element in a)

The code used to save the datasets is included here for transparency, but the conditional blocks will prevent it from being run again.

In [8]:
assert os.path.exists('./datasets/')
test_path = './datasets/occ_45K_test.pickle'
validation_path = './datasets/occ_45K_validation.pickle'
train_path = './datasets/occ_45K_train.pickle'
if not os.path.exists(test_path):
    with open(test_path, 'ab') as dbfile:
        pickle.dump(occ_test, dbfile, protocol=pickle.HIGHEST_PROTOCOL)
if not os.path.exists(validation_path):
    with open(validation_path, 'ab') as dbfile:
        pickle.dump(occ_validation, dbfile, protocol=pickle.HIGHEST_PROTOCOL)
if not os.path.exists(train_path):
    with open(train_path, 'ab') as dbfile:
        pickle.dump(occ_train, dbfile, protocol=pickle.HIGHEST_PROTOCOL)

## TestRunner

As before, I define a TestRunner class. For now at least, I am choosing to simply stick with percent accuracy in predicting year, author, and title as my performance metrics, because the frequency with which those attributes give successful matches with the reported metadata should be a good predictor of the frequency with which they give successful matches between different texts in a corpus.

Logic for k-fold CVs or LOOCVs is not managed by this class. This class is strictly for computing and reporting metrics on a given dataset.

In [None]:
class TestRunner:
  """Encapsulates logic for calculating model performance statistics."""
  def __init__(self, model, test_set):
    """Initializes the TestRunner with a trained MODEL that is Testable and a
    TEST_SET, which is a DataFrame that has the columns specified in the above
    section, "Loading Datasets."
    """
    self.model = model
    self.test_set = test_set
    # self.predictions will be a DataFrame of the same format as TEST_SET.
    # However, it is set to None here because it will be computed lazily.
    self.predictions = None
  def get_predictions(self):
    """Returns a DataFrame with the same columns as the DataFrame on which the
    model was trained, as described in the "Datasets" section of this notebook.
    """
    if self.predictions is None:
      self.predictions = self.model.predict(self.test_set.raw_text)
    return self.predictions
  def year_accuracy(self):
    """Returns the proportion of publication years that the model predicts
    accurately.
    """
    return np.mean(self.test_set.year == self.get_predictions().year)
  def surname_accuracy(self):
    """Returns the proportion of citations from which the set of all authors'
    surnames is predicted with perfect accuracy (within the simplifications of
    de-accenting and capitalization/punctuation removal).
    """
    return np.mean([
        (
            set(simple_preprocess(name.surname) for name in true_names)
            == set(simple_preprocess(name.surname) for name in predicted_names)
        ) for true_names, predicted_names
        in zip(self.test_set.author, self.get_predictions().author)
    ])
  def primary_author_surname_accuracy(self):
    """Returns the proportion of citations from which the primary author's
    surname is predicted correctly. The first author who is listed in the
    dataset will be interpreted as the primary author.
    """
    return np.mean([
        true_names[0].surname == predicted_names[0].surname
        for true_names, predicted_names
        in zip(self.test_set.author, self.get_predictions().author)
    ])
  def title_accuracy(self):
    """Returns the proportion of citations from which the title of the cited work
    is predicted with perfect accuracy (within the simplifications of
    de-accenting and capitalization/punctuation removal).
    """
    return np.mean(
        simple_preprocess(true_title) == simple_preprocess(predicted_title)
        for true_title, predicted_title
        in zip(self.test_set.title, self.get_predictions().title)
    )
  def quick_report(self):
    """Prints a report of model performance without confidence intervals.
    Intended for development (debugging, preliminary results, etc.) and not for
    final reporting.
    """
    t0 = time.time()
    get_predictions()
    elapsed_time = time.time() - t0
    print('Generated labels for {0:,} records in {:.4f} seconds ({:.4f} seconds'
          ' per record)'.format(
              len(self.test_set.index),
              elapsed_time,
              elapsed_time / len(self.test_set.index)))
    print('Surname accuracy (complete set): {:.4f}'.format(
        self.surname_accuracy()))
    print('Year accuracy: {:.4f}'.format(self.year_accuracy()))
    print('Title accuracy (with preproc): {:.4f}').format(self.title_accuracy())

The following cell demonstrates the power of `simple_preprocess`, a simple, widely used utility whose function should be reasonably easy to explain to other people.

In [None]:
simple_preprocess('The cát, in the hát.', deacc=True) == simple_preprocess('the cAt  iN tHe haT!!!?', deacc=True)

True

In [None]:
class CrossValidator:
  """Reports cross validation results for a given model and dataset."""
  # This should have a has-a relationship with TestRunner.
  # You know, it's possible sklearn has this. I will take a look.