<a href="https://colab.research.google.com/github/petervdonovan/CitationParser/blob/master/ParserEvaluation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Parser Evaluation

Each citation parser that I test will have its dedicated section in this notebook.

In [None]:
# Here, I simply import a couple of libraries and mount the Drive.
# There is nothing special to see here.
!pip install pickle5

import pandas as pd
import numpy as np
import re
import random
import pickle5 as pickle
import time
from google.colab import drive



In [None]:
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
training_sets_dir = '/content/drive/My Drive/AWCA/Training/'
citation_tagging_dir = '/content/drive/My Drive/AWCA/Colab_notebooks/CitationTagging/'

## Loading Datasets

I load and prepare three datasets:
1. The Zotero dataset, a non-public dataset curated by Dr. Anderson and prepared for our use by Jason Webb. This will be called `zotero`.
1. A preprocessed version of the Zotero dataset, the only difference being the `raw_text` column, which will have extraneous information such as capitalization and punctuation removed. This will be called `zotero_preproc`.
    * The primary purpose of this dataset is to verify that Jason Webb's results are being replicated correctly. For most purposes, I will use the non-preprocessed dataset because the objective is to be able to parse raw citation data, not pre-chewed citation data.
1. A small convenience sample of the OpenCitations dataset, a public, open-source dataset curated by [OpenCitations](https://opencitations.net/about). This will be called `occ`.
    * This dataset is far from ideal because it is not necessarily a diverse sample of the OpenCitations Corpus. For the time being, I had to take what I could get from their SPARQL endpoint, which seems unable to handle queries that are too complex or that return too many results.
    * This dataset is useful because it ensures _replicability_: Because the OpenCitations Corpus is completely public and accessible to anyone in the world, it will allow any interested individual to replicate our results.

All datasets will have the following columns:
* 'raw_text'
* 'tags'
* 'contributors'
* 'title'
* 'year'

All datasets will be split into train and test sets in this section. The train and test sets will be marked with the suffixes, `_train` and `_test`.

### Zotero Dataset

In [None]:
zotero = pd.read_csv(training_sets_dir + 'Labelled_Data_with_Dates.csv')
zotero = zotero.rename({'citations':'raw_text'}, axis=1).drop('Unnamed: 0', axis=1)
zotero.head(2)

Unnamed: 0,raw_text,tags
0,"Stieglitz, R. R. (2001b). Ebla and the Gods of...",A A A D T T T T T T O O O O O O O O O O O O O ...
1,"Felsner, P. (2001, May 1). Lecture on Recent E...",A A D O O T T T T T T


Here, I define a function for extracting words from one string that correspond to certain tags in another string. This is a specialized function intended specifically for expanding out the information in the saved CSV file.

In [None]:
def extract_from_raw(raw, tags, desired_tag):
  """Returns the substring from CITATION corresponding a DESIRED_TAG in the
  TAGS string.
  """
  ret = ''
  for word, tag in zip(raw.split(), tags.split()):
    if tag == desired_tag:
      ret += word + ' '
  return ret[:-1]
extract_from_raw(zotero.raw_text[0], zotero.tags[0], 'A')

'Stieglitz, R. R.'

In [None]:
def expand_raw_and_tags(df):
  """Converts a DataFrame containing just raw text and tags to a new DataFrame
  that also has the columns 'contributors', 'title', and 'year'. (This is a
  mutative operation -- it acts via a side effect.)
  """
  for col_name, tag in [('contributors', 'A'), ('title', 'T'), ('year', 'D')]:
    df[col_name] = df.apply(
        lambda row: extract_from_raw(row.raw_text, row.tags, tag),
        axis=1)
  return df

In [None]:
expand_raw_and_tags(zotero)
zotero.head(2)

Unnamed: 0,raw_text,tags,contributors,title,year
0,"Stieglitz, R. R. (2001b). Ebla and the Gods of...",A A A D T T T T T T O O O O O O O O O O O O O ...,"Stieglitz, R. R.",Ebla and the Gods of Canaan.,(2001b).
1,"Felsner, P. (2001, May 1). Lecture on Recent E...",A A D O O T T T T T T,"Felsner, P.",Lecture on Recent Excavations at Tell-Mishrife...,"(2001,"


As a final bit of housekeeping, I wish to clean up the 'year' column. I observe that a considerable number of entries tagged as 'D' are actually cities or pagination markers (i.e., 'pp.').

In [None]:
zotero.head(2)

Unnamed: 0,raw_text,tags,contributors,title,year
0,"Stieglitz, R. R. (2001b). Ebla and the Gods of...",A A A D T T T T T T O O O O O O O O O O O O O ...,"Stieglitz, R. R.",Ebla and the Gods of Canaan.,(2001b).
1,"Felsner, P. (2001, May 1). Lecture on Recent E...",A A D O O T T T T T T,"Felsner, P.",Lecture on Recent Excavations at Tell-Mishrife...,"(2001,"


In [None]:
errors = []
def get_year(text):
  year = re.search('\d\d\d\d', text)
  if year:
    return int(year.group(0))
  else:
    errors.append(text)
    return None
zotero.year = zotero.apply(lambda row: get_year(row.year), axis=1)
print('There were {} substrings incorrectly tagged as dates. Randomly chosen\n'
      'examples include the following:\n{}'.format(
          len(errors), str(random.sample(errors, 10))))

There were 3875 substrings incorrectly tagged as dates. Randomly chosen
examples include the following:
['', 'Chicago.', '(Ugaritica Tel', '', 'Copenhagen.', 'Berlin.', 'Leiden.', 'Ghent.', '', 'Paris.']


In [None]:
print('We now have a dataset called "zotero" with {} rows.'.format(len(zotero.index)))

We now have a dataset called "zotero" with 66985 rows.


I conclude by splitting the dataset into train and test sets. I use a 15:85 train-test split, as is done in the "[Accuracy](https://colab.research.google.com/drive/1abvov19jS3A4r_MpGbTOyV8LvJmiTj4F#scrollTo=S_Y5yeGdWYHG)" notebook.

In [None]:
zotero_train = zotero.sample(frac=0.15)
zotero_test = zotero.drop(zotero_train.index)
zotero_train.head(2)

Unnamed: 0,raw_text,tags,contributors,title,year
58688,"Mayrhofer, Manfred. Nachlese altpersischer Ins...",A O T T T T T T T T T T T T T T T T T T T T T ...,"Mayrhofer,",Nachlese altpersischer Inschriften. Zu ̈uberse...,1978.0
39152,"COUROYER, B. ""LES AAMOU LES AAMOU-HYKSOS ET LE...",A O T T T T T T T D O O,"COUROYER,","""LES AAMOU LES AAMOU-HYKSOS ET LES CANANEO-PHE...",1974.0


### Preprocessed Zotero Dataset

Jason Webb used a preprocessed version of the Zotero dataset with his BiLSTM model. I load the dataset below to verify that his results are being replicated here correctly.

In [None]:
with open(training_sets_dir + 'BiLSTM_train.pickle', 'rb') as dbfile:
  zotero_preproc_train_original = pickle.load(dbfile)
with open(training_sets_dir + 'BiLSTM_test.pickle',  'rb') as dbfile:
  zotero_preproc_test_original  = pickle.load(dbfile)

These loaded sets are formatted as lists of ordered pairs. Each ordered pair contains a list of words from the raw text and a list of tags. This is nice, but it is nonstandard, and so it makes it hard to compare models side-by-side. Below I define two methods to help with converting between the two formats.

In [None]:
def listpairlist_to_df(lpl):
  raw_texts = []
  tags = []
  for raw, tag in lpl:
    raw_texts.append(' '.join(raw))
    tags.append(' '.join(tag))
  return pd.DataFrame(data={
      'raw_text': raw_texts,
      'tags': tags
  })
def df_to_listpairlist(df):
  lpl = []
  for raw, tag in zip(df.raw_text, df.tags):
    lpl.append((raw.split(), tag.split()))
  return lpl

Here, I verify that the functions defined above are inverse functions of each other.

In [None]:
inverse_inverse = df_to_listpairlist(listpairlist_to_df(zotero_preproc_train_original))
diff = [item for idx, item in enumerate(zotero_preproc_train_original) if 
        inverse_inverse[idx] != item
       ]
print('There are {} elements in a list that are not in the output of a function'
      ' composed\nwith its inverse applied to that list. Furthermore, the '
      'difference in the lengths\nof the two lists is {}.\nThe two lists {} '
      'equal, and the first {} the second.'.format(
          len(diff),
          len(zotero_preproc_train_original) - len(inverse_inverse),
          'are' if zotero_preproc_train_original == inverse_inverse else 'are not',
          'is'  if zotero_preproc_train_original is inverse_inverse else 'is not'
      ))

There are 0 elements in a list that are not in the output of a function composed
with its inverse applied to that list. Furthermore, the difference in the lengths
of the two lists is 0.
The two lists are equal, and the first is not the second.


In [None]:
zotero_preproc_train = listpairlist_to_df(zotero_preproc_train_original)
expand_raw_and_tags(zotero_preproc_train)
zotero_preproc_test = listpairlist_to_df(zotero_preproc_test_original)
expand_raw_and_tags(zotero_preproc_test)
zotero_preproc_train.head(2)

Unnamed: 0,raw_text,tags,contributors,title,year
0,cole s & gasche h fourdigitnum second-and firs...,O A O A A D T T T T T T T O O O O O O O O O O ...,s gasche h,second-and first-millennium bc rivers in north...,fourdigitnum
1,cooper jerrold fourdigitnum buddies in babylon...,A A D T T T T T T T T O O O O O O O O O O O O ...,cooper jerrold,buddies in babylonia gilgamesh enkidu and meso...,fourdigitnum


### OpenCitations Dataset

Here, I load a 5000-row dataset consisting of bibliographic entries from the OpenCitations corpus. This dataset was downloaded from OpenCitations using [this notebook](https://github.com/petervdonovan/CitationParser/blob/master/datasets/ccc_dataset.ipynb). It is worth noting that with sufficient time, this dataset could have been made ten times larger; however, as a proof of concept, I kept this one to 5000 rows.

In [None]:
%cd /content/drive/My Drive/AWCA/Colab_notebooks/CitationTagging/ParserEvaluation
!rm -r CitationParser
!git clone https://github.com/petervdonovan/CitationParser.git
%cd CitationParser/datasets
!ls

/content/drive/.shortcut-targets-by-id/1W2EROe2FItlaK99U-WY_qaBOc2UD_LI0/AWCA/Colab_notebooks/CitationTagging/ParserEvaluation
Cloning into 'CitationParser'...
remote: Enumerating objects: 23, done.[K
remote: Counting objects: 100% (23/23), done.[K
remote: Compressing objects: 100% (17/17), done.[K
remote: Total 23 (delta 1), reused 20 (delta 1), pack-reused 0[K
Unpacking objects: 100% (23/23), done.
/content/drive/.shortcut-targets-by-id/1W2EROe2FItlaK99U-WY_qaBOc2UD_LI0/AWCA/Colab_notebooks/CitationTagging/ParserEvaluation/CitationParser/datasets
ccc_dataset.ipynb  dataset19183.pickle


In [None]:
with open('dataset19183.pickle', 'rb') as dbfile:
  occ = pickle.load(dbfile)
occ.head(2)

Unnamed: 0,raw_text,tags,contributors,title,year
0,"Holowka, D, Wensel, T, Baird, B. A nanosecond ...",A A A A A A A O O O O O O O O O O O O O D O O ...,"Holowka, David; Wensel, Theodore; Baird, Barbara",A Nanosecond Fluorescence Depolarization Study...,1990.0
1,Arias-Salgado E. G. et al. Src kinase activati...,A A A O O O O O O O O O O O O O O O O O O O O ...,"Arias Salgado, E. G.; Lizano, S.; Sarkar, S.; ...",Src Kinase Activation By Direct Interaction Wi...,2003.0


In [None]:
print('Originally the dataframe "occ" had {} rows'.format(len(occ.index)), end='')
occ = occ.dropna()
print(', but after dropping null values it now has {} rows.'.format(len(occ.index)))

Originally the dataframe "occ" had 5000 rows, but after dropping null values it now has 4783 rows.


# TestRunner

Here, I define a couple of classes that are necessary for running tests. These classes determine what metrics I use to evaluate parser performance.

## TestRunner

Every Testable parser can be given to a TestRunner instance to permit the computation of performance metrics.

By the way, four of the metrics that I borrowed from the notebook [Accuracy](https://colab.research.google.com/drive/1D7i5pLgqEsrLG3PdwiZKcSMOfafWvztl) have a quirk: A tagger that labels everything with a certain tag can have 100% accuracy for that specific tag, even though its output provides zero information. This is why the next re-implementation of this might involve [F-scores](https://en.wikipedia.org/wiki/F-score) instead.

In [None]:
class TestRunner:
  def __init__(self, model, test_set):
    """Initializes the TestRunner with a trained MODEL that is Testable and a
    TEST_SET, which is a DataFrame that has the columns specified in the above
    section, "Loading Datasets."
    """
    self.model = model
    self.test_set = test_set
    # self.predictions will be a DataFrame of the same format as TEST_SET.
    # However, it is set to None here because it will be computed lazily.
    self.predictions = None
  def get_predictions(self):
    if self.predictions is None:
      self.predictions = self.model.predict(self.test_set.raw_text)
    return self.predictions
  def get_predict_tags(self):
    return self.get_predictions().tags
  def token_tagging_accuracy(self, boot=False):
    """Returns the proportion of tokens in the test set that are correctly
    tagged.
    """
    tokens_correct = 0
    tokens_total = 0
    sample = zip(self.get_predict_tags(), self.test_set.tags)
    if boot:
      sample = list(sample)
      sample = random.choices(sample, k=len(sample))
    for predict, actual in sample:
      for token_predict, token_actual in zip(predict.split(), actual.split()):
        tokens_total += 1
        if token_predict == token_actual:
          tokens_correct += 1
    return tokens_correct / tokens_total
  def _semantic_unit_tagging_accuracy(self, tag, boot=False):
    """Returns the proportion of citations for which every word corresponding to
    TAG is correctly tagged as such. Note: This method returns 100%  (1.0) if
    every single word is tagged as TAG because it does not account for false
    positives.
    """
    correct = 0
    total = 0
    sample = zip(self.get_predict_tags(), self.test_set.tags)
    if boot:
      sample = list(sample)
      sample = random.choices(sample, k=len(sample))
    for predict, actual in sample:
      total += 1
      if all([
              token_actual != tag or token_predict == token_actual
              for token_predict, token_actual in
              zip(predict.split(), actual.split())]):
        correct += 1
    return correct / total
  def _column_accuracy(self, column, boot=False, comparator=None):
    """Returns the proportion of citations that have a non-null value in the
    COLUMN column for which the value in that column is predicted correctly.
    COMPARATOR (optional): The function that returns a boolean depending on
        whether two values are equal (or approximately equal)
    """
    selected_test = self.test_set[column].dropna()
    selected_pred = self.get_predictions().loc[selected_test.index,column]
    if boot:
      idx = random.choices(selected_test.index, k=len(selected_test.index))
      selected_test = selected_test.loc[idx]
      selected_pred = selected_pred.loc[idx]
    if comparator is None:
      return sum(selected_test == selected_pred) / len(selected_test)
    return sum(
        comparator(test, pred)
        for test, pred in zip(selected_test, selected_pred)
        ) / len(selected_test)
  def report(self, n_boots=0):
    """Side effect: Prints out a report on the performance of the model and uses
    N_BOOTS bootstrap samples to compute confidence intervals. If N_BOOTS==0,
    then no confidence intervals are computed.
    """
    t0 = time.time()
    self.get_predictions()
    print('Time to get predictions: {:.4f} seconds.'.format(time.time() - t0))
    def report_metric(msg, method):
      print('{}: {:.4f}.'.format(
          msg, method(False)
      ) + ('' if not n_boots else ' 95% CI: ({:.4f}, {:.4f})'.format(
          *boot(lambda: method(True), n_boots, 0.95)
      )))
    report_metric('Proportion of tokens correctly tagged',
                  self.token_tagging_accuracy)
    report_metric('Proportion of citations for which every word corresponding\n'
                  'to a contributor\'s name is correctly tagged as such',
                  lambda boot: self._semantic_unit_tagging_accuracy('A', boot))
    report_metric('Proportion of citations for which every word corresponding\n'
                  'to a part of a title is correctly tagged as such',
                  lambda boot: self._semantic_unit_tagging_accuracy('T', boot))
    report_metric('Proportion of citations for which every word corresponding\n'
                  'to the year of publication is correctly tagged as such',
                  lambda boot: self._semantic_unit_tagging_accuracy('D', boot))
    for column in self.test_set.drop('raw_text', axis=1).columns:
      report_metric('Proportion of citations that have a non-null value in the\n'
                    '{} column for which the value in that column is predicted\n'
                    'correctly'.format(column),
                    lambda boot: self._column_accuracy(column, boot))
    report_metric('Proportion of titles predicted approximately correctly',
                  lambda boot: self._column_accuracy(
                      'title', boot, str_approx_equal))
def boot(f, n, confidence):
  """Returns a confidence interval for the result of calling F using N trials.
  """
  results = np.zeros(n)
  for i in range(n):
    results[i] = f()
  alpha = 1 - confidence
  return (np.percentile(results, 100*alpha/2),
          np.percentile(results, 100*(1-alpha/2)))
def str_approx_equal(a, b):
  """Returns whether or not the lowercase of A with only spaces and word
  characters equals the lowercase of B with only spaces and word characters.
  """
  if a is None or b is None:
    return False
  assert isinstance(a, str) and isinstance(b, str), \
      'Can only compare strings but got {} and {}'.format(a, b)
  return re.sub(r'[^\w ]+', '', a).lower() == re.sub(r'[^\w ]+', '', b).lower()

## A Testable Interface

For testing purposes, I want every implementation of a citation parser to have at least some of the same methods. This is necessary for comparison. However, because many other parts of our project are still under development, ideas about the functionality that is truly necessary may continue to evolve. In short, we want flexibility for development but uniform behavior for testing.

To resolve this contradiction, I draw the following conclusion: Citation parsers should not necessarily implement the same abstract class. However, we should have wrapper classes for them that all implement the same abstract class.

In [None]:
from abc import ABC, abstractmethod

class Testable(ABC):
  @abstractmethod
  def predict(self):
    """Returns a Pandas DataFrame with columns that directly correspond to the
    columns of TEST, for the purpose of comparison.
    """
    pass

## Testing Parsers

This is the section where I begin to run accuracy tests on different parsers.

### A Regex-Based Parser

Here, I load and test a parser that uses regular expressions. This implementation is inherently messy: It uses hardcoded rules to identify parts of a citation. However, it is tested below because it is convenient, fast, and simple, and because it provides a reasonable baseline performance level that other models should outperform. Any more advanced model is probably not good enough to publicly use unless it can outperform the model below.

In [None]:
%cd /content/drive/My Drive/AWCA/Colab_notebooks/CitationTagging/ParserEvaluation
!rm -r CitationParser0
!git clone https://github.com/petervdonovan/CitationParser0.git
%cd CitationParser0
!ls

/content/drive/.shortcut-targets-by-id/1W2EROe2FItlaK99U-WY_qaBOc2UD_LI0/AWCA/Colab_notebooks/CitationTagging/ParserEvaluation
Cloning into 'CitationParser0'...
remote: Enumerating objects: 33, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (21/21), done.[K
remote: Total 33 (delta 9), reused 31 (delta 7), pack-reused 0[K
Unpacking objects: 100% (33/33), done.
/content/drive/.shortcut-targets-by-id/1W2EROe2FItlaK99U-WY_qaBOc2UD_LI0/AWCA/Colab_notebooks/CitationTagging/ParserEvaluation/CitationParser0
Citations  development_set.py  People  README.md  Utils


Because the original `Citation` class that I uploaded above generates titles and author names, not tags, the wrapper class must convert the output of the `Citation` class into tags in order for its performance to be evaluated.

In [None]:
import Citations.Citation as cp0
class RegexParserWrapper(Testable):
  def predict(self, test):
    tags = []
    contributors = []
    title = []
    year = []
    for raw in test:
      citation = cp0.Citation(raw)
      current_contribs = citation.getNameList()
      current_year = citation.getYear()
      current_title = citation.getTitle()
      contributors.append(current_contribs)
      year.append(current_year)
      title.append(current_title)
      # In the next several lines, I use the knowledge that the parser has
      # provided to convert the raw text into a tag.
      tag = ''
      while raw != '':
        next_space = raw.find(' ') % len(raw)
        if current_contribs and raw.find(current_contribs) == 0:
          tag += 'A ' * len(current_contribs.split())
          raw = raw[len(current_contribs):]
        elif current_year and raw.find(str(current_year)) % len(raw) < next_space:
          tag += 'D '
          # Go to the next word, or to the end if no spaces remain.
          raw = raw[raw.find(' ') % len(raw) + 1:]
        elif current_title and raw.find(current_title) == 0:
          tag += 'T ' * len(current_title.split())
          raw = raw[len(current_title):]
        else:
          tag += 'O '
          # Go to the next word, or to the end if no spaces remain.
          raw = raw[next_space + 1:]
        raw = raw.strip()
      tags.append(tag)
    return pd.DataFrame({'raw_text': test,
                         'tags': tags,
                         'contributors': contributors,
                         'title': title,
                         'year': year})

Here I test the regex-based parser on the **Zotero test dataset**.

In [None]:
rpw = RegexParserWrapper()
TestRunner(rpw, zotero_test).report() # Please pass a big number to this function (like,
                           # 500) if you want useful results (i.e., a confidence
                           # interval). Otherwise, just pass 0 or nothing so
                           # that it will run quickly.

Time to get predictions: 7.6095 seconds.
Proportion of tokens correctly tagged: 0.7118.
Proportion of citations for which every word corresponding
to a contributor's name is correctly tagged as such: 0.8672.
Proportion of citations for which every word corresponding
to a part of a title is correctly tagged as such: 0.6246.
Proportion of citations for which every word corresponding
to the year of publication is correctly tagged as such: 0.8327.
Proportion of citations that have a non-null value in the
tags column for which the value in that column is predicted
correctly: 0.0447.
Proportion of citations that have a non-null value in the
contributors column for which the value in that column is predicted
correctly: 0.0550.
Proportion of citations that have a non-null value in the
title column for which the value in that column is predicted
correctly: 0.4351.
Proportion of citations that have a non-null value in the
year column for which the value in that column is predicted
correctly: 0.9

Here I  test the regex-based parser on the **OCC dataset**.

In [None]:
regex_test_runner = TestRunner(rpw, occ)
regex_test_runner.report(1000) # Please pass a big number to this function (like,
                           # 500) if you want useful results (i.e., a confidence
                           # interval). Otherwise, just pass 0 or nothing so
                           # that it will run quickly.

Time to get predictions: 0.9780 seconds.
Proportion of tokens correctly tagged: 0.6340. 95% CI: (0.6270, 0.6410)
Proportion of citations for which every word corresponding
to a contributor's name is correctly tagged as such: 0.6939. 95% CI: (0.6814, 0.7067)
Proportion of citations for which every word corresponding
to a part of a title is correctly tagged as such: 0.6715. 95% CI: (0.6584, 0.6851)
Proportion of citations for which every word corresponding
to the year of publication is correctly tagged as such: 0.4010. 95% CI: (0.3872, 0.4144)
Proportion of citations that have a non-null value in the
tags column for which the value in that column is predicted
correctly: 0.0100. 95% CI: (0.0071, 0.0130)
Proportion of citations that have a non-null value in the
contributors column for which the value in that column is predicted
correctly: 0.0000. 95% CI: (0.0000, 0.0000)
Proportion of citations that have a non-null value in the
title column for which the value in that column is predicted
c

### BiLSTM Parser

The parser loaded below is the BiLSTM model that was created, trained, and saved in the drive by Jason Webb.

First, some setup. It is necessary to load a few objects and define a few functions and mappings that are copied from the "Accuracy" notebook.

Note: It is necessary to downgrade Pytorch to an earlier version such as 1.2.0 to get the LSTM model working. If you have not done that in your current Google Drive session, then it is necessary to run the cell below.

In [None]:
!pip install torch==1.2.0



The following code is necessary to convert between the input/output of the `BiLSTMTagger` class (numbers) and input/output that humans can understand (words). For some reason, such functionality is not built into the original class, which I did not wish to modify.

In [None]:
%cd $citation_tagging_dir
from BiLSTM_CRF import BiLSTM_CRF
import torch
torch.manual_seed(1)

with open(citation_tagging_dir + 'bilstm_word_to_ix.pickle', 'rb') as dbfile:
  word_to_ix = pickle.load(dbfile)
bilstm_tagger = torch.load(citation_tagging_dir + 'BiLSTMfullmodel.pth')
bilstm_tagger.eval()

ix_to_tag = {0: 'A', 1: 'D', 2: 'T', 3: 'O', 4: '<START>', 5: '<STOP>'}
def bilstm_get_tags(raw_text):
  """Uses the BiLSTM tagger to output predicted tags."""
  with torch.no_grad():
    prepared_cit = prepare_sequence(raw_text.split(), word_to_ix)
    predicted_tags = bilstm_tagger(prepared_cit)[1]
    return ' '.join([ix_to_tag[pred] for pred in predicted_tags])

def prepare_sequence(seq, to_ix):
  """Replaces words with numbers in a sequence based on the mapping TO_IX."""
  # This is a potential bug: I used the get method with a default to keep the
  # program from erroring out, but what if a large number of words are not in
  # the dictionary?
  idxs = [to_ix.get(w, to_ix['the']) for w in seq]
  return torch.tensor(idxs, dtype=torch.long)

/content/drive/.shortcut-targets-by-id/1W2EROe2FItlaK99U-WY_qaBOc2UD_LI0/AWCA/Colab_notebooks/CitationTagging


In [None]:
class BiLSTMTaggerWrapper(Testable):
  def predict(self, test):
    preproc = test.str.replace('[,.;:\'\"]', '')
    preproc = preproc.str.replace('\(|[a-z]\)|\)', '')
    preproc = preproc.str.replace('\d\d\d\d(–|-)\d\d\d\d', 'dateRange')
    preproc = preproc.str.replace('\d\d\d\d', 'fourDigitNum')
    preproc = preproc.str.replace('[xiv]+(–|-)[xiv]+', 'numerals')
    preproc = preproc.str.replace('\d+(–|-)\d+', 'pageRange')
    preproc = preproc.str.replace('\d+', 'otherDigits')
    preproc = preproc.str.lower()
    tags = [bilstm_get_tags(text) for text in preproc]
    ret = expand_raw_and_tags(pd.DataFrame({
        'raw_text': test, 'tags': tags
    }))
    years = np.zeros(len(ret.year))
    for i, year in enumerate(ret.year):
      match = re.search('\d\d\d\d', year)
      if match:
        years[i] = int(match.group(0))
      else:
        years[i] = np.NaN
    ret.drop('year', axis=1)
    ret['year'] = pd.Series(years, index=ret.index)
    return ret

Here, I verify that the results in Accuracy.ipynb are being replicated correctly for the **Zotero datasets**. This is necessary to gain confidence that there are no bugs in this notebook.

In [None]:
btw = BiLSTMTaggerWrapper()
print('Results for Running the BiLSTM Tagger on the Zotero Training Set:')
TestRunner(btw, zotero_preproc_train).report(1000)
print('\nResults for Running the BiLSTM Tagger on the Zotero Test Set:')
TestRunner(btw, zotero_preproc_test).report(1000)

Results for Running the BiLSTM Tagger on the Zotero Training Set:
Time to get predictions: 38.2610 seconds.
Proportion of tokens correctly tagged: 0.8821. 95% CI: (0.8787, 0.8851)
Proportion of citations for which every word corresponding
to a contributor's name is correctly tagged as such: 0.9045. 95% CI: (0.8985, 0.9099)
Proportion of citations for which every word corresponding
to a part of a title is correctly tagged as such: 0.8159. 95% CI: (0.8082, 0.8234)
Proportion of citations for which every word corresponding
to the year of publication is correctly tagged as such: 0.8285. 95% CI: (0.8208, 0.8353)
Proportion of citations that have a non-null value in the
tags column for which the value in that column is predicted
correctly: 0.3846. 95% CI: (0.3749, 0.3931)
Proportion of citations that have a non-null value in the
contributors column for which the value in that column is predicted
correctly: 0.6884. 95% CI: (0.6800, 0.6982)
Proportion of citations that have a non-null value in

I observe that all metrics that appear in [Accuracy.ipynb](https://colab.research.google.com/drive/1D7i5pLgqEsrLG3PdwiZKcSMOfafWvztl) are within $16\times10^{-4}$ of what is reported here. Therefore I am happy, and I claim that Jason Webb's results are being correctly reproduced.

I now run the tagger on the **OCC dataset**:

In [None]:
TestRunner(btw, occ).report(1000)

Time to get predictions: 30.6852 seconds.
Proportion of tokens correctly tagged: 0.5445. 95% CI: (0.5389, 0.5498)
Proportion of citations for which every word corresponding
to a contributor's name is correctly tagged as such: 0.0778. 95% CI: (0.0702, 0.0849)
Proportion of citations for which every word corresponding
to a part of a title is correctly tagged as such: 0.8946. 95% CI: (0.8867, 0.9034)
Proportion of citations for which every word corresponding
to the year of publication is correctly tagged as such: 0.4798. 95% CI: (0.4666, 0.4940)
Proportion of citations that have a non-null value in the
tags column for which the value in that column is predicted
correctly: 0.0000. 95% CI: (0.0000, 0.0000)
Proportion of citations that have a non-null value in the
contributors column for which the value in that column is predicted
correctly: 0.0025. 95% CI: (0.0012, 0.0042)
Proportion of citations that have a non-null value in the
title column for which the value in that column is predicted


These results are rather different from the results from the Zotero dataset, but what does this mean? Does it mean that the tags that I added to the OCC dataset are wrong, or does it instead mean that the tags of the OCC dataset are simply different from the tags in the Zotero dataset? If it is the latter, then that would imply that tagging involves subjectivity, in which case it might not be the ideal means for evaluating a model.

If it is the former, then, well, I ought revise my (rather messy) tagging function at the bottom of [this notebook](https://github.com/petervdonovan/CitationParser/blob/master/datasets/ccc_dataset.ipynb).