# Evaluation of ToposText Annotation in Book 4

To evalute the quality of the ToposText annotation, Precision, Recall and F1 Score were calculated on the basis of a manually curated Gold Standard for Book 4. In general, ToposText contains a high-quality annotation (F1 0.849) with a very high Precision (0.985). The Recall, instead, is lower (0.747) indicating that some entities were not annotated. In total, 630 false negatives were found.

For an introduction to these metrics see: https://towardsdatascience.com/a-look-at-precision-recall-and-f1-score-36b5fd0dd3ec

In [1]:
import pandas as pd

# 1.3.1 Creation of the Gold Standard of Book 4

The Gold Standard - created to evalute the quality annotation in ToposText and, in the following step, of a NER system - was manually curated. Consistent and accurate guidelines were provided during the annotation process to avoid inconsistencies. This was particularly relevant, since no clear guidelines are provided for the ToposText annotation. 

Book 4 is representative of the entity types to annotate, since it is a book on geography.

The entities annotated in the Gold Standard are: places (LOC) such as 'Rome', 'Athens'; persons (PEO) such as 'Caesar', 'Jupiter', and groups of people (GRO) such as 'Atraces', 'Corinthians'. Ehtnics were annotated only when they are referred to groups of individuals and not to a single person (i.e., in the sentence 'Perikles the Athenian', 'Athenian' was not annotated). It was observed, indeed, that some geographic regions are not named by a toponym, but by the name of the group of people living there. Adjectives were annotated when they are referred to places, such as 'Ionic (sea)'. Finally, only proper names were annotated. To make an example, in the sentence 'the Mount Pindus' only the word 'Pindus' was annotated. These entity boundaries (mostly consisting of single words rather than multi-word entities) are preferable to switch from the English to the Latin text (see next notebooks) by the Python library FuzzyWuzzy.

The Gold Standard contains a consistent annotation format that consists of the reference (book, chapter, paragraph), the named entity, the label, the start and end position of the word in the paragraph.

The Gold Standard contains 2,491 entities.

In [2]:
## open the Gold Standard of Book 4 (2,491 entries)
GoldStandard_Book4 = pd.read_excel("/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Outputs/1.3.GoldStandard_Book4.xlsx")

In [3]:
len(GoldStandard_Book4)

2491

In [4]:
## open the file containing the ToposText annotations in Book 4 (1,888 rows)
ToposText_Book4 = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Outputs/1.1.ToposText_Annotations_Book_4.csv", delimiter=",")

In [5]:
len(ToposText_Book4)

1888

# 1.3.2 Calculate Precision, Recall, F1 Score

To calculate Precision, Recall and F1 Score, a set of tuples was generated for both the Gold Standard and the ToposText annotations. The tuples consist of two elements: the reference (book, chapter) and the start position of each annotation. An example of the tuples generated is: 'urn:cts:latinLit:phi0978.phi001:4.23.3', 330. The text of the tagged entity, instead, was not included in the tuple (i.e., 'urn:cts:latinLit:phi0978.phi001:4.1.1', 75, Acroceraunia). In some case, indeed, it was observed that the annotation is correct (same start position) but the text annotated is different from the ground truth (i.e., Corinthian / Cortinthian Gulf). Considering only the reference and the start position permits us to reduce the miscounting.

In [6]:
## create a set of tuples for the Gold Standard
GoldStandard_tuples = set(zip(GoldStandard_Book4['Reference'], GoldStandard_Book4['First position']))

## create a set of tuples for the ToposText annotations
ToposText_tuples = set(zip(ToposText_Book4['Reference'], ToposText_Book4['First position']))

The intersection operation determines the common elements between the two sets and identify the named entities that are correctly identified by comparing the predicted entities with the ground truth entities.

The ToposText annotation contains 1,861 true positive.

In [7]:
## calculate true positives

True_Positives = len(ToposText_tuples.intersection(GoldStandard_tuples))
True_Positives

1861

The ToposText annotation contains 27 false positive. An example of false positive is 'canal' (urn:cts:latinLit:phi0978.phi001:4.5.1, 1538).

In [8]:
## calculate false positives

False_Positives = len(ToposText_tuples) - True_Positives
False_Positives

27

The ToposText annotation had a precision of 0.98.

In [9]:
## calculate precision

Precision = True_Positives / (True_Positives + False_Positives)
Precision

0.9856991525423728

The ToposText annotation contains 630 false negatives.

In [10]:
## calculate false negatives

False_Negatives = len(GoldStandard_tuples - ToposText_tuples)
False_Negatives

630

The ToposText annotation had a recall of 0.74.

In [12]:
## calculate recall

Recall = True_Positives / (True_Positives + False_Negatives)
Recall

0.7470895222802088

The ToposText annotation had a F1 score of 0.84.

In [13]:
## calculate F1

F1 = 2 * (Precision * Recall) / (Precision + Recall)
F1

0.849965745604019