# Enrich ToposText Annotation by Flair NER in Book 4

The notebook contains the code to enrich the ToposText annotation with the extraction of the Flair NER.
The steps are:

- open the CSV file containing the ToposText annotations;
- filter out the common nouns annotated in ToposText;
- open the output of Flair NER';
- create a set of tuples for the ToposText and the Flair NER annotations. Each tuple constists of the reference (book, chapter) and the start position of the named entity;
- with the intersection function, extract the annotations that are present both in ToposText and Flair NER (Common Annotations);
- extract annotations detected by NER but not in ToposText (Not Common Annotations);
- some of these annotations are not new, but are partially overlapping with already existing annotations. Exclude these annotations from the new entries;
- filter out annotations from the Flair NER in capital letters and the common nouns;
- add the new annotations to the CSV file of the ToposText annotations;
- reorder the dataframe according to the columns 'Reference' and 'Start position'.

We calculated that 1,778 annotations from Flair NER were already present in ToposText. 761 annotations detected by Flair NER that were not annotated in ToposText. Among these, 88 were overlapping annotations. In total, by Flair NER we detected 631 new annotations that were added to the CSV file for a total of 2,508 annotations.

In [1]:
import pandas as pd

In [2]:
## open the file containing ToposText annotations in Book 4 (1,888 rows)
ToposText_Book4 = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Outputs/1.1.ToposText_Annotations_Book_4.csv", delimiter=",")

In [3]:
len(ToposText_Book4)

1888

In [5]:
## drop the Unnamed column
ToposText_Book4 = ToposText_Book4.drop(['Unnamed: 0'], axis=1)

In [6]:
## filter out the common nouns annotated in ToposText (11 in total)
ToposText_Book4 = ToposText_Book4[~ToposText_Book4['Tagged Entity'].str.islower()]
ToposText_Book4.reset_index(inplace=True) ## reset the index

The resulting dataset contains 1,877 rows.

In [7]:
len(ToposText_Book4) 

1877

In [8]:
## open the file containing the output of Flair NER (2,539 entries)
NER_Flair_Book4 = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Outputs/1.5.NER_Flair_Book_4.csv")

In [9]:
len(NER_Flair_Book4)

2539

In [10]:
## create a set of tuples for the ToposText annotation
ToposText_tuples = set(zip(ToposText_Book4['Reference'], ToposText_Book4['Start position']))

## create a set of tuples for the Flair NER
NER_Flair_tuples = set(zip(NER_Flair_Book4['Reference'], NER_Flair_Book4['First position']))

In [11]:
## check the lenght of the set
len(ToposText_tuples)

1877

In [12]:
## check the lenght of the set
len(NER_Flair_tuples)

2539

# 1.8.1 Annotations present both in ToposText and Flair NER

We extracted 1,778 annotations in common in ToposText and Flair NER. These annotations detected by Flair NER were already annotated in ToposText.

In [13]:
## extract the annotations in common in ToposText and Flair NER (1,778)

Common_Annotations = NER_Flair_tuples.intersection(ToposText_tuples)
len(Common_Annotations)

1778

# 1.8.2 Annotations in Flair NER and not in ToposText

We extracted 761 annotations detected by Flair NER that were not annotated in ToposText.

In [14]:
## extract annotations in Flair NER but not in ToposText

Not_Common_Annotations = NER_Flair_tuples - ToposText_tuples
len(Not_Common_Annotations)

761

In [15]:
## the sum of annotations in common and not in common is the lenght of the Flair NER annotations

len(Common_Annotations) + len(Not_Common_Annotations)

2539

# 1.8.3 Partially Overlapping Annotations

Nonetheless, some of these annotations detected by Flair NER are not new, but are partially overlapping with already existing annotations in ToposText. For instance, the same entity was annotated as 'Pindus' in ToposText and 'Mount Pindus' by Flair NER with a different start position ((357, 369) and (363, 369)). To identify these cases, we checked whether the range of position of the entity extracted by Flair NER is in the range of position of an entity annotated in ToposText with the same reference,  or vice versa.

In total, we found 88 overlapping annotations.

In [16]:
## the subdataset contains the Flair annotations that are not present in ToposText
filtered_NER_Flair_Book4 = NER_Flair_Book4[NER_Flair_Book4[['Reference', 'First position']].apply(tuple, axis=1).isin(Not_Common_Annotations)]
filtered_NER_Flair_Book4.reset_index(inplace=True) ## reset the index

In [17]:
len(filtered_NER_Flair_Book4)

761

In [18]:
## detect partially overlapping annotations

Overlapping_Annotations = [] ## create a list of overlapping annotations

for i1, Annotation_to_Check in enumerate(filtered_NER_Flair_Book4["Reference"]): ## for each annotation in the subdataset
    Start_End_Pos1 = (filtered_NER_Flair_Book4["First position"][i1], filtered_NER_Flair_Book4["Last position"][i1]) ## create a tuple of the start and end position
    
    for i2, ToposText_Annotation in enumerate(ToposText_Book4["Reference"]): ## for each annotation in ToposText
        Start_End_Pos2 = (ToposText_Book4["Start position"][i2], ToposText_Book4["End position"][i2]) ## create a tuple of the start and end position
        
        if Annotation_to_Check == ToposText_Annotation: ## if they have the same reference
            
            ## check if the ToposText annotation is included in the NER annotation or viceversa
            if (Start_End_Pos1[0] <= Start_End_Pos2[0] and Start_End_Pos1[1] >= Start_End_Pos2[1]) or (Start_End_Pos2[0] <= Start_End_Pos1[0] and Start_End_Pos2[1] >= Start_End_Pos1[1]):
                Overlapping_Annotations.append((Annotation_to_Check, Start_End_Pos1[0]))
                
                print(i1, Annotation_to_Check, filtered_NER_Flair_Book4["Named Entity"][i1], Start_End_Pos1, ToposText_Book4["Tagged Entity"][i2], Start_End_Pos2)

7 urn:cts:latinLit:phi0978.phi001:4.1.2 Mount Pindus (357, 369) Pindus (363, 369)
11 urn:cts:latinLit:phi0978.phi001:4.1.2 Mount Tomarus (539, 552) Tomarus (545, 552)
17 urn:cts:latinLit:phi0978.phi001:4.2.1 Promontory of Leucate (326, 347) Leucate (340, 347)
26 urn:cts:latinLit:phi0978.phi001:4.3.1 Mounts Chalcis (411, 425) Chalcis (418, 425)
31 urn:cts:latinLit:phi0978.phi001:4.4.1 Ozolae (67, 73) Locri surnamed Ozolae (52, 73)
36 urn:cts:latinLit:phi0978.phi001:4.4.1 Mount Parnassus (444, 459) Parnassus (450, 459)
37 urn:cts:latinLit:phi0978.phi001:4.4.1 Fountain too of Castalia (550, 574) Castalia (566, 574)
41 urn:cts:latinLit:phi0978.phi001:4.5.2 Heights of Corinth (317, 335) Corinth (328, 335)
42 urn:cts:latinLit:phi0978.phi001:4.5.2 Fountain of Pirene (358, 376) Pirene (370, 376)
47 urn:cts:latinLit:phi0978.phi001:4.6.1 Fountain of Cymothoe (834, 854) Cymothoe (846, 854)
48 urn:cts:latinLit:phi0978.phi001:4.6.1 Promontory of Araxus (973, 993) Araxus (987, 993)
49 urn:cts:latinL

In [19]:
len(Overlapping_Annotations)

88

# 1.8.4 New Annotations from Flair NER

In total, we detected 631 new annotations detected by Flair NER that were not in ToposText. We also excluded entities detected by Flair NER in capital letters and common nouns.

In [20]:
## the subdataset contains only the new annotations from Flair NER (673)
filtered_NER_Flair_Book4 = filtered_NER_Flair_Book4[~filtered_NER_Flair_Book4[['Reference', 'First position']].apply(tuple, axis=1).isin(Overlapping_Annotations)]

In [21]:
len(filtered_NER_Flair_Book4)

673

In [22]:
## filter out entities in capital letters
filtered_NER_Flair_Book4 = filtered_NER_Flair_Book4[~filtered_NER_Flair_Book4['Named Entity'].str.isupper()]

In [23]:
len(filtered_NER_Flair_Book4)

633

In [24]:
## filter out common nouns
filtered_NER_Flair_Book4 = filtered_NER_Flair_Book4[~filtered_NER_Flair_Book4['Named Entity'].str.islower()]

In [25]:
len(filtered_NER_Flair_Book4)

631

# 1.8.5 Adding the New Annotations to ToposText

2,508 entries in total from a previous dataset of 1,877 entries in ToposText.

In [26]:
## empty the content of column 'Score'
filtered_NER_Flair_Book4['Score'] = ''

In [27]:
# rename columns before concatenating the CSV file and the filtered_NER_Flair
filtered_NER_Flair_Book4 = filtered_NER_Flair_Book4.rename(columns={'Named Entity': 'Tagged Entity', 'Type': 'Class', 'First position': 'Start position', 'Last position': 'End position', 'Score': 'ToposText ID'})

In [28]:
## concatenate the two subdatasets
Enriched_ToposText_Book4 = pd.concat([ToposText_Book4, filtered_NER_Flair_Book4], ignore_index=True)
Enriched_ToposText_Book4.drop("index", axis=1, inplace=True)

In [29]:
## reorder the dataset according to the reference and start position (ascending order)
Enriched_ToposText_Book4 = Enriched_ToposText_Book4.sort_values(by=[Enriched_ToposText_Book4.columns[0], Enriched_ToposText_Book4.columns[3]])

In [30]:
Enriched_ToposText_Book4.reset_index(inplace=True) ## reset the index
Enriched_ToposText_Book4.drop("index", axis=1, inplace=True) ## drop the 'index' column
Enriched_ToposText_Book4

Unnamed: 0,Reference,Tagged Entity,Class,Start position,End position,ToposText ID
0,urn:cts:latinLit:phi0978.phi001:4.1.1,Europe,LOC,41,47,
1,urn:cts:latinLit:phi0978.phi001:4.1.1,Acroceraunia,['place'],75,87,https://topostext.org/place/404194LKer
2,urn:cts:latinLit:phi0978.phi001:4.1.1,Hellespont,['place'],105,115,https://topostext.org/place/402264WHel
3,urn:cts:latinLit:phi0978.phi001:4.1.1,Epirus,['place'],217,223,https://topostext.org/place/395205REpe
4,urn:cts:latinLit:phi0978.phi001:4.1.1,Acarnania,['place'],225,234,https://topostext.org/place/388210RAka
...,...,...,...,...,...,...
2503,urn:cts:latinLit:phi0978.phi001:4.9.2,Cretan,['demonym'],896,902,https://topostext.org/place/352252IKre
2504,urn:cts:latinLit:phi0978.phi001:4.9.2,Aegean,['place'],924,930,https://topostext.org/place/381253WAeg
2505,urn:cts:latinLit:phi0978.phi001:4.9.2,Myrtoan,['demonym'],955,962,https://topostext.org/place/370240WMyr
2506,urn:cts:latinLit:phi0978.phi001:4.9.2,Megara,['place'],1013,1019,https://topostext.org/place/380233PMeg


In [31]:
Enriched_ToposText_Book4.to_csv("1.8.Enriched_ToposText_Book4.csv")