# Perform NER on Book 4

We tested different NER systems with the scope of identifying missing entities or incorrectly labeld entities and therefore improving the quality of ToposText annotation.

- Flair
- SpaCy
- BERT

The three systems were tested on the same sentence:
'BOIOTIA: In this country are Anthedon, Onchestus, the free town of Thespiae, Lebadea, and then Thebes, surnamed Boeotian, which does not yield the palm to Athens even in celebrity; the native land, according to the common notion, of the two Divinities Liber and Hercules. The birth-place of the Muses too is pointed out in the grove of Helicon. To this same Thebes also belong the forest of Cithaeron, and the river Ismenus. Besides these, there are in Boeotia the Fountains of Oidipodia, Psamathe, Dirce, Epicrane, Arethusa, Hippocrene, Aganippe, and Gargaphie; and, besides the mountains already mentioned, Mycalesos, Hadylius, and Acontius. The remaining towns between Megara and Thebes are Eleutherae, Haliartus, Plataeae, Pherae, Aspledon, Hyle, Thisbe, Erythrae, Glissas, and Copae; near the river Cephisus, Larymna and Anchoa; as also Medeon, Phlygone, Acraephia, Coronea, and Chaeronea. Again, on the coast and below Thebes, are Ocalea, Heleon, Scolos, Schoenos, Peteon, Hyriae, Mycalesos, Iresion, Pteleon, Olyros, and Tanagra, the people of which are free; and, situate upon the very mouth of the Euripus, a strait formed by the opposite island of Euboea, Aulis, so famous for its capacious harbour. The Boeotians formerly had the name of Hyantes.'

The output of the notebook is a CSV file containing the output of NER Flair (ner-large) on Book 4. The file contains the named entities detected in Book 4, the label assigned by the NER system, the start and end position of the word in the paragraph and the probability score for the type label. 

In [1]:
import pandas as pd
import nltk
from nltk.stem import WordNetLemmatizer
from collections import Counter
import re
import spacy
from bs4 import BeautifulSoup
from spacy import displacy
import csv

# 1.4.1 Flair

Flair models tested:
- ner
- ner-large

See: https://flairnlp.github.io/docs/tutorial-basics/tagging-entities

In [None]:
pip install flair

In [2]:
from flair.data import Sentence
from flair.nn import Classifier

# Test Flair NER on a sentence

In [3]:
## test Flair NER system on a sentence using Classifier

test = "<p work='148' id='4.12.1' wdate='77' edate='-1'>BOIOTIA: In this country are <place id='385235PAnt'>Anthedon</place>, <place id='384232UOnc'>Onchestus</place>, the free town of <place id='383232PThe'>Thespiae</place>, <place id='384229PLeb'>Lebadea</place>, and then <place id='383233PThe'>Thebes</place>, surnamed <demonym id='384233RBoi'>Boeotian</demonym>, which does not yield the palm to <place id='380237PAth'>Athens</place> even in celebrity; the native land, according to the common notion, of the two Divinities <PRN id='Q41680'>Liber</PRN> and <PRN id='Q122248'>Hercules</PRN>. The <place id='383231SMus'>birth-place of the Muses</place> too is pointed out in the grove of <place id='384228LHel'>Helicon</place>. To this same <place id='383233PThe'>Thebes</place> also belong the forest of <place id='382233LCit'>Cithaeron</place>, and the river <place id='383233WIsm'>Ismenus</place>. Besides these, there are in <place id='384233RBoi'>Boeotia</place> the Fountains of Oidipodia, <PRN id='A2196;A2197'>Psamathe</PRN>, <place id='383233WDir'>Dirce</place>, Epicrane, <PRN id='A311'>Arethusa</PRN>, <place id='383230WHip'>Hippocrene</place>, <PRN id='AganippeYY'>Aganippe</PRN>, and <place id='382233WGar'>Gargaphie</place>; and, besides the mountains already mentioned, <place id='384235PMyk'>Mycalesos</place>, <PRN id='HadyliusYY'>Hadylius</PRN>, and <place id='385229LAko'>Acontius</place>. The remaining towns between <place id='380233PMeg'>Megara</place> and <place id='383233PThe'>Thebes</place> are <place id='382234FEle'>Eleutherae</place>, <place id='384231PHal'>Haliartus</place>, <place id='382233PPla'>Plataeae</place>, <place id='384236PPha'>Pherae</place>, <place id='385230UAsp'>Aspledon</place>, <place id='385233UHyl'>Hyle</place>, <place id='383230PThi'>Thisbe</place>, <place id='382234PEry'>Erythrae</place>, <place id='384234UGli'>Glissas</place>, and <place id='385232PKop'>Copae</place>; near the river <place id='388228WKep'>Cephisus</place>, <place id='386233PLar'>Larymna</place> and <place id='385233UAnc'>Anchoa</place>; as also <place id='384232UMed'>Medeon</place>, <place id='385227PPhl'>Phlygone</place>, <place id='385232PAkr'>Acraephia</place>, <place id='384230PKor'>Coronea</place>, and <place id='385228PCha'>Chaeronea</place>. Again, on the coast and below <place id='383233PThe'>Thebes</place>, are <place id='384230UOka'>Ocalea</place>, <place id='384235UHel'>Heleon</place>, <place id='383234PSko'>Scolos</place>, <place id='384234USch'>Schoenos</place>, <place id='384234UPet'>Peteon</place>, <place id='385236UHyr'>Hyriae</place>, <place id='384235PMyk'>Mycalesos</place>, <place id='383235UEil'>Iresion</place>, <place id='381236DPte'>Pteleon</place>, Olyros, and <place id='383236PTan'>Tanagra</place>, the people of which are free; and, situate upon the very mouth of the <place id='385236WEur'>Euripus</place>, a strait formed by the opposite island of <place id='385239IEub'>Euboea</place>, <place id='384236UAul'>Aulis</place>, so famous for its capacious harbour. The <ethnic id='384233RBoi'>Boeotians</ethnic> formerly had the name of Hyantes. <a href='http://latin.packhum.org/cit/PlinSen/Nat/4.12' target='_blank'>SOL</a> </p>"
soup = BeautifulSoup(test, 'html.parser')
text = soup.get_text() ## get the text of the test

## make a sentence from the text using the Flair Sentence function
sentence = Sentence(text)

## load the NER tagger ner-large
tagger = Classifier.load('ner-large')

## run NER over the sentence
tagger.predict(sentence)

## iterate over the predicted entities and print the entity name, label and the start and end position
for entity in sentence.get_spans('ner'):
    print(entity.text, entity.labels[0].value, entity.start_position, entity.end_position)

2023-05-19 09:47:21,366 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>
BOIOTIA LOC 0 7
Anthedon LOC 29 37
Onchestus LOC 39 48
Thespiae LOC 67 75
Lebadea LOC 77 84
Thebes LOC 95 101
Boeotian MISC 112 120
Athens LOC 155 161
Liber PER 252 257
Hercules PER 262 270
Muses PER 295 300
Helicon LOC 336 343
Thebes LOC 358 364
Cithaeron LOC 391 400
Ismenus LOC 416 423
Boeotia LOC 453 460
Fountains of Oidipodia MISC 465 487
Psamathe LOC 489 497
Dirce LOC 499 504
Epicrane LOC 506 514
Arethusa LOC 516 524
Hippocrene MISC 526 536
Aganippe LOC 538 546
Gargaphie LOC 552 561
Mycalesos LOC 609 618
Hadylius LOC 620 628
Acontius LOC 634 642
Megara LOC 672 678
Thebes LOC 683 689
Eleutherae LOC 694 704
Haliartus LOC 706 715
Plataeae LOC 717 725
Pherae LOC 727 733
Aspledon LOC 735 743
Hyle LOC 745 749
Thisbe LOC 751 757
Erythrae LOC 759 767
Glissas LOC 769 776
Copae 

# 1.4.2 SpaCy

NER SpaCy systems tested:
- en_core_web_sm
- en_core_web_md
- en_core_web_lg

In [None]:
!python -m spacy download en_core_web_md

# Test SpaCy NER on a sentence

In [5]:
## test spaCy NER system on a sentence

nlp = spacy.load("en_core_web_md")

test = "<p work='148' id='4.12.1' wdate='77' edate='-1'>BOIOTIA: In this country are <place id='385235PAnt'>Anthedon</place>, <place id='384232UOnc'>Onchestus</place>, the free town of <place id='383232PThe'>Thespiae</place>, <place id='384229PLeb'>Lebadea</place>, and then <place id='383233PThe'>Thebes</place>, surnamed <demonym id='384233RBoi'>Boeotian</demonym>, which does not yield the palm to <place id='380237PAth'>Athens</place> even in celebrity; the native land, according to the common notion, of the two Divinities <PRN id='Q41680'>Liber</PRN> and <PRN id='Q122248'>Hercules</PRN>. The <place id='383231SMus'>birth-place of the Muses</place> too is pointed out in the grove of <place id='384228LHel'>Helicon</place>. To this same <place id='383233PThe'>Thebes</place> also belong the forest of <place id='382233LCit'>Cithaeron</place>, and the river <place id='383233WIsm'>Ismenus</place>. Besides these, there are in <place id='384233RBoi'>Boeotia</place> the Fountains of Oidipodia, <PRN id='A2196;A2197'>Psamathe</PRN>, <place id='383233WDir'>Dirce</place>, Epicrane, <PRN id='A311'>Arethusa</PRN>, <place id='383230WHip'>Hippocrene</place>, <PRN id='AganippeYY'>Aganippe</PRN>, and <place id='382233WGar'>Gargaphie</place>; and, besides the mountains already mentioned, <place id='384235PMyk'>Mycalesos</place>, <PRN id='HadyliusYY'>Hadylius</PRN>, and <place id='385229LAko'>Acontius</place>. The remaining towns between <place id='380233PMeg'>Megara</place> and <place id='383233PThe'>Thebes</place> are <place id='382234FEle'>Eleutherae</place>, <place id='384231PHal'>Haliartus</place>, <place id='382233PPla'>Plataeae</place>, <place id='384236PPha'>Pherae</place>, <place id='385230UAsp'>Aspledon</place>, <place id='385233UHyl'>Hyle</place>, <place id='383230PThi'>Thisbe</place>, <place id='382234PEry'>Erythrae</place>, <place id='384234UGli'>Glissas</place>, and <place id='385232PKop'>Copae</place>; near the river <place id='388228WKep'>Cephisus</place>, <place id='386233PLar'>Larymna</place> and <place id='385233UAnc'>Anchoa</place>; as also <place id='384232UMed'>Medeon</place>, <place id='385227PPhl'>Phlygone</place>, <place id='385232PAkr'>Acraephia</place>, <place id='384230PKor'>Coronea</place>, and <place id='385228PCha'>Chaeronea</place>. Again, on the coast and below <place id='383233PThe'>Thebes</place>, are <place id='384230UOka'>Ocalea</place>, <place id='384235UHel'>Heleon</place>, <place id='383234PSko'>Scolos</place>, <place id='384234USch'>Schoenos</place>, <place id='384234UPet'>Peteon</place>, <place id='385236UHyr'>Hyriae</place>, <place id='384235PMyk'>Mycalesos</place>, <place id='383235UEil'>Iresion</place>, <place id='381236DPte'>Pteleon</place>, Olyros, and <place id='383236PTan'>Tanagra</place>, the people of which are free; and, situate upon the very mouth of the <place id='385236WEur'>Euripus</place>, a strait formed by the opposite island of <place id='385239IEub'>Euboea</place>, <place id='384236UAul'>Aulis</place>, so famous for its capacious harbour. The <ethnic id='384233RBoi'>Boeotians</ethnic> formerly had the name of Hyantes. <a href='http://latin.packhum.org/cit/PlinSen/Nat/4.12' target='_blank'>SOL</a> </p>"
soup = BeautifulSoup(test, 'html.parser')
text = soup.get_text() ## get the text of the test

## process the text with the language model
test = nlp(text)

## visualize the output
displacy.render(test, style="ent", jupyter=True)

print(len(test.ents), "named entities detected")

for ent in test.ents:
    entity_text = ent.text
    entity_start = ent.start_char
    entity_end = ent.end_char
    entity_label = ent.label_

56 named entities detected


# 1.4.3 BERT

In [None]:
pip install transformers

In [6]:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

In [7]:
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")

Downloading (…)okenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading (…)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/433M [00:00<?, ?B/s]

# Test NER BERT on a sentence

In [8]:
## test BERT NER model on a sentence

nlp = pipeline("ner", model=model, tokenizer=tokenizer)

sentence = "<p work='148' id='4.12.1' wdate='77' edate='-1'>BOIOTIA: In this country are <place id='385235PAnt'>Anthedon</place>, <place id='384232UOnc'>Onchestus</place>, the free town of <place id='383232PThe'>Thespiae</place>, <place id='384229PLeb'>Lebadea</place>, and then <place id='383233PThe'>Thebes</place>, surnamed <demonym id='384233RBoi'>Boeotian</demonym>, which does not yield the palm to <place id='380237PAth'>Athens</place> even in celebrity; the native land, according to the common notion, of the two Divinities <PRN id='Q41680'>Liber</PRN> and <PRN id='Q122248'>Hercules</PRN>. The <place id='383231SMus'>birth-place of the Muses</place> too is pointed out in the grove of <place id='384228LHel'>Helicon</place>. To this same <place id='383233PThe'>Thebes</place> also belong the forest of <place id='382233LCit'>Cithaeron</place>, and the river <place id='383233WIsm'>Ismenus</place>. Besides these, there are in <place id='384233RBoi'>Boeotia</place> the Fountains of Oidipodia, <PRN id='A2196;A2197'>Psamathe</PRN>, <place id='383233WDir'>Dirce</place>, Epicrane, <PRN id='A311'>Arethusa</PRN>, <place id='383230WHip'>Hippocrene</place>, <PRN id='AganippeYY'>Aganippe</PRN>, and <place id='382233WGar'>Gargaphie</place>; and, besides the mountains already mentioned, <place id='384235PMyk'>Mycalesos</place>, <PRN id='HadyliusYY'>Hadylius</PRN>, and <place id='385229LAko'>Acontius</place>. The remaining towns between <place id='380233PMeg'>Megara</place> and <place id='383233PThe'>Thebes</place> are <place id='382234FEle'>Eleutherae</place>, <place id='384231PHal'>Haliartus</place>, <place id='382233PPla'>Plataeae</place>, <place id='384236PPha'>Pherae</place>, <place id='385230UAsp'>Aspledon</place>, <place id='385233UHyl'>Hyle</place>, <place id='383230PThi'>Thisbe</place>, <place id='382234PEry'>Erythrae</place>, <place id='384234UGli'>Glissas</place>, and <place id='385232PKop'>Copae</place>; near the river <place id='388228WKep'>Cephisus</place>, <place id='386233PLar'>Larymna</place> and <place id='385233UAnc'>Anchoa</place>; as also <place id='384232UMed'>Medeon</place>, <place id='385227PPhl'>Phlygone</place>, <place id='385232PAkr'>Acraephia</place>, <place id='384230PKor'>Coronea</place>, and <place id='385228PCha'>Chaeronea</place>. Again, on the coast and below <place id='383233PThe'>Thebes</place>, are <place id='384230UOka'>Ocalea</place>, <place id='384235UHel'>Heleon</place>, <place id='383234PSko'>Scolos</place>, <place id='384234USch'>Schoenos</place>, <place id='384234UPet'>Peteon</place>, <place id='385236UHyr'>Hyriae</place>, <place id='384235PMyk'>Mycalesos</place>, <place id='383235UEil'>Iresion</place>, <place id='381236DPte'>Pteleon</place>, Olyros, and <place id='383236PTan'>Tanagra</place>, the people of which are free; and, situate upon the very mouth of the <place id='385236WEur'>Euripus</place>, a strait formed by the opposite island of <place id='385239IEub'>Euboea</place>, <place id='384236UAul'>Aulis</place>, so famous for its capacious harbour. The <ethnic id='384233RBoi'>Boeotians</ethnic> formerly had the name of Hyantes. <a href='http://latin.packhum.org/cit/PlinSen/Nat/4.12' target='_blank'>SOL</a> </p>"
soup = BeautifulSoup(sentence, 'html.parser')
text = soup.get_text() ## get the text of the test

ner_results = nlp(text)

for result in ner_results:
    
    text = result["word"]
    label = result["entity"]
    start = result["start"]
    
    print(f"Entity: {text}, Label: {label}, Start: {start}")

Entity: ##TI, Label: I-ORG, Start: 4
Entity: An, Label: B-LOC, Start: 29
Entity: ##the, Label: I-LOC, Start: 31
Entity: ##don, Label: I-LOC, Start: 34
Entity: On, Label: B-LOC, Start: 39
Entity: ##ches, Label: I-LOC, Start: 41
Entity: ##tus, Label: I-LOC, Start: 45
Entity: The, Label: B-LOC, Start: 67
Entity: ##sp, Label: I-LOC, Start: 70
Entity: ##iae, Label: I-LOC, Start: 72
Entity: Le, Label: B-LOC, Start: 77
Entity: ##bad, Label: I-LOC, Start: 79
Entity: ##ea, Label: I-LOC, Start: 82
Entity: The, Label: B-LOC, Start: 95
Entity: ##bes, Label: I-LOC, Start: 98
Entity: Bo, Label: B-LOC, Start: 112
Entity: ##eo, Label: I-LOC, Start: 114
Entity: ##tian, Label: I-MISC, Start: 116
Entity: Athens, Label: B-LOC, Start: 155
Entity: Di, Label: B-MISC, Start: 241
Entity: Li, Label: B-ORG, Start: 252
Entity: Hercules, Label: B-PER, Start: 262
Entity: Muse, Label: B-ORG, Start: 295
Entity: He, Label: B-PER, Start: 336
Entity: The, Label: B-LOC, Start: 358
Entity: ##bes, Label: I-LOC, Start: 361


BERT is a subword-based model, meaning that it uses subword tokens to represent words. This is done to handle out-of-vocabulary (OOV) words and to improve the model's ability to handle morphologically complex languages like Turkish, Finnish, or German.

BERT uses WordPiece tokenization, which means that words can be split into subwords based on their frequency in the training corpus. Subwords are created by greedily selecting the most frequent character n-grams, where n can be any integer up to a pre-specified maximum.

When BERT predicts named entities, it operates at the subword level. As a result, it can output named entities that span across multiple subwords, which is why you see subwords in the output.

# 1.4.4 Perform Flair NER on Book 4

The test on the sentence and the manual check of the output of the different NER systems showed that NER Flair had a better performance. Flair was then performed on the entire Book 4.

In [None]:
## perform Flair NER on book 4

## open the source page as soup
soup=BeautifulSoup(open("/Users/u0154817/OneDrive - KU Leuven\Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Sources/NH_Eng_ToposText/NH_Eng_1-11.html", encoding='utf-8'), features="lxml")

## write the new csv table
f=csv.writer(open("1.5.NER_Flair_Book_4.csv", "w", newline=''))

f.writerow(["Reference", "Named Entity", "Type", "Start position", "End position", "Score"]) ## write column headers

## load the NER tagger
tagger = Classifier.load('ner-large')

## find all the paragraphs in book 4
Book_4 = soup.find_all("p", id=lambda x: x and x.startswith("urn:cts:latinLit:phi0978.phi001:4."))

for paragraph in Book_4: ## for each paragraph in book 4
    Reference=paragraph.get("id") ## position (book, chapter)
    paragraph=paragraph.get_text() ## get the text of the paragraph
    paragraph=Sentence(paragraph) ## perform Flair_Sentence on the text
    print(paragraph) ## print the paragraph
    
    tagger.predict(paragraph) ## run NER over the paragraph
    
    for entity in paragraph.get_spans('ner'): ## for each entity detected in the paragraph
        Named_Entity=entity.text ## get the named entity
        Type=entity.labels[0].value ## get the type label
        Start_Position=entity.start_position ## get the start position
        End_Position=entity.end_position ## get the end position
        Score=entity.labels[0].score ## get the probability score for the type label
        f.writerow([Reference, Named_Entity, Type, Start_Position, End_Position, Score])