# Extract ToposText Annotations in Book 4

In order to evaluate the quality of the ToposText annotation, we extracted all the annotations in Book 4 including places, persons, and ethnics. For each annotation, we extracted its position (book, chapter, paragraph), the textual content of the tag (ie., Rome), the class label (if present, ie. 'place'), the start and end position of the word in the paragraph, and the corresponding ToposText ID (if present).

The position of the word in the paragraph is calculated starting from the very beginning of the paragraph. Please notice that in ToposText each paragraph begins with the ID of the paragraph (for instance, § 4.1.1 EPIRUS: ...). Punctuation was not removed.

To extract the start position of the tagged word, we started from the HTML version of the text containing the tags. Then, each paragraph was cleaned removing the p tag (tag of the paragraph), the b tag (tag of the paragraph ID), and the content of the a tags (tags of places, people, ethnics). We substituted the < a > tag with a special character (+) to mark the following word as a 'tagged word'. The position of the special character (+) is the start position of the tagged word minus the number of all the special characters preceding the tagged word present in the paragraph. At the end of the process, we obtained a list of all the start positions of tagged words.

Splitting the processes of (1) start position extraction and (2) tagged word extraction was necessary to overcome some difficulties we faced in extracting the start position of words in a tagged text in which some of the tagged words occur more than once.

The ToposText IDs are annotated in two different ways. In some tags, the ToposText ID is in the "about" element. In other cases, it is in the "href" element.

In total, 1,888 annotations were present in Book 4.

In [1]:
import pandas as pd
import re
import csv
from bs4 import BeautifulSoup

# Extraction of Start Positions of all the Tagged Words

The next cell contains the code to extract the start and end position of a word between tags in an HTML source page. 

In [2]:
## open the source HTML page as soup by BeautifulSoup
soup = BeautifulSoup(open("/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Sources/NH_Eng_ToposText/NH_Eng_1-11.html", encoding='utf-8'), features="lxml")

## get all the paragraphs in Book 4
Book_4 = soup.find_all("p", id=lambda x: x and x.startswith("urn:cts:latinLit:phi0978.phi001:4.")) ## get all the paragraph starting with the ID phi0978.phi001:4.

## create a list of the start positions of all the tagged words in Book 4
Start_Positions_Annotations_Book4 = []

for Paragraph in Book_4: ## for each paragraph in Book 4
    
    Paragraph=str(Paragraph) ## convert the paragraph to a string
    
    ## clean the text
    Clean_text = re.sub('<p[^>]*>', '', Paragraph) ## remove the p tag
    Clean_text = re.sub('</p>', '', Clean_text)

    Clean_text = re.sub('<b>', '', Clean_text) ## remove the b tag
    Clean_text = re.sub('</b>', '', Clean_text)
    
    Clean_text = re.sub('</a>', '', Clean_text) ## remove the </a> tag
    Clean_text = re.sub('<a[^>]*>', '+', Clean_text) ## substitute the head of the <a> tag with the special character +
    
    List_of_Start_Positions = [] ## list of the start positions of tagged words in the paragraph
    List_of_Special_Char_Seen = 0 ## list of the number of special characters (+) already seen
    
    for i, Char in enumerate(Clean_text): ## for each character in the paragraph
        if Char == "+" : ## if the character is the special character +
            List_of_Start_Positions.append(i - List_of_Special_Char_Seen) ## the start position of the following tagged word is the position of the special character minus the number of all the special characters alreadty seen
            List_of_Special_Char_Seen += 1 ## add +1 to the sum of the special characters already seen
            
    Start_Positions_Annotations_Book4.extend(List_of_Start_Positions) ## append the list of start position
    
print(Clean_text) ## show the cleansed text of the last paragraph
print(List_of_Start_Positions)

§ 4.37.1  THE GENERAL MEASUREMENT OF EUROPE: Having thus made the circuit of Europe, we must now give the complete measurement of it, in order that those who wish to be acquainted with this subject may not feel themselves at a loss. +Artemidorus and +Isidorus have given its length, from the +Tanais to +Gades, as 8214 miles. +Polybius in his writings has stated the breadth of Europe, in a line from +Italy to the ocean, to be 1150 miles. But, even in his day, its magnitude was but little known. The distance of +Italy, as we have previously stated, as far as the +Alps, is 1120 miles, from which, through +Lugdunum to the +British port of the +Morini, the direction which +Polybius seems to follow, is 1168 miles. But the better ascertained, though greater length, is that taken from the +Alps through the Camp of the Legions in +Germany, in a north-westerly direction, to the mouth of the +Rhine, being 1543 miles. We shall now have to speak of Africa and +Asia. 
[233, 249, 290, 300, 322, 396, 5

In total, 1,888 start positions of tagged words were extracted from Book 4.

In [3]:
len(Start_Positions_Annotations_Book4)

1888

# 1.1.2 Extract all the Annotations

The next cell contains the code to extract tagged words from a ToposText source page and create a CSV file with the extracted information.

In [4]:
## open the source HTML page as soup by BeautifulSoup
soup = BeautifulSoup(open("/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Sources/NH_Eng_ToposText/NH_Eng_1-11.html", encoding='utf-8'), features="lxml")

## write the new csv file
f = csv.writer(open("1.1.ToposText_Annotations_Book_4_TEMP.csv", "w", newline=''))
## define column headers in the csv file
f.writerow(["Reference", "Tagged Entity", "Class", "Start position", "End position", "ToposText ID", "Temporary_ToposTextID_href"])

## get all the paragraphs in Book 4
Book_4 = soup.find_all("p", id=lambda x: x and x.startswith("urn:cts:latinLit:phi0978.phi001:4."))

Count_Annotations = 0 ## count the number of annotations detected

for Paragraph in Book_4: ## for each paragraph in Book 4 

    Reference = Paragraph.get("id") ## get the ID of the paragraph (book, chapter, paragraph)
    a_tags = Paragraph.find_all('a') ## get all a tags in the paragraph
    
    for a_tag in a_tags: ## for each a tag
        
        Tagged_Entity = a_tag.get_text() ## get the word content in the tag
        Class = a_tag.get('class') ## get the class
        Start_Position = Start_Positions_Annotations_Book4[Count_Annotations] ## get the start position in the list created above, the index is equal to the number of annotations already detected
        End_Position = Start_Position+len(Tagged_Entity) ## the end position is equal to the sum of start pos and the lenght of the string
        
        ToposText_ID_about = a_tag.get('about') ## extract the content of "about"
        ToposText_ID_href = a_tag.get('href') ## extract the content of "href"
        
        if ToposText_ID_href : ## if the tag contains "href"
            ToposText_ID_href = "https://topostext.org"+ToposText_ID_href ## create the complete link of href
        
        f.writerow([Reference, Tagged_Entity, Class, Start_Position, End_Position, ToposText_ID_about, ToposText_ID_href])
        Count_Annotations += 1
        
print(Reference, Tagged_Entity, Class, Start_Position, End_Position, ToposText_ID_about, ToposText_ID_href) ##print an example

urn:cts:latinLit:phi0978.phi001:4.37.1 Asia None 945 949 None https://topostext.org/people/15213


In [5]:
Count_Annotations

1888

In [None]:
## read the csv file
ToposText_Book4 = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Python Scripts/1.1.ToposText_Annotations_Book_4_TEMP.csv", delimiter=",")
len(ToposText_Book4)

In [None]:
## replace missing values in ToposText ID with values from Temporary_ToposTextID_href
ToposText_Book4['ToposText ID'] = ToposText_Book4['ToposText ID'].fillna(ToposText_Book4['Temporary_ToposTextID_href'])

In [None]:
len(ToposText_Book4)

In [None]:
ToposText_Book4 = ToposText_Book4.drop(['Temporary_ToposTextID_href'], axis=1)

In [None]:
ToposText_Book4.to_csv("1.1.ToposText_Annotations_Book_4.csv")