# Find the Latin passage Book 4

The notebook contains the code to get the attestation of a place name in the Latin text of the NH. This permits to access the Latin text of the NH starting from the English translation on ToposText. The edition of the NH is available on LacusCurtius (https://penelope.uchicago.edu/Thayer/E/Roman/Texts/Pliny_the_Elder/home.html).

Starting from a place name attestation from the CSV file generated from ToposText (i.e., '4.24.2 Apollonia'), the corresponding Latin book_chapter is extracted from LacusCurtius. Each paragraph (p_tag) in the The Latin chapter (a_tag) is split into sentences (using NLTK) and the sentences into words. Then, the place name (i.e., 'Apollonia') is compared with all the words starting with capital letter in the text of the paragraph. A perfect match of the first three letters is expected between the place name and the target word, that is the Latin equivalent of the place name. The similarity (or Levenshtein distance) is calculated using the Python library FuzzyWuzzy. The similarity threshold ('score') is set to > 60, which was determined empirically. The code prints the target word detected (i.e., 'Apolloniam') and the score of the match.

Notice that NLTK was used to split the paragraph into sentence, because there were some problems using the split function (i.e., division of M. Paulus).

Notice also that Pliny's critical editions contain two different level of chapter subdivisions. To be able to switch from the English text (from ToposText) to the Latin text (LacusCurtius), it was necessary to manually create a table of correspondences (3.1.Table_(Chapter)Chapter_Paragraph). For each book, the table lists the first level of chapter subdivision (indicated between round bracket), the second level of chapter subdivision (not in round bracket) and the paragraphs.

In [1]:
import webbrowser
import requests
from bs4 import BeautifulSoup
import pandas as pd
from roman_arabic_numerals import conv
from fuzzywuzzy import fuzz
import re
import numpy as np
import string
import nltk
from nltk.tokenize import sent_tokenize



In [2]:
Index_LacusCurtius=requests.get("http://penelope.uchicago.edu/Thayer/e/roman/texts/pliny_the_elder/home.html") ## get the index from LacusCurtius
Index_LacusCurtius_soup=BeautifulSoup(Index_LacusCurtius.content, 'html.parser') ## get the soup of the index
base_url="https://penelope.uchicago.edu/Thayer/" ## create a base url

Book=input() ## select the book (book number i.e., '4')

a_tag_Book = Index_LacusCurtius_soup.find('a', {'name': Book}) ## find the book in the index
href=a_tag_Book.get("href") ## get the href of the book in the index
full_url=base_url+href.strip() ## get the full link of the book

LacusCurtius_book=requests.get(full_url) ## open the book in LacusCurtius
LacusCurtius_book_soup=BeautifulSoup(LacusCurtius_book.content) ## get the soup of the book

Chapter=input() ## select the chapter
Chapter=int(Chapter) ## convert to integer

## get the table of correspondences
Table_of_Correspondences=pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/KU Leuven/PhD project 'Greek Spaces in Roman Times'/Data_Extraction/Outputs/3.1.Table_(Chapter)_Chapter_Paragraph.csv", delimiter=";")

## filter the rows with the book and chapter numbers of the inputs
Select_Rows=Table_of_Correspondences[(Table_of_Correspondences["Book"]==int(Book)) & (Table_of_Correspondences["(Chapter)"]==(Chapter))]

List_of_Paragraphs=[] ## create a list
Paragraphs=Select_Rows['Paragraph'].tolist() ## append all the paragraphs to the list
Paragraphs=np.unique(Paragraphs) ## drop duplicates
for Paragraph in Paragraphs:
    Paragraph=int(Paragraph) ## convert to integer
    List_of_Paragraphs.append(Paragraph)

List_of_Chapters_no_brack=[] ## create a list
Chapters_no_brack=Select_Rows['Chapter'].tolist() ## append all the chapters not in round brackets
Chapters_no_brack=np.unique(Chapters_no_brack) ## drop duplicates
for Chapter_no_brack in Chapters_no_brack:
    Book_Chapter_no_brack=Book+"."+Chapter_no_brack ## create the string book.chapter_no_brackets
    List_of_Chapters_no_brack.append(Book_Chapter_no_brack)

    
## the books of the NH can split into two groups according to the different chapter subivisions in the LacusCurtius version
group_1=["2", "5", "6", "8", "9", "10", "11"]
group_2=["3", "4", "7"]

Place_Name = "Apollonia" ## English place name

if Book in group_1:

    Chapter_Roman=str(conv.arab_rom(Chapter).lower()) ## convert the chapter number into a Roman number
    a_tag_Chapter = LacusCurtius_book_soup.find('a', {'id': Chapter_Roman}) ## find the chapter in the book
    
    if a_tag_Chapter: 
        
        Succ_Chapter=Chapter+1 ## determine the number of the next chapter
        Succ_Chapter_Roman=str(conv.arab_rom(Succ_Chapter).lower()) ## convert the number into a Roman number
        end_tag=LacusCurtius_book_soup.find('a', {'id': Succ_Chapter_Roman}) ## find the next chapter
        end_of_Book=LacusCurtius_book_soup.find('a', {'id': "thanks"}) ## find the end of the book

        tag_p_count=-1 #count the number of p_tags in the chapter

        for tag in a_tag_Chapter.find_all_next(): ## for all the next tags
            if tag==end_tag: ## if the next tag is the next chapter
                break ## break
            if tag==end_of_Book: ## if the next tag is the end of the book
                break ## break
            if tag.name=="p": ## if the next tag is a p_tag (paragraph)
                tag_p_count=tag_p_count+1 ## add +1 to the count of the p_tags
                Text=tag.text ## get the text of the p_tag
        
                Sentences = sent_tokenize(Text) ## split the text into sentences using NLTK
                for i,Sentence in enumerate(Sentences): ## for each sentence
                    Sentence=Sentence.translate(str.maketrans('', '', string.punctuation)) #remove the punctuation from the sentence
                    Words=Sentence.split() ## split the sentence into words
                    for Word in Words: ## for each word
                        if Word[0:3]==Place_Name[0:3]: #compare the beginning of the place name
                            Score=fuzz.token_set_ratio(Place_Name,Word) ## get the fuzzy score of the distance between the place name and the word
                            if Score>=60: ## if the matching score is higher than 60
                                Target_Word=Word ## get the target word
                                print(Place_Name, 'matched', Target_Word, 'with score', Score)
                                
                                ## determine the number of the paragraph and the number of the sentence
                                if tag_p_count in range(len(List_of_Paragraphs)):
                                    print(str(Book)+".("+str(Chapter)+")."+str(List_of_Paragraphs[tag_p_count])+"."+str(i)+":"+Sentence.strip())
                                else: print(str(Book)+".("+str(Chapter)+").CHECKTHEPARAGRAPH."+str(i)+":"+Sentence.strip())
                                    
        ## if no p_tag was detected
        if tag_p_count==-1:
    
            ## check for a_tag containing the text of the chapter
            for Paragraph in List_of_Paragraphs:
                a_tag_Paragraph=LacusCurtius_book_soup.find("a", {"class": "chapter", "name": Paragraph})
        
                all_a_tags=LacusCurtius_book_soup.find_all("a")
                index1=all_a_tags.index(a_tag_Chapter)
                index2=all_a_tags.index(a_tag_Paragraph)
        
                #the paragraph follows the chapter number
                if index1<index2:
            
                    Text=a_tag_Paragraph.find_next_sibling(text=True).strip()
                    Sentences = sent_tokenize(Text)
                    for i,Sentence in enumerate(Sentences):
                        #remove the punctuation from the phrase
                        Sentence=Sentence.translate(str.maketrans('', '', string.punctuation))
                        #split the phrase in words
                        Words=Sentence.split()
                        for Word in Words:
                            #compare the beginning of the target word with each word
                            if Word[0:3]==Place_Name[0:3]:
                                Score=fuzz.token_set_ratio(Place_Name,Word)
                                if Score>=60:
                                    Target_Word=Word
                                    print(Place_Name, 'matched', Target_Word, 'with score', Score)
                                    print(str(Book)+".("+str(Chapter)+")."+str(Paragraph)+"."+str(i)+":"+Sentence.strip())
    
                #the paragraph begins before the chapter number
                elif index1>index2:
            
                    Text=a_tag_Chapter.find_next_sibling(text=True).strip()
                
                    Sentences = sent_tokenize(Text)
                    for i,Sentence in enumerate(Sentences):
                        #remove the punctuation from the phrase
                        Sentence=Sentence.translate(str.maketrans('', '', string.punctuation))
                        #split the phrase in words
                        Words=Sentence.split()
                        for Word in Words:
                            #compare the beginning of the target word with each word
                            if Word[0:3]==Place_Name[0:3]:
                                Score=fuzz.token_set_ratio(Place_Name,Word)
                                if Score>=60:
                                    Target_Word=Word
                                    print(Place_Name, 'matched', Target_Word, 'with score', Score)
                                    print(str(Book)+".("+str(Chapter)+")."+str(Paragraph)+"."+str(i)+":"+Sentence.strip())

       
    else: print("No Chapter")
    
elif Book in group_2:
    
    for Paragraph in List_of_Paragraphs: ## for all the paragraphs in the chapter
        a_tag_Paragraph=LacusCurtius_book_soup.find("a", {"name": Paragraph}) ## get the tags of the paragraphs
        
        ## extract the text from the next p_tag by td_parents
        td_Paragraph_parent_tag=a_tag_Paragraph.parent
        td_p_parent_tag=td_Paragraph_parent_tag.find_next_sibling("td")
        for p_tag in td_p_parent_tag.find_all("p"):
            Text=p_tag.text.strip()
        
            Sentences = sent_tokenize(Text) 
            for i,Sentence in enumerate(Sentences):
                Sentence=Sentence.translate(str.maketrans('', '', string.punctuation))
                Words=Sentence.split()
                for Word in Words:
                    if Word[0:3]==Place_Name[0:3]:
                        Score=fuzz.token_set_ratio(Place_Name,Word)
                        if Score>=60:
                            Target_Word=Word
                            print(Place_Name, 'matched', Target_Word, 'with score', Score)
                            print(str(Book)+".("+str(Chapter)+")."+str(Paragraph)+"."+str(i)+": "+Sentence.strip())        

4
24
Apollonia matched Apolloniam with score 95
4.(24).78.0: M Varro ad hunc modum metitur ab ostio Ponti Apolloniam CLXXXVII·D p Callatim tantundem ad ostium Histri CXXV ad Borysthenem CCL Cherronesum Heracleotarum oppidum CCCLXXV p ad Panticapaeum quod aliqui Bosporum vocant extremum in Europae ora CCXII·D quae summa efficit XIII·XXXVII·D
