# Preparation of the Corpus of Latin Funerary Inscriptions

These steps were necessary to clean the dataset resulting from the scraping and obtain representative data to perform the linguistic analysis. The new dataset contains 172,958 inscriptions.

**Duplicates**

Inspecting the resulting dataset (184,097 rows), it was observed that it contains 419 duplicates (rows with the same EDCS-ID). All the rows with the same EDCS-ID also have the same text. Duplicates were removed using duplicate.drop.

**No text or text with less than three characters**

Then, the content of the 'inscription' column was inspected. 385 rows do not contain any text. Moreover, 4,660 rows contain a '?' value for the text. Some of the inscriptions are badly damaged and the interpretative reading of the text contains less than 3 letters (i.e., 'A'). All these rows were removed from the dataset. Inscriptions containing 3 or more words (i.e., 'Hic') are included in the dataset.

**Language**

In the EDCS, some inscriptions in other languages (Greek, Iberian, Punic, ...) are incorporated (i.e., EDCS-ID79500060). Since no language parameter is included in LatEpig, the dataset also contains texts in other language than Latin and bilingual texts.

The LatEpigr includes the metadata language that is described as: "Language of an inscription other than Latin, abbreviation for languages other than Latin, e.g. ""GR"" for Greek, extracted from the inscription attribute. Latin is default value (empty means Latin), as provided by the EDCS".

If we inspect the output of the scraping, we observe that the use of language labels is inconsistent. For instance, EDCS-ID62900013 has the metadata 'GR', while the Greek inscription EDCS-ID79500060 hasn't. This is due to the fact that the scraping contains a language label only if there is a language marker (i.e., 'GR', 'KELT') in the text used by the EDCS editors to indicate the presence of a section in another language. In the EDCS edition, this label is in the text (i.e., P(ubli) Antisti / Venusti // "GR"), while in the output of LatEpig this information is split and collocated in the 'language' column.

To remove inscriptions in other languages, a list was created containing all the EDCS-IDs of the funerary inscriptions in Greek (gr), Punic (pu), Iberian (ib), Hebrew (he), Etruscan (et), Lepontic (le), Oscian (os), Palmyrene (pa), Safaitic (sa), Venetian (ve). The list was created manually searching the EDCS database and using BeautifulSoup on the html result page. Then, the IDs were checked against the dataset and the corresponding rows were filtered out.

In [1]:
import pandas as pd
from bs4 import BeautifulSoup

In [2]:
##open the file containing the dataset resulting from the scraping (184,097)
Inscriptions = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/ICLL Prague June 2023/Output/Tituli_Sepulcrales.csv")

In [3]:
##calculate the lenght of the dataset
Len_from_LatEpig = len(Inscriptions)
Len_from_LatEpig

184097

In [4]:
Inscriptions['EDCS-ID'].value_counts()

EDCS-47300081    2
EDCS-10801210    2
EDCS-10801207    2
EDCS-10801206    2
EDCS-10801205    2
                ..
EDCS-10501903    1
EDCS-10501904    1
EDCS-10501905    1
EDCS-10501906    1
EDCS-10700051    1
Name: EDCS-ID, Length: 183678, dtype: int64

In [5]:
##count the number of times the same EDCS-ID appears twice in the dataset
Value_Counts = Inscriptions['EDCS-ID'].value_counts()
Count_Twice = Value_Counts[Value_Counts == 2].count()
print(Count_Twice)

419


In [6]:
##groups the rows by the "EDCS-ID" column
##calculates the number of unique values in the "inscription" column for each group
##compare the number of unique values to 1
##return a boolean series where True indicates groups with different inscription values
##print the count of rows with the same EDCS-ID and different inscription
Inscriptions.groupby("EDCS-ID")["inscription"].nunique().gt(1).sum()

0

In [7]:
##drop the duplicates
Inscriptions.drop_duplicates(subset='EDCS-ID', inplace=True)

In [8]:
No_Duplicate_Len = len(Inscriptions)

In [9]:
##check the number of rows removed
No_Duplicate_Len == Len_from_LatEpig - Count_Twice

True

In [10]:
##count the number of rows that do not contain a text
NaN_count = Inscriptions['inscription'].isna().sum()
NaN_count

385

In [11]:
##drop rows with NaN values in 'inscription' column
Inscriptions = Inscriptions.dropna(subset=['inscription'])

In [12]:
No_NaN_len = len(Inscriptions)

In [13]:
##check the number of rows removed
No_NaN_len == No_Duplicate_Len - NaN_count

True

In [14]:
##inspect the inscription column
Inscriptions['inscription'].value_counts().head(5)

?                           4660
D(is) M(anibus)              208
D(is) M(anibus) / [          200
[D(is)] M(anibus) / [         74
D(is) M(anibus) s(acrum)      68
Name: inscription, dtype: int64

In [15]:
##drop rows with '?' value in 'inscription' column
Inscriptions = Inscriptions[Inscriptions['inscription'] != '?']

In [16]:
No_QuestionMark_len = len(Inscriptions)

In [17]:
##check the number of rows removed
No_QuestionMark_len == No_NaN_len - 4660

True

In [18]:
##count rows where length of 'inscription' is lower than 4
Smaller_than_Three = Inscriptions[Inscriptions['inscription'].str.len() < 3]
len(Smaller_than_Three)

117

In [19]:
##filter out rows where length of 'inscription' is lower than 4
Inscriptions = Inscriptions[~(Inscriptions['inscription'].str.len() < 3)]

In [20]:
No_less_than_three = len(Inscriptions)

In [21]:
##check the number of rows removed
No_less_than_three == No_QuestionMark_len - len(Smaller_than_Three)

True

In [22]:
##create a list containing the EDCS-IDs of the texts in other languages than Latin (6,045)
EDCS_ID_to_Remove = [] ##create a list

def extract_EDCS_IDs(filename): ##define the function
    path = '/Users/u0154817/OneDrive - KU Leuven/Documents/ICLL Prague June 2023/EDCS_Latin_Funerary_Inscriptions/'+filename+".html"
    soup = ve_titulisepulcales = BeautifulSoup(open(path, encoding='utf-8'), features="lxml") ##open the page as soup
    b_tags = soup.find_all("b") ##find all the b_tags
    for b_tag in b_tags:
        if b_tag.get_text() == 'EDCS-ID:': ##get the EDCS-ID tag
            EDCS_ID = b_tag.next_sibling[1:]  ##get the EDCS-ID
            EDCS_ID_to_Remove.append(EDCS_ID)
            
filenames = ["EDCS_ve_titulisepulcrales_2", 'EDCS_et_titulisepulcrales_49', "EDCS_gr_titulisepulcrales_5944", "EDCS_he_titulisepulcrales_4", "EDCS_ib_titulisepulcrales_7", "EDCS_le_titulisepulcrales_4", "EDCS_os_titulisepulcrales_4", "EDCS_pu_titulisepulcrales_30"]
for filename in filenames:
    extract_EDCS_IDs(filename)
    
len(EDCS_ID_to_Remove)

6045

In [23]:
##remove the inscriptions in other languages
Inscriptions = Inscriptions[~Inscriptions["EDCS-ID"].isin(EDCS_ID_to_Remove)]

In [24]:
##reset the index
Inscriptions.reset_index(inplace=True)

In [25]:
##calculate the lenght of the resulting dataset (172,958)
len(Inscriptions)

172958

In [26]:
##tot number of rows during the cleaning (11,139)
Len_from_LatEpig - len(Inscriptions)

11139

In [27]:
Inscriptions.to_csv('Tituli_Sepulcrales_new.csv', index=False)