# Preparation of the Corpus of Latin Funerary Inscriptions

Inspecting the resulting dataset (184,097 rows), it was observed that it contains 419 duplicates (rows with the same EDCS-ID). All the rows with the same EDCS-ID also have the same text. Duplicates were removed using duplicate.drop.

Then, the content of the 'inscription' column was inspected. 385 rows do not contain any text. Moreover, 4,660 rows contain a '?' value for the text. Some of the inscriptions are badly damaged and the interpretative reading of the text contains less than 3 letters (i.e., 'A'). All these rows were removed from the dataset.

The new dataset contains 178,516 inscriptions.

In [1]:
import pandas as pd

In [2]:
##open the file containing the dataset resulting from the scraping
Inscriptions = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/ICLL Prague June 2023/Output/Tituli_Sepulcrales.csv")

In [3]:
##calculate the lenght of the dataset
Len_from_LatEpig = len(Inscriptions)
Len_from_LatEpig

184097

In [4]:
##count the number of times the same EDCS-ID appears twice in the dataset
Value_Counts = Inscriptions['EDCS-ID'].value_counts()
Count_Twice = Value_Counts[Value_Counts == 2].count()
print(Count_Twice)

419


In [5]:
## check if rows with the same EDCS-ID also have the same value under the column "inscription"
grouped = Inscriptions.groupby('EDCS-ID')['inscription'].nunique() ##group the rows by 'EDCS-ID' and count unique values in the 'inscription' column
all_same_inscription = (grouped == 1).all() ##check if all groups have a count of 1
print("All rows with the same EDCS-ID have the same value under 'inscription':", all_same_inscription)

All rows with the same EDCS-ID have the same value under 'inscription': False


In [6]:
##drop the duplicates
Inscriptions.drop_duplicates(subset='EDCS-ID', inplace=True)
Inscriptions.reset_index(drop=True, inplace=True)

In [7]:
No_Duplicate_Len = len(Inscriptions)

In [8]:
##check the number of rows removed
No_Duplicate_Len == Len_from_LatEpig - Count_Twice

True

In [9]:
##count the number of rows that do not contain a text
NaN_count = Inscriptions['inscription'].isna().sum()
NaN_count

385

In [10]:
##drop rows with NaN values in 'inscription' column
Inscriptions = Inscriptions.dropna(subset=['inscription'])

In [11]:
No_NaN_len = len(Inscriptions)

In [12]:
##check the number of rows removed
No_NaN_len == No_Duplicate_Len - NaN_count

True

In [13]:
##inspect the inscription column
Inscriptions['inscription'].value_counts().head(5)

?                           4660
D(is) M(anibus)              208
D(is) M(anibus) / [          200
[D(is)] M(anibus) / [         74
D(is) M(anibus) s(acrum)      68
Name: inscription, dtype: int64

In [14]:
##drop rows with '?' value in 'inscription' column
Inscriptions = Inscriptions[Inscriptions['inscription'] != '?']

In [15]:
No_QuestionMark_len = len(Inscriptions)

In [16]:
##check the number of rows removed
No_QuestionMark_len == No_NaN_len - 4660

True

In [17]:
##count rows where length of 'inscription' is lower than 4
Smaller_than_Three = Inscriptions[Inscriptions['inscription'].str.len() < 3]
len(Smaller_than_Three)

117

In [18]:
##filter out rows where length of 'inscription' is lower than 4
Inscriptions = Inscriptions[~(Inscriptions['inscription'].str.len() < 3)]

In [19]:
No_less_than_three = len(Inscriptions)

In [20]:
##check the number of rows removed
No_less_than_three == No_QuestionMark_len - len(Smaller_than_Three)

True

In [21]:
No_less_than_three

178516

In [22]:
##tot number of rows removed in the cleaning
Len_from_LatEpig - No_less_than_three

5581

In [23]:
Inscriptions.to_csv('Tituli_Sepulcrales_new.csv', index=False)