# Tokenization in the Conservative and Interpretive Texts

Three differet text versions are provided by the LatEpig scraper.
- 'inscription': D(is) M(anibus) [s(acrum)] / Dan[
- 'inscription_conservative_cleaning': D M Dan
- 'inscription_interpretive_cleaning': Dis Manibus sacrum Dan

The 'inscription' contains modern integrations (i.e., [s(acrum)]), resolution of abbreviations (i.e., D(is)), divisions of line (i.e., /), blank within a line (i.e., [3]). For the complete list of special characters used by EDCS see: https://db.edcs.eu/epigr/hinweise/hinweis-en.html. Integrations, resolution of abbreviations, and insertion of missing letters are present in the interpretative cleaning without special characters. The conservative cleaning, instead, does not contain modern integrations on the text and the abbreviations are not resolved. Neither the conservative cleaning nor the interpretive cleaning methods indicate the presence of blanks within a line.

Inscriptions do not contain punctuation.

The dataset contains 172,958 Latin inscriptions for a total of **1,943,890 tokens** in the **conservative cleaning** texts (that is, the actual tokens present on stone) and 137,341 unique tokens with a mean of 11.2 tokens per inscription. The **interpretive texts** contains **2,007,668 tokens** and 115,697 unique words with a mean of 11.6 words per inscription.

Consider that the frequency of the letter 'M', an abbreviation used for different words (_Manibus, Marcus, merenti_), sums all the occurrences of the letter in the different contexts. Note also that the letter can also be an errant letter being preceded and/or followed by blank spaces on the stone.

The 10 most common tokens in the conservative corpus are single letters and 'et' (M: 97,537, D: 67,735, S: 53,357, ET: 53,357, ...). This is due to the fact that funerary texts largely contain abbreviations.

In [1]:
import pandas as pd
from collections import Counter
from nltk.tokenize import sent_tokenize, word_tokenize

In [2]:
##open the dataset of funerary inscriptions (172,958 rows)
Inscriptions = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/ICLL Prague June 2023/Output/Tituli_Sepulcrales_new.csv")

In [3]:
len(Inscriptions)

172958

# Conservative cleaning: what's on the stone

In [4]:
##create a list of all the tokens (upper)
list_of_tokens_upper = []

for i,inscription in enumerate(Inscriptions['inscription_conservative_cleaning']):
    inscription = str(inscription)
    tokenized_inscription = word_tokenize(inscription) ##tokenize the inscription with NLTK
    
    for token in tokenized_inscription:
        token = token.upper()
        list_of_tokens_upper.append(token)

In [5]:
##number of tokens in the conservative texts (1,943,890)
len(list_of_tokens_upper)

1943890

In [6]:
counter_upper = Counter(list_of_tokens_upper)

##number of unique upper tokens (137,341)
len(counter_upper)

137341

In [7]:
##get the 300 most common tokens 
most_frequent_tokens_upper = counter_upper.most_common(100)
most_frequent_tokens_upper

[('M', 97537),
 ('D', 67735),
 ('S', 53357),
 ('ET', 50166),
 ('L', 43968),
 ('F', 35649),
 ('P', 32913),
 ('V', 31920),
 ('C', 30192),
 ('A', 28655),
 ('IN', 27158),
 ('VIXIT', 24928),
 ('H', 23359),
 ('E', 21697),
 ('T', 20129),
 ('AN', 18823),
 ('ANN', 15706),
 ('FECIT', 14976),
 ('Q', 14566),
 ('VIX', 13457),
 ('ANNIS', 12792),
 ('B', 11175),
 ('SIBI', 11090),
 ('BENE', 10215),
 ('PACE', 10201),
 ('QUI', 9744),
 ('I', 9222),
 ('QUE', 8997),
 ('CONIUGI', 8459),
 ('MERENTI', 8195),
 ('VI', 7193),
 ('X', 7162),
 ('N', 7054),
 ('III', 6966),
 ('II', 6211),
 ('FILIO', 5740),
 ('DIS', 5475),
 ('O', 5329),
 ('XX', 5253),
 ('SUIS', 5244),
 ('LIB', 5048),
 ('IIII', 5033),
 ('AUG', 5005),
 ('FIL', 4836),
 ('HIC', 4834),
 ('VII', 4795),
 ('XII', 4245),
 ('VIII', 4192),
 ('ANNOS', 4171),
 ('XXX', 4099),
 ('SUO', 3976),
 ('XI', 3825),
 ('IULIUS', 3788),
 ('XXV', 3779),
 ('US', 3668),
 ('MANIBUS', 3633),
 ('QUAE', 3609),
 ('EST', 3601),
 ('XV', 3596),
 ('LEG', 3501),
 ('CUM', 3424),
 ('KAL', 341

In [8]:
##lenght of tokens per inscription in conservative texts (mean): 11.2
import statistics

tokens_per_sentence = []

for i,inscription in enumerate(Inscriptions['inscription_conservative_cleaning']):    
    inscription = str(inscription) ##convert to string
    tokenized_inscription = word_tokenize(inscription) ##tokenize the inscription with NLTK
    
    sum_of_tokens = len(tokenized_inscription) ##calculate the lenght of the list of tokens
    tokens_per_sentence.append(sum_of_tokens) ##append the lenght to the tokens_per_sentence
    
mean = statistics.mean(tokens_per_sentence)
mean

11.23908694596376

# Interpretive cleaning: get some meaning out of it

In [9]:
##create a list of all the words in the interpretive texts
list_of_words = []

for i,inscription in enumerate(Inscriptions['inscription_interpretive_cleaning']):
    inscription = str(inscription)
    tokenized_inscription = word_tokenize(inscription) ##tokenize the inscription with NLTK
    for word in tokenized_inscription:
        word = word.lower()
        list_of_words.append(word)

In [10]:
##word count in the interpretive texts (2,007,668)
len(list_of_words)

2007668

In [11]:
counter_words = Counter(list_of_words)

##number of unique words in interpretive texts (115,697)
len(counter_words)

115697

In [12]:
##lenght of words per inscription in interpretive texts (mean): 11.6
import statistics

tokens_per_sentence = []

for i,inscription in enumerate(Inscriptions['inscription_interpretive_cleaning']):    
    inscription = str(inscription) ##convert to string
    tokenized_inscription = word_tokenize(inscription) ##tokenize the inscription with NLTK
    
    sum_of_tokens = len(tokenized_inscription) ##calculate the lenght of the list of tokens
    tokens_per_sentence.append(sum_of_tokens) ##append the lenght to the tokens_per_sentence
    
mean = statistics.mean(tokens_per_sentence)
mean

11.607835428254258