# Extraction of Sequences Repeated at least Twice in the Corpus

In this notebook, sequences of words occurring at least twice in the corpus are extracted.

The extraction starts with the calculation of the longest repeated sequences of tokens. To do this, the lenght of the longest inscriptions is calculated. It is 1,315 tokens and it is the longest possible repeated sequence in the whole corpus.

Then, a loop iterates over a range of values of i starting from 1,315 going down to 1. For each i, a list of n-grams is created for each inscription using the NLTK ngram function.

The list of n-grams is flattened (flat_list) and the presence of duplicates is checked by iterating over each n-gram in flat_lis starting from the longest one. If a duplicate is found, the loop is terminated. The longest repeated sequence is 61 tokens.

A function is defined (repeated_seq function) that takes the list of tokenized sentences and an integer n as input. The integer ranges from 61 to 2. The function finds all the n-grams of length n in each sentence and stores them in a list. The list if flattened and a counter is used to count the occurrences of each n-gram in the flattened list. If the n-gram is repeated, the n-gram and its counts are appended to the new counter.

To do the extraction, we used the texts containing indications of blanks within the line. In this way, not consecutive words can be isolated in the counting.

In [1]:
import pandas as pd
import os
import xml.etree.ElementTree as ET
from nltk.tokenize import sent_tokenize, word_tokenize
import re
from nltk.util import ngrams
import collections
from collections import Counter
import pickle

In [2]:
##open the dataset of funerary inscriptions (172,958 rows)
Inscriptions = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/ICLL Prague June 2023/Output/Tituli_Sepulcrales_new.csv")

In [3]:
##find the longest inscription

len_tokenized_inscriptions = []
list_tokenized_inscriptions = []

for i,inscription in enumerate(Inscriptions['inscription']):
    inscription = str(inscription)
    cleaned_inscription = re.sub(r"[\(\){}\[\]/]", "", inscription) ##remove all the special characters except '3' and '6'
    lower_inscription = cleaned_inscription.lower() ##lower all the character in the sentence
    tokenized_inscription = word_tokenize(lower_inscription)
    len_tokenized_inscriptions.append(len(tokenized_inscription))
    list_tokenized_inscriptions.append(tokenized_inscription)
    
max_len = max(len_tokenized_inscriptions)
print(max_len)

1345


In [4]:
##get the index of the longest inscription
index_longest_inscription = len_tokenized_inscriptions.index(max_len)
index_longest_inscription

118843

In [5]:
##get the text of the longest inscription
Inscriptions['inscription_interpretive_cleaning'][index_longest_inscription]

'uxoris morum probitate rum permansisti proba orbata es repente ante nuptiarum diem utro que parente in deserta solitudine una occisis per te maxime cum ego in Macedoniam provinciam issem vir sororis tuae Caius Cluvius in Africam provinciam inulta non est relicta mors parentum tanta cum industria munere es pietatis perfuncta efflagitando at que vindicando ut si praesto fuissemus non amplius potuissemus sed haec habes communia cum sanctissima femina sorore tua quae dum agitabas ex patria domo propter custodiam non cedisti sumpto de nocentibus supplicio evestigio te in domum matris meae tulisti ubi adventum meum expectasti temptatae deinde estis ut testamentum quo nos eramus heredes ruptum diceretur coemptione facta cum uxore ita necessario te cum universis patris bonis in tutelam eorum qui rem agitabant reccidisse sororem omnino eorum bonorum fore expertem quod emancupata esset Cluvio qua mente ista acceperis qua praesentia animi restiteris etsi afui conpertum habeo veritate causam comm

In [6]:
##find the longest repeated sequence (61 tokens)

i=max_len ##=1315 tokens
dupl=False
while (i>0) & (dupl==False): ##iterate over a range of values of i from max_len to 1
    list_n_grams=[]
    for tokenized_inscription in list_tokenized_inscriptions: ##for each tokenized inscriptions
        if len(tokenized_inscription)>=i:
            n_grams=list(ngrams(tokenized_inscription, i)) ##create a list of n_grams, the n-gram lenght is equal to i
            if (len(n_grams)>0):
                list_n_grams.append(n_grams)
    flat_list = [item for sublist in list_n_grams for item in sublist] ##flatten the list
    #print(flat_list)
    #print(i)
    new_list=[]
    for gram in flat_list:
        if not(gram in new_list): ##check for duplicates
            new_list.append(gram)
        else:
            dupl=True ##if a duplicate is found, the loop is terminated
            print('When the loop is terminated, the i value is :', i)
    i-=1

When the loop is terminated, the i value is : 61


In [7]:
longest_repeated_sequence= 61

In [8]:
def repeated_seq (lst, n): ##take a list of sentences and an integer as input
    list_n_grams=[]
    for sent in lst:
        if len(sent)>=n:
            n_grams=list(ngrams(sent, n)) ##find all the n-grams of lenght i in each sentence
            list_n_grams.append(n_grams) ##append all the found n-grams in a list
    flat_list = [item for sublist in list_n_grams for item in sublist] ##flatten the list
    c=Counter(flat_list) ##count the occurrences of each n-gram in the flattened list
    ##c contains the n-grams and their counts
    new_counter = Counter() ##create an empty counter object
    for element, count in c.items():
        if count > 1: ##if the count is greater than 1
            new_counter[element] = count ##add the count to new_counter
    return new_counter ##the new_counter contains the repeated n-grams and their counts

In [None]:
#for i in range(0, 13):
    #grams=repeated_seq(tok_sent, 14-i)
    #print(i)
    #name='final_grams/grams_'+str(14-i)
    #file = open(name, "wb")
    #pickle.dump(grams, file)
    #file.close()

In [9]:
for i in range(0, 60): ##range between 0-59
    grams=repeated_seq(list_tokenized_inscriptions, longest_repeated_sequence-i) ##find all the n-grams of lenght 78-i in each sentence
    print(i)
    name='grams_'+str(longest_repeated_sequence-i)
    file = open(name, "wb")
    pickle.dump(grams, file)
    file.close()

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59


In [16]:
with open('grams_2', 'rb') as f:
    x = pickle.load(f)

In [17]:
x.most_common()

[(('dis', 'manibus'), 64389),
 (('vixit', 'annos'), 41692),
 (('manibus', 'sacrum'), 20845),
 (('bene', 'merenti'), 18850),
 (('vixit', 'annis'), 15365),
 (('in', 'pace'), 13574),
 (('hic', 'situs'), 13430),
 (('situs', 'est'), 11563),
 (('3', '3'), 8786),
 (('sibi', 'et'), 8271),
 (('qui', 'vixit'), 8249),
 (('hic', 'sita'), 8030),
 (('sita', 'est'), 7119),
 (('quae', 'vixit'), 5929),
 (('merenti', 'fecit'), 5786),
 (('in', 'fronte'), 5744),
 (('fronte', 'pedes'), 5549),
 (('in', 'agro'), 5067),
 (('agro', 'pedes'), 4873),
 (('terra', 'levis'), 4308),
 (('<', 'v=b'), 4241),
 (('v=b', '>'), 4241),
 (('tibi', 'terra'), 3991),
 (('sit', 'tibi'), 3867),
 (('?', '3'), 3826),
 (('coniugi', 'bene'), 3818),
 (('et', 'sibi'), 3736),
 (('pius', 'vixit'), 3602),
 (('posterisque', 'eorum'), 3592),
 (('annos', '3'), 3334),
 (('et', 'suis'), 3079),
 (('>', 's'), 2895),
 (('libertis', 'libertabusque'), 2881),
 (('3', 'vixit'), 2772),
 (('est', 'sit'), 2757),
 (('pia', 'vixit'), 2666),
 (('fecit', 's