# Collocations Extraction and Measurement

The analysis was performed on the texts containing indications of blank spaces ([3], [6]), but not of division of line (/). In this way, not consecutive words are excluded from the bigram extraction. In this phase, the extracted bigrams also contain integrations and resolution of abbreviations made by modern editors.

The notebook contains three main passages in the extraction and measurement of bigrams in the inscriptions:

1. Extraction of the most frequent bigrams based on the raw frequency.
2. Extraction of the bigrams with the highest PMI score.
3. Extraction of the most significant bigrams in comparison to the reference corpus.

The three steps correspond to three key-aspects that emerged in the linguistic analysis of the formulaic language:
1. Frequency.
2. Strong association.
3. Specificity.

For a discussion see: ...

Note that the bigrams extraction also permits to perform a preliminary analysis of the layout. For example, the second most frequent bigram is '.' 'dis' (61,617), because 'Dis' is usually the first word of the inscription.

In [1]:
import pandas as pd
from collections import Counter
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()
import re
import collections
from collections import Counter
from nltk.util import ngrams
from scipy.stats import chi2_contingency
from statsmodels.stats.proportion import proportions_ztest
from statsmodels.stats.proportion import proportions_chisquare
import numpy as np
import pickle

In [2]:
##open the dataset of funerary inscriptions (172,958 rows)
Inscriptions = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/ICLL Prague June 2023/Output/Tituli_Sepulcrales_new.csv")

In [3]:
##test for the text cleaning removing all the special characters except for '3' and '6'
test = '[D(is) M(anibus) s(acrum)] / mo[r]ibus / bonis [3] m/[atro]na / [pud]ica / [prob]ata'
cleaned_test = re.sub(r"[\(\){}\[\]/]", "", test)
cleaned_test

'Dis Manibus sacrum  moribus  bonis 3 matrona  pudica  probata'

In [4]:
##create a list of all the words in inscriptions
list_of_words = []

for i,inscription in enumerate(Inscriptions['inscription']):
    inscription = str(inscription)
    cleaned_inscription = re.sub(r"[\(\){}\[\]/]", "", inscription) ##remove all the special characters except '3' and '6'
    tokenized_inscription = word_tokenize(cleaned_inscription) ##tokenize the inscription with NLTK
    for word in tokenized_inscription:
        word = word.lower() ##lower all the words
        list_of_words.append(word)
    list_of_words.append('.') ##add a stop point to mark the end of the sentence

In [5]:
len(list_of_words)

2380947

# a. Frequency: Most Frequent Bigrams

In [6]:
finder = BigramCollocationFinder.from_words(list_of_words) ##create a bigram finder

##get the 100 most frequent bigrams
finder.ngram_fd.most_common(10)

[(('dis', 'manibus'), 64389),
 (('.', 'dis'), 61617),
 (('vixit', 'annos'), 41692),
 (('manibus', 'sacrum'), 20845),
 (('bene', 'merenti'), 18850),
 (('vixit', 'annis'), 15365),
 (('in', 'pace'), 13574),
 (('hic', 'situs'), 13430),
 (('est', '.'), 12064),
 (('situs', 'est'), 11563)]

#  b. Pointwise Mutual Information (PMI) Score

The formula for calculating the PMI score is as follows:

                                       PMI = log2(P(word1, word2) / (P(word1) * P(word2)))

where P(word1, word2) represents the observed probability of the bigram, and P(word1) and P(word2) represent the observed probabilities of the individual words. The observed probability of a bigram refers to the frequency of occurrence of that specific bigram in the corpus relative to the total number of bigrams present. It is calculated by dividing the frequency of the bigram by the total number of bigrams in the corpus.

In [7]:
##get the 10 bigrams with the highest PMI score
finder.nbest(bigram_measures.pmi, 10)

[('++a3', '3+p++a+3'),
 ('++i+ei', 'iotae'),
 ('++nicios', '+++'),
 ('+ceech+', 'aano+o'),
 ('+crknωm', 'mnodo'),
 ('+ecidi', 'o+pate'),
 ('+valterni', 'bimet3ventur'),
 ('.אגוסתה.אמן', 'שלוםעלמשכה'),
 ('1abi3e', 'p3adi'),
 ('1aconanicvs', 'frattri')]

In [8]:
##get the 10 bigrams with the highest PMI score which occur at least twice in the corpus
finder.apply_freq_filter(2) ##apply a filter to the finder

for bigram in (finder.nbest(bigram_measures.pmi, 10)):
    pmi_score = finder.score_ngram(bigram_measures.pmi, bigram[0], bigram[1])
    absolute_frequency = finder.ngram_fd[bigram]
    print(f"{bigram}\tRaw_frequency: {absolute_frequency}") ##show the raw frequency

('3iei3rcs', '3alimappa')	Raw_frequency: 2
('3ln3', '3bt3')	Raw_frequency: 2
('3niu3', '3isse3')	Raw_frequency: 2
('3ussa', '3e3al')	Raw_frequency: 2
('accensorum', 'velatorum')	Raw_frequency: 2
('aeflaniae', 'debeiae')	Raw_frequency: 2
('ancialus', 'fulviniai')	Raw_frequency: 2
('asonilo', 'euporius')	Raw_frequency: 2
('astello', 'starvae')	Raw_frequency: 2
('auruncina', 'calaviana')	Raw_frequency: 2


In [9]:
##get the 10 bigrams with the highest PMI score which occur at least 122 times in the corpus
finder.apply_freq_filter(122) ##apply a filter to the finder

for bigram in (finder.nbest(bigram_measures.pmi, 10)):
    pmi_score = finder.score_ngram(bigram_measures.pmi, bigram[0], bigram[1])
    absolute_frequency = finder.ngram_fd[bigram]
    print(f"{bigram}\tRaw_frequency: {absolute_frequency}") ##show the raw frequency

('dolus', 'malus')	Raw_frequency: 155
('malus', 'abesto')	Raw_frequency: 147
('rei', 'publicae')	Raw_frequency: 136
('ddominis', 'nnostris')	Raw_frequency: 184
('vviris', 'cclarissimis')	Raw_frequency: 188
('monumento', 'dolus')	Raw_frequency: 122
('contra', 'votum')	Raw_frequency: 190
('decreto', 'decurionum')	Raw_frequency: 250
('iure', 'dicundo')	Raw_frequency: 376
('equiti', 'singulari')	Raw_frequency: 190


# c. p-value with LASLA

In [10]:
def calculate_z_score(item1, item2, corpus_counter, reference_counter):
    count=np.array([corpus_counter[item1], reference_counter[item2]])
    n=np.array([2007668, 1809855])
    return proportions_ztest(count, n)

In [11]:
def calculate_chi_square(item1, item2, corpus_counter, reference_counter): 
    count=np.array([corpus_counter[item1], reference_counter[item2]])
    n=np.array([2007668, 1809855])
    return proportions_chisquare(count, n)

In [12]:
##open the file containing the bigrams in the inscriptions
with open(r'C:\Users\u0154817\OneDrive - KU Leuven\Documents\ICLL Prague June 2023\Output\Ngrams_notcleaned\grams_2', 'rb') as f:
    bigrams_inscriptions = pickle.load(f)

In [13]:
bigrams_inscriptions.most_common(10) 

[(('dis', 'manibus'), 64389),
 (('vixit', 'annos'), 41692),
 (('manibus', 'sacrum'), 20845),
 (('bene', 'merenti'), 18850),
 (('vixit', 'annis'), 15365),
 (('in', 'pace'), 13574),
 (('hic', 'situs'), 13430),
 (('situs', 'est'), 11563),
 (('3', '3'), 8786),
 (('sibi', 'et'), 8271)]

In [14]:
type(bigrams_inscriptions) ##bigrams_inscriptions is a Counter object

collections.Counter

In [15]:
##open the file containing the bigrams in the LASLA corpus
with open(r'C:\Users\u0154817\OneDrive - KU Leuven\Documents\ICLL Prague June 2023\Sources\final_grams_lasla\grams_2', 'rb') as f:
    bigrams_LASLA = pickle.load(f)

In [16]:
bigrams_LASLA.most_common(10) 

[(('et', 'in'), 1168),
 (('rei', 'publicae'), 948),
 (('non', 'est'), 906),
 (('que', 'et'), 754),
 (('sed', 'etiam'), 677),
 (('non', 'modo'), 665),
 (('populi', 'romani'), 653),
 (('que', 'in'), 646),
 (('est', 'quod'), 618),
 (('est', 'et'), 588)]

In [17]:
type(bigrams_LASLA) ##bigrams_LASLA is a Counter object

collections.Counter

In [18]:
for i in  bigrams_inscriptions.most_common(10):
    print(i)
    i1 = re.sub(r'v', 'u', i[0][0]) ##we observed that in the LASLA corpus 'vixit' is written 'uixit'
    i2 = re.sub(r'v', 'u', i[0][1]) ##we observed that in the LASLA corpus 'vixit' is written 'uixit'
    cleaned_tuple = (i1,i2)
    #print(cleaned_tuple)
    a, b=calculate_z_score(i[0], cleaned_tuple, bigrams_inscriptions, bigrams_LASLA)
    print(a, f'{b:.3g}')
    #print(calculate_chi_square(i[0], cleaned_tuple, bigrams_inscriptions, bigrams_LASLA))
    #if calculate_chi_square(i[0], cleaned_tuple, bigrams_inscriptions, bigrams_LASLA)[1]<0.05:
        #print('significant')

(('dis', 'manibus'), 64389)
242.9707449070468 0
(('vixit', 'annos'), 41692)
194.93367226751113 0
(('manibus', 'sacrum'), 20845)
137.45667695168333 0
(('bene', 'merenti'), 18850)
130.62352023914664 0
(('vixit', 'annis'), 15365)
117.9282831355557 0
(('in', 'pace'), 13574)
109.97859520824204 0
(('hic', 'situs'), 13430)
110.14563287029506 0
(('situs', 'est'), 11563)
102.1804860936101 0
(('3', '3'), 8786)
89.09880528833129 0
(('sibi', 'et'), 8271)
85.45462177652446 0


In [19]:
list_specificity=[]

for i in bigrams_inscriptions:
    i1 = re.sub(r'v', 'u', i[0]) ##we observed that in the LASLA corpus 'vixit' is written 'uixit'
    i2 = re.sub(r'v', 'u', i[1]) ##we observed that in the LASLA corpus 'vixit' is written 'uixit'
    cleaned_tuple = (i1,i2)
    list_specificity.append((i, calculate_chi_square(i, cleaned_tuple, bigrams_inscriptions, bigrams_LASLA)))
list_specificity.sort(key=lambda a: a[1][1])

In [20]:
list_specificity[:15]

[(('3', '3'),
  (7938.597103807971, 0.0, (array([[   8786., 1998882.],
           [      0., 1809855.]]),
    array([[   4620.63255362, 2003047.36744638],
           [   4165.36744638, 1805689.63255362]])))),
 (('sibi', 'et'),
  (7302.492382968851,
   0.0,
   (array([[8.271000e+03, 1.999397e+06],
           [5.900000e+01, 1.809796e+06]]),
    array([[   4380.81825309, 2003287.18174691],
           [   3949.18174691, 1805905.81825309]])))),
 (('in', 'fronte'),
  (5128.042546647204,
   0.0,
   (array([[5.744000e+03, 2.001924e+06],
           [2.000000e+01, 1.809835e+06]]),
    array([[   3031.33690406, 2004636.66309594],
           [   2732.66309594, 1807122.33690406]])))),
 (('in', 'agro'),
  (4413.650622411032,
   0.0,
   (array([[5.067000e+03, 2.002601e+06],
           [5.600000e+01, 1.809799e+06]]),
    array([[   2694.22952108, 2004973.77047892],
           [   2428.77047892, 1807426.22952108]])))),
 (('agro', 'pedes'),
  (4398.484032436684, 0.0, (array([[   4873., 2002795.],
      