# 1.6 Extract and Measure Collocations

There are different tools to extract and measure collocations (or bigrams) in a text. The analysis was performed on the interpretive cleaning texts. Conservative texts were excluded from the extraction of bigrams since blank spaces are not marked and not following words are concatenated leading to false assumptions.

The most frequent collocations are 'dis manibus', 'vixit annos', 'manibus sacrum'. The simple frequency of a collocation is quite difficult to interpret. For instance, the bigram 'dis manibus' occurs 65,433 in the interpretive texts. But, is this number relevant? In comparison to what? To assess the meaning of this values, other metrics (NLTK) can be used that permit to explore other aspects of the collocation bound.

NLTK measures include:

- Pointwise Mutual Information (PMI)
- Likelihood Ratio

The different measures are compared.

In [1]:
import pandas as pd
from collections import Counter

In [2]:
##open the dataset of funerary inscriptions (172,958 rows)
Inscriptions = pd.read_csv("/Users/u0154817/OneDrive - KU Leuven/Documents/ICLL Prague June 2023/Output/Tituli_Sepulcrales_new.csv")

# 1.4.1 Bigrams collocations: Frequency

In [3]:
##create a list of all the words in the interpretive texts
List_of_Words = []

for i,Inscription in enumerate(Inscriptions['inscription_interpretive_cleaning']):
    Inscription = str(Inscription)
    Inscription = Inscription.split()
    for Word in Inscription:
        Word = Word.lower()
        List_of_Words.append(Word)
    List_of_Words.append('.') ##add a stop point to mark the end of the sentence

In [4]:
import nltk
from nltk.collocations import *

bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(List_of_Words) ##create a bigram finder

##get the 100 most frequent bigrams in the interpretive texts
finder.ngram_fd.most_common(10)

[(('dis', 'manibus'), 65433),
 (('.', 'dis'), 62490),
 (('vixit', 'annos'), 44016),
 (('manibus', 'sacrum'), 20886),
 (('bene', 'merenti'), 19361),
 (('vixit', 'annis'), 16453),
 (('in', 'pace'), 14158),
 (('hic', 'situs'), 13493),
 (('situs', 'est'), 12584),
 (('est', '.'), 12236)]

# 1.4.2 NLTK Metrics

**1.4.2.a Likelihood Ratio Measure**

The likelihood ratio measure is a statistical metric that evaluates the strength of association between two words in a bigram. It compares the **observed frequency** of a bigram with the **expected frequency** under the assumption of independence


                    LR = 2 * (log-likelihood of observed frequency - log-likelihood of expected frequency)


The assumption of independence in collocation analysis refers to the assumption that the occurrences of two words in a bigram are not dependent on each other. By assuming independence, we can establish a baseline expectation for the frequency of a bigram. It's important to note that the assumption of independence is a simplifying assumption made in collocation analysis to quantify the strength of association between words. In reality, the occurrences of words in a corpus can be influenced by various contextual factors, and true independence may not always hold.

The likelihood ratio score measures the deviation from this expected frequency. If the observed frequency is significantly higher than the expected frequency, the likelihood ratio score will be larger, indicating a stronger association between the two words in the bigram.

For example, if you have a large and representative reference corpus available, you can use the from_corpus_freq() method to estimate the expected frequency based on the frequencies of word pairs in the reference corpus. Alternatively, if you don't have a reference corpus available, you can use the from_words() method to estimate the expected frequency based on the frequencies of the individual words in your list of words.

In [5]:
##get the 100 most significant bigrams based on the likelihood ratio measure in the interpretive texts
finder.nbest(bigram_measures.likelihood_ratio, 10)

[('dis', 'manibus'),
 ('.', 'dis'),
 ('vixit', 'annos'),
 ('bene', 'merenti'),
 ('manibus', 'sacrum'),
 ('in', 'pace'),
 ('hic', 'situs'),
 ('situs', 'est'),
 ('vixit', 'annis'),
 ('hic', 'sita')]

**1.4.2.b Pointwise Mutual Information (PMI) Score**

The Pointwise Mutual Information (PMI) score measures the extent to which the joint probability of the bigram deviates from the expected probability under the assumption of independence.

The formula for calculating the PMI score is as follows:

                                       PMI = log2(P(word1, word2) / (P(word1) * P(word2)))

In this formula, P(word1, word2) represents the observed probability of the bigram, and P(word1) and P(word2) represent the observed probabilities of the individual words. The observed probability of a bigram refers to the frequency of occurrence of that specific bigram in the corpus relative to the total number of bigrams present. It is calculated by dividing the frequency of the bigram by the total number of bigrams in the corpus.

                        Observed probability of a bigram = Frequency of the bigram / Total number of bigrams

Consider that since the denominator of the formula includes the individual probabilities of the words, if one or both of the words in a bigram have very low probabilities, the denominator becomes small, resulting in a larger PMI score.

This characteristic of PMI can lead to the identification of rare or unique word combinations, but it can also overestimate the significance of certain associations and some expressions are also very infrequent. Therefore it is useful to apply filters, such as ignoring bigrams which occurr less than three times in the corpus.

Since PMI takes into account the individual word probabilities, it can capture associations that are not accounted for by the likelihood ratio. As a result, the list of bigrams ranked by PMI may differ from the list ranked by likelihood ratio.

In [6]:
##get the 100 most significant bigrams based on the PMI score in the interpretive texts
finder.nbest(bigram_measures.pmi, 10)

[('aaascia', 'dddedicaverunt'),
 ('aabsmun', 'eosa'),
 ('aai', 'aciihi'),
 ('aaulis', 'considiis'),
 ('abacio', 'gavernis'),
 ('abanv', 'igavsie'),
 ('abascantis', 'lathymo'),
 ('abbulae', 'pervincia'),
 ('abcdefgh', 'savg'),
 ('abcdefghiklm', 'nopqrstvxyz')]

In [7]:
finder.apply_freq_filter(100) ##apply a filter to the finder
finder.nbest(bigram_measures.pmi, 100)

[('dolus', 'malus'),
 ('malus', 'abesto'),
 ('rei', 'publicae'),
 ('equo', 'publico'),
 ('ddominis', 'nnostris'),
 ('vviris', 'cclarissimis'),
 ('somno', 'pacis'),
 ('contra', 'votum'),
 ('decreto', 'decurionum'),
 ('monumento', 'dolus'),
 ('iure', 'dicundo'),
 ('milia', 'nummum'),
 ('ulla', 'querella'),
 ('tribunus', 'militum'),
 ('equiti', 'singulari'),
 ('ascia', 'dedicaverunt'),
 ('domino', 'nostro'),
 ('consulatum', 'basili'),
 ('ob', 'merita'),
 ('huic', 'monumento'),
 ('sine', 'ulla'),
 ('famulus', 'dei'),
 ('datus', 'decreto'),
 ('eques', 'singularis'),
 ('ascia', 'dedicavit'),
 ('equiti', 'romano'),
 ('viri', 'clarissimi'),
 ('fieri', 'iussit'),
 ('post', 'consulatum'),
 ('clarissimo', 'consule'),
 ('praetoriae', 'misenensis'),
 ('tiberio', 'claudio'),
 ('poni', 'iussit'),
 ('siti', 'sunt'),
 ('post', 'obitum'),
 ('classis', 'praetoriae'),
 ('basili', 'viri'),
 ('sua', 'pecunia'),
 ('pro', 'pietate'),
 ('tiberi', 'claudi'),
 ('non', 'sequetur'),
 ('sub', 'ascia'),
 ('si', 'qui

# 1.4.3 Trigrams, frougrams, ngrams

In [None]:
trigram_measures = nltk.collocations.TrigramAssocMeasures()

# 1.4.4 Spanning intervening words

In [None]:
finder = BigramCollocationFinder.from_words(List_of_Words, window_size=3) ##create a bigram finder
finder.apply_freq_filter(10) ##apply a filter to the finder
#ignored_words =
#finder.apply_word_filter(lambda w: len(w) , 3 or w.lower() in ignored_words)
finder.nbest(bigram_measures.likelihood_ratio, 100)