## Multiword expressions identification and extraction
- Use SpaCy tokenizer API to tokenize the text from the FIQA corpus (from the 1 lab).
-Compute bigram counts of downcased tokens. 
- Discard bigrams containing characters other than letters. Make sure that you discard the invalid entries after computing the bigram counts.
- Use pointwise mutual information to compute the measure for all pairs of words.
- Sort the word pairs according to that measure in the descending order and determine top 10 entries.
- Filter bigrams with number of occurrences lower than 5. Determine top 10 entries for the remaining dataset (>=5 occurrences).
- Use SpaCy to lemmatize and tag the sentences in the corpus.
- Using the tagged corpus compute bigram statistic for the tokens containing: a. lemmatized, downcased word b. morphosyntactic category of the word (subst, fin, adj, etc.)

- Compute the same statistics as for the non-lemmatized words (i.e. PMI) and print top-10 entries with at least 5 occurrences.
- Group the bigrams by morphosyntactic tag, i.e. a pair of words belongs to a given group if all pairs have the same syntactic category for the first and the second word. E.g. one group would be words with subst as the first words and adj as the second word.
- Print top-10 categories (sort them by total count of bigrams) and print top-5 pairs for each category.
- Create a table comparing the results for copora without and with tagging and lemmatization.

In [1]:
import pandas as pd
from datasets import load_dataset
import spacy
from nltk import bigrams
from nltk.tokenize import word_tokenize
from nltk import FreqDist
import math
from collections import Counter
from collections import defaultdict

  from .autonotebook import tqdm as notebook_tqdm


## Use SpaCy tokenizer API to tokenize the text from the FIQA corpus.

In [2]:
# loading spacy model
nlp = spacy.load("pl_core_news_sm")

In [3]:
# loading dataset
dataset = load_dataset("clarin-knext/fiqa-pl", "corpus")
df = pd.DataFrame(dataset['corpus'])
df_text = df['text']

In [4]:
df['text'][0]

'Nie mówię, że nie podoba mi się też pomysł szkolenia w miejscu pracy, ale nie możesz oczekiwać, że firma to zrobi. Szkolenie pracowników to nie ich praca – oni tworzą oprogramowanie. Być może systemy edukacyjne w Stanach Zjednoczonych (lub ich studenci) powinny trochę martwić się o zdobycie umiejętności rynkowych w zamian za ich ogromne inwestycje w edukację, zamiast wychodzić z tysiącami zadłużonych studentów i narzekać, że nie są do niczego wykwalifikowani.'

In [5]:
# # Spacy tokenization
# tokenized_texts = []

# for text in df_text:
#     doc = nlp(text)
#     tokens = [token.text for token in doc]
#     tokenized_texts.append(tokens)

In [6]:
# loading tokenized corpus saved in previous report
corpus_tokenized = pd.read_csv("tokenized_texts.csv", sep = ';', converters={'tokens': eval})

In [7]:
corpus_tokens = []

In [8]:
# make one list of tokens from many lists
for row in corpus_tokenized['tokens']:
    for token in row:
        corpus_tokens.append(token.lower())

In [9]:
corpus_tokens[:10]

['nie', 'mówię', ',', 'że', 'nie', 'podoba', 'mi', 'się', 'też', 'pomysł']

## Compute bigram counts of downcased tokens.

In [10]:
# computing bigrams
bi_grams = list(bigrams(corpus_tokens))
bi_grams[:10]

[('nie', 'mówię'),
 ('mówię', ','),
 (',', 'że'),
 ('że', 'nie'),
 ('nie', 'podoba'),
 ('podoba', 'mi'),
 ('mi', 'się'),
 ('się', 'też'),
 ('też', 'pomysł'),
 ('pomysł', 'szkolenia')]

In [11]:
# calculating bigrams frequency
bigram_freq = FreqDist(bi_grams)

In [71]:
# see results
for idx, (bigram, count) in enumerate(bigram_freq.items()):
    print(f'"{bigram[0]} {bigram[1]}": {count}')
    if idx == 5:
        break


"nie mówię": 275
"że nie": 5013
"nie podoba": 164
"podoba mi": 234
"mi się": 1398
"się też": 98


## Discard bigrams containing characters other than letters. 

In [13]:
# ignore bigrams containing characters other than letters
bigram_freq = {bigram: count for bigram, count in bigram_freq.items() if all(word.isalpha() for word in bigram)}

In [72]:
# see results
for idx, (bigram, count) in enumerate(bigram_freq.items()):
    print(f'"{bigram[0]} {bigram[1]}": {count}')
    if idx == 5:
        break

"nie mówię": 275
"że nie": 5013
"nie podoba": 164
"podoba mi": 234
"mi się": 1398
"się też": 98


##  Use pointwise mutual information to compute the measure for all pairs of words.

In [75]:
# calculate unigram freequency for PMI 
unigram_freq = FreqDist(corpus_tokens)
unigram_freq = {unigram: count for unigram, count in unigram_freq.items() if all(word.isalpha() for word in unigram)}

In [78]:
# see results
for idx, (unigram, count) in enumerate(unigram_freq.items()):
    print(f'"{unigram}": {count}')
    if idx == 5:
        break

"nie": 131454
"mówię": 978
"że": 90019
"podoba": 503
"mi": 5497
"się": 85840


In [20]:
# function calculating PMI for bigrams
def calculate_pmi_for_corpus(bigrams, unigram_freq, bigram_freq):
    total_bigrams = sum(bigram_freq.values())
    pmi_scores = {}

    for bigram in bigrams:
        word_a, word_b = bigram
        p_a = unigram_freq.get(word_a, 0) / total_bigrams
        p_b = unigram_freq.get(word_b, 0) / total_bigrams
        p_a_b = bigram_freq.get(bigram, 0) / total_bigrams

        if p_a * p_b != 0 and p_a_b != 0:
            pmi = math.log2(p_a_b / (p_a * p_b))
            pmi_scores[bigram] = pmi

    return pmi_scores

In [21]:
# calculate pmi on our corpus bigrams
pmi_scores = calculate_pmi_for_corpus(bi_grams, unigram_freq, bigram_freq)

In [22]:
# see results
for idx, (bigram, pmi) in enumerate(pmi_scores.items()):
    print(f'"{bigram[0]} {bigram[1]}": {pmi}')
    if idx == 9:
        break

"nie mówię": 3.4418706758592106
"że nie": 1.1057901233813492
"nie podoba": 3.655410937211979
"podoba mi": 8.74799340481655
"mi się": 3.911829464770224
"się też": 0.5396463964380495
"też pomysł": 1.4408548954513487
"pomysł szkolenia": 4.730602584589561
"szkolenia w": 0.5798881871130837
"w miejscu": 2.3930893671063753


## Sort the word pairs according to that measure in the descending order and determine top 10 entries.

In [23]:
# sorting pmi scores based on pmi value
sorted_pmi_scores = sorted(pmi_scores.items(), key=lambda x: x[1], reverse= True)

In [24]:
# see top 10 pmi scores
for idx, (bigram, pmi) in enumerate(sorted_pmi_scores):
    print((bigram, pmi))
    if idx == 9:
        break

(('jankesem', 'skrzeczącym'), 22.276472039076435)
(('wizach', 'panamskich'), 22.276472039076435)
(('devrait', 'éviter'), 22.276472039076435)
(('facturer', 'fortement'), 22.276472039076435)
(('peletki', 'chmielowe'), 22.276472039076435)
(('opalaniem', 'wzmiankowane'), 22.276472039076435)
(('wielowęzłowy', 'sekwencer'), 22.276472039076435)
(('sonarowe', 'zamontowane'), 22.276472039076435)
(('napompowanymi', 'gpas'), 22.276472039076435)
(('непрекъсната', 'памет'), 22.276472039076435)


Received PMI scores seem very high, suggesting that these bigrams occur very rarely, but when they do occur, they are very characteristic of the text in question. We can see that in our corpus some bigrams that have the highest pmi scores are phrases not in polish language. On the other hand polish bigrams with high PMI are very specific combinations of 2 words that are not commonly used.

## Filter bigrams with number of occurrences lower than 5. Determine top 10 entries for the remaining dataset (>=5 occurrences).

In [25]:
# filtering only bigrams that occure 5 times
bigrams_freq_more_than_5 = {bigram: count for bigram, count in bigram_freq.items() if count >= 5}


In [26]:
# calculate pmi on our corpus bigrams that occur more than 5 times
pmi_scores_more_than_5 = calculate_pmi_for_corpus(bi_grams, unigram_freq, bigrams_freq_more_than_5)

In [27]:
# sorting pmi scores based on pmi value
sorted_pmi_scores_more_than_5 = sorted(pmi_scores_more_than_5.items(), key=lambda x: x[1], reverse= True)

In [28]:
# see top 10 pmi scores
for idx, (bigram, pmi) in enumerate(sorted_pmi_scores_more_than_5):
    print((bigram, pmi))
    if idx == 9:
        break

(('klęska', 'żywiołowa'), 19.279373896686394)
(('bert', 'hellinger'), 19.279373896686394)
(('królicza', 'nora'), 19.279373896686394)
(('инарные', 'опционы'), 19.279373896686394)
(('опционы', 'олимп'), 19.279373896686394)
(('олимп', 'трейд'), 19.279373896686394)
(('мою', 'команду'), 19.279373896686394)
(('моя', 'группа'), 19.279373896686394)
(('stucco', 'veneziano'), 19.279373896686394)
(('остались', 'вопросы'), 19.279373896686394)


Remaining top 10 PMI scores are still very high and again suggest that bigrams are relevant phenomena in the context of the text. Now again we can see that most of these 10 bigrams are non polish words (maily russian).

## Use SpaCy to lemmatize and tag the sentences in the corpus.

In [29]:
# Function to lemmatize and tag sentences
def lemmatize_and_tag(sentence):
    doc = nlp(sentence)
    lemmatized_and_tagged = [(token.lemma_.lower(), token.tag_) for token in doc]
    return lemmatized_and_tagged

In [30]:
lemmatize_and_tag("Ala ma kota")

[('ala', 'SUBST'), ('mieć', 'FIN'), ('kot', 'SUBST')]

In [31]:
# df['lemmatized_and_tagged'] = df['text'].apply(lemmatize_and_tag)

In [32]:
# df.head(5)

In [33]:
# df.to_csv("df_lemmatized_tagged", index = False)

In [34]:
df = pd.read_csv("df_lemmatized_tagged.csv", converters={'lemmatized_and_tagged': eval})

In [35]:
df.head()

Unnamed: 0,_id,title,text,lemmatized_and_tagged
0,3,,"Nie mówię, że nie podoba mi się też pomysł szk...","[(nie, QUB), (mówić, FIN), (,, INTERP), (że, C..."
1,31,,Tak więc nic nie zapobiega fałszywym ocenom po...,"[(tak, ADV), (więc, CONJ), (nic, SUBST), (nie,..."
2,56,,Nigdy nie możesz korzystać z FSA dla indywidua...,"[(nigdy, ADV), (nie, QUB), (móc, FIN), (korzys..."
3,59,,Samsung stworzył LCD i inne technologie płaski...,"[(samsung, SUBST), (stworzyć, PRAET), (lcd, SU..."
4,63,,Oto wymagania SEC: Federalne przepisy dotycząc...,"[(oto, QUB), (wymaganie, SUBST), (sec, SUBST),..."


## Using the tagged corpus compute bigram statistic for the tokens containing: 

a. lemmatized, downcased word 

b. morphosyntactic category of the word (subst, fin, adj, etc.)

In [37]:
# Function to compute bigram statistics based on lemmatized, lowercase words
def compute_lemmatized_bigram_statistics(lemmatized_paragraphs):
    lemmatized_bigram_statistics = Counter()

    for lemmatized_paragraph in lemmatized_paragraphs:
        # Extract lemmatized bigrams
        for i in range(len(lemmatized_paragraph) - 1):
            lemmatized_word_a, _ = lemmatized_paragraph[i]
            lemmatized_word_b, _ = lemmatized_paragraph[i + 1]
            
            lemmatized_bigram = (lemmatized_word_a, lemmatized_word_b)
            lemmatized_bigram_statistics[lemmatized_bigram] += 1

    return lemmatized_bigram_statistics

In [38]:
# Function to compute bigram statistics based on morphosyntactic categories
def compute_category_bigram_statistics(tagged_paragraphs):
    category_bigram_statistics = Counter()

    for tagged_paragraph in tagged_paragraphs:
        # Extract category bigrams
        for i in range(len(tagged_paragraph) - 1):
            _, category_a = tagged_paragraph[i]
            _, category_b = tagged_paragraph[i + 1]
            
            category_bigram = (category_a, category_b)
            category_bigram_statistics[category_bigram] += 1

    return category_bigram_statistics

In [39]:
# Example usage with DataFrame
lemmatized_bigram_stats = compute_lemmatized_bigram_statistics(df['lemmatized_and_tagged'])
category_bigram_stats = compute_category_bigram_statistics(df['lemmatized_and_tagged'])

In [40]:
# see top 10 lemmatized bigrams
print("Top 10 Lemmatized Bigrams:")
for lemmatized_bigram, count in lemmatized_bigram_stats.most_common(10):
    print(f"{lemmatized_bigram[0]} {lemmatized_bigram[1]} - Count: {count}")

# see top 10 category bigrams
print("\nTop 10 Category Bigrams:")
for category_bigram, count in category_bigram_stats.most_common(10):
    print(f"{category_bigram[0]} {category_bigram[1]} - Count: {count}")

Top 10 Lemmatized Bigrams:
, że - Count: 86095
, który - Count: 55544
, a - Count: 32643
, ale - Count: 32549
to , - Count: 28987
, aby - Count: 28290
nie być - Count: 25266
, co - Count: 20579
być to - Count: 19794
. jeśli - Count: 19470

Top 10 Category Bigrams:
SUBST SUBST - Count: 913780
SUBST INTERP - Count: 646717
ADJ SUBST - Count: 496638
PREP SUBST - Count: 423126
SUBST PREP - Count: 284892
SUBST ADJ - Count: 242427
SUBST FIN - Count: 177841
INTERP COMP - Count: 176484
PREP ADJ - Count: 171353
INTERP SUBST - Count: 157059


below the same  but not considering punctuation, just out of curiosity

In [41]:
import string

# function to remove punctuation
def remove_punctuation(lemmatized_paragraph):
    return [(word, tag) for word, tag in lemmatized_paragraph if not any(char in string.punctuation for char in word)]

# Function to compute bigram statistics based on lemmatized, lowercase words, no punctuation
def compute_lemmatized_bigram_statistics_no_punct(lemmatized_paragraphs):
    lemmatized_bigram_statistics = Counter()

    for lemmatized_paragraph in lemmatized_paragraphs:
        lemmatized_paragraph_no_punct = remove_punctuation(lemmatized_paragraph)

        for i in range(len(lemmatized_paragraph_no_punct) - 1):
            lemmatized_word_a, _ = lemmatized_paragraph_no_punct[i]
            lemmatized_word_b, _ = lemmatized_paragraph_no_punct[i + 1]

            lemmatized_bigram = (lemmatized_word_a, lemmatized_word_b)
            lemmatized_bigram_statistics[lemmatized_bigram] += 1

    return lemmatized_bigram_statistics

# Function to compute bigram statistics based on morphosyntactic categories, no punctuation 
def compute_category_bigram_statistics_no_punct(tagged_paragraphs):
    category_bigram_statistics = Counter()

    for tagged_paragraph in tagged_paragraphs:
        tagged_paragraph_no_punct = remove_punctuation(tagged_paragraph)

        for i in range(len(tagged_paragraph_no_punct) - 1):
            _, category_a = tagged_paragraph_no_punct[i]
            _, category_b = tagged_paragraph_no_punct[i + 1]

            category_bigram = (category_a, category_b)
            category_bigram_statistics[category_bigram] += 1

    return category_bigram_statistics

lemmatized_bigram_stats_no_punct = compute_lemmatized_bigram_statistics_no_punct(df['lemmatized_and_tagged'])
category_bigram_stats_no_punct = compute_category_bigram_statistics_no_punct(df['lemmatized_and_tagged'])

print("Top 10 Lemmatized Bigrams (Without Punctuation):")
for lemmatized_bigram, count in lemmatized_bigram_stats_no_punct.most_common(10):
    print(f"{lemmatized_bigram[0]} {lemmatized_bigram[1]} - Count: {count}")

print("\nTop 10 Category Bigrams (Without Punctuation):")
for category_bigram, count in category_bigram_stats_no_punct.most_common(10):
    print(f"{category_bigram[0]} {category_bigram[1]} - Count: {count}")

Top 10 Lemmatized Bigrams (Without Punctuation):
nie być - Count: 25427
być to - Count: 19924
nie mieć - Count: 14496
to że - Count: 11124
w ten - Count: 10288
po prosty - Count: 9764
być w - Count: 8959
to być - Count: 7715
móc być - Count: 7458
to co - Count: 7191

Top 10 Category Bigrams (Without Punctuation):
SUBST SUBST - Count: 592574
ADJ SUBST - Count: 498747
PREP SUBST - Count: 422059
SUBST PREP - Count: 321584
SUBST ADJ - Count: 307722
SUBST FIN - Count: 225846
SUBST CONJ - Count: 195645
PREP ADJ - Count: 171523
FIN SUBST - Count: 141214
SUBST COMP - Count: 122852


## Compute the same statistics as for the non-lemmatized words (i.e. PMI) and print top-10 entries with at least 5 occurrences.

In [42]:
lemmatized_texts = df['lemmatized_and_tagged']

In [43]:
lem_bigrams = []
for lemmatized_paragraph in lemmatized_texts:
    lemmatized_bigrams = [(word_a, word_b) for (word_a, _), (word_b, _) in zip(lemmatized_paragraph, lemmatized_paragraph[1:])]
    lem_bigrams.append(lemmatized_bigrams)

In [44]:
lem_bigrams = [bigram for bigram_list in lem_bigrams for bigram in bigram_list]

In [45]:
lem_bigrams[:5]

[('nie', 'mówić'),
 ('mówić', ','),
 (',', 'że'),
 ('że', 'nie'),
 ('nie', 'podobać')]

In [46]:

# Utwórz unigram_freq
lem_unigram_freq = Counter()
for lemmatized_paragraph in lemmatized_texts:
    for word, _ in lemmatized_paragraph:
        lem_unigram_freq[word] += 1

In [47]:
for idx, (unigram, count) in enumerate(lem_unigram_freq.items()):
    print((unigram, count))
    if idx ==5:
        break

('nie', 130916)
('mówić', 7895)
(',', 611388)
('że', 90022)
('podobać', 667)
('ja', 12144)


In [48]:
# create bigram_freq
lem_bigram_freq = Counter()
for lemmatized_paragraph in lemmatized_texts:
    lemmatized_bigrams = [(word_a, word_b) for (word_a, _), (word_b, _) in zip(lemmatized_paragraph, lemmatized_paragraph[1:])]
    lem_bigram_freq.update(lemmatized_bigrams)


In [49]:
for idx, (bigram, count) in enumerate(lem_bigram_freq.items()):
    print((bigram, count))
    if idx ==5:
        break

(('nie', 'mówić'), 825)
(('mówić', ','), 3316)
((',', 'że'), 86095)
(('że', 'nie'), 5014)
(('nie', 'podobać'), 214)
(('podobać', 'ja'), 301)


In [50]:
# ignore bigrams containing characters other than letters
lem_bigram_freq = {bigram: count for bigram, count in lem_bigram_freq.items() if all(word.isalpha() for word in bigram)}

In [51]:
# filtering only bigrams that occure 5 times
lem_bigrams_freq_more_than_5 = {bigram: count for bigram, count in lem_bigram_freq.items() if count >= 5}

In [52]:
lem_pmi_scores = calculate_pmi_for_corpus(lem_bigrams, lem_unigram_freq, lem_bigrams_freq_more_than_5)

In [53]:
# sorting pmi scores based on pmi value
sorted_lem_pmi_scores = sorted(lem_pmi_scores.items(), key=lambda x: x[1], reverse= True)

In [54]:
# see top 10 pmi scores
for idx, (bigram, pmi) in enumerate(sorted_lem_pmi_scores):
    print((bigram, pmi))
    if idx == 9:
        break

(('emiratów', 'arabskich'), 19.476754113430072)
(('bert', 'hellinger'), 19.476754113430072)
(('инарные', 'опционы'), 19.476754113430072)
(('опционы', 'олимп'), 19.476754113430072)
(('олимп', 'трейд'), 19.476754113430072)
(('мою', 'команду'), 19.476754113430072)
(('моя', 'группа'), 19.476754113430072)
(('stucco', 'veneziano'), 19.476754113430072)
(('остались', 'вопросы'), 19.476754113430072)
(('экономическая', 'игра'), 19.213719707596276)


## Group the bigrams by morphosyntactic tag, i.e. a pair of words belongs to a given group if all pairs have the same syntactic category for the first and the second word.

In [55]:
# create list of all bigrams with tags
all_bigrams_with_tags = []

for lemmatized_paragraph in lemmatized_texts:
    lemmatized_bigrams_with_tags = [((word_a, tag_a), (word_b, tag_b)) for (word_a, tag_a), (word_b, tag_b) in zip(lemmatized_paragraph, lemmatized_paragraph[1:])]
    all_bigrams_with_tags.extend(lemmatized_bigrams_with_tags)

In [56]:
all_bigrams_with_tags[:10]

[(('nie', 'QUB'), ('mówić', 'FIN')),
 (('mówić', 'FIN'), (',', 'INTERP')),
 ((',', 'INTERP'), ('że', 'COMP')),
 (('że', 'COMP'), ('nie', 'QUB')),
 (('nie', 'QUB'), ('podobać', 'FIN')),
 (('podobać', 'FIN'), ('ja', 'PPRON12')),
 (('ja', 'PPRON12'), ('się', 'QUB')),
 (('się', 'QUB'), ('też', 'QUB')),
 (('też', 'QUB'), ('pomysł', 'SUBST')),
 (('pomysł', 'SUBST'), ('szkolenie', 'GER'))]

In [57]:
# Słownik do przechowywania grupowanych bigramów
grouped_bigrams = defaultdict(list)

# Grupowanie bigramów
for bigram in all_bigrams_with_tags:
    first_word_tag = bigram[0][1]
    second_word_tag = bigram[1][1]
    
    # Klucz grupy to krotka z tagami obu słów
    group_key = (first_word_tag, second_word_tag)
    
    # Dodaj bigram do odpowiedniej grupy
    grouped_bigrams[group_key].append(bigram)


In [59]:
# Liczenie łącznej liczby bigramów dla każdej grupy
group_counts = Counter()
for group_key, group in grouped_bigrams.items():
    group_counts[group_key] = len(group)

In [63]:
# Wydrukowanie top-10 kategorii
top_categories = group_counts.most_common(10)
for category, count in top_categories:
    print(f"\nCategory {category}, Total Count: {count}")
    
    # Wydrukowanie top-5 par dla danej kategorii
    top_pairs = grouped_bigrams[category][:5]
    for pair in top_pairs:
        print(pair)


Category ('SUBST', 'SUBST'), Total Count: 913780
(('miejsce', 'SUBST'), ('praca', 'SUBST'))
(('student', 'SUBST'), (')', 'SUBST'))
(('nic', 'SUBST'), ('wykwalifikować', 'SUBST'))
(('wykwalifikować', 'SUBST'), ('.', 'SUBST'))
(('strona', 'SUBST'), ('rynek', 'SUBST'))

Category ('SUBST', 'INTERP'), Total Count: 646717
(('praca', 'SUBST'), (',', 'INTERP'))
(('zrobić', 'SUBST'), ('.', 'INTERP'))
(('praca', 'SUBST'), ('–', 'INTERP'))
(('oprogramowanie', 'SUBST'), ('.', 'INTERP'))
(('edukacja', 'SUBST'), (',', 'INTERP'))

Category ('ADJ', 'SUBST'), Total Count: 496638
(('ogromny', 'ADJ'), ('inwestycja', 'SUBST'))
(('fałszywy', 'ADJ'), ('ocen', 'SUBST'))
(('dodatkowy', 'ADJ'), ('kontrola', 'SUBST'))
(('nowoki', 'ADJ'), ('kontrola', 'SUBST'))
(('należyty', 'ADJ'), ('staranność', 'SUBST'))

Category ('PREP', 'SUBST'), Total Count: 423126
(('w', 'PREP'), ('miejsce', 'SUBST'))
(('w', 'PREP'), ('stany', 'SUBST'))
(('w', 'PREP'), ('edukacja', 'SUBST'))
(('z', 'PREP'), ('tysiąc', 'SUBST'))
(('do', 

Some thoughts based on obtained results:

- ('ADJ', 'SUBST'): This category shows frequent juxtapositions of adjectives with nouns, which can indicate descriptions or characterizations of specific things or situations.
- ('PREP', 'SUBST') and ('SUBST', 'PREP'): Combinations of prepositions with nouns suggest some spatial or logical relationship between objects.
- ('SUBST', 'SUBST'): In this category, we observe frequent combinations of nouns, suggesting that there are specific phrases or terms in the analyzed text.
- ('SUBST', 'FIN'): Juxtapositions of nouns with action verbs indicate descriptions of actions or processes. 

(nothing very revealing to be honest)