# Best Friends

In this task we will look for word association pairs using the pointwise mutual information method. We will compare word pairs that occur right after each other and then word pairs that occur in the same context.

In this notebook we include the code for running the PMI method and then we run four experiments.

The full outputs of the algorithms can be found in the `output_best_friends/` directory.

In [30]:
import numpy as np
import pandas as pd
from itertools import chain
from pathlib import Path
from IPython.display import HTML
# create output directory
Path("output_best_friends/").mkdir(parents=True, exist_ok=True)

from dataset import read_word_list, pairs, find_common_words, filter_pairs_by_vocabulary

In [31]:
def pmi_bigrams(cooccurence_matrix, vocabulary):
    """Creates a generator of bigrams with their PMI value

    Args:
        cooccurence_matrix (np.array): Occurence matrix for the words in vocabulary
        vocabulary (dict): Word - id mapping

    Yields:
        dict: word, context word and the PMI value
    """
    sum_occurences = np.sum(cooccurence_matrix)
    # todo: fix beginning/ending words
    cnts_word1 = np.sum(cooccurence_matrix, axis=1)
    cnts_word2 = np.sum(cooccurence_matrix, axis=0)
    
    # sanity check because of the sum axis argument:
    assert np.sum(cooccurence_matrix[42, :]) == cnts_word1[42]

    for word1, idx1 in vocabulary.items():
        for word2, idx2 in vocabulary.items():
            cnt_coocurence = cooccurence_matrix[idx1, idx2]
            cnt_word1 = cnts_word1[idx1]
            cnt_word2 = cnts_word2[idx2]
            if cnt_coocurence == 0 or cnt_word1 == 0 or cnt_word2 == 0:
                continue
            pmi = (cnt_coocurence * sum_occurences) / (cnt_word1 * cnt_word2)
            pmi = np.log2(pmi)
            yield {
                "word": word1,
                "context": word2,
                "word occurence": cnt_word1,
                "context occurence": cnt_word2,
                "cooccurence": cnt_coocurence,
                "pmi": pmi,
            }

In [34]:
def conduct_experiment(name, data_path, distances):
    # Load the list of words
    word_list = read_word_list(data_path)

    # Find the words that occur 10 or more times
    vocabulary = find_common_words(word_list, occurence_threshold=10)

    # Add indices to words for lookup
    vocabulary = {word: index for index, word in enumerate(vocabulary)}
    # print(f"Size of the {language} vocabulary is", len(vocabulary))

    # Create the pairs for counting
    pair_iterables = []
    for distance in distances:
        # print(distance)
        all_pairs = pairs(word_list, distance=distance)
        pairs_without_rare_words = filter_pairs_by_vocabulary(all_pairs, vocabulary)
        
        pair_iterables.append(pairs_without_rare_words)
    
    final_pairs = chain.from_iterable(pair_iterables)

    # Create the coocurence matrix
    cooccurence_matrix = np.zeros((len(vocabulary), len(vocabulary)))
    for word1, word2 in final_pairs:
        idx1 = vocabulary[word1]
        idx2 = vocabulary[word2]
        cooccurence_matrix[idx1, idx2] += 1

    bigrams_pmi = list(pmi_bigrams(cooccurence_matrix, vocabulary))
    df = pd.DataFrame(bigrams_pmi).sort_values("pmi", ascending=False)
    print(f"{name} with the highest Pointwise Mutual Information:")
    display(HTML(df.head(20).to_html(index=False)))
    #display(df.head(20))
    print(f"{name} with the lowest PMI:")
    display(HTML(df.tail(5).to_html(index=False)))
    #display(df.tail(5))
    
    filename = name.replace(" ", "_")
    df.to_csv(f"output_best_friends/{filename}.csv", columns=["word", "context", "pmi"], index=False)

## Experiment: Consecutive words

We compute the pointwise mutual information for all the possible word pairs appearing consecutively in the data. Both words need to occur at least 10 times in the original corpus to be included. We report 20 word pairs with the highest PMI and 5 word pairs with the lowest PMI. In the result tables we also include the information about word unigram and bigram occurence to be able to better interpret the computed PMI value.

In [35]:
experiments_near = [
    ("English pairs with distance 1", "data/TEXTEN1.txt", [1]),
    ("Czech pairs with distance 1", "data/TEXTCZ1.txt", [1]),
]
for name, datapath, distances in experiments_near:
    conduct_experiment(name, datapath, distances)

English pairs with distance 1 with the highest Pointwise Mutual Information:


word,context,word occurence,context occurence,cooccurence,pmi
La,Plata,10.0,12.0,10.0,13.917209
Asa,Gray,10.0,12.0,10.0,13.917209
competent,observers,1.0,12.0,1.0,13.917209
de,Candolle,14.0,13.0,13.0,13.694816
worth,while,12.0,10.0,7.0,13.402636
faced,tumbler,9.0,16.0,8.0,13.332246
Fritz,Muller,14.0,20.0,14.0,13.180243
lowly,organised,11.0,18.0,9.0,13.04274
Malay,Archipelago,12.0,22.0,12.0,13.04274
shoulder,stripe,14.0,11.0,7.0,13.04274


English pairs with distance 1 with the lowest PMI:


word,context,word occurence,context occurence,cooccurence,pmi
in,of,4395.0,7801.0,1.0,-7.528919
.,of,4498.0,7801.0,1.0,-7.56234
of,.,8482.0,4581.0,1.0,-7.709464
.,the,4498.0,12731.0,1.0,-8.268955
the,",",10880.0,11706.0,2.0,-8.42218


Czech pairs with distance 1 with the highest Pointwise Mutual Information:


word,context,word occurence,context occurence,cooccurence,pmi
zavedení,příspěvku,2.0,2.0,1.0,14.599157
Peter,Carrington,2.0,5.0,2.0,14.277229
pražském,hotelu,1.0,7.0,1.0,13.791802
starých,struktur,2.0,8.0,2.0,13.599157
SE,,2.0,4.0,1.0,13.599157
deník,The,8.0,3.0,3.0,13.599157
vojenského,materiálu,4.0,2.0,1.0,13.599157
teplota,minus,5.0,7.0,4.0,13.469874
platební,bilance,5.0,7.0,4.0,13.469874
Hamburger,SV,9.0,9.0,9.0,13.429232


Czech pairs with distance 1 with the lowest PMI:


word,context,word occurence,context occurence,cooccurence,pmi
(,.,989.0,7080.0,1.0,-6.140203
na,.,1417.0,7080.0,1.0,-6.659
.,že,6914.0,1691.0,1.0,-6.879808
.,se,6914.0,2103.0,1.0,-7.194381
",",.,10310.0,7080.0,3.0,-7.93717


## Discussion: Consecutive words

At first glance the tables contain plausible word pairs. We see collocations - proper names such as "La Plata" or "Č. Budějovice". We also see phrasal verbs ("branched off") and idioms ("worth while"). These collocations seem to be usable for further use in NLP applications.

Closer inspection of the word pairs reveals some word pairs that, we hypothesize, are a result of a limited size of the corpus and/or limited topic variety. 

For some word pairs we see that one of the words appears exclusively in succession with the other word in the dataset. This is to be expected for proper names or very specific idioms. For word pairs such as "competent 	observers" or "pražském hotelu" this is however not the case. Rather this is probably a result of having a limited corpus where these particular words ("competent", "pražském") don't occur in different contexts. We could test this hypothesis by running the experiment on a larger corpus and comparing the new PMI value for these word pairs.

Looking at all the word pairs we also observe some bias towards the topics included in the datasets. For the English dataset we see a lot of "biological" collocations such as "competent observers", "faced tumbler" (pigeon breed), "lowly organised", "shoulder stripe", "self fertilisation". The highly-scored proper names also refer to this topic.
In the Czech dataset we see that the collocations are related to news articles or reports, especially from the topics of Economy, (International) Politics, and Military.

We are not certain that this bias in what word pairs we found is caused only by the corpus characteristics or is also a result of some of the properties of the algorithm. Maybe some topics or text forms contain less collocations that are possible to find by measuring PMI of successive words and so are not represented in the top 20 word pairs.

We also quickly discuss the low-PMI cases. As we can see the word pairs found are very common words that cannot follow each other because of grammatical reasons.

## Experiment: Words in the same context

We compute the pointwise mutual information for all the possible word pairs appearing in the same context of the data. More specifically the words need to be from 1 to 50 words apart from each other to be considered in the same context. Both words need to occur at least 10 times in the original corpus to be included. We report 20 word pairs with the highest PMI and 5 word pairs with the lowest PMI. In the result tables we also include the information about word unigram and bigram occurence to be able to better interpret the computed PMI value.

In [36]:
experiments_far = [
    ("English pairs with distance [2,50]", "data/TEXTEN1.txt", range(2, 51)),
    ("Czech pairs with distance [2,50]", "data/TEXTCZ1.txt", range(2, 51)),
]
for name, datapath, distances in experiments_far:
    conduct_experiment(name, datapath, distances)

English pairs with distance [2,50] with the highest Pointwise Mutual Information:


word,context,word occurence,context occurence,cooccurence,pmi
dried,floated,416.0,763.0,17.0,8.929476
floated,dried,744.0,416.0,16.0,8.878393
dried,germinated,416.0,445.0,8.0,8.619891
dried,dried,416.0,416.0,7.0,8.524467
floated,germinated,744.0,445.0,13.0,8.481611
avicularia,vibracula,563.0,522.0,11.0,8.412526
floated,floated,744.0,763.0,21.0,8.395611
stripe,shoulder,540.0,607.0,12.0,8.380586
layer,hexagonal,452.0,430.0,7.0,8.356975
eastern,Pacific,579.0,500.0,10.0,8.296716


English pairs with distance [2,50] with the lowest PMI:


word,context,word occurence,context occurence,cooccurence,pmi
selection,islands,20699.0,6073.0,1.0,-3.787475
genera,conditions,9642.0,13540.0,1.0,-3.842065
species,wax,81677.0,1791.0,1.0,-4.0062
wax,species,1809.0,81693.0,1.0,-4.02091
varieties,organs,19109.0,8450.0,1.0,-4.148708


Czech pairs with distance [2,50] with the highest Pointwise Mutual Information:


word,context,word occurence,context occurence,cooccurence,pmi
výher,výher,524.0,539.0,40.0,9.454761
žel,žel,380.0,377.0,15.0,9.019011
Sandžaku,Sandžaku,332.0,319.0,10.0,8.869873
h,teplota,609.0,351.0,20.0,8.8567
CIA,CIA,318.0,307.0,8.0,8.665419
ODÚ,VPN,357.0,343.0,10.0,8.66048
Petrof,Petrof,437.0,422.0,14.0,8.555163
IFS,IFS,383.0,379.0,11.0,8.552574
silniční,doprava,306.0,348.0,8.0,8.540066
Bělehrad,Benfica,391.0,417.0,12.0,8.510431


Czech pairs with distance [2,50] with the lowest PMI:


word,context,word occurence,context occurence,cooccurence,pmi
1,kteří,33958.0,6469.0,1.0,-5.470394
!,jsou,20355.0,11181.0,1.0,-5.52146
6,jsem,19351.0,11768.0,1.0,-5.522305
1,jednání,33958.0,6821.0,1.0,-5.546834
2,jsem,26772.0,11768.0,1.0,-5.990622


## Discussion: Words in the same context

In this experiment, we again see plausible word pairs in the English table. As was discussed in the first experiment setting, the topic selection for the English corpus is very visible in the word pairs found by the algorithm. We can observe that the word pairs are connected semantically. Word pairs such as "dried floated", "floated germinated" or "dried days" could be found together in texts describing germination processes. We also see one example of a word pair that occurs in a specific 4-word collocation - "survival (of the) fittest".

In the Czech table we see quite a different output compared to the English table. We see many word pairs that consist of two identical words. Many of the words are also abbreviations or proper names. We are not sure why there is this difference between the languages but we have two hypotheses. 

First is that the Czech dataset contains news articles that contain more abbreviations and proper nouns in general and that these in turn are more likely to be detected by the algorithm. When one article contains the abbreviation "CIA" it is very likely it will contain this abbreviation multiple times. On the other hand this abbreviation will not occur in other news articles. In this way we then observe strong PMI values for some word pairs caused by clusters of thematically related articles.

The other difference is that Czech language is morphologically richer language and proper names might behave more regularly than other words that have many forms. If this is true it might be interesting to lemmatize the Czech corpus and run the algorithm again.