<center>
<h1>Positive Pointwise Mutual Information</h2>
<h2>Corpus Linguistics with Python</h1>
<h3>Iro-Georgia Malta</h3>
</center>

# Step 1: Corpus Pre-Processing

For the lemmatization of the corpus, I import the lemmatizer of Standford CoreNLP. The code for the lemmatizer is the following:

In [None]:
! java -cp "/compLing/students/courses/corpusLinguistics/stanford-corenlp-4.4.0/*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,cleanxml,ssplit,pos,lemma -file "/compLing/students/courses/corpusLinguistics/finalProject/text_fic.txt" -outputFormat conll

1. I use the output of the lemmatizer **text_fic.txt.conll** and I process it.
2. I initialize an empty list **lemmatized_text**, where each lemmatized sentence will be stored.
3. In a *for loop*, I use the following preprocessing steps:
- I remove spaces at the beginning and at the end of each string with *strip()*. 
- I lowercase each string with *lower()*. 
- I split each string by the tabs with *split()*.

4. The length of the sentence is determined by the length of the line. If the length of a line is equal to 7, only the **third element**, the lemma, is being appended to the new list **lemmatized_text**. If the length of the sentence is not equal to seven (empty line), it marks the end of the sentence. 
5. In the else statement, I create a new file **coca.preprocessed** with sentences (one sentence per line) containing only space separated lemmas.
6. The list **lemmatized_text** is initialized again inside the else statement, so that the lemmas of the next sentence to be appended.

In [None]:
corpus_name = "text_fic.txt.conll"
new_corpus = "coca.preprocessed"

with open(corpus_name, "r") as f:
    lemmatized_text = []  # Initialize empty list
    for line in f:
        elements = line.strip().lower().split("\t")
        if len(elements) == 7:  # [4, GATHERED, gather, VBD, _, _, _]
            lemmatized_text.append(elements[2])  # Append the lemma (gather) to the list 
        else:
            with open(new_corpus, "a") as fout:
                print(" ".join(lemmatized_text), end = '\n', file = fout)
            lemmatized_text = []  # Initialize empty list again


# Step 2: Counting Unigrams and Bigrams

For the word tokenization of the lemmatized corpus, I import the *word_tokenize* method from *nltk*. Additionally, I import the *bigrams* method from *nltk* to extract bigrams from the corpus:

In [None]:
import nltk
from nltk import word_tokenize
from nltk import bigrams

1. First, I load the lemmatized text file **coca.preprocessed** and store the lemmatized text to a new variable **text2**.
2. Then, I tokenize it with the *word_tokenize* method and store it in variable **tokens**. 
3. After this point, I count all the unigrams of the text and I append a new key (lemma) and its value (count frequency) to the **unigramsDict**, when it is not present in the dictionary. If the lemma is present in the dictionary, its count frequency is increased by one. I set the frequency of unigrams to 0, if their frequency counts is lower or equal to 5.
4. To extract bigrams from the lemmatized corpus, I use the *bigrams* method on the variable *tokens*. I convert the output of the *bigrams* method to a list, and I assign it to variable **bigrams_list**.
5. I count all the bigrams from **bigrams_list** and I append a new key (lemma) and its value (count frequency) to the **bigramDict**, when it is not present in the dictionary. If the lemma is present in the dictionary, its count frequency is increased by one. I set the frequency of bigrams to 0, if their frequency counts is lower or equal to 5.

In [None]:
corpus_name2 = "coca.preprocessed"
new_corpus2 = "coca.counts"

with open(corpus_name2, "r") as f:
    text2 = f.read()
    unigramsDict = {}  # Initialize unigramsDict
    tokens = word_tokenize(text2)
    for unigram in tokens:  # Count unigrams
        if unigram not in unigramsDict:
            unigramsDict[unigram] = 1
        else:
            unigramsDict[unigram] += 1
    for unigram, freq in unigramsDict.items():
        if freq <= 5:
            unigramsDict[unigram] = 0       
    bigrams_list = list(bigrams(tokens))
    bigramDict = {}  # Initialize bigramDict
    for bigram in bigrams_list:  # Count bigrams
        if bigram not in bigramDict:
            bigramDict[bigram] = 1
        else:
            bigramDict[bigram] += 1
    for bigram, freq in bigramDict.items():
        if freq <= 5:
            bigramDict[bigram] = 0

6. I create a new text file **coca.counts**, in which I store the lemmas and their frequency counts from **unigramsDict** and **bigramDict** dictionaries as key-value pairs.
7. I count the total number of tokens from **unigramsDict** by summing the frequency counts (values) of the unigrams (keys). The sum of the frequency counts is stored in variable **totalTokens**.

In [None]:
with open(new_corpus2, "a") as fout:
        totalTokens = 0  # Intialize variable totalTokens
        for k, v in unigramsDict.items():
            print(k, v, sep = '\t', file = fout)
            totalTokens += v  # Count total number of tokens from unigramsDict
        print(totalTokens)  # Print total number of tokens
        for k, v in bigramDict.items():
            print(k, v, sep = '\t', file = fout)


# Step 3: PPMI

To compute the PPMI scores of the bigrams in the corpus, I import the library *math* and use its *log* method:

In [None]:
import math
from math import log

1. I define the **PPMI()** function, which has the arguments: **word1, word2, unigramsDict, bigramDict, totaltokens**.
2. The **PPMI()** function returns:
    - A score equal to 0.0:
        - if **word1** (lemma) or **word2** are not inside **unigramsDict**, 
        - or the frequency counts of **word1** or **word2** (separately) is equal to 0,
        - or **word1** and **word2** (occuring together) are not inside **bigramDict**,
        - or their frequency counts (together) are equal to 0,
        - or the PPMI score of a bigram is negative (since PPMI takes negative PMI values and turns them into zeros).
    - The PPMI score:
        - the *probabilties* of **word1** and **word2** (separately), divided by total frequency counts (totalTokens)
        - the *joint probability* of **word1** and **word2**, divided by total frequency counts
        - the final PPMI score is calculated with the *math.log()* method inside the *round()* function, and it is rounded to the third decimal position.

In [None]:
def PPMI(word1, word2, unigramsDict, bigramDict, Totaltokens):
    if (word1 not in unigramsDict  # unigramsDict
    or word2 not in unigramsDict 
    or unigramsDict[word1] == 0 
    or unigramsDict[word2] == 0 
    or (word1, word2) not in bigramDict  # bigramDict
    or bigramDict[word1, word2] == 0):
        PPMI_score = 0.0
        return PPMI_score
    else:
        word1_prob = unigramsDict[word1] / Totaltokens  # P(x)
        word2_prob = unigramsDict[word2] / Totaltokens  # P(y)
        prob_word1_word2 = bigramDict[word1, word2] / Totaltokens  # P(x,y)
        PPMI_score = round(math.log(prob_word1_word2 
                                    / (word1_prob*word2_prob), 2), 3)
        if PPMI_score < 0:  # If any negative scores
            PPMI_score = 0.0
            return PPMI_score
        else:
            return PPMI_score


At this point, I call the function to check if it works properly:

In [None]:
PPMI(',', 'have', unigramsDict, bigramDict, totalTokens)  # negative score

In [None]:
PPMI('croatia', 'crete', unigramsDict, bigramDict, totalTokens)  # new lemmas

In [None]:
PPMI('who', 'leave', unigramsDict, bigramDict, totalTokens)  # existing lemmas

# Step 4: Computing PPMI Scores

1. I choose **dictionary** as the data structure, where I store all the bigrams and their PPMI scores. 
2. I use *for loop* to go through the bigrams (keys) of **bigramDict**. Inside the loop I use the *PPMI()* function, which contains: **bigram[0], bigram[1], unigramsDict, bigramDict** and **totalTokens**. To access the items of the bigramDict, I use *the indices [0] and [1]* because the bigrams are stored as tuples in the dictionary. 
3. Once PPMI score is computed, each bigram (key) and its PPMI score (value) is stored to the new dictionary **PPMI_bigrams_score**.

In [None]:
PPMI_bigrams_score = {}  # Initialize empty dictionary

for bigram in bigramDict:
    score = PPMI(bigram[0], bigram[1], unigramsDict, bigramDict, totalTokens)
    PPMI_bigrams_score[bigram] = score


As a next step, I create a new text file **coca.ppmi**, where I print each bigram (key) and their PPMI score (value) from the dictionary **PPMI_bigrams_score**.

In [None]:
new_file4 = "coca.ppmi"

with open(new_file4, "a") as fout:
    for bigram, score in PPMI_bigrams_score.items():
        print(bigram[0], bigram[1], score, sep = '\t', file = fout)


At this point, I define the new function **topN()**. The function expects a dictionary (**PPMI_bigramScores**) and has the agrument **n=20**, which indicates the default value of the number of bigrams to visualise. The contents of the function are:
1. I sort the PPMI scores in a descending order using the built-in function **sorted()** and its reverse argument, and I store it to variable **bigrams_sorted**.
2. Since the list is sorted in a descending order, the first 20 elements of the list are the top 20 scores.
3. To store the top 20 bigrams, which have the same PPMI score from sorted list **bigrams_sorted**: I look up the corresponding value from the **PPMI_bigramScores** dictionary and I match the key (bigrams) to its corresponding score from **bigrams_sorted**. Then, I add the key-value pairs to the new dictionary **sorted_dict**.
4. Lastly, the function prints the first **[0]** and the second **[1]** lemma of each bigram as well as their PPMI scores in the cell, *tab separated* from each other.

In [None]:
def topN(PPMI_bigramScores, n = 20):
    bigrams_sorted = sorted(PPMI_bigramScores.values(), reverse=True)
    sorted_dict = {}  # Initialize empty dictionary
    for score in bigrams_sorted[:n]:  # Loop through list
        for bigram in PPMI_bigramScores.keys():  # Loop through dictionary
            if PPMI_bigramScores[bigram] == score:
                sorted_dict[bigram] = PPMI_bigramScores[bigram]
    for k, v in sorted_dict.items():
        print(k[0], k[1], v, sep = '\t')


Now, I call the **topN()** function and I pass to it the **PPMI_bigrams_score** dictionary:

In [None]:
topN(PPMI_bigrams_score)

# Step 5: Comparing PPMI and Frequency Counts

Comparing differences between PPMI scores and frequency counts:

**PPMI scores**:
- PPMI scores show the possibility of two words (bigram) occuring together in the corpus.
- The calculation of PPMI scores involves many computations.
- The bigrams with the highest PPMI scores are **named entities**.
- **Named entities** are useful information.

**Frequency Counts**:
- Frequency counts show how many times a bigram occurs in the corpus.
- The calculation of frequency counts involves only the summing of the number of occurrences from each word of a bigram.
- The bigrams with the highest frequency counts are mainly **punctuation** and **function words** (e.g., conjunctions, pronouns, articles, prepositions and auxiliary verbs). **Function words** do not provide any useful information about the context of a text.
- However, the bigram *'do'* and *'not'* contains the **content word** *'not'*, which belongs to **negation**. **Content words** such as **negation** provide important information about the context of a text.

**<u>Conclusion</u>**:
The bigrams with highest PPMI scores are **named entities**, while the bigrams with the highest frequency counts are mainly **punctuation** and **function words**. **Named entities** are considered useful information, while **punctuation** and **function words** are not because they do not provide important information about the context of a text. The frequency counts of **punctuation** and **function words** are the highest ones because they occur more frequently in a corpus than content words.

In [None]:
topN(bigramDict)