# 3. The same in Python

*Based on notebook by Lucas Champolion with some alterations from Mia Jacobsen*

Now, let's do the same calculations again in Python. This is just a warm-up for the next exercise, where we will do the same calculations using millions of words!

We start by counting the bigrams in our dataset. Don't worry if you don't fully understand every single part of the code we are using in this assignment (it uses more than just the basic Python we've introduced you to). 

The point is that this code does the same as what you have just done yourself by hand: it counts every bigram and stores them in a way that makes it convenient for the computer to look up conditional probabilities.  

The code in this question and the next is modified from https://nlpforhackers.io/language-models/.

First, we import the Natural Language Toolkit (NLTK), which describes itself as a  "leading platform for building Python programs to work with human language data". While we're at it, we'll also import a few other packages we'll need.


In [4]:
%pip install nltk
import nltk # (1)
from nltk import bigrams, trigrams #(2)
import random # (3)
import re

Note: you may need to restart the kernel to use updated packages.


Next, let's import a special Counter device that will make our code easier to write. A Counter is like a Python dictionary except that the value of each entry is an integer, initially zero, that can be incremented to count things. We will use it to count occurrences of bigrams and (later) trigrams.

In [5]:
from collections import Counter

#### Code Explanation
_The following explanation refers to the above code blocks, referencing parts in parentheses, e.g. `(1)`._

(1) When a word becomes highlighted in Jupyter notebooks, we know it has some special meaning in Python. In the case of this line, we see the `import` keyword come in to play. For simple Python functionality, we can write our own functions, do basic math, etc. However, not all programs need complex functionality (like working with human language data), so these are left to external libraries- like NLTK! Using the `import` statement, we're saying, "Hey Python, we need a hand here- can you get the nltk library for us?"

(2) Here we see another special keyword, `from`. If we tried to say `import bigrams`, Python might not know what we're talking about. However, when we say `from`, we're saying to look specifically at some library for things WITHIN that library. For example, "Hey Python! In the nltk library you just imported, please find the `bigrams` and `trigrams` information within it."

(3) Here we see, similarly to (1), importing the `random` module, which lets us do things like generating random numbers. The `random` module can be thought of as an external file called `random.py`. When we import it and want to access any code within it, we have to use `random.thing`. For example, `random` has a function to generate random integers. Instead of saying `randint(0, 10)`, we need to remind python that we need the function from the `random` module. This is done with the `.` operator, such as writing `random.randint(0, 10)`. To use `bigrams`, for example, we'd have to write `nltk.bigrams.function()` everywhere we wanted to use the bigrams, but thanks to `(2)` we can say `bigrams.function()`!

As an added note, python also has a special keyword `as` that can be used for imports. If I had a super long library name, like `florgleborglekorgle` and I didn't want to write that everywhere in my code, I can do `import florgleborglekorgle as fbk`. That way, I can write `fbk.function()` wherever I need it.

We'll represent our dataset as a list of lists of strings. Each string is a word and each list is a sentence.

In [6]:
seuss_dataset = [["I", "am", "Sam"], \
                 ["Sam", "I", "am"], \
                 ["I", "do", "not", "like", "green", "eggs", "and", "ham"]]

To create our language model, we'll first count all the bigrams that occur in this dataset, and then convert our counts to probabilities. The Counter device will help us do this. We start by creating an empty Counter.

The code in the next cell goes through the three sentences. For each of the sentences, it goes through the words in it, and uses the bigrams function from NLTK to access the bigrams in the sentence. It then uses the Counter device to count all the bigrams in the text. Note that "i += 1" is shorthand for "i = i + 1". Similarly for "-=", "/=", etc. We use w1 for the first word in a bigram and w2 for the second. The last line just causes the counter to print the outputs so you can inspect it, it doesn’t change the functionality of the code.

In [7]:
# this is similar to last week when we made an empty dictionary to store word counts!
seuss_model = Counter()

for sentence in seuss_dataset: # (1)
    for w1, w2 in bigrams(sentence, pad_right=True, pad_left=True): # (2)
        if not w1 in seuss_model: # (3)
            seuss_model[w1] = Counter()
        seuss_model[w1][w2] += 1
        
seuss_model

Counter({None: Counter({'I': 2, 'Sam': 1}),
         'I': Counter({'am': 2, 'do': 1}),
         'am': Counter({'Sam': 1, None: 1}),
         'Sam': Counter({None: 1, 'I': 1}),
         'do': Counter({'not': 1}),
         'not': Counter({'like': 1}),
         'like': Counter({'green': 1}),
         'green': Counter({'eggs': 1}),
         'eggs': Counter({'and': 1}),
         'and': Counter({'ham': 1}),
         'ham': Counter({None: 1})})

#### Code Explanation
_The following explanation refers to the above code block, referencing parts in parentheses, e.g. `(1)`._

(1) Here we see the presence of a nested for loop, so a for loop within a for loop. This first for loop here simply says, "for every sentence within `suess_dataset`, perform all the following code".

(2) This is where the trickiness of nested for loops comes in. This for loop is saying, "for word1 and word2 in the bigrams of `sentence`, perform the following code". Let's break this down.
- `bigrams(sentence)` is saying to find all the bigrams in a following string `sentence`. This variable `sentence` comes from the outermost, or the first, for loop.
- Because the function `bigrams` returns two values (versus, say, a `sum` function that returns one value), we use `w1, w2`. We're telling Python to store the first return value of `bigrams` into `w1`, and the second into `w2`. Neat!

(3) Lastly, we're now checking if `w1` already exists in our variable `seuss_model`. If it doesn't, we go ahead and add it so that it exists. If it already exists, we don't want ot add it twice! Then we increase the count of two words occurring next to each other with that second to last line.

An important distinction with nested for loops is that the innermost loop, `for w1, w2...` in this case, will always run through ALL iterations first. If we have 1 sentence, then the inner-most for loop runs once for every bigram. If we have two sentences, it runs once for every bigram in the first sentence, then once for every bigram in the second sentence... and so on and so forth. For `n` sentences, if each sentence has `x` bigrams, then we run the loop `n*x` times. Because the `seuss_model` is declared outside both for loops, it stores all of this data and never resets within the loops.

Now, let's transform the counts to probabilities. We consider each word in turn and imagine it to be w1. Holding w1 fixed, we go through all the possible values for w2 and sum them up. Then we divide each of them by the sum. This way, all the w2 counts for a given w1 add up to 1. To do this, we use a for loop nested in another for loop.

In [8]:
for w1 in seuss_model:
    total_count = float(sum(seuss_model[w1].values()))
    for w2 in seuss_model[w1]:
        seuss_model[w1][w2] /= total_count

seuss_model

Counter({None: Counter({'I': 0.6666666666666666, 'Sam': 0.3333333333333333}),
         'I': Counter({'am': 0.6666666666666666, 'do': 0.3333333333333333}),
         'am': Counter({'Sam': 0.5, None: 0.5}),
         'Sam': Counter({None: 0.5, 'I': 0.5}),
         'do': Counter({'not': 1.0}),
         'not': Counter({'like': 1.0}),
         'like': Counter({'green': 1.0}),
         'green': Counter({'eggs': 1.0}),
         'eggs': Counter({'and': 1.0}),
         'and': Counter({'ham': 1.0}),
         'ham': Counter({None: 1.0})})

We can now use this model to estimate the probability that a word will occur next, given the previous word. As we said, this is called the "maximum likelihood estimate". 

Let’s make simple predictions with this language model. As before, we will start with one randomly selected word – “I”. We want our model to tell us what the next word might be. Remember that there are three bigrams whose first word is "I":

`
I am
I am
I do
`

Now, we will estimate the probabilities of the next word given that a randomly selected word is "I". When we did this ourselves, we noticed that this is the case in two of these three bigrams, so we estimated:

$P(am | I)=2/3$ 

$P(do | I)=1/3$ 

Here, w_1 is "I" and w_2 ranges over "am" and "do".

Run the following command to ask the computer to do the same calculation:

In [9]:
words_after_I = dict(seuss_model["I"])

print(words_after_I)

{'am': 0.6666666666666666, 'do': 0.3333333333333333}


The computer could also have found these bigrams just by going through the text and looking for all the instances of "I". But arranging the dataset in a table of bigrams makes this lookup simpler and more efficient. Humans work the same way: it is often useful to arrange data in a format that makes looking up things simpler and more efficient.

**Question 3.1** 
Which command would ask the computer to look up the probabilities of the next word given that a randomly selected word is "am"? (Hint: This is an easy question and its answer will look similar to the previous code box.)


In [10]:
solution_q3_1 = dict(seuss_model["am"])

print(solution_q3_1)

{'Sam': 0.5, None: 0.5}


You will notice that the end of a sentence is represented by the special keyword None.

**Question 3.2** 
Which command would ask the computer to look up the probabilities of the next word given that a randomly selected word is "green"? (Hint: This will also look similar to the previous code boxes.)


In [11]:
solution_q3_2 = dict(seuss_model["green"])

print(solution_q3_2)

{'eggs': 1.0}


# 4. Trigrams using the Reuters Corpus
(30 points -- autograded)

In this problem, we will build a basic language model using trigrams of the Reuters corpus. (A "corpus" is a large and structured collection of texts. The plural of "corpus" is "corpora"). The Reuters corpus is a collection of 10,788 news documents totaling 1.3 million words. We can build a language model in a few lines of code using the NLTK package. Essentially, this code does the same as before, except with more words and with trigrams instead of bigrams.

In probability notation, we write $P(w_n|w_{n-2},w_{n-1})$ for the probability that the next word will be $w_n$, given that the last two words were $w_{n-2}$ and $w_{n-1}$.

The following box tells your notebook where to find the Reuters corpus as well as other corpora.

In [65]:
# code in this question is based on https://nlpforhackers.io/language-models/

# first we need to download any copora we might be interested in - so if you want gutenberg later, remember to download it like this

nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /home/ucloud/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


True

The following box specifies which corpus to use. We will start with the Reuters corpus.

A later question will ask you to change the name of the corpus in this box. At this point, you don't need to change it. Just run the box.

In [73]:
# Available corpora include:
# - reuters
# - gutenberg
# - webtext
# - brown
# - inaugural
# - state_union

# TO CHANGE THE CORPUS NAME, YOU WOULD EDIT THE FOLLOWING LINE
from nltk.corpus import gutenberg as the_corpus 

# eg. to use gutenberg, you would change the previous line to: 
# from nltk.corpus import gutenberg as the_corpus

#### Code Explanation
_The following explanation refers to the above code block._

Revisit the code explanation block in the import statements at the start of this assignment. Keep in mind the explanation of the `as` keyword as you look at the code below. When you change the corpus, we don't have to change everywhere `the_corpus` shows up- we would if we used something like `reuters.words()` instead of `the_corpus.words()`!

By default, run this box without changes to load the Reuters corpus. Later on we will ask you to use other corpora. To download these corpora, you will then change the code as indicated and run it again.

Now let's print the first few words of the corpus just to make sure everything is working as expected. 

In [74]:
print("Beginning of corpus:", the_corpus.words())

Beginning of corpus: ['[', 'Emma', 'by', 'Jane', 'Austen', '1816', ']', ...]



If you have loaded the Reuters corpus, you should see something like this: Beginning of corpus: ['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', ...]

The following code is very similar to the one in the previous question, and does essentially the same thing, but with trigrams instead of bigrams: construct a big table where each row is the previous two words, and each column is the next word. 

Let's create an empty language model. 

Next, we go through the corpus and count the trigrams we see. Depending on how large it is, this might need a lot of computer memory (about 700 MB for the reuters corpus). You can see in the top right corner how much memory you are using. Normally, this shouldn't be a problem, but if you run into any issues, let us know via the Padlet. 

In [75]:
corpus_model = Counter()

for sentence in the_corpus.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        # this means we now use two words at a time to predict the next
        if not (w1,w2) in corpus_model:
            corpus_model[w1,w2] = Counter()
        corpus_model[w1,w2][w3] += 1
        


#### Code Explanation
_The following explanation refers to the above code block._

Once again we have a nested for loop performing an operation on every sentence. The structure is the same as the use of bigrams in our `seuss_model` previously in the assignment, but now we're using trigrams!

Now, let's transform the counts to probabilities. As before, we use a for loop nested in another for loop. The only difference is we have trigrams rather than bigrams, so we have w1, w2, and w3 instead of just w1 and w2.

In [76]:
for (w1,w2) in corpus_model:
    total_count = float(sum(corpus_model[w1,w2].values()))
    for w3 in corpus_model[w1,w2]:
        corpus_model[w1,w2][w3] /= total_count
        


Let’s make simple predictions with this language model. We will start with two simple words – “today the”. We want our model to tell us what it thinks the next word will be.

Let's try this out with the words "today" and "the". See what happens when you replace one or both of these words, e.g. replacing "the" with "a" or replacing "today" with "on" or "yesterday". 

In [77]:
corpus_model['yesterday','the']

Counter({'blood': 1.0})

In [78]:
word1 = "today"
word2 = "the"

# The following code looks up the probabilities of all the words that follow these two words, and prints them 
# in a human-readable format.
if corpus_model == None: # (1)
    pass
elif (word1,word2) not in corpus_model: # (2)
     print("There are no occurrences of '"+word1+" "+word2+"' in the corpus!")
#    random_text = random.sample(list(corpus_model), 1)[0]
#    print("There are no occurrences of '"+word1+" "+word2+"' in the corpus! Using random words '"+random_text[0]+" "+random_text[1]+"' instead.")
#    text[0]=random_text[0]
#    text[1]=random_text[1]
else: # (3)
    values = corpus_model[word1, word2]
    sorted_result = sorted(values.items(), key=lambda x: (-x[1], x[0].lower())) # (4)
    for entry in sorted_result: # (5)
        word=entry[0]
        prob=entry[1]
        prob_rounded=round(entry[1],3)
        if prob == prob_rounded:
            print("P('"+entry[0]+"'|'"+word1+"','"+word2+"') = "+str(entry[1]))
        else:
            print("P('"+entry[0]+"'|'"+word1+"','"+word2+"') = "+str(entry[1])+" (rounded: "+str(round(entry[1],3))+")")

P('conquering'|'today','the') = 0.5
P('whole'|'today','the') = 0.5


#### Code Explanation
_The following explanation refers to the above code block, referencing parts in parentheses, e.g. `(1)`._

(1) This is a simple `if` statement that says "if `corpus_model` doesn't exist, do nothing". Since we can't have empty `if` statements in Python, amongst other things, we use the keyword `pass`.

(2) This `elif` statement executes only if the above `if` statement does not. It simply checks if `word1` and `word2` exist as a pair inside of the previous generated `corpus_model` and prints this out if it doesn't. Note that `word1` and `word2` are values declared at the top of the code cell. Change them to non-words and see what prints!

(3) If neither the above `if` or `elif` statements execute, then this `else` block will execute. For the two given words, we store the value in `corpus_model` into the variable called `values.`

(4) Here, we are storing something inside the variable `sorted_values`. All in all, we are sorting the the list `values.items()` (feel free to write `print(values.items())` above this and see what it shows) according to the formula after `key=`. Don't stress too much about this though, it is just the way you tell Python to put it in descending order (i.e., if you ever need something like this again, use google or copy/paste this line)

(5) And that all brings us to our for loop! For every item in our `sorted_result` list, which we call `entry` here, `entry[0]`, to be the word being compared and `entry[1]` which gets the probably of that. We then print this result, showing the word, our `word1` and `word2` above, as well as its probability. See if you can see where each of these variables print out!

In [79]:
# For ease of use, I've made it a function
# notice that I could just copy/paste almost all of the code, because my function inputs are the same as how we defined the words in the previous cell
# meaning that functions can usually be built from code you already have with minor alterations!

def trigram_probs(word1, word2):
    # for function purposes, print statements become saved as an output which I then return
    output = []
    if corpus_model == None:
        pass
    elif (word1,word2) not in corpus_model: 
         output = ("There are no occurrences of '"+word1+" "+word2+"' in the corpus!")
    else: 
        values = corpus_model[word1, word2]
        sorted_result = sorted(values.items(), key=lambda x: (-x[1], x[0].lower())) 
        for entry in sorted_result: # (5)
            word=entry[0]
            prob=entry[1]
            prob_rounded=round(entry[1],3)
            if prob == prob_rounded:
                output.append("P('"+entry[0]+"'|'"+word1+"','"+word2+"') = "+str(entry[1]))
            else:
                output.append("P('"+entry[0]+"'|'"+word1+"','"+word2+"') = "+str(entry[1])+" (rounded: "+str(round(entry[1],3))+")")
    
    return output

**Bonus Q.** Take the function from above which returns a list of the previous print statements and instead make it return the probability of a specific word given word1 and word2 (i.e., make a function that gives you the answer to the following 3 questions.

In [80]:
def simple_bonus(word1,word2,word3):
    if corpus_model == None:
        pass
    elif (word1,word2) not in corpus_model: 
         output = (f"There are no occurrences of {word1} and {word2} in the corpus!")
    else:
        values = corpus_model[word1, word2]
        sorted_result = sorted(values.items(), key=lambda x: (-x[1], x[0].lower())) 
        
        for entry in sorted_result: # (5)
            word=entry[0]
            if word == word3:
                prob=entry[1]
                output=round(prob,3)
    return output

def super_duper_fun_bonus_exercise(word1,word2,word3):
    if corpus_model == None:
        pass
    elif (word1,word2) not in corpus_model: 
         output = (f"There are no occurrences of {word1} and {word2} in the corpus!")
            
    else:
        results = dict(corpus_model[word1,word2])
        if not word3 in results:
            output = f"{word3} never follows {word1} {word2}"
        else:
            output = round(results[word3],3)
    return output

super_duper_fun_bonus_exercise('today', 'the', 'ice')

'ice never follows today the'

In [82]:
def bonus_bonus_func(word1, word2):
    if corpus_model == None:
        pass
    elif (word1,word2) not in corpus_model: 
         output = (f"There are no occurrences of {word1} and {word2} in the corpus!")
    else:
        values = corpus_model[word1, word2]
        sorted_result = sorted(values.items(), key=lambda x: (-x[1], x[0].lower())) 
        output = sorted_result[0][0]
    return output

bonus_bonus_func('today','we')

'There are no occurrences of today and we in the corpus!'

In [83]:
simple_bonus('today', 'the', 'Bank')

UnboundLocalError: local variable 'output' referenced before assignment

In [None]:

def bonus_solution(word1, word2, word3):
    
    return output

**Question 4.1.** What is the probability $P(Bank|today,the)$? You can give the solution rounded to three decimals. This is an easy question.


In [None]:
solution_q4_1 = trigram_probs('today', 'the')

solution_q4_1


**Question 4.2.** Now, what is the probability $P(company|today,the)$? You can give the solution rounded to three decimals. This is an easy question.


In [85]:
solution_q4_2 = super_duper_fun_bonus_exercise('today', 'the', 'company')

solution_q4_2

'company never follows today the'

**Question 4.3.** Now, what is the probability $P(reported|the,company)$? This will require you to change the words "today" and "the" in the cell above, running it again, and looking up the answer in the output. You can give the solution rounded to three decimals.


In [86]:
solution_q4_3 = trigram_probs('the', 'company')


In [138]:
# code courtesy of https://nlpforhackers.io/language-models/

# first let's reload the corpora and tokenizer in case user has reset the kernel after the previous question

#nltk.data.path=[str(Path.home())+'/groupshare/nltk_data']

# starting words
text = ["she", "loves"]
sentence_finished = False

if (text[0],text[1]) not in corpus_model: #1
    print("There are no occurrences of '"+text[0]+" "+text[1]+"' in the corpus!")
else: 
    while not sentence_finished: #2
        # select a random probability threshold  
        r = random.random()
        accumulator = .0

        for word in corpus_model[tuple(text[-2:])].keys(): #3
            accumulator += corpus_model[tuple(text[-2:])][word]
            # select words that are above the probability threshold
            if accumulator >= r: #4
                text.append(word)
                break

        if text[-2:] == [None, None]: #5
            sentence_finished = True

    text = ' '.join([t for t in text if t]) #6
    text = re.sub(r'\s([?.!",\';:‘](?:\s|$))', r'\1', text) # removes whitespaces before punctuation
    print(text)

she loves the Admiral was putting a stone in a cottage for Susan and her mother, and, behold, one ram, and to pour upon the name of the family property ,) she was going off into the way: 18 And Rehoboam went to the city whither I have heard a little too hasty decision, and put it upon many waters, ( What were you, and a thick cloud, a face looking in that time.


In [139]:
from nltk.tokenize import word_tokenize
with open ('mycorpus.txt', 'r') as file_in:
    corpus_model = Counter()
    for line in file_in:
        sentence = word_tokenize(line)
        for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
            if not (w1,w2) in corpus_model:
                corpus_model[w1,w2] = Counter()
            corpus_model[w1,w2][w3] += 1

    for (w1,w2) in corpus_model:
        total_count = float(sum(corpus_model[w1,w2].values()))
        for w3 in corpus_model[w1,w2]:
            corpus_model[w1,w2][w3] /= total_count

In [158]:
text = ["I", "eat"]
sentence_finished = False

if (text[0],text[1]) not in corpus_model:
    print("There are no occurrences of '"+text[0]+" "+text[1]+"' in the corpus!")
else: 
    while not sentence_finished:
        # select a random probability threshold  
        r = random.random()
        accumulator = .0

        for word in corpus_model[tuple(text[-2:])].keys():
            accumulator += corpus_model[tuple(text[-2:])][word]
            # select words that are above the probability threshold
            if accumulator >= r:
                text.append(word)
                break

        if text[-2:] == [None, None]:
            sentence_finished = True
    text = ' '.join([t for t in text if t])
    text = re.sub(r'\s([?.!",\';:‘](?:\s|$))', r'\1', text) # removes whitespaces before punctuation
    print(text)

I eat it?
