# Bigram Language Model

In this lab we will implement a bigram language model and use it to compute the probability of some sample sentences. 

### Outcomes
* Know how to count word frequencies in a corpus using Python libraries.
* Understand how to compute conditional probabilities.
* Be able to apply the chain rule to compute the probability of a sentence.

### Overview

The first part of the notebook loads the same dataset as last week. 
The next part splits the data into training and test sets, and tokenises the utterances.
After this there are some tasks to complete to implement and test the language model. 

# 1. Preparing the Data 

In [1]:
from datasets import load_dataset

split = "train"
cache_dir = "./data_cache"

dataset = load_dataset(
    "doc2dial",
    name="dialogue_domain",  # this is the name of the dataset for the second subtask, dialog generation
    split=split,
    ignore_verifications=True,
    cache_dir=cache_dir,
)

Reusing dataset doc2dial (./data_cache/doc2dial/dialogue_domain/1.0.1/c15afdf53780a8d6ebea7aec05384432195b356f879aa53a4ee39b740d520642)


In [2]:
# Collect all the utterances into a list. 
# For this task, we don't care about the order of the utterances in the conversation -- 
# we will just be using the utterances of examples of the language we want to model.

utterances = []
for sample in dataset:
    turns = sample['turns']
    for turn in turns:
        if turn['role'] == 'user':
            utterances.append(turn['utterance'])
            
###
print(f'Number of utterances: {len(utterances)}')

Number of utterances: 22151


In [3]:
# Tokenise the samples. You can replace NLTK with another tokenizer if you prefer. 
import nltk

for i in range(len(utterances)):
    utterances[i] = nltk.word_tokenize(utterances[i])
    
print(utterances[2])

['Thanks', ',', 'and', 'in', 'case', 'I', 'forget', 'to', 'bring', 'all', 'of', 'the', 'documentation', 'needed', 'to', 'the', 'DMV', 'office', ',', 'what', 'can', 'I', 'do', '?']


In [None]:
# We need to put in some artificial start <s> and end <e> tokens. 
# These will be used to model which words are most likely to start or end a sentence. 

for i in range(len(utterances)):
    utterances[i] = ['<s>'] + utterances[i] + ['<e>']

In [None]:
# Split the data into training and test using scikit-learn.
from sklearn.model_selection import train_test_split

train_size = 0.8
test_size = 0.2

# Split the train data from the test data
train_data, test_data = train_test_split(utterances, train_size=train_size, test_size=test_size)


print(f'The training set has {len(train_data)} samples and the test set has {len(test_data)} samples.')

# 2. Counting Tokens

The n-gram language model needs to compute two sets of counts from the training data:
1. The counts of how many times each bigram occurs.
2. The counts of how many times each first token (condition in the conditional probability) occurs. 

Let's start by finding the vocabulary of unique token 'types': 

In [None]:
import numpy as np

vocab = np.unique(np.concatenate(train_data))
V = len(vocab)

print(vocab)
print(f'There are {V} types in our vocabulary.')

Now we create an object to store the bigram counts in:

In [None]:
# A matrix where row indexes will correspond to the first token in a bigram, 
# and column indexes to the second token. The indexes must map to the index
# of the token in the vocabulary. The values in the matrix will be the counts.
counts = np.zeros((V, V))

In [None]:
# Here is an example of how to find the index of a given word:

word = '<s>'  # example word
index = np.argwhere(vocab == word)[0][0]
print(index)

TODO 1: count the bigrams that occur in the training set.

TODO 2: use numpy's sum() function to compute the first token counts.

TODO 3: compute a matrix (numpy array) of conditional probabilities using the counts.

TODO 4: write a function that computes the probability of a given tokenised sentence, such as the example below.

In [None]:
# example tokenised sentence
sen = ['<s>', 'If', 'you', 'give', 'me', 'the', 'help', ',', 'what', 'is', 'the', 'payment', 'system', '?', '<e>']

TODO 5: compute the perplexity over the whole test set.

EXTENSION 1: use the language model to generate new sentences by sampling. 
You can follow the example below to sample using scipy's multinomial class.

In [None]:
from scipy.stats import multinomial

example_vocab = np.array(['a', 'b', 'c', 'd'])

sample = multinomial.rvs(1, [0.3, 0.2, 0.1, 0.4])
sample_bool = sample.astype(bool)  # convert the sample from integer to boolean
generated_word = example_vocab[sample_bool]  # use the one-hot boolean vector to look up the word

print(generated_word)

MORE EXTENSIONS: 
* Add some smoothing to the counts and see how it affects the results.
* Use trigrams instead of bigrams. Does it improve perplexity?