In [1]:
import random
import nltk
from nltk.corpus import brown

nltk.download('brown')
nltk.download('punkt')

# Example corpus
tokens = brown.words() # get the words from the brown corpus

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## 1. Unigram

A unigram model in NLP is a type of probabilistic language model used for predicting the next item in a sequence as a single unit (word) independent of its preceding or following words. This model assumes that the probability of each word occurring in a text is independent of the words around it.

We can find such probabilities and use them to generate sentences, as you'll find out in the next exercises.

In [2]:
# @title 🧑🏿‍💻 Your code here

probabilities = {} # dictionary to store the probabilities in the format word:probability

In [3]:
# @title 👀 Solution

from collections import Counter

# Calculating word frequencies
frequency = Counter(tokens)
total_words = sum(frequency.values())

# Converting frequencies to probabilities
probabilities = {word: freq / total_words for word, freq in frequency.items()}

In [4]:
# let's get the probability of a sentence
sentence = tokens[0:10] #"The Fulton County Grand Jury said Friday an investigation of"

sentence_prob = 1

for word in sentence:
    sentence_prob *= probabilities[word]

print(sentence)
print(sentence_prob)

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
1.0853868324836725e-37


In [5]:
# @title 🧑🏿‍💻 Your code here

def generate_sentence(length, probabilities):
    pass

In [6]:
# @title 👀 Solution

random.seed(42)

def generate_sentence(length, probabilities):
    sentence = []
    words = list(probabilities.keys())
    word_probabilities = list(probabilities.values())
    for _ in range(length):
        word = random.choices(words, weights=word_probabilities)[0]
        sentence.append(word)
    return ' '.join(sentence)

In [7]:
print(generate_sentence(10, probabilities))

used of was , Thomas economic editors . with of


### Discussion

- What are the limitations of generating text using a unigram model?
- How might the sentences differ if a bigram or trigram model were used instead?
- What improvements might be considered for a more realistic text generation given the constraint of using a unigram model?

## 2. Bigram LM

A bigram model is used for predicting the next word in a sequence based on the previous word. It is a simple form of n-gram model where n is equal to 2. The concept of a bigram model is rooted in the Markov assumption that the probability of a word depends only on a finite history of previous words. In the case of a bigram, just the immediately preceding word.

In [8]:
# @title 🧑🏿‍💻 Your code here

bigram_probabilities = {} # dictionary to store the probabilities in the format word_i:word_{i+1}:probability

In [9]:
# @title 👀 Solution

from nltk import bigrams
from collections import Counter, defaultdict

# Calculating bigram frequencies
bigram_freq = Counter(bigrams(tokens))
total_bigrams = sum(bigram_freq.values())

# Building bigram probabilities
bigram_probabilities = defaultdict(dict)
for (w1, w2), freq in bigram_freq.items():
    bigram_probabilities[w1][w2] = freq / total_bigrams

In [10]:
bigram_probabilities

defaultdict(dict,
            {'The': {'Fulton': 8.611847663304314e-07,
              'jury': 7.750662896973882e-06,
              'September-October': 8.611847663304314e-07,
              'grand': 1.7223695326608628e-06,
              'City': 4.305923831652157e-06,
              'jurors': 8.611847663304314e-07,
              'couple': 1.7223695326608628e-06,
              'petition': 1.7223695326608628e-06,
              'Hartsfield': 8.611847663304314e-07,
              "mayor's": 8.611847663304314e-07,
              'largest': 1.7223695326608628e-06,
              'Republicans': 1.7223695326608628e-06,
              'Georgia': 2.5835542989912942e-06,
              'bond': 1.7223695326608628e-06,
              'department': 3.4447390653217255e-06,
              'Highway': 8.611847663304314e-07,
              'Constitution': 1.7223695326608628e-06,
              'resolution': 5.1671085979825885e-06,
              'new': 3.530857541954769e-05,
              'campaign': 8.61184766330431

In [11]:
# @title 🧑🏿‍💻 Your code here

def generate_bigram_sentence(start_word, length, bigram_probabilities):
    pass

In [12]:
# @title 👀 Solution

def generate_bigram_sentence(start_word, length, bigram_probabilities):
    if start_word not in bigram_probabilities:
        raise ValueError("Start word not in bigram probabilities")
    sentence = [start_word]
    current_word = start_word
    for _ in range(length - 1):
        next_word = random.choices(list(bigram_probabilities[current_word].keys()),
                                   weights=bigram_probabilities[current_word].values())[0]
        sentence.append(next_word)
        current_word = next_word
    return ' '.join(sentence)

In [13]:
# Generate a 10-word sentence starting with a given word
start_word = random.choice(list(bigram_probabilities.keys()))
print(generate_bigram_sentence(start_word, 20, bigram_probabilities))


billing . That Prokofieff's own complex of what the expressions in the Sherman Act , get a family affair .


### Discussion

- Compare the sentences generated by the unigram and bigram models. Which model produces more coherent sentences?
- What are the limitations of a bigram model in text generation?
- How do the results change if starting words are varied?

## 3. Word2Vec

We have already seen Word2Vec. Unlike the previously discussed unigram and bigram models, which focus on word frequencies or word sequences, Word2Vec captures the semantic relationships between words in a way that preserves context and similarity. Can we make use of these relationships in order to generate sentences?

In [14]:
from gensim.models import Word2Vec

WINDOW_SIZE = 3

tokenized_sentences = brown.sents()
model = Word2Vec(tokenized_sentences, vector_size=100, window=WINDOW_SIZE, min_count=1, sg=1) # sg=1 -> skip-gram

Word2Vec itself is not a language model in the traditional sense, as it doesn't model the probability distribution of a sequence of words in a language. As we already discussed, it is primarily used to learn vector representations of words that capture semantic similarities and relationships between them. Still, we can use this model to predict context words, which may appear in the same window as target words.

In [15]:
# Function to predict context words
def predict_context(target_word, window_size=2):
    if target_word not in model.wv:
        return "Word not in vocabulary!"

    # Generate all word-context pairs
    word_context_pairs = []
    for sentence in tokenized_sentences:
        for i, word in enumerate(sentence):
            if word == target_word:
                start_index = max(0, i - window_size)
                end_index = min(len(sentence), i + window_size + 1)
                context_words = sentence[start_index:i] + sentence[i + 1:end_index]
                word_context_pairs.extend(context_words)

    # Use the model to find the most similar words to the predicted context
    if word_context_pairs:
        predicted_words = model.wv.most_similar(positive=[model.wv[target_word]], topn=len(set(word_context_pairs)))
        return [word for word, similarity in predicted_words if word in word_context_pairs]
    else:
        return "No context available."

# Testing the function
target_word = 'ball'
predicted_context = predict_context(target_word, window_size=3)
print(f"Context words predicted for '{target_word}': {predicted_context}")


Context words predicted for 'ball': ['ball', 'fight', 'rifle', 'house', 'desk', 'Mike', 'thrown', 'hit', 'road', 'Deegan', 'loose', 'belly', 'fly', 'dropped', 'safely', 'bat', 'caught', 'dirt', 'start', 'visitors', 'handed', 'squad', 'pitcher']


In [16]:
# @title 🧑🏿‍💻 Your code here

def generate_sentence(start_word, max_length=10):
    pass

In [17]:
# @title 👀 Solution

def generate_sentence(start_word, max_length=10):
    if start_word not in model.wv:
        return "Word not in vocabulary!"

    current_words = [start_word]
    sentence = [start_word]
    for _ in range(max_length - 1):
        # Find most similar word
        similar_words = model.wv.most_similar(current_words, topn=10)
        next_word = similar_words[0][0]  # pick the most similar word
        sentence.append(next_word)

        # Update current words
        if len(current_words) == WINDOW_SIZE:
            current_words.pop(0)
        current_words.append(next_word)

    return ' '.join(sentence)

In [18]:
# Generate a sentence from a random starting word
random_start_word = random.choice(list(model.wv.index_to_key))
generated_sentence = generate_sentence(random_start_word, 10)
print(f"Generated sentence from '{random_start_word}': {generated_sentence}")

Generated sentence from 'Rio': Rio Stamford McGeorge Peck Turner Stamford Palm Peck Turner Stamford


### Discussion

- Why might sentences generated using the most similar word method lack coherence over longer sequences? How could the model be adjusted to produce more grammatically coherent outputs? With this perspective in mind, how could we modify the model, in order to also predict the position of the generated context word?
- What are some methods to prevent the generated sentences from looping or becoming repetitive? How might introducing randomness or diversity in word selection impact the quality of the generated sentences?
- How do different parameters of the Word2Vec model (like vector size, window size, and training algorithm) affect the outcome of the generated sentences? What happens when you alter these parameters?
- Have you noticed anything by generating sentences starting with an arbitrary word (i.e. "gensim")? Describe the issue and potential solutions.