In [1]:
import random
import nltk
from nltk.corpus import brown

nltk.download("brown")
nltk.download("punkt")

# Example corpus
tokens = brown.words()  # get the words from the brown corpus

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## 1. Unigram

A unigram model in NLP is a type of probabilistic language model used for predicting the next item in a sequence as a single unit (word) independent of its preceding or following words. This model assumes that the probability of each word occurring in a text is independent of the words around it.

We can find such probabilities and use them to generate sentences, as you'll find out in the next exercises.

In [2]:
# @title 🧑🏿‍💻 Your code here

probabilities = (
    {}
)  # dictionary to store the probabilities in the format word:probability

In [3]:
# @title 👀 Solution

from collections import Counter

# Calculating word frequencies
frequency = Counter(tokens)
total_words = sum(frequency.values())

# Converting frequencies to probabilities
probabilities = {word: freq / total_words for word, freq in frequency.items()}

In [4]:
# let's get the probability of a sentence
sentence = tokens[
    0:10
]  # "The Fulton County Grand Jury said Friday an investigation of"

sentence_prob = 1

for word in sentence:
    sentence_prob *= probabilities[word]

print(sentence)
print(sentence_prob)

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
1.0853868324836725e-37


In [5]:
# @title 🧑🏿‍💻 Your code here


def generate_sentence(length, probabilities):
    pass

In [6]:
# @title 👀 Solution

random.seed(42)


def generate_sentence(length, probabilities):
    sentence = []
    words = list(probabilities.keys())
    word_probabilities = list(probabilities.values())
    for _ in range(length):
        word = random.choices(words, weights=word_probabilities)[0]
        sentence.append(word)
    return " ".join(sentence)

In [7]:
print(generate_sentence(10, probabilities))

used of was , Thomas economic editors . with of


### Discussion

- What are the limitations of generating text using a unigram model?
- How might the sentences differ if a bigram or trigram model were used instead?
- What improvements might be considered for a more realistic text generation given the constraint of using a unigram model?

## 2. Bigram LM

A bigram model is used for predicting the next word in a sequence based on the previous word. It is a simple form of n-gram model where n is equal to 2. The concept of a bigram model is rooted in the Markov assumption that the probability of a word depends only on a finite history of previous words. In the case of a bigram, just the immediately preceding word.

In [8]:
# @title 🧑🏿‍💻 Your code here

bigram_probabilities = (
    {}
)  # dictionary to store the probabilities in the format word_i:word_{i+1}:probability

In [9]:
# @title 👀 Solution

from nltk import bigrams
from collections import Counter, defaultdict

# Calculating bigram frequencies
bigram_freq = Counter(bigrams(tokens))
total_bigrams = sum(bigram_freq.values())

# Building bigram probabilities
bigram_probabilities = defaultdict(dict)
for (w1, w2), freq in bigram_freq.items():
    bigram_probabilities[w1][w2] = freq / total_bigrams

In [50]:
for k1 in list(bigram_probabilities.keys())[:2]:
    for k2 in list(bigram_probabilities[k1].keys())[:10]:
        print(k1, k2, bigram_probabilities[k1][k2])

The Fulton 8.611847663304314e-07
The jury 7.750662896973882e-06
The September-October 8.611847663304314e-07
The grand 1.7223695326608628e-06
The City 4.305923831652157e-06
The jurors 8.611847663304314e-07
The couple 1.7223695326608628e-06
The petition 1.7223695326608628e-06
The Hartsfield 8.611847663304314e-07
The mayor's 8.611847663304314e-07
Fulton County 5.1671085979825885e-06
Fulton Superior 1.7223695326608628e-06
Fulton legislators 1.7223695326608628e-06
Fulton taxpayers 8.611847663304314e-07
Fulton ordinary's 8.611847663304314e-07
Fulton Tax 8.611847663304314e-07
Fulton Health 8.611847663304314e-07
Fulton to 8.611847663304314e-07
Fulton was 8.611847663304314e-07
Fulton , 8.611847663304314e-07


In [11]:
# @title 🧑🏿‍💻 Your code here


def generate_bigram_sentence(start_word, length, bigram_probabilities):
    pass

In [12]:
# @title 👀 Solution


def generate_bigram_sentence(start_word, length, bigram_probabilities):
    if start_word not in bigram_probabilities:
        raise ValueError("Start word not in bigram probabilities")
    sentence = [start_word]
    current_word = start_word
    for _ in range(length - 1):
        next_word = random.choices(
            list(bigram_probabilities[current_word].keys()),
            weights=bigram_probabilities[current_word].values(),
        )[0]
        sentence.append(next_word)
        current_word = next_word
    return " ".join(sentence)

In [13]:
# Generate a 10-word sentence starting with a given word
start_word = random.choice(list(bigram_probabilities.keys()))
print(generate_bigram_sentence(start_word, 20, bigram_probabilities))

billing . That Prokofieff's own complex of what the expressions in the Sherman Act , get a family affair .


### Discussion

- Compare the sentences generated by the unigram and bigram models. Which model produces more coherent sentences?
- What are the limitations of a bigram model in text generation?
- How do the results change if starting words are varied?

## 3. Word2Vec

We have already seen Word2Vec. Unlike the previously discussed unigram and bigram models, which focus on word frequencies or word sequences, Word2Vec captures the semantic relationships between words in a way that preserves context and similarity. Can we make use of these relationships in order to generate sentences?

In [14]:
from gensim.models import Word2Vec

WINDOW_SIZE = 3

tokenized_sentences = brown.sents()
model = Word2Vec(
    tokenized_sentences, vector_size=100, window=WINDOW_SIZE, min_count=1, sg=1
)  # sg=1 -> skip-gram

Word2Vec itself is not a language model in the traditional sense, as it doesn't model the probability distribution of a sequence of words in a language. As we already discussed, it is primarily used to learn vector representations of words that capture semantic similarities and relationships between them. Still, we can use this model to predict context words, which may appear in the same window as target words.

In [15]:
# @title 🧑🏿‍💻 Your code here


def generate_sentence(start_word, max_length=10):
    pass

In [16]:
# @title 👀 Solution


def generate_sentence(start_word, max_length=10):
    if start_word not in model.wv:
        return "Word not in vocabulary!"

    current_words = [start_word]
    sentence = [start_word]
    for _ in range(max_length - 1):
        # Find most similar words
        similar_words = model.wv.most_similar(current_words, topn=10)
        # print(similar_words)
        next_word = similar_words[0][0]  # pick the most similar word
        sentence.append(next_word)

        # Update current words
        if len(current_words) == WINDOW_SIZE:
            current_words.pop(0)
        current_words.append(next_word)

    return " ".join(sentence)

In [17]:
# Generate a sentence from a random starting word
random_start_word = random.choice(list(model.wv.index_to_key))
generated_sentence = generate_sentence(random_start_word, 10)
print(f"Generated sentence from '{random_start_word}': {generated_sentence}")

Generated sentence from 'Rio': Rio Stamford Peck Palm Turner Stamford Peck Palm Turner Stamford


### 3.1 Sampling

#### Uniform Top-K

As you can see in the solution of the previous exercise, one way of picking the next word is taking the most similar. Such a method is called Argmax Sampling. For the next exercise, code and see what happens when we sample uniformly from the 10 most similar words.

In [18]:
def generate_sentence_with_sampling(start_word, sample_func, max_length=10, topk=10):
    if start_word not in model.wv:
        return "Word not in vocabulary!"

    current_words = [start_word]
    sentence = [start_word]
    for _ in range(max_length - 1):
        # Find most similar words
        similar_words = model.wv.most_similar(current_words, topn=topk)
        next_word = sample_func(similar_words)
        sentence.append(next_word)

        # Update current words
        if len(current_words) == WINDOW_SIZE:
            current_words.pop(0)
        current_words.append(next_word)

    return " ".join(sentence)

In [19]:
# @title 🧑🏿‍💻 Your code here


def sample_unif_topk(start_word, max_length=10):
    pass

In [20]:
# @title 👀 Solution


def sample_unif_topk(similar_words):
    return random.choices([word for word, _ in similar_words], k=1)[0]

In [21]:
# Generate a sentence from a random starting word
random_start_word = random.choice(list(model.wv.index_to_key))
generated_sentence = generate_sentence_with_sampling(
    random_start_word, sample_unif_topk, 10
)
print(f"Generated sentence from '{random_start_word}': {generated_sentence}")

Generated sentence from 'Rondo': Rondo restoring Georgia's well-established Dusseldorf Missouri's Sherlock Panza dip piety


#### Top-K Sampling

Sampling uniformly from the top-k similar words introduces significant variety into the text. However, expanding the selection to a larger pool, such as 100 or 1000 similar words, could introduce excessive noise if we continue to sample uniformly. To refine this approach, you can enhance the sampling method by weighting the probability of each word according to its similarity score. To do so, we may rely on python's `random.choice()` or implement inverse transform sampling directly.

The next exercise involves coding this weighted sampling technique.

In [22]:
# @title 🧑🏿‍💻 Your code here


def sample_topk(similar_words):
    pass


def sample_topk_inverse_transform(similar_words):
    pass

In [23]:
# @title 👀 Solution


# relying on python's random.choices to sample based on the weights
def sample_topk(similar_words):
    words = []
    weights = []
    for word, similarity in similar_words:
        words.append(word)
        weights.append(similarity)
    return random.choices(words, weights=weights, k=1)[0]


# implement inverse transform sampling
def sample_topk_inverse_transform(similar_words):
    words = []
    weights = []
    for word, similarity in similar_words:
        words.append(word)
        weights.append(similarity)

    # Normalize the weights to form a probability distribution
    total_weight = sum(weights)
    probabilities = [weight / total_weight for weight in weights]

    # Create the cumulative distribution function (CDF)
    cumulative_probabilities = []
    cumulative_sum = 0
    for p in probabilities:
        cumulative_sum += p
        cumulative_probabilities.append(cumulative_sum)

    # Generate a random float in the range [0, 1]
    u = random.random()

    # Find the first index where the cumulative probability exceeds the random number
    for index, cumulative_probability in enumerate(cumulative_probabilities):
        if cumulative_probability > u:
            return words[index]

    # As a fallback, return the first word if something goes wrong
    return words[0]

In [24]:
# Generate a sentence from a random starting word
random_start_word = random.choice(list(model.wv.index_to_key))
generated_sentence = generate_sentence_with_sampling(
    random_start_word, sample_topk, 10, topk=1000
)
print(f"Generated sentence from '{random_start_word}': {generated_sentence}")

generated_sentence = generate_sentence_with_sampling(
    random_start_word, sample_topk_inverse_transform, 10, topk=1000
)
print(f"Generated sentence from '{random_start_word}': {generated_sentence}")

Generated sentence from 'subjectively': subjectively assented disarmed minded Koehler reassemble housekeeping Nehru proffered Palmer's
Generated sentence from 'subjectively': subjectively Sidney Self Travel Late leering Contest Paula's Lamar Horne


### Discussion

- Why might sentences generated using the most similar word method lack coherence over longer sequences? How could the model be adjusted to produce more grammatically coherent outputs? With this perspective in mind, how could we modify the model, in order to also predict the position of the generated context word?
- What are some methods to prevent the generated sentences from looping or becoming repetitive? How might introducing randomness or diversity in word selection impact the quality of the generated sentences?
- How do different parameters of the Word2Vec model (like vector size, window size, and training algorithm) affect the outcome of the generated sentences? What happens when you alter these parameters?
- Have you noticed anything by generating sentences starting with an arbitrary word (i.e. "gensim")? Describe the issue and potential solutions.

### 3.2 Beam Search (Extra)

In this exercise, you will implement the beam search algorithm using Word2Vec embeddings to generate text. Beam search, a heuristic search algorithm, balances exploration and efficiency by keeping only the top-scoring word sequences at each step. You'll integrate this with Word2Vec, which provides word embeddings that capture semantic similarities, to guide the generation process.

Through this task, explore how different settings for the beam width affect the coherence and diversity of the generated sentences, enhancing your understanding of both beam search mechanics and the practical use of word embeddings in NLP tasks.

In [25]:
# @title 🧑🏿‍💻 Your code here


def beam_search(start_word, beam_width=3, max_length=10):
    pass

In [26]:
# @title 👀 Solution
def beam_search(start_word, beam_width=3, max_length=10):
    if start_word not in model.wv:
        return "Word not in vocabulary!"

    # Initialize the beam with the start word
    beam = [(start_word, 0)]  # List of tuples (sentence, cumulative_score)

    for _ in range(max_length - 1):
        candidates = []
        # Explore each word in the beam
        for sentence, cum_score in beam:
            current_word = sentence.split()[-1]
            # Find potential next words
            try:
                next_words = model.wv.most_similar(current_word, topn=beam_width)
            except KeyError:
                continue  # Skip if current_word has no similar words

            # Add new sentences with updated scores
            for next_word, sim in next_words:
                new_sentence = sentence + " " + next_word
                new_score = cum_score + sim  # Update cumulative score
                candidates.append((new_sentence, new_score))

        # Prune to keep only the top beam_width entries
        beam = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]

    # Choose the best sequence
    best_sequence, _ = sorted(beam, key=lambda x: x[1], reverse=True)[0]
    return best_sequence

In [27]:
# Generate a sentence from a random starting word
random_start_word = random.choice(list(model.wv.index_to_key))
generated_sentence = beam_search(random_start_word, beam_width=3, max_length=10)
print(f"Generated sentence from '{random_start_word}': {generated_sentence}")

Generated sentence from 'Body': Body supersonic geese smelling geese smelling geese smelling geese smelling



To enhance the beam search function and address potential issues such as repetition or lack of diversity in the generated sentences, one effective strategy is to introduce a penalty for sampled words. This approach helps to discourage the selection of the same word or similar words too frequently within a short segment of the generated text.

In [None]:
# @title 🧑🏿‍💻 Your code here
def beam_search_with_penalty(start_word, beam_width=3, max_length=10, penalty=0.1):
    pass

In [28]:
# @title 👀 Solution
def beam_search_with_penalty(start_word, beam_width=3, max_length=10, penalty=0.1):
    if start_word not in model.wv:
        return "Word not in vocabulary!"

    # Initialize the beam with the start word
    beam = [
        (start_word, 0, [start_word])
    ]  # Each entry is (sentence, cumulative_score, recent_words)

    for _ in range(max_length - 1):
        candidates = []
        # Explore each word in the beam
        for sentence, cum_score, recent_words in beam:
            current_word = sentence.split()[-1]
            # Find potential next words
            try:
                next_words = model.wv.most_similar(
                    current_word, topn=beam_width + len(recent_words)
                )
            except KeyError:
                continue  # Skip if current_word has no similar words

            # Apply penalty to similar words based on recent usage
            filtered_words = []
            for next_word, sim in next_words:
                if next_word not in recent_words:
                    filtered_words.append((next_word, sim))
                else:
                    # Apply penalty
                    filtered_words.append((next_word, sim * (1 - penalty)))

            # Keep only the top beam_width entries after applying penalty
            filtered_words = sorted(filtered_words, key=lambda x: x[1], reverse=True)[
                :beam_width
            ]

            # Add new sentences with updated scores and update recent words
            for next_word, sim in filtered_words:
                new_sentence = sentence + " " + next_word
                new_score = cum_score + sim  # Update cumulative score
                new_recent_words = recent_words[-(beam_width - 1) :] + [
                    next_word
                ]  # Update recent words
                candidates.append((new_sentence, new_score, new_recent_words))

        # Prune to keep only the top beam_width entries
        beam = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]

    # Choose the best sequence
    best_sequence, _, _ = sorted(beam, key=lambda x: x[1], reverse=True)[0]
    return best_sequence

In [34]:
# Generate a sentence from a random starting word
random_start_word = random.choice(list(model.wv.index_to_key))
generated_sentence = beam_search_with_penalty(
    random_start_word, beam_width=100, max_length=100, penalty=1
)
print(f"Generated sentence from '{random_start_word}': {generated_sentence}")

Generated sentence from 'unfastened': unfastened triangular boarding supersonic geese Cemetery Palm Turner Snow's Janssen Campbell Clifford Moody Blumberg Spurdle celery Von Worthy Faber Freight Minnett '48 Gene-Princess Pembina Staten Bench Compson Peck Motor Dog Mail Musical France's dictators excitatory fermentation Erich chisel sediments pickoff electrode 5000 tomato biscuits oxalate centering olive idioms Projects bulbs engraved fermented jejunum butts hikes Cr filtering nitrogen tumble Turk summertime aft excesses Carnival canvases backwoods Leopoldville suppression tribunals Chesapeake Tudor negotiated rubbish Investment Savings Marin Simpson Trophy Vocational scours ketosis Rall Pride-Starlette '58 Mon-Columbia Heel-Miracle Plate Area teaspoon adjectives walnuts maple shrinkage diarrhea Loan elution Des Electrical expandable vapor-pressure


### 4. Designing an Autoregressive Word2Vec-Inspired Model for Position-Aware Language Generation (Extra)

Word2Vec is a powerful tool for generating vector representations of words, capturing their semantic relationships based on their contexts. However, traditional Word2Vec models do not account for the order or position of words in context; they merely predict context words regardless of their specific positions relative to the target word. This limitation can be addressed by integrating positional awareness into a model inspired by Word2Vec but designed for language generation.

**Task:**

Propose a detailed architecture for an autoregressive model that not only predicts context words but also their positions relative to the current word, similar to how Word2Vec captures contextual word relationships. Your proposed model should be capable of generating coherent language sequences by considering both the semantic similarity of words and their positional dynamics in sentences.

Points to consider in your proposal:

1. **Model Foundation**: Describe how you would modify the foundational architecture of Word2Vec (either CBOW or Skip-gram) to include positional information. What changes would be necessary to incorporate the sequential nature of language?

2. **Position Encoding**: How would you encode positional information within the model? Consider techniques used in other models like transformers (e.g., positional encodings) and describe how they could be adapted or improved for this use case.

3. **Autoregressive Mechanism**: Explain how the model would generate text in an autoregressive manner, particularly how it would use the positionally-aware embeddings to predict the next word in a sequence. What kind of neural network architecture would be suitable for this?

4. **Training Strategy**: Discuss potential strategies for training this model. What kind of corpus and preprocessing would be needed? How would you define the loss function to include accuracy not just for predicting the right word, but also placing it correctly?

5. **Potential Applications and Benefits**: Highlight how this position-aware model could outperform traditional Word2Vec models in specific NLP tasks. Consider applications in machine translation, summarization, or interactive dialog systems.

Your response should outline the theoretical framework, practical considerations for implementation, and possible challenges you might face with this architecture. This exercise aims to deepen your understanding of how word embeddings can be extended beyond traditional models to enhance language generation capabilities in NLP systems.