# Natural Language Processing

### 2.2 Language Modeling (LM) and Sampling from a LM

In this tutorial, we will cover:

- Language Modeling (LM) with bigram
- How to sample from a LM
- Different sampling strategy

Prerequisites:

- Python
- numpy

#### Authors

<br>Prof. Iacopo Masi and Prof. Stefano Faralli

TA: Robert Adrian Minut

In [None]:
# @title Requirements
import random
import nltk
from nltk.corpus import brown

# Natural Language Toolkit (NLTK)

![NLTK](https://miro.medium.com/v2/resize:fit:1400/format:webp/1*-dNH8WI8Oy3etClaRvRCgw.png)

https://www.nltk.org/howto.html

NLTK, or the Natural Language Toolkit, is a powerful open-source library in Python that provides tools for working with human language data. It's widely used in Natural Language Processing (NLP) for tasks such as tokenization, stemming, tagging, parsing, and more.

## Key Features

- **Text Processing:** NLTK allows for tokenizing, part-of-speech tagging, and named entity recognition.
- **Corpora Support:** The library comes with a wide range of linguistic data, including corpora like the Brown Corpus and the WordNet lexical database.
- **Machine Learning:** It supports classification, clustering, and other machine learning techniques specifically designed for language processing.

## Installation

You can install NLTK using `pip`:

```bash
pip install nltk


Let us download the brown and punkt corpus

In [None]:
nltk.download('brown')
nltk.download('punkt')

# Example corpus
tokens = brown.words() # get the words from the brown corpus

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1. Unigram

A unigram model in NLP is a type of probabilistic language model used for predicting the next item in a sequence as a single unit (word) independent of its preceding or following words. This model assumes that the probability of each word occurring in a text is independent of the words around it.


Modeling Fully the probabilities of a sentence

![probs](https://miro.medium.com/v2/resize:fit:968/format:webp/1*vkvHegfkMxfjZ8hrqa1aAQ.png)

### Unigram assumes all words independent

![unigram](https://miro.medium.com/v2/resize:fit:1252/format:webp/1*qF3ZRwKDgmAG4cUBw3XeGA.png)

### **EXERCISE 1** 💻

Find such probabilities and use them to generate sentences, as you'll find out in the next exercises.

In [None]:
# @title 🧑🏿‍💻 Your code here

probabilities = {} # dictionary to store the probabilities in the format word:probability

In [None]:
# @title 👀 Solution

from collections import Counter

# Calculating word frequencies
frequency = Counter(tokens)
total_words = sum(frequency.values())

# Converting frequencies to probabilities
probabilities = {word: freq / total_words for word, freq in frequency.items()}

### **EXERCISE 2** 💻

Compute the probability of the sentence


> "The Fulton County Grand Jury said Friday an investigation of"

In [None]:
# @title 🧑🏿‍💻 Your code here
sentence = tokens[0:10] #"The Fulton County Grand Jury said Friday an investigation of"

In [None]:
# @title 👀 Solution
# let's get the probability of a sentence
sentence = tokens[0:10] #"The Fulton County Grand Jury said Friday an investigation of"

sentence_prob = 1

for word in sentence:
    sentence_prob *= probabilities[word]

print(sentence)
print(sentence_prob)

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of']
1.0853868324836725e-37


### **EXERCISE 3** 💻

Given the unigram model that we developed so far, write a function to sample from the unigram model possibile sentences of length `10`.

Then run this function 10 times and check the result.

In [None]:
# @title 🧑🏿‍💻 Your code here

random.seed(42)

def generate_sentence(length, probabilities):
    pass

In [None]:
# @title 👀 Solution

random.seed(42)

def generate_sentence(length, probabilities):
    sentence = []
    words = list(probabilities.keys())
    word_probabilities = list(probabilities.values())
    for _ in range(length):
        word = random.choices(words, weights=word_probabilities)[0]
        sentence.append(word)
    return ' '.join(sentence)

In [None]:
for i in range(10):
  print(generate_sentence(10, probabilities))

used of was , Thomas economic editors . with of
, -- of , afford Washington , estimated probate said
London know it the bucking it . . B decided
streets presentation House Chadwick is long plot case drunk their
small `` , to . , . was say on
two , for cared absolutely could the whose the is
Committeemen room I ; remain clubs , of a for
, goodbye backgrounds a ignored so manage ) and and
violence and only lock at , Blackwell before . ``
. ? surely with that is SX-21 A pallid character


### Discussion

- What are the limitations of generating text using a unigram model?
- How might the sentences differ if a bigram or trigram model were used instead?
- What improvements might be considered for a more realistic text generation given the constraint of using a unigram model?

## 2. Bigram LM and beyond

A bigram model is used for predicting the next word in a sequence based on the previous word. It is a simple form of n-gram model where n is equal to 2. The concept of a bigram model is rooted in the **Markov assumption** that the probability of a word depends only on a finite history of previous words. In the case of a bigram, just the immediately preceding word.

$$
P(w_1,\ldots,w_m) = \prod^m_{i=1} P(w_i\mid w_1,\ldots,w_{i-1})\approx \prod^m_{i=2} P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1})
$$

### We estimate Bigram via MLE (frequency-based estimator)

$$
P(w_i\mid w_{i-(n-1)},\ldots,w_{i-1}) = \frac{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1},w_i)}{\mathrm{count}(w_{i-(n-1)},\ldots,w_{i-1})}
$$

## Let us make an example

> I saw the red house

with bigram it becomes:

$$
P(\text{I, saw, the, red, house}) \approx P(\text{I}\mid\langle s\rangle) P(\text{saw}\mid \text{I}) P(\text{the}\mid\text{saw}) P(\text{red}\mid\text{the}) P(\text{house}\mid\text{red}) P(\langle /s\rangle\mid \text{house})
$$

`<s> and </s>` indicates start and end of sentence.

### **EXERCISE 4** 💻

Compute a bigram model over the tokens of the previous corpus.

Use
```python
from nltk import bigrams
```

where `bigram_probabilities` is dictionary of dictionary that contains the probability.

In [None]:
# @title 🧑🏿‍💻 Your code here

bigram_probabilities = {} # dictionary to store the probabilities in the
# format word_i:word_{i+1} -> probability

In [None]:
# @title 👀 Solution

from nltk import bigrams
from collections import Counter, defaultdict

# Calculating bigram frequencies
bigram_freq = Counter(bigrams(tokens))
total_bigrams = sum(bigram_freq.values())

# Building bigram probabilities
bigram_probabilities = defaultdict(dict)
for (w1, w2), freq in bigram_freq.items():
    bigram_probabilities[w1][w2] = freq / frequency[w1]

#### Let us plot a few of them...

In [None]:
for k1 in list(bigram_probabilities.keys())[:2]:
  for k2 in list(bigram_probabilities[k1].keys())[:10]:
    print(k1, k2, bigram_probabilities[k1][k2])

The Fulton 0.00013777900248002206
The jury 0.0012400110223201985
The September-October 0.00013777900248002206
The grand 0.0002755580049600441
The City 0.0006888950124001103
The jurors 0.00013777900248002206
The couple 0.0002755580049600441
The petition 0.0002755580049600441
The Hartsfield 0.00013777900248002206
The mayor's 0.00013777900248002206
Fulton County 0.35294117647058826
Fulton Superior 0.11764705882352941
Fulton legislators 0.11764705882352941
Fulton taxpayers 0.058823529411764705
Fulton ordinary's 0.058823529411764705
Fulton Tax 0.058823529411764705
Fulton Health 0.058823529411764705
Fulton to 0.058823529411764705
Fulton was 0.058823529411764705
Fulton , 0.058823529411764705


### **EXERCISE 5** 💻

Given the bigram model that we developed so far, write a function to sample from the bigram model possibile sentences of length `10` and with a starting word.

Then run this function 10 times and check the result.

In [None]:
# @title 🧑🏿‍💻 Your code here

def generate_bigram_sentence(start_word, length, bigram_probabilities):
    pass

In [None]:
# @title 👀 Solution

def generate_bigram_sentence(start_word, length, bigram_probabilities):
    if start_word not in bigram_probabilities:
        raise ValueError("Start word not in bigram probabilities")
    sentence = [start_word]
    current_word = start_word
    for _ in range(length - 1):
        next_word = random.choices(list(bigram_probabilities[current_word].keys()),
                                   weights=bigram_probabilities[current_word].values())[0]
        sentence.append(next_word)
        current_word = next_word
    return ' '.join(sentence)

In [None]:
# Generate a 10-word sentence starting with a given word
start_word = random.choice(list(bigram_probabilities.keys()))
for i in range(10):
  print(generate_bigram_sentence(start_word, 20, bigram_probabilities))


Colquitt Policeman Tom Horn had much for a man and certain tissues she couldn't lift the reader will give the
Colquitt -- or lover . Complementing the nature of it across enough to be confused over these differences which begins
Colquitt -- although Brown University , following Monday morning the 2 . But at least an ideal . The other
Colquitt Policeman Tom Williams wrote the right of a demurrer to an equal ) 3 ) . One thing as
Colquitt Policeman Tom attended Midwood High blood '' , into three factors as it has been confined almost a class
Colquitt Policeman Tom '' . The president made a young conductor also failed to have it seldom a team ,
Colquitt Policeman Tom '' . `` I was meant to a public ventures he ( 2 . America . The
Colquitt Policeman Tom Swift . ) and in what I had a Russian tests . `` Thou art relates human
Colquitt -- long way to share one of one of the children , far as nice of many universities ,
Colquitt Policeman Tom Horn , technological factors that it , however , rhy

### Discussion

- Compare the sentences generated by the unigram and bigram models. Which model produces more coherent sentences?
- What are the limitations of a bigram model in text generation?
- How do the results change if starting words are varied?

## 3. Word2Vec

We have already seen Word2Vec. Unlike the previously discussed unigram and bigram models, which focus on word frequencies or word sequences, Word2Vec captures the semantic relationships between words in a way that preserves context and similarity. Can we make use of these relationships in order to generate sentences?

In [None]:
import gensim.downloader
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


Let us change and now use `glove-wiki-gigaword-100`

In [None]:
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-100')



Just a small hack to call it `model` instead of `wv`

In [None]:
class MyModel:
  def __init__(self, model):
    self.wv = model
model = MyModel(glove_vectors)

In [None]:
# from gensim.models import Word2Vec

# WINDOW_SIZE = 3

#tokenized_sentences = brown.sents()
#model = Word2Vec(tokenized_sentences, vector_size=100, window=WINDOW_SIZE, min_count=1, sg=1) # sg=1 -> skip-gram

Word2Vec itself is not a language model in the traditional sense, as it doesn't model the probability distribution of a sequence of words in a language. As we already discussed, it is primarily used to learn vector representations of words that capture semantic similarities and relationships between them. Still, we can use this model to predict context words, which may appear in the same window as target words.

### **EXERCISE 6** 💻

Generate a sentence from a word2vec model treating it as an autoregressive model.

The pseudo-algorithm can be:

1. start from a inital word called `start_word`
2. As you know word2vec was trained considering a buffer of words, thus we can consider a `WINDOW_SIZE` of the most recent generated words and use the most recent generated words of size `WINDOW_SIZE` to "sample" a new word. To do this use the function `most_similar(current_words, topn=10)` where current_words is of size `WINDOW_SIZE`.
3. Now that you have "sampled" a new word we stored it in:
  - The sentence that will be generated
  - the list of current words of length `WINDOW_SIZE`. Note that current_words at most has to be of size  `WINDOW_SIZE`

In [None]:
# @title 🧑🏿‍💻 Your code here

def generate_sentence(start_word, WINDOW_SIZE, max_length=100):
    pass

In [None]:
# @title 👀 Solution

def generate_sentence(start_word, WINDOW_SIZE, max_length=100):
    if start_word not in model.wv:
        return "Word not in vocabulary!"

    current_words = [start_word]
    sentence = [start_word]
    for _ in range(max_length - 1):
        # Find most similar words
        similar_words = model.wv.most_similar(current_words, topn=10)
        #print(similar_words)
        next_word = similar_words[0][0]  # pick the most similar word
        sentence.append(next_word)

        # Update current words
        if len(current_words) == WINDOW_SIZE:
            current_words.pop(0)
        current_words.append(next_word)

    return ' '.join(sentence)

In [None]:
# Generate a sentence from a random starting word
random.seed(42)
random_start_word = "the"  # or use random.choice(list(model.wv.index_to_key))
generated_sentence = generate_sentence(random_start_word, 10, 100)
print(f"Generated sentence from '{random_start_word}':\n {generated_sentence}")

Generated sentence from 'the':
 the this same one only it but so though even because not this that if what it but so though even because not this that if what it but so though even because not this that if what it but so though even because not this that if what it but so though even because not this that if what it but so though even because not this that if what it but so though even because not this that if what it but so though even because not this that if what it but so though even because not


### 3.1 Sampling

#### Uniform Top-K

As you can see in the solution of the previous exercise, one way of picking the next word is taking the most similar.

**Such a method is called Argmax Sampling.**

### **EXERCISE 7** 💻

 For the next exercise,  **code and see what happens when we sample uniformly from the 10 most similar words.**


`generate_sentence_with_sampling` is already done you need to complete `sample_unif_topk`



In [None]:
def generate_sentence_with_sampling(start_word, sample_func, WINDOW_SIZE,
                                    max_length=100, topk=10):
    if start_word not in model.wv:
        return "Word not in vocabulary!"

    current_words = [start_word]
    sentence = [start_word]
    for _ in range(max_length - 1):
        # Find most similar words
        similar_words = model.wv.most_similar(current_words, topn=topk)
        next_word = sample_func(similar_words)
        sentence.append(next_word)

        # Update current words
        if len(current_words) == WINDOW_SIZE:
            current_words.pop(0)
        current_words.append(next_word)

    return ' '.join(sentence)

In [None]:
# @title 🧑🏿‍💻 Your code here

def sample_unif_topk(start_word, max_length=100):
    pass

In [None]:
# @title 👀 Solution

def sample_unif_topk(similar_words):
    return random.choices([word for word, _ in similar_words], k=1)[0]

In [None]:
# Generate a sentence from a random starting word
random.seed(42)
random_start_word = "the"  # or use random.choice(list(model.wv.index_to_key))
generated_sentence = generate_sentence_with_sampling(random_start_word,
                                                     sample_unif_topk,
                                                     10, #WINDOW SIZE
                                                     100,# How much to generate
                                                     10) # topk
print(f"Generated sentence from '{random_start_word}':\n {generated_sentence}")

Generated sentence from 'the':
 the on this it same part but so one though only because well this even yet there not they could but n't would that because should did might not say we what be even do would if that should did but might not they even could because n't if should be did might but it we that could yet would even what be n't might come did so not if they we that what but yet it way could even because n't if we did they but yet that not even because what be n't so they this would might


#### We start now from "The cat is"
Now let us change the function so that we start always with

> ['the','cat', 'is']

In [None]:
def generate_sentence_with_sampling(current_words, sample_func, WINDOW_SIZE,
                                    max_length=100, topk=10):

  sentence = current_words[:]
  for _ in range(max_length - 1):
      # Find most similar words
      similar_words = model.wv.most_similar(current_words, topn=topk)
      next_word = sample_func(similar_words)
      sentence.append(next_word)

      # Update current words
      if len(current_words) == WINDOW_SIZE:
          current_words.pop(0)
      current_words.append(next_word)

  return ' '.join(sentence)

In [None]:
# Generate a sentence from a random starting word
random.seed(42)
random_start_word = "the"  # or use random.choice(list(model.wv.index_to_key))
current_words = ['the','cat', 'is']
generated_sentence = generate_sentence_with_sampling(current_words[:],
                                                     sample_unif_topk,
                                                     10, #WINDOW SIZE
                                                     100,# How much to generate
                                                     10,# topk
                                                     )
print(f"Generated sentence from '{current_words}':\n {generated_sentence}")

Generated sentence from '['the', 'cat', 'is']':
 the cat is now one only as though . although but because it this well even however as same . that fact though there it this although what so but because much . however well only they not that yet this there because . but so it though although however that well only not this even they be . though there so that would yet this could now they even not it be that might still because n't would now they but even be so that did it because still n't not but they yet what if even could it did this


#### Top-K Sampling

Sampling uniformly from the top-k similar words introduces significant variety into the text. However, expanding the selection to a larger pool, such as 100 or 1000 similar words, could introduce excessive noise if we continue to sample uniformly. To refine this approach, you can enhance the sampling method by weighting the probability of each word according to its similarity score. To do so, we may rely on python's `random.choice()` or implement inverse transform sampling directly.

### **EXERCISE 7** 💻

The next exercise involves coding this weighted sampling technique.

In [None]:
# @title 🧑🏿‍💻 Your code here

def sample_topk(similar_words):
    # use random.choice()
    pass


## This is OPTIONAL since it can take longer time to implement
def sample_topk_inverse_transform(similar_words):
    # do not use random.choice() but reimplement inverse transform sampling by hand
    pass

In [None]:
# @title 👀 Solution

# relying on python's random.choices to sample based on the weights
def sample_topk(similar_words):
    words = []
    weights = []
    for word, similarity in similar_words:
        words.append(word)
        weights.append(similarity)
    return random.choices(words, weights=weights, k=1)[0]

# implement inverse transform sampling
def sample_topk_inverse_transform(similar_words):
    words = []
    weights = []
    for word, similarity in similar_words:
        words.append(word)
        weights.append(similarity)

    # Normalize the weights to form a probability distribution
    total_weight = sum(weights)
    probabilities = [weight / total_weight for weight in weights]

    # Create the cumulative distribution function (CDF)
    cumulative_probabilities = []
    cumulative_sum = 0
    for p in probabilities:
        cumulative_sum += p
        cumulative_probabilities.append(cumulative_sum)

    # Generate a random float in the range [0, 1]
    u = random.random()

    # Find the first index where the cumulative probability exceeds the random number
    for index, cumulative_probability in enumerate(cumulative_probabilities):
        if cumulative_probability > u:
            return words[index]

    # As a fallback, return the first word if something goes wrong
    return words[0]

In [None]:
# Generate a sentence from a random starting word
random.seed(42)
current_words = ['the','cat', 'is']
generated_sentence = generate_sentence_with_sampling(current_words[:],
                                                     sample_topk,
                                                     10, #WINDOW SIZE
                                                     100,# How much to generate
                                                     10, # topk
                                                     )
print(f"Generated sentence from '{current_words}':\n {generated_sentence}")

Generated sentence from '['the', 'cat', 'is']':
 the cat is now one only as though . although but because it this not even however well they that yet if but would only because though still it not even although could now however they . so this well that there because only but it not even . so if although however that this because there what fact but only even not one although yet however if they so but only because not did n't it we though be could so they that even yet only would because have might not could they did though n't even that if should say


### Not so many difference as before but...
now we can try to increase top-k because we inject as randomness in the sampling (noise) but we are sure that the noise is proportional to the similarity with the current words.

In [None]:
# Generate a sentence from a random starting word
random.seed(42)
current_words = ['the','cat', 'is']
generated_sentence = generate_sentence_with_sampling(current_words[:],
                                                     sample_topk,
                                                     10, #WINDOW SIZE
                                                     100,# How much to generate
                                                     25, # topk
                                                     )
print(f"Generated sentence from '{current_words}':\n {generated_sentence}")

Generated sentence from '['the', 'cat', 'is']':
 the cat is so it even only what way time one same this well as so because although yet . they when but way could only not as would because that time however same no if should could n't does might now only do would we so even did could come way because going might make this that take n't does could they you want but way can it do n't make get if 'll what want did going really even so think nothing something certainly thought you why ? i really n't sure want everyone what think something everybody 've thought


### Discussion

- Why might sentences generated using the most similar word method lack coherence over longer sequences? How could the model be adjusted to produce more grammatically coherent outputs? With this perspective in mind, how could we modify the model, in order to also predict the position of the generated context word?
- What are some methods to prevent the generated sentences from looping or becoming repetitive? How might introducing randomness or diversity in word selection impact the quality of the generated sentences?
- How do different parameters of the Word2Vec model (like vector size, window size, and training algorithm) affect the outcome of the generated sentences? What happens when you alter these parameters?
- Have you noticed anything by generating sentences starting with an arbitrary word (i.e. "gensim")? Describe the issue and potential solutions.

### 3.2 Beam Search (Extra)

### **EXERCISE 7** 💻


In this exercise, you will implement the beam search algorithm using Word2Vec embeddings to generate text. Beam search, a heuristic search algorithm, balances exploration and efficiency by keeping only the top-scoring word sequences at each step. You'll integrate this with Word2Vec, which provides word embeddings that capture semantic similarities, to guide the generation process.

Through this task, explore how different settings for the beam width affect the coherence and diversity of the generated sentences, enhancing your understanding of both beam search mechanics and the practical use of word embeddings in NLP tasks.

In [None]:
# @title 🧑🏿‍💻 Your code here

def beam_search(start_word, beam_width=3, max_length=10):
  pass

In [None]:
# @title 👀 Solution
def beam_search(start_word, beam_width=3, max_length=10):
    if start_word not in model.wv:
        return "Word not in vocabulary!"

    # Initialize the beam with the start word
    beam = [(start_word, 0)]  # List of tuples (sentence, cumulative_score)

    for _ in range(max_length - 1):
        candidates = []
        # Explore each word in the beam
        for sentence, cum_score in beam:
            current_word = sentence.split()[-1]
            # Find potential next words
            try:
                next_words = model.wv.most_similar(current_word, topn=beam_width)
            except KeyError:
                continue  # Skip if current_word has no similar words

            # Add new sentences with updated scores
            for next_word, sim in next_words:
                new_sentence = sentence + ' ' + next_word
                new_score = cum_score + sim  # Update cumulative score
                candidates.append((new_sentence, new_score))

        # Prune to keep only the top beam_width entries
        beam = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]

    # Choose the best sequence
    best_sequence, _ = sorted(beam, key=lambda x: x[1], reverse=True)[0]
    return best_sequence


In [None]:
# Generate a sentence from a random starting word
random_start_word = 'the' #random.choice(list(model.wv.index_to_key))
generated_sentence = beam_search(random_start_word, beam_width=3, max_length=10)
print(f"Generated sentence from '{random_start_word}': {generated_sentence}")

Generated sentence from 'the': the one another a another a another a another a



To enhance the beam search function and address potential issues such as repetition or lack of diversity in the generated sentences, one effective strategy is to introduce a penalty for sampled words. This approach helps to discourage the selection of the same word or similar words too frequently within a short segment of the generated text.

In [None]:
# @title 🧑🏿‍💻 Your code here
def beam_search_with_penalty(start_word, beam_width=3, max_length=10, penalty=0.1):
  pass

In [None]:
# @title 👀 Solution
def beam_search_with_penalty(start_word, beam_width=3, max_length=10, penalty=0.1):
    if start_word not in model.wv:
        return "Word not in vocabulary!"

    # Initialize the beam with the start word
    beam = [(start_word, 0, [start_word])]  # Each entry is (sentence, cumulative_score, recent_words)

    for _ in range(max_length - 1):
        candidates = []
        # Explore each word in the beam
        for sentence, cum_score, recent_words in beam:
            current_word = sentence.split()[-1]
            # Find potential next words
            try:
                next_words = model.wv.most_similar(current_word, topn=beam_width + len(recent_words))
            except KeyError:
                continue  # Skip if current_word has no similar words

            # Apply penalty to similar words based on recent usage
            filtered_words = []
            for next_word, sim in next_words:
                if next_word not in recent_words:
                    filtered_words.append((next_word, sim))
                else:
                    # Apply penalty
                    filtered_words.append((next_word, sim * (1 - penalty)))

            # Keep only the top beam_width entries after applying penalty
            filtered_words = sorted(filtered_words, key=lambda x: x[1], reverse=True)[:beam_width]

            # Add new sentences with updated scores and update recent words
            for next_word, sim in filtered_words:
                new_sentence = sentence + ' ' + next_word
                new_score = cum_score + sim  # Update cumulative score
                new_recent_words = recent_words[-(beam_width-1):] + [next_word]  # Update recent words
                candidates.append((new_sentence, new_score, new_recent_words))

        # Prune to keep only the top beam_width entries
        beam = sorted(candidates, key=lambda x: x[1], reverse=True)[:beam_width]

    # Choose the best sequence
    best_sequence, _, _ = sorted(beam, key=lambda x: x[1], reverse=True)[0]
    return best_sequence

In [None]:
# Generate a sentence from a random starting word
random_start_word = 'the' #random.choice(list(model.wv.index_to_key))
generated_sentence = beam_search_with_penalty(random_start_word, beam_width=100, max_length=100, penalty=1)
print(f"Generated sentence from '{random_start_word}': {generated_sentence}")

Generated sentence from 'the': None


### 4. Designing an Autoregressive Word2Vec-Inspired Model for Position-Aware Language Generation (Extra)

Word2Vec is a powerful tool for generating vector representations of words, capturing their semantic relationships based on their contexts. However, traditional Word2Vec models do not account for the order or position of words in context; they merely predict context words regardless of their specific positions relative to the target word. This limitation can be addressed by integrating positional awareness into a model inspired by Word2Vec but designed for language generation (i.e. Language Modeling).

**Task:**

Propose a detailed architecture for an autoregressive model that not only predicts context words but also their positions relative to the current word, similar to how Word2Vec captures contextual word relationships. Your proposed model should be capable of generating coherent language sequences by considering both the semantic similarity of words and their positional dynamics in sentences.

Points to consider in your proposal:

1. **Model Foundation**: Describe how you would modify the foundational architecture of Word2Vec (either CBOW or Skip-gram) to include positional information. What changes would be necessary to incorporate the sequential nature of language?

2. **Position Encoding**: How would you encode positional information within the model? Consider techniques used in other models like transformers (e.g., positional encodings) and describe how they could be adapted or improved for this use case.

3. **Autoregressive Mechanism**: Explain how the model would generate text in an autoregressive manner, particularly how it would use the positionally-aware embeddings to predict the next word in a sequence. What kind of neural network architecture would be suitable for this?

4. **Training Strategy**: Discuss potential strategies for training this model. What kind of corpus and preprocessing would be needed? How would you define the loss function to include accuracy not just for predicting the right word, but also placing it correctly?

5. **Potential Applications and Benefits**: Highlight how this position-aware model could outperform traditional Word2Vec models in specific NLP tasks. Consider applications in machine translation, summarization, or interactive dialog systems.

Your response should outline the theoretical framework, practical considerations for implementation, and possible challenges you might face with this architecture. This exercise aims to deepen your understanding of how word embeddings can be extended beyond traditional models to enhance language generation capabilities in NLP systems.