How does a bigram model work?

The bigram model assumes that the probability of a word depends only on the word immediately preceding it. This is an application of the Markov assumption, which simplifies complex sequential data by reducing the dependency horizon to a fixed size (one in the case of a bigram model).

Mathematically, for a sequence of words w1, w2,…, wn​, the probability of the sequence is approximated as:
$$
P(w_1,w_2,\ldots,w_n) \approx P(w_1) \cdot P(w_2 \mid w_1) \cdot P(w_3 \mid w_2) \cdots P(w_n \mid w_{n-1})
$$

Here, P(wi​ | wi−1​) is the probability of word wi​ given the preceding word wi−1​.

Example

Suppose we have a simple sentence:
“I love programming languages."

The bigrams extracted from this sentence are:

    ("I", "love")

    ("love", "programming")

    ("programming", "languages")

Using a bigram model, the probability of the sentence is computed as:

$$
P("I \ love\ programming\ languages \") = P("I") \cdot P(\text{love} \mid "I") \cdot P(\text{programming} \mid \text{love}) \cdot P(\text{languages} \mid \text{programming})
$$

In [1]:
text_data = """
Welcome. This is a simple test text for model training.
Hello my dear friend. This text is also a test and hope you like my example.
We will test a simple bigram language model on this text.
We will test and see how simple bigram langauge model behave.
I am doing my best to learn through practice.
Thank you for taking the time to read this.
"""


In [40]:
import re

def tokenize(text):
    text= text.lower()
    # Tokenize the text into words and punctuation
    tokens = re.findall(r'\b\w+\b|[^\w\s]', text, re.UNICODE)
    return tokens

In [41]:
tokens=tokenize(text_data)
print("Tokens: ", tokens)

Tokens:  ['welcome', '.', 'this', 'is', 'a', 'simple', 'test', 'text', 'for', 'model', 'training', '.', 'hello', 'my', 'dear', 'friend', '.', 'this', 'text', 'is', 'also', 'a', 'test', 'and', 'hope', 'you', 'like', 'my', 'example', '.', 'we', 'will', 'test', 'a', 'simple', 'bigram', 'language', 'model', 'on', 'this', 'text', '.', 'we', 'will', 'test', 'and', 'see', 'how', 'simple', 'bigram', 'langauge', 'model', 'behave', '.', 'i', 'am', 'doing', 'my', 'best', 'to', 'learn', 'through', 'practice', '.', 'thank', 'you', 'for', 'taking', 'the', 'time', 'to', 'read', 'this', '.']


In [42]:
length = len(tokens)
print("Number of tokens: ", length)

Number of tokens:  74


In [43]:
from collections import defaultdict
bigram_counts = defaultdict(int)
unigram_counts = defaultdict(int)
# Count unigrams and bigrams
for i in range(length-1):
    w1=tokens[i]
    w2=tokens[i+1]
    bigram_counts[(w1,w2)] += 1
    unigram_counts[w1] += 1
unigram_counts[tokens[-1]] += 1
for bigram,count in list(bigram_counts.items())[:10]:
    print(f"Bigram: {bigram}, Count: {count}")
for unigram,count in list(unigram_counts.items())[:10]:
    print(f"Unigram: {unigram}, Count: {count}")

Bigram: ('welcome', '.'), Count: 1
Bigram: ('.', 'this'), Count: 2
Bigram: ('this', 'is'), Count: 1
Bigram: ('is', 'a'), Count: 1
Bigram: ('a', 'simple'), Count: 2
Bigram: ('simple', 'test'), Count: 1
Bigram: ('test', 'text'), Count: 1
Bigram: ('text', 'for'), Count: 1
Bigram: ('for', 'model'), Count: 1
Bigram: ('model', 'training'), Count: 1
Unigram: welcome, Count: 1
Unigram: ., Count: 8
Unigram: this, Count: 4
Unigram: is, Count: 2
Unigram: a, Count: 3
Unigram: simple, Count: 3
Unigram: test, Count: 4
Unigram: text, Count: 3
Unigram: for, Count: 2
Unigram: model, Count: 3



Bigram probability calculation :

bigram_probabilities will help to generate text, picking the next word based on the probability distribution given the current word. here we have simplified example of probability calculation.
### Concrete Calculations

$$
P(\text{"my"} \mid \text{"hello"}) 
= \frac{\text{count}(\text{"hello"}, \text{"my"})}{\text{count}(\text{"hello"})}
= \frac{2}{2}
= 1.0
$$

$$
P(\text{"friend"} \mid \text{"my"})
= \frac{\text{count}(\text{"my"}, \text{"friend"})}{\text{count}(\text{"my"})}
= \frac{2}{3}
\approx 0.6667
$$

$$
P(\text{"cat"} \mid \text{"my"})
= \frac{\text{count}(\text{"my"}, \text{"cat"})}{\text{count}(\text{"my"})}
= \frac{1}{3}
\approx 0.3333
$$

---




In [44]:
bigram_probabilities = defaultdict(float)
for (w1,w2), count in bigram_counts.items():
    bigram_probabilities[(w1,w2)] = count / unigram_counts[w1]
for bigram, prob in list(bigram_probabilities.items())[:10]:
    print(f"Bigram: {bigram}, Probability: {prob:.4f}")


Bigram: ('welcome', '.'), Probability: 1.0000
Bigram: ('.', 'this'), Probability: 0.2500
Bigram: ('this', 'is'), Probability: 0.2500
Bigram: ('is', 'a'), Probability: 0.5000
Bigram: ('a', 'simple'), Probability: 0.6667
Bigram: ('simple', 'test'), Probability: 0.3333
Bigram: ('test', 'text'), Probability: 0.2500
Bigram: ('text', 'for'), Probability: 0.3333
Bigram: ('for', 'model'), Probability: 0.5000
Bigram: ('model', 'training'), Probability: 0.3333


In [None]:
import random

def generate_sentence(bigram_probabilities, start_word, max_length=8):
    sentence = [start_word]
    for _ in range(max_length - 1):
        next_word_candidates = [(w2, prob) for (w1, w2), prob in bigram_probabilities.items() if w1 == sentence[-1]]
        if not next_word_candidates:
            break
        next_words, probs = zip(*next_word_candidates)
        next_word = random.choices(next_words, weights=probs)[0] # Choose the next word based on the probabilities
        sentence.append(next_word)
    return ' '.join(sentence)
start_word = random.choice(tokens) # Choose a random start word from the tokens but you can also change it to any word in the text
#choose random start word from the tokens, if its punctuation, choose the next word
#if tokens.index(start_word) == len(tokens)-1 or not tokens[tokens.index(start_word)+1].isalpha():
#    start_word = random.choice(tokens[:-1])

print("Start word: ", start_word)
generated_sentence = generate_sentence(bigram_probabilities, start_word, max_length=10)
print("bigram_probabilities:", bigram_probabilities)
print("Generated sentence: ", generated_sentence)

Start word:  for
bigram_probabilities: defaultdict(<class 'float'>, {('welcome', '.'): 1.0, ('.', 'this'): 0.25, ('this', 'is'): 0.25, ('is', 'a'): 0.5, ('a', 'simple'): 0.6666666666666666, ('simple', 'test'): 0.3333333333333333, ('test', 'text'): 0.25, ('text', 'for'): 0.3333333333333333, ('for', 'model'): 0.5, ('model', 'training'): 0.3333333333333333, ('training', '.'): 1.0, ('.', 'hello'): 0.125, ('hello', 'my'): 1.0, ('my', 'dear'): 0.3333333333333333, ('dear', 'friend'): 1.0, ('friend', '.'): 1.0, ('this', 'text'): 0.5, ('text', 'is'): 0.3333333333333333, ('is', 'also'): 0.5, ('also', 'a'): 1.0, ('a', 'test'): 0.3333333333333333, ('test', 'and'): 0.5, ('and', 'hope'): 0.5, ('hope', 'you'): 1.0, ('you', 'like'): 0.5, ('like', 'my'): 1.0, ('my', 'example'): 0.3333333333333333, ('example', '.'): 1.0, ('.', 'we'): 0.25, ('we', 'will'): 1.0, ('will', 'test'): 1.0, ('test', 'a'): 0.25, ('simple', 'bigram'): 0.6666666666666666, ('bigram', 'language'): 0.5, ('language', 'model'): 1.0, ('