Tokenization: split() uses whitespace only; punctuation (if present) would stay attached. For real tasks you might want a better tokenizer.

Defaultdict: saves you from writing if bigram in dict: ... else: ...—missing counts start at 0.

Ties: When multiple next-words have the same frequency (e.g., after 'love'), the earliest one encountered during counting is returned.

Out-of-vocabulary: If you ask for a word that never appears as the first element of any bigram, the function returns None.

In [1]:
# Imports defaultdict which supplies a default value (like 0) for missing keys.

from collections import defaultdict

1. Dataset

In [2]:
# A tiny corpus: a list of sentences (strings). We'll learn bigrams from this.

dataset = [
    "Though the righteous fall seven times, they rise again, but the wicked stumble when calamity strikes.",
    "Do not gloat when your enemy falls; when they stumble, do not let your heart rejoice.",
    "Do not envy the wicked; do not desire their company.",
    "Do not fret because of evildoers or be envious of the wicked, for the evildoer has no future hope, and the lamp of the wicked will be snuffed out.",
    "By wisdom a house is built, and through understanding it is established; through knowledge its rooms are filled with rare and beautiful treasures."
]

In [3]:
dataset

['Though the righteous fall seven times, they rise again, but the wicked stumble when calamity strikes.',
 'Do not gloat when your enemy falls; when they stumble, do not let your heart rejoice.',
 'Do not envy the wicked; do not desire their company.',
 'Do not fret because of evildoers or be envious of the wicked, for the evildoer has no future hope, and the lamp of the wicked will be snuffed out.',
 'By wisdom a house is built, and through understanding it is established; through knowledge its rooms are filled with rare and beautiful treasures.']

2. Tokenize sentences

In [4]:
# List comprehension:
# - sentence.lower(): make everything lowercase so 'Do' and 'do' are treated the same.
# - .split(): split on whitespace into tokens (words).
# Result: list of lists, e.g. [["do","not","envy","the","wicked"], ...]

tokenized_data = [sentence.lower().split() for sentence in dataset]

In [5]:
tokenized_data

[['though',
  'the',
  'righteous',
  'fall',
  'seven',
  'times,',
  'they',
  'rise',
  'again,',
  'but',
  'the',
  'wicked',
  'stumble',
  'when',
  'calamity',
  'strikes.'],
 ['do',
  'not',
  'gloat',
  'when',
  'your',
  'enemy',
  'falls;',
  'when',
  'they',
  'stumble,',
  'do',
  'not',
  'let',
  'your',
  'heart',
  'rejoice.'],
 ['do',
  'not',
  'envy',
  'the',
  'wicked;',
  'do',
  'not',
  'desire',
  'their',
  'company.'],
 ['do',
  'not',
  'fret',
  'because',
  'of',
  'evildoers',
  'or',
  'be',
  'envious',
  'of',
  'the',
  'wicked,',
  'for',
  'the',
  'evildoer',
  'has',
  'no',
  'future',
  'hope,',
  'and',
  'the',
  'lamp',
  'of',
  'the',
  'wicked',
  'will',
  'be',
  'snuffed',
  'out.'],
 ['by',
  'wisdom',
  'a',
  'house',
  'is',
  'built,',
  'and',
  'through',
  'understanding',
  'it',
  'is',
  'established;',
  'through',
  'knowledge',
  'its',
  'rooms',
  'are',
  'filled',
  'with',
  'rare',
  'and',
  'beautiful',
  'trea

3. Build bigram counts

In [6]:
bigram_counts = defaultdict(int)
# Create a dictionary mapping (word1, word2) -> count.
# Using defaultdict(int) means unseen keys start at 0 automatically.

for sentence in tokenized_data:
# Loop over each tokenized sentence (a list of words).

    for i in range(len(sentence) - 1):
        # Iterate over indices where a bigram exists.
        # If a sentence has N words, it has N-1 bigrams (pairs of consecutive words).

        bigram = (sentence[i], sentence[i+1])   # Build the bigram tuple: current word and the next word.
        bigram_counts[bigram] += 1     # Increment the count for this bigram.

print("Bigram frequencies:\n")
for bigram, count in bigram_counts.items():    # Iterate through all learned bigrams and their counts.
    print(bigram, ":", count)     # Print each bigram as a tuple and its frequency.

Bigram frequencies:

('though', 'the') : 1
('the', 'righteous') : 1
('righteous', 'fall') : 1
('fall', 'seven') : 1
('seven', 'times,') : 1
('times,', 'they') : 1
('they', 'rise') : 1
('rise', 'again,') : 1
('again,', 'but') : 1
('but', 'the') : 1
('the', 'wicked') : 2
('wicked', 'stumble') : 1
('stumble', 'when') : 1
('when', 'calamity') : 1
('calamity', 'strikes.') : 1
('do', 'not') : 5
('not', 'gloat') : 1
('gloat', 'when') : 1
('when', 'your') : 1
('your', 'enemy') : 1
('enemy', 'falls;') : 1
('falls;', 'when') : 1
('when', 'they') : 1
('they', 'stumble,') : 1
('stumble,', 'do') : 1
('not', 'let') : 1
('let', 'your') : 1
('your', 'heart') : 1
('heart', 'rejoice.') : 1
('not', 'envy') : 1
('envy', 'the') : 1
('the', 'wicked;') : 1
('wicked;', 'do') : 1
('not', 'desire') : 1
('desire', 'their') : 1
('their', 'company.') : 1
('not', 'fret') : 1
('fret', 'because') : 1
('because', 'of') : 1
('of', 'evildoers') : 1
('evildoers', 'or') : 1
('or', 'be') : 1
('be', 'envious') : 1
('envious

4. Predict next word given a word

In [7]:
# Dictionary comprehension:
# - Look at every bigram (k is a tuple like ('do','not')), v is its count.
# - Keep only those whose first word (k[0]) matches the input 'word'.
# - Map each candidate next word (k[1]) to its count.
# Example: if word == 'd', candidates might be {'not': 5}.

def predict_next_word(word):
    candidates = {k[1]: v for k, v in bigram_counts.items() if k[0] == word}
    if not candidates:
        return None
    return max(candidates, key=candidates.get)

# Test prediction
print("\nPrediction examples:")
print("After 'do' ->", predict_next_word("do"))
print("After 'calamity' ->", predict_next_word("calamity"))
print("After 'through' ->", predict_next_word("through"))


Prediction examples:
After 'do' -> not
After 'calamity' -> strikes.
After 'through' -> understanding
