<a href="https://colab.research.google.com/github/mishba-ai/Learning-ML/blob/main/bigram_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Bigram language model


*   Load the Dataset: Use one of the datasets above.

- Tokenize the Text: Split the text into words or characters.

- Count Bigrams: Create a dictionary to count the frequency of each bigram.

- Calculate Probabilities: Compute the probability of each bigram.

- Generate Text: Use the bigram probabilities to
generate new text.




In [18]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
from collections import defaultdict, Counter
import re
import random


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


## load the dataset

In [2]:
# Access the corpus
text = " ".join(brown.words())
print(text[:500])  # Print the first 500 words

The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place . The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted . The September-October term jury had been charged by Fulton Superior Court Judge Durwood Pye to invest


In [3]:
len(text)

6127073

## tokenize the dataset

In [4]:
text = re.sub(r'[^\w\s]', '', text)  # Remove all punctuation except spaces

In [5]:
words = text.split()

In [6]:
print(words[:20])

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', 'Atlantas', 'recent', 'primary', 'election', 'produced', 'no', 'evidence', 'that', 'any', 'irregularities']


## count bigrams in a python dictionary

In [10]:
bigrams = list(zip(words[:-1], words[1:]))
bigram_counts = Counter(bigrams)

In [11]:
sorted(bigram_counts.items(),key=lambda kv:-kv[1]) #The key=lambda kv:-kv[1] tells the sorted() function to sort the bigram-frequency pairs based on the negative of their frequencies. This effectively sorts the items in descending order of frequency.

[(('of', 'the'), 9638),
 (('in', 'the'), 5552),
 (('to', 'the'), 3437),
 (('on', 'the'), 2303),
 (('and', 'the'), 2141),
 (('for', 'the'), 1761),
 (('to', 'be'), 1697),
 (('at', 'the'), 1509),
 (('with', 'the'), 1479),
 (('of', 'a'), 1466),
 (('that', 'the'), 1388),
 (('from', 'the'), 1353),
 (('in', 'a'), 1318),
 (('by', 'the'), 1313),
 (('as', 'a'), 908),
 (('it', 'is'), 886),
 (('with', 'a'), 882),
 (('is', 'a'), 871),
 (('of', 'his'), 806),
 (('is', 'the'), 791),
 (('was', 'a'), 784),
 (('had', 'been'), 761),
 (('it', 'was'), 746),
 (('for', 'a'), 734),
 (('he', 'was'), 728),
 (('as', 'the'), 691),
 (('into', 'the'), 674),
 (('he', 'had'), 669),
 (('to', 'a'), 655),
 (('have', 'been'), 649),
 (('and', 'a'), 614),
 (('would', 'be'), 608),
 (('the', 'same'), 598),
 (('one', 'of'), 594),
 (('will', 'be'), 593),
 (('It', 'is'), 587),
 (('has', 'been'), 567),
 (('in', 'his'), 566),
 (('that', 'he'), 563),
 (('It', 'was'), 553),
 (('of', 'this'), 547),
 (('the', 'first'), 541),
 (('was',

## Calculate Probabilities




In [12]:
# Precompute the total count of bigrams starting with each word (w1)
unigram_counts = defaultdict(int)
for (w1, w2), count in bigram_counts.items():
    unigram_counts[w1] += count  # Total occurrences of w1 as the starting word

In [13]:
# Compute probabilities in O(N) time
bigram_probabilities = defaultdict(dict)
for (w1, w2), count in bigram_counts.items():
    total = unigram_counts[w1]  # Precomputed total for w1
    bigram_probabilities[w1][w2] = count / total  # Fast division

## generate text

In [21]:
# Generate text with weighted random choice

current_word = "peace"
generated_text = [current_word]
for _ in range(30): # generate 20 words
        if current_word not in bigram_probabilities:
           break
        next_word_probs = bigram_probabilities[current_word].items()

        # Convert to lists for random.choices
        next_words = [word for word, _ in next_word_probs]
        probabilities = [prob for _, prob in next_word_probs]

        # Choose next word based on probabilities
        if next_words:
            next_word = random.choices(next_words, weights=probabilities, k=1)[0]
            generated_text.append(next_word)
            current_word = next_word
        else:
            break

print(" ".join(generated_text))

peace The maximum of protein identification One family in 1959 At the welfare state of aqueous phase and invention is a quiet Of all this year she wrote I was and




---



## Counting Bigrams in a 2D PyTorch Tensor

In [None]:
import torch