<a href="https://colab.research.google.com/github/pushan9/Colab-notebook/blob/main/Text_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Demo: Text Generation**

This demonstration employs the Natural Language Toolkit (NLTK) and the Brown corpus to demonstrate text generation through a Markov chain model using trigrams.

## **Steps to Perform:**

Step 1: Import the Necessary Libraries

Step 2: Define Stopwords and Punctuation

Step 3: Load Sentences and Generate N-grams

Step 4: Remove Stopwords from N-grams

Step 5: Calculate Frequency Distributions

Step 6: Create a Dictionary of Trigram Frequencies

Step 7: Define the Text Generation Function

Step 8: Execute the Text Generation Function

###**Step 1: Import the Necessary Libraries**





*   Import the necessary libraries and set up the OpenAI API key.
*   Download the necessary NLTK packages and corpus.



In [None]:
# Import necessary libraries
import string # string.punctuation → list of punctuation characters (.,?!;:). Useful for cleaning text before generating (removing unwanted symbols).

import random # For random sampling — selecting a random word or next token based on probability.

import nltk # one of the oldest and most widely used NLP libraries in Python.

from nltk import FreqDist # a class for counting word frequencies.

from nltk.corpus import brown # Brown Corpus, one of the first large, balanced English text corpora.
# The Brown corpus contains around a million words from diverse sources — news, fiction, essays, etc.

from collections import defaultdict, Counter # defaultdict → dictionary with a default value for missing keys (avoids key errors).

from nltk.util import ngrams
# Automatically creates n-grams (groups of consecutive words):
# n=2 → bigrams (“machine learning”)
# n=3 → trigrams (“deep learning model”)
# eg: I am a good boy → bigrams: (I am), (am a), (a good), (good boy)
# eg: I am a good boy → trigrams: (I am a), (am a good), (a good boy)

# Download necessary NLTK packages and corpus
nltk.download('punkt') # Tokenizer models for splitting text into words/sentences.
nltk.download('stopwords') # Common words (the, is, in) to ignore during text processing.
nltk.download('brown') # Brown Corpus for training language models.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

### **Step 2: Define Stopwords and Punctuation**

*   Stopwords are common words in a language that are often considered to be of little value in text analysis.
*   Punctuation refers to characters used to separate sentences, clauses, phrases, or words in writing.





In [None]:
# Define stopwords and punctuation
stop_words = set(nltk.corpus.stopwords.words('english'))
string.punctuation += '"\'-—' # Adds extra punctuation characters (", ', -, —) to Python’s built-in list of punctuation. eg., to remove quotes and dashes from text - "it's", "well-known", "—he said—"

removal_list = list(stop_words) + list(string.punctuation) + ['lt', 'rt']
# These are HTML-like or Twitter-style tokens:
# lt = “less than” (from <)
# rt = “retweet”
# Such tokens appear in older or web-scraped corpora (like social media or Brown samples).
# They are not meaningful words, so they are removed too.
# When you build your n-gram model later, you’ll feed it cleaned text like:
# ["technology", "advances", "rapidly"]
# instead of messy text like:
["the", "technology", "advances", "—", "lt", "rt"]

['the', 'technology', 'advances', '—', 'lt', 'rt']

### **Step 3: Load Sentences and Generate N-grams**

*   Load sentences from the Brown corpus and generate N-grams.
*   By the end of this process, **unigram**, **bigram**, and **trigram** lists will contain the respective N-grams for the sentences in the Brown corpus.





In [None]:
# Load sentences from the Brown corpus
sents = brown.sents()
# Loads all sentences from the Brown corpus.
# Each sentence is returned as a list of words, e.g.
# ["The", "jury", "said", "it", "found", "this", "decision", "unfair"].

# Lets's print the first 2 sentences to see how they look
print("First 2 sentences from Brown Corpus:")
for i in range(2):
    print(' '.join(sents[i]))

# Print the total number of sentences in the Brown corpus
print(f"\nTotal sentences in Brown Corpus: {len(sents)}")

First 2 sentences from Brown Corpus:
The Fulton County Grand Jury said Friday an investigation of Atlanta's recent primary election produced `` no evidence '' that any irregularities took place .
The jury further said in term-end presentments that the City Executive Committee , which had over-all charge of the election , `` deserves the praise and thanks of the City of Atlanta '' for the manner in which the election was conducted .

Total sentences in Brown Corpus: 57340


In [None]:
# Initialize lists for storing n-grams
# Creates three empty lists to hold:
unigram = [] # single words
bigram = [] # pairs of consecutive words
trigram = [] # triplets of consecutive words

# Generate n-grams
for sentence in sents:
    sentence = [word.lower() for word in sentence if word not in string.punctuation] # Cleans each sentence by converting words to lowercase and removing punctuation.
    unigram.extend(sentence) # Adds single words to the unigram list.
    bigram.extend(list(ngrams(sentence, 2, pad_left=True, pad_right=True))) # pad_left=True and pad_right=True add special None tokens at sentence boundaries so you can model sentence beginnings and endings.
    # Example:
    # Sentence = ["the","cat","sat"] →
    # Bigrams = [(None,"the"), ("the","cat"), ("cat","sat"), ("sat",None)]

    trigram.extend(list(ngrams(sentence, 3, pad_left=True, pad_right=True)))
    # Example:
    # Sentence = ["the","cat","sat"] →
    # Trigrams = [(None,None,"the"), (None,"the","cat"), ("the","cat","sat"), ("cat","sat",None), ("sat",None,None)].

In [None]:
# Lets print a few samples of each n-gram type
print("\nSample Unigrams:", random.sample(unigram, 10))
print("\nSample Bigrams:", random.sample(bigram, 10))
print("\nSample Trigrams:", random.sample(trigram, 10))


Sample Unigrams: ['thunder', 'drag', 'persisted', 'carried', 'pope', 'with', 'bill', 'night', 'in', 'was']

Sample Bigrams: [('also', 'will'), ('not', 'exceed'), ('new', 'as'), ('all', 'questions'), ('her', 'unusual'), ('and', 'all'), ('differences', 'the'), ("ingleside's", 'drunk-and-disorderlies'), ('and', '``'), ('it', 'by')]

Sample Trigrams: [('eyes', 'of', 'all'), ('as', 'a', 'result'), ('as', 'an', 'air'), ('of', 'these', 'benefits'), ('few', 'communities', 'that'), (None, None, 'the'), ('new', 'york', None), ('other', 'group', 'of'), ('robert', 'zeising', 'mrs.'), ('at', 'about', '600-degrees')]


In [None]:
# Print the first 10 bigrams to verify padding
print("\nFirst 10 Bigrams with padding:", bigram[:10])
# Print the last 10 bigrams to verify padding
print("\nLast 10 Bigrams with padding:", bigram[-10:])


First 10 Bigrams with padding: [(None, 'the'), ('the', 'fulton'), ('fulton', 'county'), ('county', 'grand'), ('grand', 'jury'), ('jury', 'said'), ('said', 'friday'), ('friday', 'an'), ('an', 'investigation'), ('investigation', 'of')]

Last 10 Bigrams with padding: [('glance', 'the'), ('the', 'figure'), ('figure', 'inside'), ('inside', 'the'), ('the', 'coral-colored'), ('coral-colored', 'boucle'), ('boucle', 'dress'), ('dress', 'was'), ('was', 'stupefying'), ('stupefying', None)]


In [None]:
# Print the first 10 trigrams to verify padding
print("\nFirst 10 Trigrams with padding:", trigram[:10])
# Print the last 10 trigrams to verify padding
print("\nLast 10 Trigrams with padding:", trigram[-10:])


First 10 Trigrams with padding: [(None, None, 'the'), (None, 'the', 'fulton'), ('the', 'fulton', 'county'), ('fulton', 'county', 'grand'), ('county', 'grand', 'jury'), ('grand', 'jury', 'said'), ('jury', 'said', 'friday'), ('said', 'friday', 'an'), ('friday', 'an', 'investigation'), ('an', 'investigation', 'of')]

Last 10 Trigrams with padding: [('glance', 'the', 'figure'), ('the', 'figure', 'inside'), ('figure', 'inside', 'the'), ('inside', 'the', 'coral-colored'), ('the', 'coral-colored', 'boucle'), ('coral-colored', 'boucle', 'dress'), ('boucle', 'dress', 'was'), ('dress', 'was', 'stupefying'), ('was', 'stupefying', None), ('stupefying', None, None)]


### **Step 4: Remove Stopwords from N-grams**

*   Define a function to remove stopwords from the N-grams.
*   Use it to clean the bigrams and trigrams.





In [None]:
# Print the first 10 items in removal_list
print("\nRemoval List Sample:", removal_list[:10])


Removal List Sample: ['theirs', 'than', 'herself', 'where', 'all', 'at', 'those', 'does', 'ain', 'there']


In [None]:
# Function to remove stopwords from n-grams
def remove_stopwords(ngrams, n):
    if n == 2:
        return [(a, b) for (a, b) in ngrams if a not in removal_list and b not in removal_list]
    elif n == 3:
        return [(a, b, c) for (a, b, c) in ngrams if a not in removal_list and b not in removal_list and c not in removal_list]

# Remove stopwords from n-grams
bigram = remove_stopwords(bigram, 2)
trigram = remove_stopwords(trigram, 3)


In [None]:
# Lets print a few samples of cleaned n-grams
print("\nSample Cleaned Bigrams:", random.sample(bigram, 10))
print("\nSample Cleaned Trigrams:", random.sample(trigram, 10))


Sample Cleaned Bigrams: [('utopia', None), ('vandals', 'naught'), ('guided', 'exposure'), ('small', 'matter'), ('quiet', 'moment'), ('plastics', 'units'), ("watson-watt's", 'remarks'), ('quickly', 'away'), ('md.', 'march'), ('gratitude', None)]

Sample Cleaned Trigrams: [(None, None, 'among'), (None, None, '``'), ('three', 'oranges', 'gay'), ('screeching', 'war', 'whoop'), (None, None, 'fresh'), ('instead', 'met', 'violent'), ('find', "kayabashi's", 'secretary'), ('ap', None, None), ('sizzling', 'day', None), (None, None, 'issue')]


###**Step 5: Calculate Frequency Distributions**

*   Calculate the frequency distributions of the bigrams and trigrams.



In [None]:
# Calculate frequency distributions
# FreqDist() is a built-in NLTK class that counts how many times each unique item appears in a list.
freq_bi = FreqDist(bigram)
freq_tri = FreqDist(trigram)


In [None]:
# Lets print a sample of cleaned bigrams and trigrams
print("\nMost Common Bigrams:", freq_bi.most_common(10))
print("\nMost Common Trigrams:", freq_tri.most_common(10))
print()
# Lets also print least common n-grams to see rare combinations
print("\nLeast Common Bigrams:", freq_bi.most_common()[-10:])
print("\nLeast Common Trigrams:", freq_tri.most_common()[-10:])


Most Common Bigrams: [(("''", None), 4747), ((None, '``'), 4177), (('said', None), 445), ((None, 'one'), 401), (('united', 'states'), 392), (('said', '``'), 323), (('new', 'york'), 296), ((None, 'mr.'), 241), (('af', None), 236), ((None, '--'), 219)]

Most Common Trigrams: [(("''", None, None), 4747), ((None, None, '``'), 4177), (('said', None, None), 445), ((None, None, 'one'), 401), ((None, None, None), 242), ((None, None, 'mr.'), 241), (('af', None, None), 236), ((None, None, '--'), 219), (('time', None, None), 205), ((None, None, 'even'), 190)]


Least Common Bigrams: [(("novelist's", 'carping'), 1), (('carping', 'phrase'), 1), (('lower', 'lip'), 1), (('voluptuous', None), 1), (('swift', 'greedy'), 1), (('greedy', 'glance'), 1), (('figure', 'inside'), 1), (('coral-colored', 'boucle'), 1), (('boucle', 'dress'), 1), (('stupefying', None), 1)]

Least Common Trigrams: [(('j.', 'perelman', None), 1), (('perelman', None, None), 1), ((None, None, 'revulsion'), 1), (('train', 'slid', 'shu

### **Step 6: Create a Dictionary of Trigram Frequencies**

*   Create a dictionary of trigram frequencies to use it in the text generation function.



In [None]:
print("\nMost Common Trigrams:", freq_tri.most_common(10))


Most Common Trigrams: [(("''", None, None), 4747), ((None, None, '``'), 4177), (('said', None, None), 445), ((None, None, 'one'), 401), ((None, None, None), 242), ((None, None, 'mr.'), 241), (('af', None, None), 236), ((None, None, '--'), 219), (('time', None, None), 205), ((None, None, 'even'), 190)]


In [None]:
# Create a dictionary of trigram frequencies
d = defaultdict(Counter) # defaultdict with Counter as default factory. This allows us to create a nested dictionary where each key maps to another dictionary that counts occurrences.

for ngram in freq_tri:
    if None not in ngram:
        d[ngram[:-1]][ngram[-1]] += freq_tri[ngram]
        # ngram[:-1] → All words except the last one → the context (first two words)
        # ngram[-1] → The last word → the predicted next word
        # freq_tri[ngram] → Frequency count of the trigram i.e. How often that full trigram appeared

# Print sample entries from the trigram frequency dictionary
print("\nSample Trigram Frequency Dictionary Entries:")
sample_keys = random.sample(list(d.keys()), 5) # Randomly selects 5 contexts (first two words).
for key in sample_keys:
    print(f"Context: {key} -> Next Word Counts: {d[key]}")


Sample Trigram Frequency Dictionary Entries:
Context: ('last', 'six') -> Next Word Counts: Counter({'months': 3})
Context: ("hogan's", 'patience') -> Next Word Counts: Counter({'ran': 1})
Context: ('vaughn', 'knows') -> Next Word Counts: Counter({'better': 1})
Context: ('offering', 'tuesday') -> Next Word Counts: Counter({'consisted': 1})
Context: ('approach', 'amply') -> Next Word Counts: Counter({'confirm': 1})


### **Step 7: Define the Text Generation Function**

*   Define the **generate_text** function to generate text based on the trigram frequencies.



In [None]:
# Function to generate text
def generate_text(prefix, n=20): # prefix: the starting two words (tuple) — acts as the initial context, Example: ("machine", "learning"), n: how many words to generate (default = 20)
    for _ in range(n):
        suffix_candidates = list(d.get(prefix, Counter()).elements())
        if not suffix_candidates:
            new_prefix = random.choice(unigram), random.choice(unigram)
            yield new_prefix[0]  # Yield the first word of the new prefix
            prefix = new_prefix
        else:
            suffix = random.choice(suffix_candidates)
            yield suffix
            prefix = (*prefix[1:], suffix)


### **Step 8: Execute the Text Generation Function**

*   Call the **generate_text** function and print the generated text.



In [None]:
# Generate text
prefix = ("he", "said")
generated_text = list(generate_text(prefix))
if generated_text:
    print(" ".join(generated_text))
else:
    print("No text generated.")


road that him her to could it's five of properly `` this the can't always simple on be of declarative


In [None]:
# One more example
# Generate text
prefix = ("the", "market")
generated_text = list(generate_text(prefix))
if generated_text:
    print(" ".join(generated_text))
else:
    print("No text generated.")

two social a was adored if this return could of chow integrated everybody of sinai any now criticism suffering well


## **Conclusion**

This demo showcases NLTK and the Brown corpus for trigram-based Markov chain text generation. Users can run it multiple times to observe the varying generated outputs.