#**Demo: Text Generation**
Will use Natural Language Toolkit (NLTK) and the Brown corpus to demonstrate text generation through a Markov chain model using trigrams.

##**Steps to Perform:**

1. Import the Necessary Libraries
Load NLTK and other Python libraries required for text processing and handling n-grams.

2. Define Stopwords and Punctuation
Specify a stopword list (using NLTK’s corpus or a custom list) and punctuation marks to be filtered out.

3. Load Sentences and Generate N-grams
Retrieve sentences from the Brown corpus (or another dataset) and tokenize them to create bigrams/trigrams.

4. Remove Stopwords from N-grams
Filter out unwanted stopwords and punctuation from the generated n-grams for cleaner processing.

5. Calculate Frequency Distributions
Use NLTK’s FreqDist to compute how frequently each n-gram appears in the corpus.

6. Create a Dictionary of Trigram Frequencies
Build a mapping structure that links word pairs (context) to their possible next words based on frequency.

7. Define the Text Generation Function
Write a function that leverages the trigram model to generate text step by step.

8. Execute the Text Generation Function
Run the function to produce and display a sample passage of generated text.

###**Step 1: Import the Necessary Libraries**





*   Import the necessary libraries and API key.
*   Download the necessary NLTK packages and corpus.



In [4]:
#!pip install nltk

In [51]:
# Import necessary libraries
import string
import nltk
from nltk.corpus import brown
from nltk.util import ngrams

# Download necessary NLTK packages and corpus
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('brown')


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Lenovo\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

###**Step 2: Define Stopwords and Punctuation**

*   Stopwords are common words in a language that are often considered to be of little value in text analysis.
*   Punctuation refers to characters used to separate sentences, clauses, phrases, or words in writing.





In [52]:
# Define stopwords and punctuation
stop_words = set(nltk.corpus.stopwords.words('english'))
string.punctuation += '"\'-—'
removal_list = list(stop_words) + list(string.punctuation) + ['lt', 'rt']


In [53]:
# removal_list
# stop_words

###**Step 3: Load Sentences and Generate N-grams**

*   Load sentences from the Brown corpus and generate N-grams.
*   By the end of this process, **unigram**, **bigram**, and **trigram** lists will contain the respective N-grams for the sentences in the Brown corpus.





In [54]:
# Load sentences from the Brown corpus
sents = brown.sents()

# Initialize lists for storing n-grams
unigram = []
bigram = []
trigram = []

# Generate n-grams
for sentence in sents:
    sentence = [word.lower() for word in sentence if word not in string.punctuation]
    unigram.extend(sentence)
    bigram.extend(list(ngrams(sentence, 2, pad_left=True, pad_right=True)))
    trigram.extend(list(ngrams(sentence, 3, pad_left=True, pad_right=True)))


In [55]:
# unigram

###**Step 4: Remove Stopwords from N-grams**

*   Define a function to remove stopwords from the N-grams.
*   Use it to clean the bigrams and trigrams.





In [56]:
# Function to remove stopwords from n-grams

def remove_stopwords(ngrams, n, removal_list):
    removal_set = set(removal_list)  # faster lookups
    return [ng for ng in ngrams if all(token not in removal_set for token in ng)]
# Remove stopwords from n-grams
# removal_list = ["the", "and", "is"]

bigram = remove_stopwords(bigram, 2, removal_list)
trigram = remove_stopwords(trigram, 3, removal_list)

In [57]:
bigram[:5]

[('fulton', 'county'),
 ('county', 'grand'),
 ('grand', 'jury'),
 ('jury', 'said'),
 ('said', 'friday')]

In [58]:
trigram[:5]

[('fulton', 'county', 'grand'),
 ('county', 'grand', 'jury'),
 ('grand', 'jury', 'said'),
 ('jury', 'said', 'friday'),
 ("atlanta's", 'recent', 'primary')]

###**Step 5: Calculate Frequency Distributions**

*   Calculate the frequency distributions of the bigrams and trigrams.



In [59]:
# Calculate frequency distributions
from nltk import FreqDist

freq_bi = FreqDist(bigram)
freq_tri = FreqDist(trigram)


In [60]:
freq_bi.most_common(10)

[(("''", None), 4747),
 ((None, '``'), 4177),
 (('said', None), 445),
 ((None, 'one'), 401),
 (('united', 'states'), 392),
 (('said', '``'), 323),
 (('new', 'york'), 296),
 ((None, 'mr.'), 241),
 (('af', None), 236),
 ((None, '--'), 219)]

In [61]:
freq_tri.most_common(10)

[(("''", None, None), 4747),
 ((None, None, '``'), 4177),
 (('said', None, None), 445),
 ((None, None, 'one'), 401),
 ((None, None, None), 242),
 ((None, None, 'mr.'), 241),
 (('af', None, None), 236),
 ((None, None, '--'), 219),
 (('time', None, None), 205),
 ((None, None, 'even'), 190)]

###**Step 6: Create a Dictionary of Trigram Frequencies**

*   Create a dictionary of trigram frequencies to use it in the text generation function.



In [62]:
# Create a dictionary of trigram frequencies
from collections import defaultdict, Counter

d = defaultdict(Counter)
for ngram in freq_tri:
    if None not in ngram:
        d[ngram[:-1]][ngram[-1]] += freq_tri[ngram]


In [63]:
d.items()



###**Step 7: Define the Text Generation Function**

*   Define the **generate_text** function to generate text based on the trigram frequencies.



In [64]:
# Function to generate text
import random
def generate_text(prefix, n=20):
    for _ in range(n):
        suffix_candidates = list(d.get(prefix, Counter()).elements())
        if not suffix_candidates:
            new_prefix = random.choice(unigram), random.choice(unigram)
            yield new_prefix[0]  # Yield the first word of the new prefix
            prefix = new_prefix
        else:
            suffix = random.choice(suffix_candidates)
            yield suffix
            prefix = (*prefix[1:], suffix)


###**Step 8: Execute the Text Generation Function**

*   Call the **generate_text** function and print the generated text.



In [66]:
# Generate text
prefix = ("he", "said")
generated_text = list(generate_text(prefix))
if generated_text:
    print(" ".join(generated_text))
else:
    print("No text generated.")


and producer referred the there 8 great is most numbers part the public ignore from such of the what had


##**Conclusion**

This showcases NLTK and the Brown corpus for trigram-based Markov chain text generation. Run it multiple times to observe the varying generated outputs.