#**Demo: Text Generation**

This demonstration employs the Natural Language Toolkit (NLTK) and the Brown corpus to demonstrate text generation through a Markov chain model using trigrams.

##**Steps to Perform:**

Step 1: Import the Necessary Libraries

Step 2: Define Stopwords and Punctuation

Step 3: Load Sentences and Generate N-grams

Step 4: Remove Stopwords from N-grams

Step 5: Calculate Frequency Distributions

Step 6: Create a Dictionary of Trigram Frequencies

Step 7: Define the Text Generation Function

Step 8: Execute the Text Generation Function

###**Step 1: Import the Necessary Libraries**





*   Import the necessary libraries and set up the OpenAI API key.
*   Download the necessary NLTK packages and corpus.



In [1]:
# Import necessary libraries
import string
import random
import nltk
from nltk import FreqDist
from nltk.corpus import brown
from collections import defaultdict, Counter
from nltk.util import ngrams

# Download necessary NLTK packages and corpus
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('brown')


[nltk_data] Downloading package punkt to /voc/work/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /voc/work/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package brown to /voc/work/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

###**Step 2: Define Stopwords and Punctuation**

*   Stopwords are common words in a language that are often considered to be of little value in text analysis.
*   Punctuation refers to characters used to separate sentences, clauses, phrases, or words in writing.





In [9]:
# Define stopwords and punctuation
stop_words = set(nltk.corpus.stopwords.words('english'))
string.punctuation += '"\'-—'
removal_list = list(stop_words) + list(string.punctuation) + ['lt', 'rt']

###**Step 3: Load Sentences and Generate N-grams**

*   Load sentences from the Brown corpus and generate N-grams.
*   By the end of this process, **unigram**, **bigram**, and **trigram** lists will contain the respective N-grams for the sentences in the Brown corpus.





In [10]:
# Load sentences from the Brown corpus
sents = brown.sents()

# Initialize lists for storing n-grams
unigram = []
bigram = []
trigram = []

# Generate n-grams
for sentence in sents:
    sentence = [word.lower() for word in sentence if word not in string.punctuation]
    unigram.extend(sentence)
    bigram.extend(list(ngrams(sentence, 2, pad_left=False, pad_right=False)))
    trigram.extend(list(ngrams(sentence, 3, pad_left=False, pad_right=False)))

###**Step 4: Remove Stopwords from N-grams**

*   Define a function to remove stopwords from the N-grams.
*   Use it to clean the bigrams and trigrams.





In [11]:
# Function to remove stopwords from n-grams
def remove_stopwords(ngrams, n):
    if n == 2:
        return [(a, b) for (a, b) in ngrams if a not in removal_list and b not in removal_list]
    elif n == 3:
        return [(a, b, c) for (a, b, c) in ngrams if a not in removal_list and b not in removal_list and c not in removal_list]

# Remove stopwords from n-grams
bigram = remove_stopwords(bigram, 2)
trigram = remove_stopwords(trigram, 3)

###**Step 5: Calculate Frequency Distributions**

*   Calculate the frequency distributions of the bigrams and trigrams.



In [12]:
# Calculate frequency distributions
freq_bi = FreqDist(bigram)
freq_tri = FreqDist(trigram)

###**Step 6: Create a Dictionary of Trigram Frequencies**

*   Create a dictionary of trigram frequencies to use it in the text generation function.



In [13]:
# Create a dictionary of trigram frequencies with a threshold for filtering
threshold = 2  # Minimum frequency for trigrams to be included
d = defaultdict(Counter)
for ngram, freq in freq_tri.items():
    if freq >= threshold:
        d[ngram[:-1]][ngram[-1]] += freq

###**Step 7: Define the Text Generation Function**

*   Define the **generate_text** function to generate text based on the trigram frequencies.



In [14]:
# Function to generate text with enhanced logic
def generate_text(prefix, n=20):
    text = list(prefix)
    for _ in range(n):
        suffix_candidates = list(d.get(prefix, Counter()).elements())
        if not suffix_candidates:
            # Choose a new prefix from the dictionary keys if no candidates
            prefix = random.choice(list(d.keys()))
        else:
            # Choose a suffix and update the prefix
            suffix = random.choice(suffix_candidates)
            text.append(suffix)
            prefix = (*prefix[1:], suffix)
    return " ".join(text)

###**Step 8: Execute the Text Generation Function**

*   Call the **generate_text** function and print the generated text.



In [16]:
# Generate text with a random valid prefix
prefix = random.choice(list(d.keys()))  # Randomly select a valid prefix
generated_text = generate_text(prefix, n=50)  # Generate text of 50 words
print("Generated Text:", generated_text)

Generated Text: set freight rates `` areas property excluding coal said cent -- century four us robert f. kennedy scores '' listing '' '' '' di ferro states supra institutions united states pursuant


##**Conclusion**

This demo showcases NLTK and the Brown corpus for trigram-based Markov chain text generation. Users can run it multiple times to observe the varying generated outputs.