## N-gram Text Generation

In this assignment we'll generate text via various n-gram models. See the README for full instructions. For this whole assignment use a tokenization that folds to lowercase and removes tokens where `isalpha` is False. 

In [1]:
import nltk
import random

from nltk.book import *
from collections import Counter

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [2]:
# In this assignment, I recommend you use the random.choices function. 
# here are some examples of its use.

values = "a b c d e f g".split()
weights = [1,1,5,5,10,10,20]

In [3]:
random.choices(population=values,weights=weights,k=5)

['g', 'g', 'g', 'f', 'e']

In [4]:
from collections import Counter

Counter(random.choices(population=values,weights=weights,k=1000)).most_common(10)

[('g', 373),
 ('e', 199),
 ('f', 197),
 ('d', 107),
 ('c', 91),
 ('b', 17),
 ('a', 16)]

Now, write a function that generates text of a given length, using the probabilistic approach to glue one word to another. Have it start with a text and the desired length of the output.

In [5]:
def generate_unigram(text,length=10) : ##Creating a function that will generate unigrams of length 10
    
    #Place each word in the text provided into a list
    tokens= []
    for word in text:  
        tokens.append(word)
    
    #normalize list of tokens by making all elements lowercase and removing non-alphabetic characters
    clean_tokens = []
    
    clean_tokens=[w.lower() for w in tokens if w.isalpha()]
    
    
    # Now use random.choices to select `length` from clean_tokens and return them
    
    results=random.choices(population = clean_tokens, k=length)
    
    return(" ".join(results)) #Return results separated by a space


Now play around with the various texts, generating nonsense sentences from them. 

In [6]:
print(generate_unigram(text1))

person for the satisfaction murder of through oil the pip


In [7]:
print(generate_unigram(text2))

of dare which mind groom ever eagerly sight time thence


In [8]:
print(generate_unigram(text5))

there dam pouting is have join hi join i gay


Now do the same thing, but have it work with bigrams. This is harder, since you have a "current word" you want to glue text onto. The parameter "start" will give you a word to start with. 

In [9]:
def generate_bigram(text,length=10,start=None) :
    #Creating empty lists for lowercased text and results
    lc_text= [] 
    results= []
    
    for tokens in text: #Lowercasing all text and removing non-alphabetic characters
        if tokens.isalpha(): 
            lc_text.append(tokens.lower())

    if not start : #The starting point will be one random selection from the population of lc_text list items
        results.append(random.choices(population = lc_text, k=1)[0]) ##Add the starting point word to the list of results
       
    
    else :  #If the lowercase starting word isn't in the text...
        start=start.lower() 
        if start not in lc_text :
            print(f"The starting word, {start}, isn't in the text!")
            return("")
        
        else: ##Otherwise add start word to results list
            
            results.append(start)
    
    lc_fd = FreqDist(nltk.bigrams(lc_text)) #Create a frequency distribution of bigrams in lc_text
    
    while len(results) < length:  ##While the length of the results list is less than 10
        bigram_candidates = []
        for pair in lc_fd:  #For each pair in the frequency distribution
            if pair[0]==results[-1]:  #if the first word in the pair is the most recently added word in the results list
                bigram_candidates.append(pair) #Add the word pair to the bigram candidates list
                
        next_pair = random.choices(population = bigram_candidates) #The next pair in the results list will be a random selection from bigram_candidates
        results.append(next_pair[0][1])
        
    
    return(" ".join(results))



In [10]:
generate_bigram(text1,10,"the")


'the articles word till a bull poor stubb go hand'

In [11]:
generate_bigram(text2,10,"the")

'the tallest politely decided regard inevitable delay and parties stood'

In [12]:
generate_bigram(text5,10,"the")

'the missing part dum de dum du dummmm pm alohas'