# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [1]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

Unnamed: 0,transcript
ali,\n \n\n\n\n\n\nALI WONG: BABY COBRA (2016) - F...
anthony,\n \n\n\n\n\n\nAnthony Jeselnik: Thoughts And ...
bert,\n \n\n\n\n\n\nBert Kreischer: Hey Big Boy (20...
bill,\n \n\n\n\n\n\nBILL BURR: I'M SORRY YOU FEEL T...
bo,\n \n\n\n\n\n\nPage Not Found - Scraps from th...
catherine,\n \n\n\n\n\n\nCatherine Cohen: The Twist...? ...
chris,\n \n\n\n\n\n\nChris Rock: Bigger & Blacker (1...
dave,\n \n\n\n\n\n\nDave Chappelle: The Age of Spin...
george,\n \n\n\n\n\n\nGeorge Carlin: Doin' It Again (...
hasan,\n \n\n\n\n\n\nPage Not Found - Scraps from th...


In [18]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [19]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data.transcript.apply(round1))
data_clean

Unnamed: 0,transcript
ali,ali wong baby cobra full transcript scraps from the loft \r\t\tskip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv ...
anthony,anthony jeselnik thoughts and prayers full transcript scraps from the loft \r\t\tskip to content moviesmovie reviewsmovie transcriptsstanley kubr...
bert,bert kreischer hey big boy transcript scraps from the loft \r\t\tskip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaelt...
bill,bill burr im sorry you feel that way full transcript scraps from the loft \r\t\tskip to content moviesmovie reviewsmovie transcriptsstanley kubri...
bo,page not found scraps from the loft \r\t\tskip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcr...
catherine,catherine cohen the twist shes gorgeous transcript scraps from the loft \r\t\tskip to content moviesmovie reviewsmovie transcriptsstanley kubrick...
chris,chris rock bigger blacker transcript scraps from the loft \r\t\tskip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv...
dave,dave chappelle the age of spin transcript scraps from the loft \r\t\tskip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline k...
george,george carlin doin it again transcript scraps from the loft \r\t\tskip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kael...
hasan,page not found scraps from the loft \r\t\tskip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcr...


In [20]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('\n', '', text)
    text = re.sub('\t', '', text)
    text = re.sub('\r', '', text)
    text = re.sub(' +', ' ', text)
    text = re.sub('♪', '', text)
    text = re.sub(r'–\b|\b–', "", text)
    return text

round2 = lambda x: clean_text_round2(x)

In [21]:
data_clean = pd.DataFrame(data_clean.transcript.apply(round2))
data_clean

Unnamed: 0,transcript
ali,ali wong baby cobra full transcript scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv series...
anthony,anthony jeselnik thoughts and prayers full transcript scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpau...
bert,bert kreischer hey big boy transcript scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seri...
bill,bill burr im sorry you feel that way full transcript scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpaul...
bo,page not found scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcriptsco...
catherine,catherine cohen the twist shes gorgeous transcript scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpaulin...
chris,chris rock bigger blacker transcript scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv serie...
dave,dave chappelle the age of spin transcript scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv ...
george,george carlin doin it again transcript scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv ser...
hasan,page not found scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcriptsco...


In [22]:
data = data_clean

In [39]:
# Extract only Ali Wong's text
ali_text = data.transcript.loc['ali']
ali_text[:2000]

' ali wong baby cobra full transcript scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcriptscomedystandup comedy transcriptsgeorge carlindave chappelleinterviewsplayboy interviewsmusichistorybooks menumoviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcriptscomedystandup comedy transcriptsgeorge carlindave chappelleinterviewsplayboy interviewsmusichistorybooks search moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcriptscomedystandup comedy transcriptsgeorge carlindave chappelleinterviewsplayboy interviewsmusichistorybooks menumoviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcriptscomedystandup comedy transcriptsgeorge carlindave chappelleinterviewsplayboy interviewsmusichistorybooks searchcomedy ali wong baby cobra – full transcript september ali wongs stand up special delves into her sexual adventures hoar

In [38]:
# Extract only Ali Wong's text
joe_text = data.transcript.loc['joe']
joe_text[:2000]

' joe rogan triggered transcript scraps from the loft skip to content moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcriptscomedystandup comedy transcriptsgeorge carlindave chappelleinterviewsplayboy interviewsmusichistorybooks menumoviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcriptscomedystandup comedy transcriptsgeorge carlindave chappelleinterviewsplayboy interviewsmusichistorybooks search moviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcriptscomedystandup comedy transcriptsgeorge carlindave chappelleinterviewsplayboy interviewsmusichistorybooks menumoviesmovie reviewsmovie transcriptsstanley kubrickpauline kaeltv seriestv show transcriptscomedystandup comedy transcriptsgeorge carlindave chappelleinterviewsplayboy interviewsmusichistorybooks searchcomedy joe rogan triggered – transcript august unleashing his inquisitive intense comedic style rogan explores everything fro

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [34]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
    words = text.split(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [40]:
# Create the dictionary for Ali's routine, take a look at it
ali_dict = markov_chain(ali_text)
ali_dict

{'': ['ali'],
 'ali': ['wong', 'wong', 'wongs', 'wong', 'you', 'we', 'why', 'did', 'wong'],
 'wong': ['baby', 'baby', 'hi', 'have', 'standup'],
 'baby': ['cobra',
  'cobra',
  'wipe',
  'theyll',
  'that',
  'and',
  'comes',
  'hasnt',
  'can',
  'can',
  'from'],
 'cobra': ['full', '–'],
 'full': ['transcript', 'transcript', 'blown'],
 'transcript': ['scraps', 'september'],
 'scraps': ['from', 'from'],
 'from': ['the',
  'a',
  'the',
  'the',
  'the',
  'the',
  'pier',
  'harvard',
  'the',
  'the',
  'having',
  'having',
  'the',
  'this',
  'pedestrians',
  'their',
  'a',
  'thats',
  'holding',
  'bacteria',
  'harvard',
  'tight',
  'the'],
 'the': ['loft',
  'rocky',
  'stage',
  'light',
  'chatter',
  'kind',
  'kind',
  'presence',
  'lifechanging',
  'center',
  'hoarding',
  'hpv',
  'country',
  'third',
  'communiststhe',
  'worst',
  'manual',
  'calculator',
  'future',
  'tesla',
  'manual',
  'way',
  'lucky',
  'outside',
  'inside',
  'first',
  'time',
  'fifth

In [41]:
joe_dict = markov_chain(joe_text)
joe_dict

{'': ['joe'],
 'joe': ['rogan', 'rogan', 'roganwhat'],
 'rogan': ['triggered', 'triggered', 'explores', 'standup'],
 'triggered': ['transcript', '–'],
 'transcript': ['scraps', 'august'],
 'scraps': ['from', 'from'],
 'from': ['the',
  'why',
  'bruce',
  'the',
  'the',
  'her',
  'her',
  'the',
  'around',
  'them',
  'but',
  'today',
  'i',
  'the',
  'crabs',
  'you',
  'it',
  'tight',
  'the'],
 'the': ['loft',
  'fck',
  'move',
  'people',
  'fck',
  'other',
  'fck',
  'problem',
  'same',
  'pot',
  'gummy',
  'leg',
  'leg',
  'fck',
  'scary',
  'thing',
  'problems',
  'boat',
  'earth',
  'boats',
  'suns',
  'water',
  'water',
  'water',
  'water',
  'clouds',
  'bottom',
  'time',
  'words',
  'foods',
  'waters',
  'same',
  'other',
  'same',
  'same',
  'way',
  'word',
  'case',
  'same',
  'secret',
  'middle',
  'ocean',
  'problem',
  'people',
  'weirdest',
  'most',
  'weirdest',
  'shit',
  'type',
  'world',
  'best',
  'fck',
  'white',
  'fcking',
  'we'

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [42]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

### Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [43]:
generate_sentence(ali_dict)

'Atlanta and he was a box spring iliza remains the outside of options when they.'

In [44]:
generate_sentence(joe_dict)

'Bet if adam and some real every corner they go to make this fcking dudes.'

In [45]:
import random
import string

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''
    
    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        sentence += ' ' + word2

        # Check if the word ends with a punctuation mark
        if word2[-1] in string.punctuation:
            break

        # Stop if we reach the desired sentence length
        if i == count-2:
            break

        word1 = word2

    # End it with a random punctuation mark
    sentence += random.choice(string.punctuation)
    return sentence

In [46]:
generate_sentence(ali_dict)

'Unspoken understanding uh between each other homeless dude believe he doesnt have to get huge+'

In [47]:
generate_sentence(joe_dict)

'Pop out a dog shit they have at anybody would be me i asked a='

In [None]:
##In this updated function, we use the string.punctuation constant to check if
##a word ends with a punctuation mark. If it does, we stop generating the sentence. We also use the random.choice function to select a random punctuation mark to end the sentence with.