# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Ali Wong, so as a first step, let's extract the text from her comedy routine.

In [None]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('corpus.pkl')
data

Unnamed: 0,transcript,full_name
adam,\n\n\n\n\n\n\n\nAdam Sandler: Love You (2024)...,Adam Sandler
ali,\n\n\n\n\n\n\n\nAli Wong: Baby Cobra (2016) |...,Ali Wong
anthony,\n\n\n\n\n\n\n\nAnthony Jeselnik: Thoughts An...,Anthony Jeselnik
bill,\n\n\n\n\n\n\n\nBILL BURR: I'M SORRY YOU FEEL...,Bill Burr
bo,\n\n\n\n\n\n\n\nScraps from the loft\n\n\n\n\...,Bo Burnham
chad,\n\n\n\n\n\n\n\nChad Daniels: Dad Chaniels (2...,Chad Daniels
dave,\n\n\n\n\n\n\n\nDave Chappelle: The Age of Sp...,Dave Chappelle
ellen,\n\n\n\n\n\n\n\nEllen DeGeneres: For Your App...,Ellen DeGeneres
gabriel,\n\n\n\n\n\n\n\nGabriel Iglesias: Legend of F...,Gabriel Iglesias
hasan,\n\n\n\n\n\n\n\nScraps from the loft\n\n\n\n\...,Hasan Minhaj


In [None]:
# Extract only Ali Wong's text
ali_text = data.transcript.loc['ali']
ali_text[:200]

' \n\n\n\n\n\n\n\nAli Wong: Baby Cobra (2016) | Transcript - Scraps from the loft\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\tSkip to content\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\t'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [None]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''

    # Tokenize the text by word, though including punctuation
    words = text.split(' ')

    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)

    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [None]:
# Create the dictionary for Ali's routine, take a look at it
ali_dict = markov_chain(ali_text)
ali_dict

{'': ['\n\n\n\n\n\n\n\nAli'],
 '\n\n\n\n\n\n\n\nAli': ['Wong:', 'Wong:'],
 'Wong:': ['Baby', 'Baby'],
 'Baby': ['Cobra', 'Cobra'],
 'Cobra': ['(2016)', '(2016)'],
 '(2016)': ['|', '|'],
 '|': ['Transcript',
  'Transcript',
  'Transcript\t\t\t\n\n\nComedian',
  'Transcript\t\t\t\n\n\nGabriel',
  'Transcript\t\t\t\n\n\nMichelle'],
 'Transcript': ['-', '\n\n\n\n\n\n\n\nSeptember'],
 '-': ['Scraps'],
 'Scraps': ['from', 'from'],
 'from': ['the',
  'a',
  'the',
  'the',
  'the',
  'the',
  'Pier',
  'Harvard',
  'the',
  'the',
  'having',
  'having',
  'the',
  'this',
  'pedestrians',
  'their',
  'a',
  'holding',
  'bacteria',
  'Harvard',
  'the'],
 'the': ['loft\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n',
  'rocky',
  'stage:',
  'light',
  'chatter',
  'kind',
  'kind',
  'presence',
  'center',
  'hoarding',
  'HPV',
  'country',
  'third',
  'Communists.\nThe',
  'worst',
  'calculator',
  'future.',
  'Tesla',
  'man

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [None]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [None]:
generate_sentence(ali_dict)

'Sparkle. The Boat, I learned about to go to date white man! I do all.'

### Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [None]:
# Extract only Adam Sandlar's text
adam_text = data.transcript.loc['adam']
adam_text[:200]

' \n\n\n\n\n\n\n\nAdam Sandler: Love You (2024) | Transcript - Scraps from the loft\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\tSkip to content\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n'

In [None]:
# Create the dictionary for Adam's routine, take a look at it
adam_dict = markov_chain(adam_text)
adam_dict

{'': ['\n\n\n\n\n\n\n\nAdam'],
 '\n\n\n\n\n\n\n\nAdam': ['Sandler:', 'Sandler:'],
 'Sandler:': ['Love', 'Love'],
 'Love': ['You', 'You', 'you,', 'you', 'ya!\n[cheers', 'you,'],
 'You': ['(2024)',
  '(2024)',
  'know',
  'know',
  'want',
  'gotta',
  'do',
  'landed',
  'want',
  'fucked',
  'think',
  'know',
  'guys',
  'guys',
  'were',
  'know?',
  'guys',
  'can',
  'want',
  'know',
  'can’t',
  'understand',
  'know,',
  'don’t',
  'go,',
  'paint',
  'go,',
  'got…”',
  'start',
  'start',
  'go,',
  'go,',
  'go,',
  'got',
  'just',
  'fucked',
  'gotta',
  'gotta',
  'guys',
  'alright,',
  'ever',
  'want',
  'know?\nGary.',
  'have',
  'guys',
  'two…',
  'okay',
  'sound',
  'fucking',
  'know,',
  'know',
  'can',
  'know,',
  'guys…',
  'guys',
  'promise?',
  'know,',
  'said',
  'swear',
  'guys',
  'look',
  'need',
  'know,',
  'can',
  'were',
  'made',
  'swear?',
  'swear?',
  'know'],
 '(2024)': ['|', '|', '|'],
 '|': ['Transcript',
  'Transcript',
  'Transcript

In [None]:
generate_sentence(adam_dict)

'Good for world peace.” He said, “Get me and the fuck on the second that.'

In [None]:
generate_sentence(adam_dict,20)

'Chip in yoga pants ♪ [chuckles] ♪ Pony kegs and Whoopi For the leaves ♪ With his load. He goes,.'

In [None]:
# Extract only Dave Chappelle's text
dave_text = data.transcript.loc['dave']
dave_text[:200]

' \n\n\n\n\n\n\n\nDave Chappelle: The Age of Spin (2017) - Transcript - Scraps from the loft\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\tSkip to content\n\n\n\n\n\n\n\n\n\n \n'

In [None]:
# Create the dictionary for Dave's routine, take a look at it
dave_dict = markov_chain(dave_text)
dave_dict

{'': ['\n\n\n\n\n\n\n\nDave'],
 '\n\n\n\n\n\n\n\nDave': ['Chappelle:', 'Chappelle:'],
 'Chappelle:': ['The', 'The'],
 'The': ['Age',
  'Age',
  'article',
  'whole',
  'old',
  'cops',
  'body',
  'City',
  'Steven',
  'motherfucker',
  'first',
  'Daily',
  'Daily',
  'other',
  'truth',
  'only',
  'Texan',
  'Texan’s',
  'restaurant',
  'glove',
  'glove',
  'men,',
  'Nike',
  'last',
  'third',
  'Improv.',
  'Juice.',
  'show',
  'conversation',
  'Russians',
  'age',
  'ladies',
  'only',
  'lights',
  'crowd',
  'longer',
  'crowd',
  'point',
  'fourth',
  'fourth',
  'Juice.',
  'material'],
 'Age': ['of', 'of', 'of'],
 'of': ['Spin',
  'Spin',
  'Spin:',
  'race,',
  'his',
  'thought,',
  'it',
  'weed,',
  'weed,',
  'racial',
  'it.',
  'Detroit,',
  'the',
  'dollars–',
  'bubble',
  'black',
  'one',
  'those',
  'mine.',
  'my',
  'this',
  'the',
  'the',
  'the',
  'the',
  'the',
  'the',
  'that',
  'L.A.',
  'money',
  'shots.\nEverybody’s',
  'the',
  'that',
  '

In [None]:
generate_sentence(dave_dict)

'Cowlings, the first place.” “Not yet. Not like teddy bears, but then, on board, dead..'

In [None]:
# Extract only Ronny Chieng's text
ronny_text = data.transcript.loc['ronny']
ronny_text[:200]

' \n\n\n\n\n\n\n\nRonny Chieng: Love to Hate It (2024) | Transcript - Scraps from the loft\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\t\tSkip to content\n\n\n\n\n\n\n\n\n\n \n\n\n\n'

In [None]:
# Create the dictionary for Ronny's routine, take a look at it
ronny_dict = markov_chain(ronny_text)
ronny_dict

{'': ['\n\n\n\n\n\n\n\nRonny'],
 '\n\n\n\n\n\n\n\nRonny': ['Chieng:', 'Chieng:'],
 'Chieng:': ['Love', 'Love'],
 'Love': ['to', 'to', 'farming.'],
 'to': ['Hate',
  'content\n\n\n\n\n\n\n\n\n\n',
  'Hate',
  'have',
  'the',
  'high',
  'the',
  'my',
  'high',
  'use',
  'push',
  'inject',
  'inject',
  'do',
  'you.”',
  'your',
  'make',
  'put',
  'flick',
  'put',
  'watch',
  'watch',
  'inject',
  'do',
  'make',
  'contribute',
  'know.”',
  'lie',
  'jerk',
  'finish…\n[crowd',
  'death',
  'medically',
  'medically',
  'test',
  'the',
  'my',
  'do',
  'Langone',
  'the…',
  'the',
  'tag',
  'Vegas,',
  'overcome',
  'think',
  'do',
  'be',
  'be',
  'be',
  'strangers',
  'overcome',
  'even',
  'make',
  'become',
  'law',
  'me.\n[crowd',
  'sincerely',
  'straight',
  'seek',
  'be',
  'be',
  'start',
  'lift',
  'lock',
  'know',
  'control',
  'the',
  'be',
  'be',
  'lift',
  'cause',
  'do',
  'do',
  'be',
  'straight',
  'seek',
  'resist.',
  'MMA',
  'MMA-fi

In [None]:
generate_sentence(ronny_dict)

'\n\n\n\ninstagram\n \n\n\n\n\n\n\n\n\n\n\n© 2024 Scraps from outside the egg, sometimes, like, “Yeah, that every basic bitch.'

In [None]:
import random
import string

def generate_sentence2(chain, count=15):
    '''Generate a sentence from a Markov chain.
       Ends early if a word with punctuation is found.
       Otherwise, ends with random punctuation (., !, ?).'''

    punctuation_marks = ['.', '!', '?']
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    for _ in range(count - 1):
        if word1[-1] in punctuation_marks:
            break

        word2 = random.choice(chain[word1])
        sentence += ' ' + word2
        word1 = word2

    if sentence[-1] not in punctuation_marks:
        sentence += random.choice(punctuation_marks)

    return sentence


In [None]:
generate_sentence2(ali_dict)

'Beard for us, and then resist the other homeless friends, dead.'

In [None]:
generate_sentence2(adam_dict)

'That!'

In [None]:
generate_sentence2(dave_dict)

'A ticket for like, ‘Same Hero, New Boots!’ And at other like, “Hooray!” And he!'

In [None]:
generate_sentence2(ronny_dict)

'Kids’ accolades, through this year.'