# Text Generation

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In [1]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle('./wednesday_data/corpus.pkl')
data

Unnamed: 0,transcript
E01,"Original release date: November 23, 2022 Wedne..."
E02,"Original release date: November 23, 2022 Wedne..."
E03,"Original release date: November 23, 2022 Wedne..."
E04,"Original release date: November 23, 2022 Wedne..."
E05,"Original release date: November 23, 2022 32 ye..."
E06,"Original release date: November 23, 2022 Wedne..."
E07,"Original release date: November 23, 2022 At Ma..."
E08,"Original release date: November 23, 2022 Wedne..."


In [2]:
# Extract only E01 text
ep01_text = data.transcript.loc['E01']
ep01_text[:200]

'Original release date: November\xa023,\xa02022 Wednesday Addams, a high-school student, finds her brother\xa0Pugsley\xa0tied up in a locker. She sees a psychic vision of his bullies whom she attempts to kill in r'

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [3]:
from collections import defaultdict

def markov_chain(text):
    '''The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text.'''
    
    # Tokenize the text by word, though including punctuation
    words = text.split(' ')
    
    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)
    
    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [4]:
# Create the dictionary for E01, take a look at it
ep01_dict = markov_chain(ep01_text)
ep01_dict

{'Original': ['release'],
 'release': ['date:'],
 'date:': ['November\xa023,\xa02022'],
 'November\xa023,\xa02022': ['Wednesday'],
 'Wednesday': ['Addams,',
  'is',
  'meets',
  'meet',
  'has',
  'is',
  'always',
  'to',
  'and',
  'was'],
 'Addams,': ['a', 'time'],
 'a': ['high-school',
  'locker.',
  'psychic',
  'school',
  'hiker',
  'sentient',
  'falling',
  'vision',
  'closed',
  'magical',
  'good',
  'polite',
  'problem',
  'werewolf.',
  'beautiful',
  'unique',
  'line',
  'unique',
  'very',
  'relationship',
  'therapist',
  'week.',
  'little',
  'hugger.',
  'uniform.',
  'tour',
  'part',
  'version',
  'kid',
  'pentagon.',
  'wiki',
  'living',
  'small',
  'soul-sucking',
  'moment.',
  'brilliant',
  'little',
  'symbol',
  'housewife,',
  'family.',
  'kind',
  'leg',
  'goddamn',
  'rainbow',
  'day',
  'foreign',
  'bad',
  'little',
  'few',
  'privilege,',
  'right.',
  'brisk',
  'shuttle',
  'tad',
  'clean',
  'military',
  'school',
  'run',
  'concussi

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

In [5]:
import random

def generate_sentence(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [6]:
generate_sentence(ep01_dict)

'Look pretty, but clueless. It’s made an obsession with your footsteps. Becoming captain of advice..'

### Assignment:
1. Generate sentence for other comedians also.
2. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.

In [7]:
ep08_text = data.transcript.loc['E08']
ep08_text[:200]

'Original release date: November\xa023,\xa02022 Wednesday and her classmates lure Tyler into the forest, where they kidnap him. Seeking a confession, Wednesday starts torturing Tyler. Disagreeing with her me'

In [8]:
ep08_dict = markov_chain(ep08_text)
ep08_dict

{'Original': ['release'],
 'release': ['date:'],
 'date:': ['November\xa023,\xa02022'],
 'November\xa023,\xa02022': ['Wednesday'],
 'Wednesday': ['and',
  'starts',
  'is',
  'visits',
  'get',
  'to',
  'destroys',
  'departs',
  'to'],
 'and': ['her',
  'Wednesday',
  'Wednesday',
  'subdue',
  'leaves',
  'Eugene',
  'Wednesday',
  'suddenly',
  'me',
  'you',
  'married',
  'let',
  'detached.',
  'a',
  'torture?',
  'cooperation',
  'trust',
  'impulsive.',
  'the',
  'I',
  'there',
  'is',
  'never',
  'you',
  'werewolves,',
  'waiting.',
  'I',
  'renewal.',
  'those',
  'just',
  'the',
  'pin',
  'shut',
  'her',
  'confronted',
  'I',
  'Thornhill',
  'Wednesday’s',
  'if',
  'whine',
  'be',
  'I',
  'for',
  'true.',
  'you',
  'Kent?',
  'then',
  'alert',
  'I',
  'forever.',
  'heal',
  'calmly,',
  'drizzle',
  'Tyler',
  'Ellie',
  'Frank’s',
  'Ellie',
  'Ellie',
  'flooded',
  'Ellie'],
 'her': ['classmates',
  'methods,',
  'classmates',
  'from',
  'that',
  'sh

In [13]:
def generate_sentence2(chain, count=15):
    '''Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence.'''

    punc = ["'", '?','.',',',':',';','!','*','%','"','/']
    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2
        if word2[-1] in punc:
            return(sentence)

    # End it with a period
    sentence += '.'
    return(sentence)

In [16]:
generate_sentence(ep08_dict)

'Defeats Tyler with her methods, her true identity—Laurel Gates. However, Gates resurrects Crackstone must remember.'

In [14]:
generate_sentence2(ep08_dict)

'Oleander, one can pretty much better now.'