# Text Generation

This is a fork of https://github.com/adashofdata/nlp-in-python-tutorial.  Rather than using the transcripts of stand-up comedians, it uses the text of New Testament books of the Bible.

- Created Nov 2022
- Revised Nov 2024

In [1]:
# # Load the Drive helper and mount
# from google.colab import drive
# drive.mount('/content/drive')

import os
# base_dir = '/content/drive/MyDrive/ColabData/NLP_Demo/'
base_dir = '.'

try:
    os.mkdir(base_dir)
except FileExistsError:
    pass

## Introduction

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

Markov chains don't generate text as well as deep learning, but it's a good (and fun!) start.

## Select Text to Imitate

In this notebook, we're specifically going to generate text in the style of Paul, so as a first step, let's extract the text from his letters.

In [2]:
# Read in the corpus, including punctuation!
import pandas as pd

data = pd.read_pickle(os.path.join(base_dir, 'corpus.pkl'))
data

Unnamed: 0,book_text,num_chapters
Matthew,1:1 This is the record of the genealogy of Jes...,28
Mark,1:1 The beginning of the gospel of Jesus Chris...,16
Luke,1:1 Now many have undertaken to compile an acc...,24
John,"1:1 In the beginning was the Word, and the Wor...",21
Acts,"1:1 I wrote the former account, Theophilus, ab...",28
Romans,"1:1 From Paul, a slave of Christ Jesus, called...",16
1 Corinthians,"1:1 From Paul, called to be an apostle of Chri...",16
2 Corinthians,"1:1 From Paul, an apostle of Christ Jesus by t...",13
Galatians,"1:1 From Paul, an apostle (not from men, nor b...",6
Ephesians,"1:1 From Paul, an apostle of Christ Jesus by t...",6


In [3]:
# Remove chapter & verse numbering and newlines
import re
import string

def clean_text_round1(text):
    """Make text lowercase, remove chapter and verse numbering (and all other numbers), and remove punctuation."""
    text = re.sub('[0-9]+:[0-9]+', '', text)  # remove all chapter:verse references
    text = re.sub('[0-9]+', '', text)         # remove all remaining numbers
    text = re.sub('\n', '', text)             # remove newlines
    return text

round1 = lambda x: clean_text_round1(x)
data_clean = data.book_text.apply(round1)

In [4]:
# get text from several of Paul's letters
books_by_paul = [
    'Romans',
    '1 Corinthians',
    '2 Corinthians',
    'Galatians',
    'Ephesians',
    'Philippians',
    'Colossians',
    '1 Thessalonians',
    '2 Thessalonians',
    '1 Timothy',
    '2 Timothy',
    'Titus',
    'Philemon',
]
paul_text = ''.join([data_clean.loc[b] for b in books_by_paul])
john_text = data_clean.loc['John']
synoptic_text = ''.join([data_clean.loc[b] for b in ['Matthew', 'Mark', 'Luke']])
#paul_text

## Build a Markov Chain Function

We are going to build a simple Markov chain function that creates a dictionary:
* The keys should be all of the words in the corpus
* The values should be a list of the words that follow the keys

In [5]:
from collections import defaultdict

def markov_chain(text):
    """The input is a string of text and the output will be a dictionary with each word as
       a key and each value as the list of words that come after the key in the text."""

    # Tokenize the text by word, though including punctuation
    words = text.split(' ')

    # Initialize a default dictionary to hold all of the words and next words
    m_dict = defaultdict(list)

    # Create a zipped list of all of the word pairs and put them in word: list of next words format
    for current_word, next_word in zip(words[0:-1], words[1:]):
        m_dict[current_word].append(next_word)

    # Convert the default dict back into a dictionary
    m_dict = dict(m_dict)
    return m_dict

In [6]:
# Create the dictionary from the text
paul_dict = markov_chain(paul_text)
john_dict = markov_chain(john_text)
synoptic_dict = markov_chain(synoptic_text)
#paul_dict

## Create a Text Generator

We're going to create a function that generates sentences. It will take two things as inputs:
* The dictionary you just created
* The number of words you want generated

Here are some examples of generated sentences:

>'Shape right turn– I also takes so that she’s got women all know that snail-trail.'

>'Optimum level of early retirement, and be sure all the following Tuesday… because it’s too.'

In [7]:
import random

def generate_sentence(chain, count=25):
    """Input a dictionary in the format of key = current word, value = list of next words
       along with the number of words you would like to see in your generated sentence."""

    # Capitalize the first word
    word1 = random.choice(list(chain.keys()))
    sentence = word1.capitalize()

    # Generate the second word from the value list. Set the new word as the first word. Repeat.
    for i in range(count-1):
        word2 = random.choice(chain[word1])
        word1 = word2
        sentence += ' ' + word2

    # End it with a period
    sentence += '.'
    return(sentence)

In [8]:
generate_sentence(paul_dict)

'Slanders, evil one.     but we endure,   I passed on those who rejoice,   So then, brothers and blameless.'

In [9]:
generate_sentence(john_dict)

'Secret.      Nathanael (who was Jesus.    Eight days already.   Andrew, Simon Peter said to the man.'

In [10]:
generate_sentence(synoptic_dict)

'Mother named Jairus, came to Jesus replied as we have found one hair with it.   For this way.  I will worship me.'

## Additional Exercises

1. Try making the generate_sentence function better. Maybe allow it to end with a random punctuation mark or end whenever it gets to a word that already ends with a punctuation mark.