<a href="https://colab.research.google.com/github/hlab-repo/learning/blob/main/Build_a_Text_Bot_by_Timothy_Beal_and_Michael_Hemenway.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **BUILD A TEXT BOT**

**by Timothy Beal and Michael Hemenway, [h.lab](https://case.edu/artsci/hlab/)**

Build a fully functional text bot in the programming language of Python. No coding experience necessary, but each step is annotated for those who want to understand more about what's going on from line to line.

This is a slightly simplified version of the program behind [KJVBot](https://twitter.com/kjvbot), which autogenerates and then tweets its own verses based on the King James Version Bible. Many thanks to Justin Barber for extensive help in desiging and building the original Markov bot.

Once you work your way though this notebook, you can easily adapt it to work from other texts (see the simple instructions at the end).




##**About Markov Processes**

This bot uses something called a Markov chain. Named after the Russian mathematicial Andrey Andreyevich Markov (1856–1922), a Markov chain is a simple yet powerful process of prediction. It begins with a *current state* and predicts (based on training data) what the *next state* to follow that state will be.

Let's say, for example, we are trying to predict the weather. Today, which is our current state, is rain. First, we build a list of every next state, or next day's weather, that has followed rain. Imagine that list, which is the Markov chain's training data, looks like this: `(cloudy, rain, cloudy, sunny, rain, rain, rain, rain, rain, rain)`. The Markov process then randomly chooses from that list, and whatever it chooses becomes the new current state. Given that `rain` is followed by more `rain` seven out of ten times (70 percent) in our list, more `rain` is quite probable. Still, `cloudy` has a 20 percent chance and `sunny` has a 10 percent chance. And so on.

But how this process work for a text bot? Imagine that our database is not weather history but *The New York Times*, and that our current state is the word "Barack". Our program would go through the entire database of news text (very quickly!), making a list of every next state, that is, next word or punctuation mark, that has ever followed "Barack" in the *Times*. Then it would randomly select one from that list. Let's say that, of the millions of occurences of "Barack," the word "Obama" is the next state that follows "Barack" 90 percent of the time. Roughly nine times out of ten, then, the program will end up randomly selecting that word as its next state. "Obama" then becomes the current state, and the process begins again, compiling and selecting from a list of all its next states. And so on, and so on, adding each next state to the utterance until it reaches a predetermined stopping point (e.g., a period or a maximum number of words).

This bot works in a similar way, continuing the process until it lands upon a period, exclamation mark, or question mark, at which point it stops. If the length of the resulting utterance is less than 130 characters (or whatever you set the limit to be), it prints (i.e., displays) that utterance. If it's over the set limit, it starts again from scratch. And so on.

Don't worry if the process is not entirely clear to you yet. Working through the following steps should help. So let's dive in.

##**1. Save a Copy of This Notebook**

First, you need to make a copy of this Colab notebook that you can edit and run it yourself. Go to “File” / “Save a copy in Drive” and save it to your own Google Drive space (or use the GitHub option if you prefer).

##**2. Download the KJV Bible**

Next, open this KJV Bible text file and save it to your own computer: https://drive.google.com/file/d/1A9aiYV3XvsfoC81M7jgPimu1m4wFA7Ww/view. Important: remember where you saved it.
  

##**3. Import the Needed Libraries**

Libraries are collections of pre-written code that you can import in order to carry out certain actions more efficiently. In case you are interested, these are the libaries we will import and use:

> **`files`** (part of Google's larger **colab** library, thus imported "`from google.colab`"), which allows us to upload the text file from Drive or a local computer into this program;

> **`nltk`** ("Natural Language Tool Kit" library), which we use to convert our text from a single string of words and punctuation into a list of sentences;

> **`tee`** (part of the **`itertools`** library), which makes a list of sequences, with each sequence moving ahead one step (see below in the `list_crawler` function);

> **`defaultdict`** (part of the **`collections`** library), which works with **`tee`** to make a dictionary (a collection of key:value pairs) that will give us every next word (*next state*) that follows every three-word string (*current state*) in our text;

> **`re`** is the **`regular expressions`** library, which we use to clean up and prepare the text; and

> **`choice`** (part of the **`random`** library), which randomly chooses an item from a list (we use it to randomly choose the bot's starting point from a list of possible starting phrases).

The next cell is our first block of code, which will import all of the above into our program. *To run it, click the play button in the upper left corner.*






In [None]:
from google.colab import files
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from itertools import tee
from collections import defaultdict
import re
from random import choice


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


##**4. Upload your Text File**

In Colab, you need to upload any text or other files your program uses every time you start a new session. Here we will upload a plain text version of the King James Version Bible.

Run the next cell to prepare Colab to upload or "mount" the text file.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Now run this cell and a pop up menu will appear. Find your kjv.txt file wherever you saved it. Colab will upload it for use during this session.


In [None]:
uploaded = files.upload()

Saving kjv.txt to kjv.txt


##**5. Define the Three Functions Needed to Run the Bot.**

A function is a block of code with a name that runs when it is "called" (i.e., named) later in the program. Often, functions are called within other functions. Here we define the three functions that this bot will use: `list_crawler()`, `build_sentence()`, and `markovize()`. Later in the program, you'll see that the first two functions, `list_crawler()` and `build_sentence()`, will actually be called *within* the third function, `markovize()`.



**A. Define the `list_crawler()` function.**

These lines build a function that crawls through a list -- in this case, the KJV text as a list of words and punctuation marks -- moving forward one step each time. So, if our text were Genesis 1, it would result in a list like this: `[("in", "the", "beginning"), ("the", "beginning", "god"), ("beginning", "god", "created"), ("god", "created", "the")]`. This function will be used called within the `markovize()` function to find all *next states* for all *current states* (see below).



In [None]:
def list_crawler(iterable, n=2):
    if len(iterable) < n:
        return
    iterables = tee(iterable, n)
    for i, iter_ in enumerate(iterables):
        for num in range(i):
            next(iter_)
    return zip(*iterables)


**B. Define the `build_sentence()` function.**

These lines will be used along with `list_crawler()` inside the `markovize()` function, below, to create the new utterance.

In [None]:
def build_sentence(seed, sent_tokens):
    token = ''
    while token not in set('׃.?!\n'):
        last_tokens = tuple(seed[-3:])
        new_token = choice(sent_tokens[last_tokens])
        seed.append(new_token)
        token = new_token
    sentence = ' '.join(seed)
    sentence = re.sub(r'\s+([׃.,?!:;\n])', r'\1', sentence)
    return sentence

**C. Define the `markovize()` function.**

This function does a lot. First, it opens the text we uploaded (lines 2-3). Then it turns that text into a list of separated sentences (line 4) and tokens (i.e., words and punctuation marks; lines 6-7). Then, using the `list_crawler()` function, defined above, and the `defaultdict` library that we imported, it turns that list of tokens into a huge dictionary of key:value pairs, with each "key" being a three-token string and each "value" being the next word that follows that string in the text. What this gives us, then, is every next word, or *next state*, that comes after every three-word phrase, or *current state*, in the text. The result of this process gives us a very, very long dictionary that looks like this (with the three tokens in parentheses as the key and the bracketed token, which is the *next state* in the text, as the value for that key):

>`{('the', 'revelation', 'of'): ['jesus'], ('revelation', 'of', 'jesus'): ['christ'], ('of', 'jesus', 'christ'): [','], ('jesus', 'christ', ','): ['which'], ('christ', ',', 'which'): ['god'], (',', 'which', 'god'): ['gave'] ...}`

Once it has built that dictionary, the `build_sentence()` function works within an iterating loop to build the actual verse or "utterance." It does so using a Markov process: beginning with a three-token start phrase as its *current state*, it makes a list of all possible *next states* (tokens that follow that phrase) in the text, and then randomly selects one from its list. That new token then becomes the third of the three tokens in the *current state* (the former first token drops off), and the process begins again. The process continues until it randomly selects a period, exclamation point, or question mark, at which point it stops. If the resulting utterance is less than 130 characters (or whatever you set the limit to be), it prints it; if not, it starts all over again with a new three-token start phrase.

In [None]:
def markovize(word1, word2, word3, fileid, char_limit=None):
    with open(fileid, encoding='utf-8') as f:
        text = f.read()
    sentences = sent_tokenize(text)
    sent_tokens = defaultdict(list)
    for sentence in sentences:
        tokens = re.findall(r"[\w']+|[׃.,?!:;\n]", sentence)
        crawled_list = list_crawler(tokens, n=4)
        if crawled_list:
            for token1, token2, token3, token4 in crawled_list:
                sent_tokens[token1, token2, token3].append(token4)
    too_long = True
    while too_long:
        sentence = [word1, word2, word3]
        utterance = build_sentence(sentence, sent_tokens)
        len_utterance = len(utterance)
        if char_limit is not None and len_utterance > char_limit:
            too_long = True
        else:
            too_long = False
    print(utterance)

##**6. Create Different Start Phrases for the Bot**

Here we simply create a list of possible three-token starting points for the bot. Note that they all need to show up someplace in the KJV text or the bot will fail before it begins. We use the **`choice`** library that we imported to randomly choose the starting point for the bot each time it runs.





In [None]:
start_phrases = [["Woe", "unto", "the"],
     ["And", "when", "he"],
     ["And", "I", "saw"],
     ["And", "he", "answered"],
     ["And", "the", "priest"],
     ["In", "the", "beginning"]]

[word1, word2, word3] = choice(start_phrases)

##**7. Run the Bot!**

We run the bot by calling the function `markovize()`, which, as we saw, incorporates the two previous functions within it. We do that simply by typing its name along with the "arguments" (placed inside the parenthesis) it needs to run. These arguments are: the first three tokens (`word1, word2, word3`, which have been randomly selected from the five choices for "start_phrases", above), the text file it will process (`kjv.txt`), and the character limit for the utterance (`130`).

After you've run the whole program once during a session, you can simply run the last two cells to produce new utterances with new start phrases (or run only the last cell to rerun the bot with the same start phrase).


In [None]:
markovize(word1, word2, word3, "kjv.txt", 130)

In the beginning of our confidence stedfast unto the end of heaven to the other.


##**Things to Try Next**

1. Experiment with longer or shorter length limits for by changing `130` to a different number.
2. Change some or all of the start phrases in the previous cell to ones you know are in the KJV text.
3. Adapt this program to work with a different text by (a) uploading a different plain text file (include a space before and after each punctuation mark so the program recognizes it as a separate token), (b) replacing "kjv.txt" with your file name, and (c) change the start phrases to ones you know are in that text.
4. See if you can revise the program so that it uses a shorter or longer start phrase. How does that change the kinds of utterances it produces?

The last two will take a little more time, but they could be fun ways to learning some coding by diving in!
