# **BUILD A BOT!**

This notebook was created for a "make" session of THATCamp at the annual meeting of the American Theological Library Association. It is a fully functional bot, built in the programming language of Python, that uses Markov processes to autogenerate its own verses based on the King James Version Bible. This is a slightly simplified version of the program behind [KJVBot](https://twitter.com/kjvbot), which tweets its auto-generated verses based on all or part of the KJV Bible.
<br>
<br>
This program can easily be adapted to work from other texts (just upload a different .txt file and feed it start phrases that you know are in that text). No coding experience is necessary here, but each step is annotated for those who want to understand more about what's going on from line to line.
<br>
<br>
As with all programs in Colab, you can either run each cell of code one at a time by clicking each play button, or you can run them all in order by selecting "Runtime" / "Run all" from the menu at the top.

##**About Markov Processes**

The Markov process is a simple yet powerful means of prediction. It begins with a *current state* and, based on probability, predicts what the *next state* to follow that state will be. So, for example, if we were talking about predicting the weather, and the *current state* were raining, the Markov process would make a list of all the states in its data that have come after raining (partly cloudy, sunny, sunny, raining, raining, raining, partly cloudy). Then it would randomly choose from that list, and whatever it chooses would become the new *current state*. Given that certain next states are more common in the list than others (e.g., continuing rain), its prediction is probablistic. Now, imagine that our database is not weather history but *The Washington Post*, and that our *current state* is the word "Barack." The Markov process would go through the entire database of news text (very quickly!), make a list of every *next state* word or punctuation mark that has followed "Barack," and randomly select one from that list. Let's say 96% of the time "Obama" is the *next state* word after "Barack." So most of the time the process will end up selecting that word as its *next state* prediciton. That word then becomes the *current state*, and the process begins again, compiling and selecting from a list of all the *next states* to "Obama". And so on.
<br>
<br>
With this bot, the process continues until it lands upon a period, exclamation mark, question mark, at which point it stops and, if the length of the utterance is less than 130 characters (or whatever you set the limit to be), it prints that utterance. If it's over the set limit, it starts again from scratch. And all that in a second or two!

##**Before Diving In ...**

You need to do two things. First, make a copy of this Colab notebook that you can edit and run by going to “File” / “Save a copy in Drive” and saving it to your own Google Drive space. Second, open this KJV Bible text file and save it to your own computer: https://drive.google.com/file/d/1A9aiYV3XvsfoC81M7jgPimu1m4wFA7Ww/view (remember where you saved it and what you called it).


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##**1. Import the needed libraries.**

Libraries are collections of pre-written code that can perform certain functions that are imported and used in a program. The ones we will use are:

> **files** (part of Google's **colab** library), which allows us to upload the text file from Drive or a local computer into this program;
<br>
<br>
> **nltk** ("Natural Language Tool Kit" library), which we use to convert our text from a single string of words and punctuation into a list of sentences and then, using **regular expressions**, below, we turn those sentences into lists of tokens, each of which is a single word or punctuation mark (e.g., "["and", ",", "behold"]");
<br>
<br>
> **tee** (part of the **itertools** library), which makes a list of sequences, with each sequence moving ahead one step (see below in the list_crawler function);
<br>
<br>
> **defaultdict** (part of the **collections** library), which works with **tee** to make a dictionary (a collection of key:value pairs) that will give us every next word (*next state*) that follows every three-word string (*current state*) in our text;
<br>
<br>
> **re** is the **regular expressions** library, which we use to tokenize the text and to find and replace certain characters and line breaks in the text; and 
<br>
<br>
> **choice** (part of the **random** library), which randomly chooses an item from a list (we use it to randomly choose the bot's starting point from a list of possibilities).
<br>
<br>






In [2]:
from google.colab import files
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
from itertools import tee
from collections import defaultdict
import re
from random import choice


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


##**2. Upload a text for the bot to work from.**

In Colab, you need to upload whatever files you need to run your program every time you start a new session. For the purposes of this exercise, we will upload a plain text version of the King James Version Bible. When you run this line, a pop up menu will appear. Find your kjv.txt file in Drive or wherever you have it saved. Colab will then upload it for use during this session.

In [3]:
uploaded = files.upload()

Saving kjv.txt to kjv.txt


##**3. Define the three functions needed to run the bot.**

Functions work like programs within a program to perform certain actions. Most often, certain functions are run, or "called," within other functions. So, in what follows, the first two functions, list_crawler() and build_sentence(), are called inside the third, markovize().



**A. Define the list_crawler() function.**

These lines build a function that crawls through a list -- in this case, the KJV text as a list of words and punctuation marks -- moving forward one step each time. So, if our text were Genesis 1, it would result in a list like this: [("in", "the", "beginning"), ("the", "beginning", "god"), ("beginning", "god", "created"), ("god", "created", "the")]. This function will be used ("called") in the markovize() function to find all *next states* for all *current states* (see below).



In [None]:
def list_crawler(iterable, n=2):
    if len(iterable) < n:
        return
    iterables = tee(iterable, n)
    for i, iter_ in enumerate(iterables):
        for num in range(i):
            next(iter_)
    return zip(*iterables)


**B. Define the build_sentence() function.**

These lines will be used along with list_crawler() inside the markovize() function, below, to create the new utterance.

In [None]:
def build_sentence(seed, sent_tokens):
    token = ''
    while token not in set('׃.?!\n'):
        last_tokens = tuple(seed[-3:])
        new_token = choice(sent_tokens[last_tokens])
        seed.append(new_token)
        token = new_token
    sentence = ' '.join(seed)
    sentence = re.sub(r'\s+([׃.,?!:;\n])', r'\1', sentence)
    return sentence

**C. Define the markovize() function.**

This function does a lot. First, it opens the text we uploaded (lines 2-3). Then it turns that text into a list of separated sentences (line 4) and tokens (lines 6-7). Then, using the list_crawler() function, defined above, and the **defaultdict** module that we imported, it turns that list of tokens into a huge dictionary of key:value pairs, with each "key" being a three-token string and each "value" being the next word that follows that string in the text. What this gives us, then, is every next word, or *next state*, that comes after every three-word phrase, or *current state*, in the text. The result of this process gives us a very, very long dictionary that looks like this (with the three tokens in parentheses as the key and the bracketed token, which is the *next state* in the text, is the value for that key):

>{('the', 'revelation', 'of'): ['jesus'], ('revelation', 'of', 'jesus'): ['christ'], ('of', 'jesus', 'christ'): [','], ('jesus', 'christ', ','): ['which'], ('christ', ',', 'which'): ['god'], (',', 'which', 'god'): ['gave'] ...}

Once it has built that dictionary, the build_sentence() function works within an iterating loop to build the actual verse or "utterance." It does so using a Markov process: beginning with a three-token start phrase as its *current state*, it makes a list of all possible *next states* (tokens that follow that phrase) in the text, and then randomly selects one from its list. That new token then becomes the third of the three tokens in the *current state* (the former first token drops off), and the process begins again. The process continues until it randomly selects a period, exclamation point, or question mark, at which point it stops. If the resulting utterance is less than 130 characters (or whatever you set the limit to be), it prints it; if not, it starts all over again with a new three-token start phrase.

In [None]:
def markovize(word1, word2, word3, fileid, char_limit=None):
    with open(fileid, encoding='utf-8') as f:
        text = f.read()
    sentences = sent_tokenize(text)
    sent_tokens = defaultdict(list)
    for sentence in sentences:
        tokens = re.findall(r"[\w']+|[׃.,?!:;\n]", sentence)
        crawled_list = list_crawler(tokens, n=4)
        if crawled_list:
            for token1, token2, token3, token4 in crawled_list:
                sent_tokens[token1, token2, token3].append(token4)
    too_long = True
    while too_long:
        sentence = [word1, word2, word3]
        utterance = build_sentence(sentence, sent_tokens)
        len_utterance = len(utterance)
        if char_limit is not None and len_utterance > char_limit:
            too_long = True
        else:
            too_long = False
    print(utterance)

##**4. Create several different start phrases for the bot.**

Here we simply create a list of possible three-token starting points for the bot. Note that they all need to show up someplace in the KJV text or the bot will fail before it begins. We use the **choice** module that we imported to randomly choose the starting point for the bot each time it runs.





In [None]:
start_phrases = [["Woe", "unto", "the"],
     ["And", "when", "he"],
     ["And", "I", "saw"],
     ["And", "he", "answered"],
     ["And", "the", "priest"],
     ["In", "the", "beginning"]]

[word1, word2, word3] = choice(start_phrases)

##**5. Run the bot!**

We run the bot by calling the function markovize(), which, as we saw, incorporates the two previous functions within it. We do that simply by typing its name plus the details ("key arguments") it needs to run, namely: the first three tokens (word1, word2, word3, which have been randomly selected from the five choices for "start_phrases", above), the text file it will process (kjv.txt), and the character limit for the utterance (130).
<br>
<br>
After you've run the whole program once during a session, you can simply run the last two cells to produce new utterances with new start phrases (or run only the last cell to rerun the bot with the same start phrase).


In [None]:
markovize(word1, word2, word3, "kjv.txt", 130)

And he answered, No.
