# Predictive text and text generation

By [Allison Parrish](http://www.decontextualize.com/)

This notebook is a whirlwind tour of how certain kinds of predictive text generation work! By "predictive text generation" what I mean is any text generation method that is based around a statistical model that, given a certain stretch of text, "predicts" which bit of text should come next, based on probabilities learned from an existing corpus of text.

The code is written in Python, but you don't really need to know Python in order to use the notebook. Everything's pre-written for you, so you can just execute the cells, making small changes to the code as needed.

## Working with text files

Before we get started, we'll first need some text! Grab two [plain text files from Project Gutenberg](http://www.gutenberg.org/) (or from another source of your choice) and save them to the same directory as this notebook. (I suggest working with two files because we'll be running some code explicitly to "compare" two texts. Also, I think seeing two different outputs from the text generation methods discussed in this notebook will help you better understand how those methods work.) The code in the following cell loads into Python variables the contents of *two plain text files*, assigned to variables `text_a` and `text_b`. You'll need to replace the filenames with the names of the files that you downloaded, keeping the quotation marks (`"`) intact.

In [2]:
text_a = open("coprus_cleaned.txt").read()
text_b = open("frost.txt").read()

These variables are *strings*, which are essentially just long lists of the characters that occur in the text, in the order that they occur. The code in the following cell shows the first two hundred characters of text A:

In [3]:
print(text_a[:200])

Dog Woman
BY CHRIS ABANI
It’s like flying in your dreams, she said. You empty
Yourself out and just lift off. Soar. It’s like that.
 	*
Red. 	Red.	Red.
	Just that word. Sometimes.
 	*
Yang & Yin. Like


You can change `text_a` to `text_b` to see the output from your second text, or change `200` to a number of your choosing.

The `random.sample()` function gives us a random sampling of the contents of a variable (as long as that variable is a sequence of things, like a string or a list). So, for example, to see twenty random characters from text B:

In [4]:
import random
random.sample(text_b, 20)

['n',
 'I',
 'o',
 't',
 'l',
 'n',
 'n',
 '\n',
 'o',
 'g',
 'o',
 'a',
 'e',
 'i',
 'c',
 'b',
 't',
 'l',
 't',
 'e']

This isn't incredibly helpful on its own, but you'll notice that the characters it drew (probably) more or less follow the expected letter distribution for English (i.e., lots of `e`s and `n`s and `t`s).

Perhaps more interesting would be to see a randomly-sampled list of *words*. To do this, we'll make separate variables for the words in the text, using a Python function called `.split()`, which takes a string and turns it into a list of words contained in that string. The following cell makes two new variables that contain the words from both texts respectively:

In [5]:
a_words = text_a.split()
b_words = text_b.split()

Now, ten random words from both text A and text B:

In [6]:
random.sample(a_words, 10)

['salty',
 'on',
 'posthumous',
 'tree',
 'can',
 'black',
 'over',
 'it!',
 'is',
 'The']

In [7]:
random.sample(b_words, 10)

['as', 'in', 'In', 'undergrowth;', 'just', 'one', 'and', 'Oh,', 'far', 'lay']

The code in the following cell uses Python's `Counter` object to count the *most common* letters in the first of these texts:

In [8]:
from collections import Counter
Counter(text_a).most_common(12)

[(' ', 147012),
 ('e', 83285),
 ('t', 57127),
 ('o', 51676),
 ('a', 51252),
 ('n', 46542),
 ('i', 43179),
 ('s', 42741),
 ('h', 39228),
 ('r', 38385),
 ('l', 29623),
 ('d', 26958)]

Specifying the `a_words` variable gives the most frequent *words* instead:

In [9]:
Counter(a_words).most_common(12)

[('the', 8766),
 ('a', 4257),
 ('and', 4015),
 ('of', 3772),
 ('I', 3348),
 ('to', 3043),
 ('in', 2579),
 ('is', 1785),
 ('you', 1414),
 ('my', 1297),
 ('that', 1222),
 ('it', 1202)]

Compare these to the most common words in text B:

In [10]:
Counter(b_words).most_common(12)

[('I', 8),
 ('the', 8),
 ('And', 6),
 ('as', 5),
 ('in', 3),
 ('a', 3),
 ('one', 3),
 ('and', 3),
 ('that', 3),
 ('Two', 2),
 ('roads', 2),
 ('diverged', 2)]

Unigram Language Model

In [11]:
' '.join(random.sample(b_words,50))

'this as telling long In bent equally for Had wear; leaves To less undergrowth; fair, has lay as hence: in way be And far sigh trodden sorry traveled for difference. no the took on looked Though ages the took passing I just Yet them Oh, And wood, the as day!'

In [16]:
pairs = [' '.join(a_words[i:i+2]) for i in range(len(a_words))]

In [17]:
print(text_a)

Dog Woman
BY CHRIS ABANI
It’s like flying in your dreams, she said. You empty
Yourself out and just lift off. Soar. It’s like that.
 	*
Red. 	Red.	Red.
	Just that word. Sometimes.
 	*
Yang & Yin. Like twins tumbling through summer.
	He, the rooster crowing sun; desperate—afraid—
 	As only men can be.
And Yin? Let’s say she has long hair—
	No, that won’t work. If we are to believe
the ancient Chinese, she was a dog
	howling moon.
 	*
When I counted out the pills, it was a slowing down.
	Like the delay between when the car goes through
the dip and your stomach falls away—
	And won’t stop.
 	*
Of course it was because she didn’t fit my mold.
So I punished her. And why? And why? And why?
	You did it, I said. You did it.
Wouldn’t fill my world.
 	*
And eventually we all kill our mothers.
Their eyes a tenderness that doesn’t flinch
	from it. Knowing. Eventually.
 	*
What else is there?
 	*
Paula’s paintings are real. The women thick, visceral,
like stubborn cliffs the sea cannot contain—or d

In [19]:
Counter(pairs).most_common(12)

[('in the', 810),
 ('of the', 638),
 ('on the', 432),
 ('and the', 339),
 ('. .', 320),
 ('to the', 314),
 ('like a', 256),
 ('in a', 255),
 ('of a', 245),
 ('at the', 244),
 ('I am', 201),
 ('to be', 199)]

## Markov models: what comes next?

Now that we have the ability to find and record the n-grams in a text, it’s time to take our analysis one step further. The next question we’re going to try to answer is this: Given a particular n-gram in a text, what is most likely to come next?

We can imagine the kind of algorithm we’ll need to extract this information from the text. It will look very similar to the code to find n-grams above, but it will need to keep track not just of the n-grams but also a list of all units (word, character, whatever) that *follow* those n-grams.

Let’s do a quick example by hand. This is the same character-level order-2 n-gram analysis of the (very brief) text “condescendences” as above, but this time keeping track of all characters that follow each n-gram:

| n-grams |	next? |
| ------- | ----- |
|co| n|
|on| d|
|nd| e, e|
|de| s, n|
|es| c, (end of text)|
|sc| e|
|ce| n, s|
|en| d, c|
|nc| e|

From this table, we can determine that while the n-gram `co` is followed by n 100% of the time, and while the n-gram `on` is followed by `d` 100% of the time, the n-gram `de` is followed by `s` 50% of the time, and `n` the rest of the time. Likewise, the n-gram `es` is followed by `c` 50% of the time, and followed by the end of the text the other 50% of the time.

Exercise: Imagine (or even better, write out) what this table might look like if you were analyzing words instead of characters, with a source text of your choice.

### Markov chains: Generating text from a Markov model

The Markov models we created above don't just give us interesting statistical probabilities. It also allows us generate a *new* text with those probabilities by *chaining together predictions*. Here’s how we’ll do it, starting with the order 2 character-level Markov model of `condescendences`: (1) start with the initial n-gram (`co`)—those are the first two characters of our output. (2) Now, look at the last *n* characters of output, where *n* is the order of the n-grams in our table, and find those characters in the “n-grams” column. (3) Choose randomly among the possibilities in the corresponding “next” column, and append that letter to the output. (Sometimes, as with `co`, there’s only one possibility). (4) If you chose “end of text,” then the algorithm is over. Otherwise, repeat the process starting with (2). Here’s a record of the algorithm in action:

    co
    con
    cond
    conde
    conden
    condend
    condendes
    condendesc
    condendesce
    condendesces
    
As you can see, we’ve come up with a word that looks like the original word, and could even be passed off as a genuine English word (if you squint at it). From a statistical standpoint, the output of our algorithm is nearly indistinguishable from the input. This kind of algorithm—moving from one state to the next, according to a list of probabilities—is known as a Markov chain generator.

### Generating with Markovify

Fortunately, with the invention of digital computers, you don't have to perform this algorithm by hand! In fact, Markov chain text generation has been a pastime of poets and programmers going back [all the way to 1983](https://www.jstor.org/stable/24969024), so it should be no surprise that there are many implementations of the idea in Python that you can download and install. The one we're going to use is [Markovify](https://github.com/jsvine/markovify), a Markov chain text generation library originally developed for BuzzFeed, apparently. It comes with a lot of extra niceties that will make our lives easier, but underneath the hood, it implements an algorithm very similar to the one we just did by hand above.

To install Markovify on your computer, run the cell below:

In [20]:
!pip install markovify

Collecting markovify
  Downloading https://files.pythonhosted.org/packages/99/b7/a5cf39283f08c8013623dbcf67063b0215942ae464fc864eca1434d050e1/markovify-0.7.2.tar.gz
Collecting unidecode (from markovify)
[?25l  Downloading https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl (238kB)
[K     |████████████████████████████████| 245kB 2.4MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: markovify
  Building wheel for markovify (setup.py) ... [?25ldone
[?25h  Created wheel for markovify: filename=markovify-0.7.2-cp36-none-any.whl size=9478 sha256=ab879adf2582a6d42eb0b11e970b0ecb7a8f58edb8e9b47e8ee16e22f98bec69
  Stored in directory: /Users/imac/Library/Caches/pip/wheels/0c/19/38/b901adb8ab0721a6c8c86f468e48b22f3ecf08560e6aeb99fa
Successfully built markovify
Installing collected packages: unidecode, markovify
Successfully installed markovify-0.7.2 unidecode-1.1.1


And then run this cell to make the library available in your notebook:

In [21]:
import markovify

The code in the following cell creates a new text generator, using the text in the variable specified to build the Markov model, which is then assigned to the variable `generator_a`.

In [47]:
generator_a = markovify.Text(text_a)

In [37]:
generator_a = markovify.Text(text_a, state_size=1)

In [45]:
generator_a = markovify.Text(text_a.split("\n"), state_size=2)

You can then call the `.make_sentence()` method to generate a sentence from the model:

In [49]:
print(generator_a.make_sentence())

A good wedding Starts in the garden no one else knows that he’s sincere until, my eyes wide and intelligent.


In [51]:
print(generator_a.make_sentence(tries=1000))

His dog follows after him And I suppose You could have been grateful, instead I felt the heat of the living room floor.


The `.make_short_sentence()` method allows you to specify a maximum length for the generated sentence:

In [24]:
print(generator_a.make_short_sentence(50))

Girls tend to see her face on the scrimmage line.


By default, Markovify tries to generate a sentence that is significantly different from any existing sentence in the input text. As a consequence, sometimes the `.make_sentence()` or `.make_short_sentence()` methods will return `None`, which means that in ten tries it wasn't able to generate such a sentence. You can work around this by increasing the number of times it tries to generate a sufficiently unique sentence using the `tries` parameter:

In [14]:
print(generator_a.make_short_sentence(40, tries=100))

There were few people of fashion.”


Or by disabling the check altogether with `test_output=False`:

In [15]:
print(generator_a.make_short_sentence(40, test_output=False))
#test_output = false means that the system doesn't test if the generated string is in the original text

On Mr. and Mrs. Long.


### Changing the order

When you create the model, you can specify the order of the model using the `state_size` parameter. It defaults to 2. Let's make two model with different orders and compare:

In [54]:
gen_a_1 = markovify.Text(text_a, state_size=1)
gen_a_4 = markovify.Text(text_a, state_size=4)

In [56]:
print("order 1")
print(gen_a_1.make_sentence(test_output=False))
print()
print("order 4")
print(gen_a_4.make_sentence(test_output=False))

order 1
See BY LOUIS SIMPSON It’s just waited with him get off the only heard.

order 4
Skull and crossbones, they crunch like candy.


In general, the higher the order, the more the sentences will seem "coherent" (i.e., more closely resembling the source text). Lower order models will produce more variation. Deciding on the order is usually a matter of taste and trial-and-error.

### Changing the level

Markovify, by default, works with *words* as the individual unit. It doesn't come out-of-the-box with support for character-level models. The following code defines a new kind of Markovify generator that implements character-level models. Execute it before continuing:

In [57]:
class SentencesByChar(markovify.Text):
    def word_split(self, sentence):
        return list(sentence)
    def word_join(self, words):
        return "".join(words)

Any of the parameters you passed to `markovify.Text` you can also pass to `SentencesByChar`. The `state_size` parameter still controls the order of the model, but now the n-grams are characters, not words.

The following cell implements a character-level Markov text generator for the word "condescendences":

In [58]:
con_model = SentencesByChar("condescendences", state_size=2)

Execute the cell below to see the output—it'll be a lot like what we implemented by hand earlier!

In [59]:
con_model.make_sentence()

'condencencencencences'

Of course, you can use a character-level model on any text of your choice. So, for example, the following cell creates a character-level order-7 Markov chain text generator from text A:

In [60]:
gen_a_char = SentencesByChar(text_a, state_size=7)

And the cell below prints out a random sentence from this generator. (The `.replace()` is to get rid of any newline characters in the output.)

In [61]:
print(gen_a_char.make_sentence(test_output=False).replace("\n", " "))

Likewise too the laughing stumbled him  up for you?”


### Combining models

Markovify has a handy feature that allows you to *combine* models, creating a new model that draws on probabilities from both of the source models. You can use this to create hybrid output that mixes the style and content of two (or more!) different source texts. To do this, you need to create the models independently, and then call `.combine()` to combine them.

In [62]:
generator_a = markovify.Text(text_a)
generator_b = markovify.Text(text_b)
combo = markovify.combine([generator_a, generator_b], [0.5, 0.5])

The bit of code `[0.5, 0.5]` controls the "weights" of the models, i.e., how much to emphasize the probabilities of any model. You can change this to suit your tastes. (E.g., if you want mostly text A with but a *soupçon* of text B, you would write `[0.9, 0.1]`. Try it!) 

Then you can create sentences using the combined model:

In [66]:
print(combo.make_sentence())

Till the drop of ripeness exudes, And the serpent there, In another nest, the master of the girl would not be a bad one, what a spoon is the public library, painted several shades of white, I think.


### Bringing it all together

I've pre-written some code below to make it easy for you to experiment and produce output from Markovify. Just make adjustments to the values assigned to the variables in the cell below:

In [72]:
# change to "word" for a word-level model
level = "char"
# controls the length of the n-gram
order = 7
# controls the number of lines to output
output_n = 24
# weights between the models; text A first, text B second.
# if you want to completely exclude one model, set its corresponding value to 0
weights = [0.5, 0.5]
# limit sentence output to this number of characters
length_limit = 280

(The lines beginning with `#` are "comments"—they don't do anything, they're just there to explain what's happening in the code.)

After making your changes above, run the cell below to generate text according to your parameters. Repeat as necessary until you get something you really like!

In [75]:
model_cls = markovify.Text if level == "word" else SentencesByChar
gen_a = model_cls(text_a, state_size=order)
gen_b = model_cls(text_b.split("\n"), state_size=order)
gen_combo = markovify.combine([gen_a, gen_b], weights)
for i in range(output_n):
    out = gen_combo.make_short_sentence(length_limit, test_output=False)
    out = out.replace("\n", " ")
    print(out)
    print()

Reading All my length and lover, rector, docent!

They write this except for your Dodge Charger.

Bluey Gibbons.

In her moonlight.

Don’t miss me says DJ.

I turn around in my breast, last goes.

Architecture in the deck.

This rape joke is that is not lies More than they are  useful.

*  For each grave I speaking & they are, but I pour myself but always have made the particle.

I wonders is she sitting in his universe before you shivering residue, chilled and blear she rest, or wheat are introduced.

A Week After Your Death.

It is usual soprano’s  athleticism,—   	the broth to chill.

•	1940: First DQ® store cannot keep it really well.

In sentence fifty-four the same clothes, feet,  	Nor what I've come down slow 	Effacement has to keep the latest row, brandishing pole jigging—I like New York, Tokyo, Hong Kong, Paris the burns & turns to dust.

The waste remains and kicks off.

Never mind the almond, and silver Cadillac for the complications—;  	and so I 	thought of homemade bombs i

What we're comparing here is called *unigram frequency*. ("Unigram" means a sequence of length one—more on this below.) For most English texts, the most frequent words in any given text will correspond closely to the most common 

Neural networks are better at modeling "long distance dependencies" than Markov modeling. Long distance dependencies are words or sequences that are far apart in the corpus. Therefore, neural networks are great for long corpuses but require more data to successfully train. 

## Neural network text prediction with `textgenrnn`

Like a [Markov chain](ngrams-and-markov-chains.ipynb), a recurrent neural network (RNN) is a way to make predictions about what will come next in a sequence. For our purposes, the sequence in question is a sequence of characters, and the prediction we want to make is *which character will come next*. Both Markov models and recurrent neural networks do this by using statistical properties of text to make a *probability distribution* for what character will come next, given some information about what comes before. The two procedures work very differently internally, and we're not going to go into the gory details about implementation here. (But if you're interested in the gory details, [here's a good place to start](https://karpathy.github.io/2015/05/21/rnn-effectiveness/).) For our purposes, the main *functional* difference between a Markov chain and a recurrent neural network is the *portion* of the sequence used to make the prediction. A Markov model uses a fixed window of history from the sequence, while an RNN (theoretically) uses the *entire history* of the sequence.

## Start with Markov

To illustrate, here's that Markov model of the word "condescendences." In a Markov model based on bigrams from this string of characters, you'd make a list of bigrams and the characters that follow those bigrams, like so:

| n-grams |	next? |
| ------- | ----- |
| co      | n |
| on      | d |
| nd      | e, e |
| de      | s, n |
| es      | c, (end of text) |
| sc      | e |
| ce      | n, s |
| en      | d, c |
| nc      | e |

You could also write this as a probability distribution, with one column for each bigram. The value in each cell indicates the probability that the character following the bigram in a given row will be followed by the character in a given column:

| n-grams | c | o | n | d | e | s | END |
| ------- | - | - | - | - | - | - | --- |
| co      | 0 | 0 | 1.0 | 0 | 0 | 0 | 0 | 
| on      | 0 | 0 | 0 | 1.0 | 0 | 0 | 0 | 
| nd      | 0 | 0 | 0 | 0 | 1.0 | 0 | 0 | 
| de      | 0 | 0 | 0.5 | 0 | 0 | 0.5 | 0 |
| es      | 0.5 | 0 | 0 | 0 | 0 | 0 | 0.5 |
| sc      | 0 | 0 | 0 | 0 | 1.0 | 0 | 0 |
| ce      | 0 | 0 | 0.5 | 0 | 0 | 0.5 | 0 |
| en      | 0.5 | 0 | 0 | 0.5 | 0 | 0 | 0 |
| nc      | 0 | 0 | 0 | 0 | 1.0 | 0 | 0 |

Each row of this table is a *probability distribution*, meaning that it shows how probable a given letter is to follow the n-gram in the original text. In a probability distribution, all of the values add up to 1.

Fitting a Markov model to the data is a matter of looking at each sequence of characters in a given text, and updating the table of probability distributions accordingly. To make a prediction from this table, you can "sample" from the probability distribution for a given n-gram (i.e., sampling from the distribution for the bigram `de`, you'd have a 50% chance of picking `n` and a 50% chance of picking `s`).

Another way of thinking about this Markov model is as a (hypothetical!) function *f* that takes a bigram as a parameter and returns a probability distribution for that bigram:

    f("ce") → [0.0, 0.0, 0.5, 0.0, 0.0, 0.5, 0.0]
    
(Note that the values at each index in this distribution line up with the columns in the table above.)
    
The items in the list returned from this function correspond to the probability for the corresponding next character, as given in the table. To sample from this list, you'd pick randomly among the indices according to their probabilities, and then look up the corresponding character by its position in the table.

To generate new text from this model:

1. Set your output string to a randomly selected n-gram
2. Sample a letter from the probability distribution associated with the n-gram at the end of the output string
3. Append the sampled letter to the end of the string
4. Repeat from (2) until the END token is reached

Of course, you don't write this function by hand! When you're creating a Markov model from your data (or "training" the model), you're essentially asking the computer to write this function *for you*. In this sense, a Markov model is a very simple kind of machine learning, since the computer "learns" the probability distribution from the data that you feed it.

## A (very) simplified explanation of RNNs

The mechanism by which a recurrent neural network "learns" probability distributions from sequences is much more sophisticated than the mechanism used in a Markov model, but functionally they're very similar: you give the computer some data to "train" on, and then ask it to automatically create a function that will return a probability distribution of what comes next, given some input. An RNN differs from a Markov chain in that to predict the next item in the sequence, you pass in *the entire sequence* instead of just the most recent n-gram.

In other words, you can (again, hypothetically) think of an RNN as a way of automatically creating a function *f* that takes a sequence of characters of arbitrary length and returns a probability distribution for which character comes next in the sequence. Unlike a Markov chain, it's possible to *improve* the accuracy of the probability distribution returned from this function by training on the same data multiple times.

Let's say that we want to train the RNN on the string "condescendences" to learn this function, and we want to make a prediction about what character is most likely to follow the sequence of characters "condescendence". When training a neural network, the process of learning a function like this works iteratively: you start off with a function that essentially gives a uniform probability distribution for each outcome (i.e., no one outcome is considered more likely than any other):

    f("condescendences") → [0.14, 0.14, 0.14, 0.14, 0.14, 0.14, 0.14] (after zero passes through the data)
    
... and as you iterate over the training data (in this case, the word "condescendences"), the probability distribution  gradually improves, ideally until it comes to accurately reflect the actual observed distribution (in the parlance, until it "converges"). After some number of passes through the data, you might expect the automatically-learned function to return distributions like this:

    f("condescendences") → [0.01, 0.02, 0.01, 0.03, 0.01, 0.9, 0.02] (after n passes through the data)

A single pass through the training data is called an "epoch." When it comes to any neural network, and RNNs in particular, more epochs is almost always better.

To generate text from this model:

1. Initialize your output string to an empty string, or a random character, or a starting "prefix" that you specify;
2. Sample the next letter from the distribution returned for the current output string;
3. Append that character to the end of the output string;
4. Repeat from (2)

Of course, in a real life application of both a Markov model and an RNN, you'd normally have more than seven items in the probability distribution! In fact, you'd have one element in the probability distribution for every possible character that occurs in the text. (Meaning that if there were 100 unique characters found in the text, the probability distribution would have 100 items in it.)

## Markov chains vs RNNs    

The primary benefit of an RNN over a Markov model for text generation is that an RNN takes into account *the entire history* of a sequence when generating the next character. This means that, for example, an RNN can theoretically learn how to close quotes and parentheses, which a Markov chain will never be able to reliably do (at least for pairs of quotes and parentheses longer than the n-gram of the Markov chain).

The drawback of RNNs is that they are *computationally expensive*, from both a processing and memory perspective. This is (again) a simplification, but internally, RNNs work by "squishing" information about the training data down into large matrices, and make predictions by performing calculations on these large matrices. That means that you need a lot of CPU and RAM to train an RNN, and the resulting models (when stored to disk) can be very large. Training an RNN also (usually) takes a lot of time.

Another consideration is the size of your corpus. Markov models will give interesting and useful results even for very small datasets, but RNNs require large amounts of data to train—the more data the better.

So what do you do if you *don't* have a very large corpus? Or if you don't have a lot of time to train on your corpus?

## RNN generation from pre-trained models

Fortunately for us, developer and data scientist [Max Woolf](https://github.com/minimaxir) has made a Python library called [textgenrnn](https://github.com/minimaxir/textgenrnn) that makes it really easy to experiment with RNN text generation. This library includes a model (according to the documentation) "trained on hundreds of thousands of text documents, from Reddit submissions (via BigQuery) and Facebook Pages (via my Facebook Page Post Scraper), from a very diverse variety of subreddits/Pages," and allows you to use this model as a starting point for your own training.

First install textgenrnn with `pip`. (You don't need to do this if you're running the notebook on Binder.)

In [76]:
!pip install tensorflow



In [77]:
!pip install --upgrade textgenrnn

Collecting textgenrnn
[?25l  Downloading https://files.pythonhosted.org/packages/1f/66/042499854474fdfca20403729ca88c9ceac8b5fd7374ed13be4a9ade6e7d/textgenrnn-1.5.0.tar.gz (1.7MB)
[K     |████████████████████████████████| 1.7MB 2.3MB/s eta 0:00:01
Building wheels for collected packages: textgenrnn
  Building wheel for textgenrnn (setup.py) ... [?25ldone
[?25h  Created wheel for textgenrnn: filename=textgenrnn-1.5.0-cp36-none-any.whl size=1733159 sha256=f16386ca04152b2e637fa04fd2ee1b853f0f5527cc6bbf3451de406c1815c7d8
  Stored in directory: /Users/imac/Library/Caches/pip/wheels/23/46/b0/4444949d8310e43e273e931cfa9e175c34b7a2349c8114a6f7
Successfully built textgenrnn
Installing collected packages: textgenrnn
  Found existing installation: textgenrnn 1.4.1
    Uninstalling textgenrnn-1.4.1:
      Successfully uninstalled textgenrnn-1.4.1
Successfully installed textgenrnn-1.5.0


Once it's installed, import the `textgenrnn` class from the package:

In [82]:
from textgenrnn import textgenrnn

And create a new `textgenrnn` object like so. (The `name` parameter controls the filename used when automatically saving the model to disk, so pick something descriptive!)

In [83]:
textgen = textgenrnn(name="text_a")

This object has a `.generate()` method which will, by default, generate text from the pre-trained model only.

In [85]:
textgen.generate()

Anyone else broke a "deadlink" and have made it to my life album: “I finally want to buy the people ended up, any other than!



The `textgenrnn` library needs a data structure called a "list of strings" as its source text for training. We'll use Markovify's `split_into_sentences` method to turn our plain-text input files into lists of sentences like so:

In [108]:
from markovify.splitters import split_into_sentences
text_a_sentences = split_into_sentences(text_a)
text_b_sentences = split_into_sentences(text_b)

#text_a_lines = (text_a.split("\n"), 5)
#text_b_lines = (text_b.split("\n"), 5)

Here are five random sentences from both texts:

In [109]:
random.sample(text_a_sentences, 5)

['Some \nof the ears on the floor caught this scrap of his voice.',
 'My yellow brick road of intelligent design.',
 'On My 60th Birthday I Will Dig Up a Dead Body\nBY SARAH GALVIN\nI know I will be 30 in two months\nbecause the Parks Concierge is jaywalking.',
 'We dream the dream of extirpation.',
 'As for the farmers, they are, for the most part, indistinguishable: here the tractor is red, there yellow; here a pair of dirty hands, there a pair of dirty hands.']

In [110]:
random.sample(text_b_sentences, 5)

ValueError: Sample larger than population or is negative

To train a text generator on your own text, use the `.train_on_texts()` method, passing in a list of strings. The `num_epochs` parameter allows you to indicate how many epochs (i.e., passes over the data) should be performed. The more epochs the better, especially for shorter texts, but you'll get okay results even with just a few.

Training a neural network usually takes a really long time! So it makes sense to "try out" a text before committing to the many hours it might take to train the network on the full text. The following example trains the neural network on just the first 100 lines from text A, which lets you get an idea of what the output will look like when training on its entire contents. You'll notice that the `train_on_texts()` function prints output as it goes, showing what the generated text is likely to look like.

In [95]:
textgen.train_on_texts(text_a_sentences[:100], num_epochs=3)

Training on 4,848 character sequences.
Epoch 1/3
####################
Temperature: 0.2
####################
The DQ® Queent than the shootog store on the store the store on the consisses in the store of the shooting the shooting the stomp in the store on the store on the store on the things in the store on the DQ® men in the store of the sond.

The DQ® Queent than the store on the stomp store that we won’t she said.

•1965: The DQ® Queent singe in the DQ® Queent is a shot of the shootog store.

####################
Temperature: 0.5
####################
Some shot and so say the sea city is beetter in the come as a shooting the pass and she said.

Soing sean the shoots forcinger steakents than their female.

•1999: The DQ® Queetting Bay be commed.

####################
Temperature: 1.0
####################
Itetteral won’t commeminiant acry on bests like though.

Live in the birth has end.

Mild, cannots.

Epoch 2/3
####################
Temperature: 0.2
####################
•1965: The DQ® 

After training, you can generate new text using the `.generate()` method again:

In [99]:
textgen.generate()

hmmmm it is another life that wose the pussy of Middle as the sun in the word.



The results aren't very interesting because by default the generator is very conservative in how it samples from the probability distribution. You can use the `temperature` parameter to make the sampling a bit more likely to pick improbable outcomes. The higher the value, the weirder the results. The default is 0.2, and going above 1.0 is likely to produce unacceptably strange results:

In [100]:
textgen.generate(temperature=0.05)

*It was a women the worldand opens in the worldand of the worldand opens in the worldand of the worldand opens in the worldand of the worldand opens in the worldand of the worldand contained.



In [103]:
textgen.generate(temperature=0.9)

We are ever tesh the roostical donate if these DQ CHRIS puth shating my dients in 7: sain Heirlesy, she wanted a lottom stoomarter history: “I smelled with their poer and was the shirks.



In [104]:
textgen.generate(temperature=1.5)

•15695?



If you pass a number `n` to the `.generate()` method as its first parameter, `.generate()` will print out `n` instances of text generation from the model. The code in the following cell prints out ten examples from the specified temperature:

In [111]:
textgen.generate(10, temperature=0.35)

 10%|█         | 1/10 [00:00<00:04,  2.13it/s]

The DQ® should is eats and she said.



 20%|██        | 2/10 [00:00<00:02,  2.72it/s]

*It was sayind a smile.



 30%|███       | 3/10 [00:01<00:03,  2.07it/s]

•1968: First DQ® store opens in Canada Crystandom Base is a smile.



 40%|████      | 4/10 [00:03<00:04,  1.29it/s]

And the world’s life is the one of the month of the worlded the store for the stomcheated the shortest of the word opens in one of the dogs and so I said.



 50%|█████     | 5/10 [00:04<00:04,  1.06it/s]

•1965: The DQ® store opens in Yammorambells in the DQ® Streak in the DQ® Streaks of the contained one of the worlds in the painting in the modern ward.



 70%|███████   | 7/10 [00:05<00:02,  1.31it/s]

•1971: The DQ® mental is introduced.

*When I think.



 80%|████████  | 8/10 [00:06<00:01,  1.32it/s]

*What are the to open and she was one weighth and sun introduced.



 90%|█████████ | 9/10 [00:06<00:00,  1.40it/s]

•1965: The Dairy Queen® she said.



100%|██████████| 10/10 [00:07<00:00,  1.33it/s]

The Darker Basher is suckin with the paintilly the contained one of the warmest and the worldander dogs in the word.






(This may take a little while.)

If you specify `return_as_list=True`, the `.generate()` method returns the results as a list instead of printing them out:

In [60]:
generated_strs = textgen.generate(10, temperature=0.5, return_as_list=True)
print(generated_strs)

100%|██████████| 10/10 [00:16<00:00,  1.59s/it]

['“When the such in one was not must sure it.”', 'I was so the streament in a settling it intended; that you must was that he wished; that he can he was a news of Mr. Bennet may so you will be any cannabu when can north them, and not make that you must know what we decimie and it is see it and it is any keve_ it any know themselves you amand in th', '“Does not that he was married them, and advice it in the her song.', '“I have a single tim for him to her and it taken his dear do you must think of the considers.”', '“But you have anything .', '“I hope a man of so finte them all of my dear, what you think of the said that I have to know what you the house of introduce of them so you you must make an alway and her wife.”', '“I have a balture of the his not depended with that I am sure if searched and therefore is a father.', 'A was theother and it is addrick them and anyone any keve_ her more of the married than the hand of herself so went in the hand to make to get ney with himmyyones ar




When you're satisfied with the results and you're ready to train on all of the sentences, just remove the `[:100]` from the call to `.train_on_texts()`:

In [None]:
textgen.train_on_texts(text_a_sentences, num_epochs=5)

The textgenrnn library automatically saves the model to disk after each epoch in the same directory as this notebook. You can load a model you've previously trained by passing its filename to the `textgenrnn` function:

In [53]:
textgen = textgenrnn("text_a_weights.hdf5")

And then you can call the `.generate()` method as normal:

In [54]:
textgen.generate(10, temperature=0.5)

 10%|█         | 1/10 [00:03<00:27,  3.03s/it]

How make that see the such a said my best of the sister will be in a single will never to be the thing than them as you have a good term of they will be caperad in the for them, sick and her she been with Mrs.



 20%|██        | 2/10 [00:03<00:18,  2.30s/it]

“I have a wife are consider you may be abluse?”



 30%|███       | 3/10 [00:07<00:19,  2.72s/it]

“I was a man abused to be anything who are all them, little assure it wouldned; “the only is to heard to see the surrouse in the such a way of Mr. Bennet,” and the she had; it is introducedolic and anyone the married of the such as a not must any changeland it insurention them.



 40%|████      | 4/10 [00:08<00:13,  2.21s/it]

You cannot she should sending to sendenberg used to him to take nervous.



 50%|█████     | 5/10 [00:12<00:13,  2.79s/it]

“I do not want to see the grandic will tell you the his own what that I have an are a gight be on the mother has a dear, be muchdely that he have an not so have some of not the fortune of the beauty to start when like that I am heard; that it is a have served to actually will deall of entither.



 60%|██████    | 6/10 [00:13<00:08,  2.15s/it]

“I have a nerve of the onlymores are wife.



 70%|███████   | 7/10 [00:14<00:06,  2.01s/it]

“But he she will not water _unter to see how much themsight to watche them said to you should be not won them.



 80%|████████  | 8/10 [00:19<00:05,  2.84s/it]

“I have no used on the first of the sued of the sisters the wife had no also be replied that he is not must have a gue of the heaven to take the prepast on the amuse of the serios to find that it is to the heavy includedwiffling to she she designg the such she is no married; I was prepeyd and is a 



 90%|█████████ | 9/10 [00:20<00:02,  2.22s/it]

“Are you may was tour to be so muchdeliess them.”



100%|██████████| 10/10 [00:21<00:00,  1.94s/it]

“I said that I was so devereed to heard or so, that he can be one of them.”






(*Note*: If you're running this on Binder, make sure to download the weights from the notebook server's Home page! You can upload them again when you start a new session. Binder will automatically delete the data associated with inactive notebooks.)

### Generating with shorter texts

I've found that `textgenrnn` works especially well with very short, word-length texts. For example, download [this file of human moods](https://github.com/dariusk/corpora/blob/master/data/humans/moods.json) from Corpora Project, and put it in the same directory as this notebook.

Then load the JSON file and grab just the list of words naming moods:

In [112]:
import json
mood_data = json.loads(open("./moods.json").read())
moods = mood_data['moods']

FileNotFoundError: [Errno 2] No such file or directory: './moods.json'

And create another textgenrnn object:

In [57]:
mood_gen = textgenrnn(name="moods")

Now, train the RNN on these moods. One epoch will do the trick:

In [58]:
mood_gen.train_on_texts(moods, num_epochs=1)

Training on 6,651 character sequences.
Epoch 1/1
####################
Temperature: 0.2
####################
art

derressed

instanted

####################
Temperature: 0.5
####################
inactly

alliee

tarreated

####################
Temperature: 1.0
####################
deed

tymed

joygucture



Now generate a list of new moods:

In [59]:
mood_gen.generate(25, temperature=0.5)

  8%|▊         | 2/25 [00:00<00:02,  9.93it/s]

owned

innepsess



 16%|█▌        | 4/25 [00:00<00:02,  8.62it/s]

insterette

resepted



 24%|██▍       | 6/25 [00:00<00:02,  7.49it/s]

joyals

indderful



 32%|███▏      | 8/25 [00:00<00:02,  8.29it/s]

cored

jusital



 44%|████▍     | 11/25 [00:01<00:01,  9.34it/s]

selful

aimed

friend



 52%|█████▏    | 13/25 [00:01<00:01,  8.08it/s]

suppleased

exielshed



 60%|██████    | 15/25 [00:01<00:01,  7.23it/s]

sitterful

indenden



 68%|██████▊   | 17/25 [00:02<00:01,  6.88it/s]

deriss

erlssepredy



 76%|███████▌  | 19/25 [00:02<00:00,  7.50it/s]

depire

contime



 84%|████████▍ | 21/25 [00:02<00:00,  6.91it/s]

distanted

exploured



 92%|█████████▏| 23/25 [00:02<00:00,  6.53it/s]

feated

belsiffful



100%|██████████| 25/25 [00:03<00:00,  7.31it/s]

bad

volutic






## Further reading

* [This notebook from the creator of textgenrnn](https://github.com/minimaxir/textgenrnn/blob/master/docs/textgenrnn-demo.ipynb) covers everything about the library that I covered in this tutorial—and much more, including how to start generation from a particular "seed" and how to save and load models (useful if you spent an afternoon training a model on your own corpus and don't want to have to do it again!)
* Take a look at [Janelle Shane's wonderful overview of how she uses RNNs in her process](http://aiweirdness.com/faq). And then take a look at her [wonderful creative work with RNNs](http://aiweirdness.com/).
* Hayes, Brian. “Computer recreations.” Scientific American, vol. 249, no. 5, 1983, pp. 18–31. JSTOR, http://www.jstor.org/stable/24969024. (Original column from Scientific American that described how Markov chain text generation works—very readable! I can send a PDF, hit me up.)
* [A Travesty Generator for Micros](https://elmcip.net/critical-writing/travesty-generator-micros) is a follow-up to Hayes' article that has some more theory and an actual Pascal listing (which is now mostly of only historical interest).
* [This notebook](https://github.com/aparrish/rwet/blob/master/ngrams-and-markov-chains.ipynb) shows how to implement a Markov chain generator from scratch in Python, if you're interested in such things!