# textgenrnn 1.0 Demo

by [Max Woolf](http://minimaxir.com)

*Max's open-source projects are supported by his [Patreon](https://www.patreon.com/minimaxir). If you found this project helpful, any monetary contributions to the Patreon are appreciated and will be put to good creative use.*

## Intro

textgenrnn is a Python module on top of Keras/TensorFlow which can easily generate text using a pretrained recurrent neural network:

In [1]:
from textgenrnn import textgenrnn

textgen = textgenrnn()

Using TensorFlow backend.


## Generate Text

The `generate` function generates `n` text documents:

In [2]:
textgen.generate(5)

[Idea] The Bell Art Series Took New Performance

[Humor] I want to see a friend that the main doesn't have to watch them in the season.

What is the bad show and thinks he wants to stay me to do anything.

What is the doom when people show me that they take a community of the problems?

How to get still a boy and it's not supposed to stay to share the show of my characters



In addition, you can set the `temperature` to modify the amount of creativity (default 0.5; I do not recommend setting to more than 1.0), set a `prefix` to force the document to start with certain characters and generate characters accordingly, and set a `return_as_list` flag (default False) to use the generated texts elsewhere in your application (e.g. as an API)

In [3]:
generated_texts = textgen.generate(n=5, prefix="Trump", temperature=0.2, return_as_list=True)
print(generated_texts)

['Trump is a big decision of the state of the state of the story of the world...', 'Trump companies to the series of the start in the same story of the most poster in the most poster in the background and then see if you have a stranger and then the other party shows the end of the world in the same state with a community of the state of the shop when you have a good design to be', "Trump is a programming back in a base to the first time and the first time that I don't know what to do it?", 'Trump confirms a complete shot of the state of the state of the state of the state of the state of the state of the season to the state of the state of my parents for the state of the man who was a good day.', 'Trump is the best second of the state of the state of the state of the shape of the world.']


Using `generate_samples()` is a great way to test the model at different temperatures.

In [4]:
textgen.generate_samples()

####################
Temperature: 0.2
####################
I was an accident and I was a subreddit at the same person in the world is a lot of the most second of the state of the state of the state of the state of the state of the same card starting a community on the same state of the state of the state of the party in the state of the starter of the sta

The Mario on Twitter: "I just got my first time in the most poster and a bit of the party in the most book and have been released and want to see the strange to the state of the state of the state of the most and then going to make a program and I can be able to say they are a bot to make a strang

I want to see the top of the same state of the same story of the state of the same time in the first time in the same time to the state of the startup of the computer is the only one to get the way to start started the state of the state of the background of the world is a post in the most and the

####################
Temperature: 0.5
###

You may also `generate_to_file()` to make the generated texts easier to copy/paste to other sources (e.g. blog/social media):

In [5]:
textgen.generate_to_file('textgenrnn_texts.txt', n=5)

## Train on New Text

As shown above, the results on the pretrained model vary greatly, since it's a lot of data compressed in a small form. The `train_on_texts` function fine-tunes the model on a new dataset.

In [6]:
texts = ['Never gonna give you up, never gonna let you down',
            'Never gonna run around and desert you',
            'Never gonna make you cry, never gonna say goodbye',
            'Never gonna tell a lie and hurt you']

textgen.train_on_texts(texts, num_epochs=1)

Epoch 1/1
####################
Temperature: 0.2
####################
Never gonna gight you down

Never gonna give a gay agair against aro=ggy

Never gonna gigg gonna gir againt aroas

####################
Temperature: 0.5
####################
Never gay agr art at gay

Never gonna gigg aurount to you

Never gonna give up

####################
Temperature: 1.0
####################
Stirt agant as try add agf for great gonna game

Never gongaa grome never never art 2 yea=?

Never guy



Although the network was only trained on 4 texts, the original network still transfers the latent knowledge of all modern grammar and can incorporate that knowledge into generated texts, which becomes evident at higher temperatures or when using a prefix containign a character not present in the original dataset.

You can reset a trained model back to the original state by calling `reset()`.

In [7]:
textgen.reset()

Included in the repository is a `hacker-news-top-2000.txt` file containing a list of the Top 2000 [Hacker News](https://news.ycombinator.com/news) submissions by score. Let's retrain the model using that dataset.

For this example, I only will use a single epoch to demonstrate how easily the model learns with just one pass of the data: I recommend leaving the default of 50 epochs, or set it even higher for complex datasets. On my 2016 15" MacBook Pro (quad-core Skylake CPU), the dataset trains at about 1.5 minutes per epoch.

In [8]:
textgen.train_from_file('../datasets/hacker_news_2000.txt', num_epochs=1)

2000 texts collected.
Epoch 1/1
####################
Temperature: 0.2
####################
Why I hade a service of my side in a server

Why I hade a company in my side in its programmer

Startup Startup Startup Introducing Software For Frame

####################
Temperature: 0.5
####################
Who is haded a commerce completional software experiment for a learnarm

The Support Use Introducing Book From Drugs

I made me I founn the code in the US company is introducing in it

####################
Temperature: 1.0
####################
Imple on Star UnSQL Dropbow

Firefit for Ya comments down

Encords the up tickeu deveromerities last services intervert



Now, we can create very distinctly-HN titles, even with the very little amount of training, thanks to the pre-trained nature of the textgenrnn:

In [9]:
textgen.generate(5, prefix="Apple")

Apple Startups in Global

Apple HN: A Specifim to A Future of Facebook

Apple Startups Are Murdered From Backship For Selling For Company

Apple experience bank hacking the Housing Ask How I Found By Fast

Apple Completion Structures Chart



Other runtime parameters for `train_on_text` and `train_from_file` are:

* `num_epochs`: Number of epochs to train for (default: 50)
* `gen_epochs`: Number of epochs to run between generating sample outputs; good for measuring model progress (default: 1)
* `batch_size`: Batch size for training; may want to increase if running on a GPU for faster training (default: 128)
* `prop_keep`: Random proportion of sequence samples to keep: good for controlling overfitting and reducing memory required to process the data. (default: 1.0/all)

## Save and Load the Model

The model saves the weights automatically after each epoch, or you can call `save()` and give a HDF5 filename. Those weights can then be loaded into a new textgenrnn model by specifying a path to the weights on construction. (Or use `load()` for an existing textgenrnn object).

In [10]:
textgen_2 = textgenrnn('textgenrnn_weights.hdf5')
textgen_2.generate_samples()

####################
Temperature: 0.2
####################
Why I hade my side programmers

Why I founn the Found Startup

Why I am helped me and I can't be software the service of my programmers

####################
Temperature: 0.5
####################
A day of More India

Braund Programmers in Facebook

Amazing Source Brand Employees

####################
Temperature: 1.0
####################
I hade etclytelin Gamebraby streamer

Servicer Cosment is cull AI

App Iska by I thought



In [11]:
textgen.model.get_layer('rnn_1').get_weights()[0] == textgen_2.model.get_layer('rnn_1').get_weights()[0]

array([[ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       ...,
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True]])

Indeed, the weights between the original model and the new model are equivalent.

You can use this functionality to load models from others which have been trained on larger datasets with many more epochs (and the model weights are small enough to fit in an email!).

In [12]:
textgen = textgenrnn('../weights/hacker_news.hdf5')
textgen.generate_samples(temperatures=[0.2, 0.5, 1.0, 1.2, 1.5])

####################
Temperature: 0.2
####################
A lesson on the importance of encouraging your children with their projects

A letter to our daughter

Show HN: Make a programmable mirror

####################
Temperature: 0.5
####################
Ask HN: What was your “why didn't I start doing this sooner” moment?

Show HN: This up votes itself

The Star Wars Route: Do a traceroute to 216.81.59.173

####################
Temperature: 1.0
####################
Encryption, Practice Roman, acquired Server

An off-grid social network

Farm Has Me Maybeat Armoill

####################
Temperature: 1.2
####################
A Gmai Now Think It Costs Got Upbot

Amit Gupta has passwording aliswry that  tool to take

Linus Torvalds: Successful projects are 99% perspiration and 1% innovation

####################
Temperature: 1.5
####################
Panamapov Airplisswackｐa, MK, and More Free?

Oracle v. Google - Diviscons The Guy used a Kaby Developer in the Armuloz

Ask HN: How to be 

## Training a New Model

You can train a new model using any modern RNN architecture you want by calling `train_new_model` if supplying texts, or adding a `new_model=True` parameter if training from a file. If you do, the model will save a `config` file and a `vocab` file in addition to the weights, and those must be also loaded into a `textgenrnn` instances.

The config parameters available are:

* `word_level`: Whether to train the model at the word level (default: False)
* `rnn_layers`: Number of recurrent LSTM layers in the model (default: 2)
* `rnn_size`: Number of cells in each LSTM layer (default: 128)
* `rnn_bidirectional`: Whether to use Bidirectional LSTMs, which account for sequences both forwards and backwards, and perform especially well with the Attention layer (default: False)
* `max_length`: Maximum number of previous characters/words to use before predicting the next token. This value should be reduced for word-level models (default: 40)
* `max_words`: Maximum number of words (by frequency) to consider for training (default: 10000)
* `dim_embeddings`: Dimensionality of the character/word embeddings (default: 100)

You can also specify a `name` when creating a textgenrnn instance which will help name the output weights/config/vocab appropriately.

In [13]:
textgen = textgenrnn(name="new_model")

In [14]:
texts = ['Never gonna give you up, never gonna let you down',
            'Never gonna run around and desert you',
            'Never gonna make you cry, never gonna say goodbye',
            'Never gonna tell a lie and hurt you']

textgen.train_new_model(texts,
                        word_level=True,
                        rnn_layers=1,
                        max_length=2,
                        num_epochs=10,
                        gen_epochs=5)

print(textgen.model.summary())

Training new model w/ 1-layer, 128-cell LSTMs
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
####################
Temperature: 0.2
####################
never gonna tell you lie , never gonna tell you lie , and gonna you you , , gonna gonna you you , , gonna gonna you you , , gonna gonna you you , , gonna gonna you you

never gonna let you down

never gonna run you and

####################
Temperature: 0.5
####################
never gonna give you cry , never gonna make you cry , never gonna tell a lie and hurt you

never gonna give you up , never gonna tell a lie and hurt you , , gonna make you cry , never gonna say goodbye

never gonna say you

####################
Temperature: 1.0
####################
never gonna lie around and and you you

never gonna make let you let , around gonna gonna you you down , , gonna tell lie hurt hurt you

never gonna run around desert never

Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
####################
Temperature: 0.2
##########

In [15]:
textgen.reset()
textgen.train_from_file('../datasets/hacker_news_2000.txt',
                        new_model=True,
                        rnn_bidirectional=True,
                        rnn_size=64,
                        dim_embeddings=300,
                        num_epochs=1)

print(textgen.model.summary())

2000 texts collected.
Training new model w/ 2-layer, 64-cell Bidirectional LSTMs
Epoch 1/1
####################
Temperature: 0.2
####################
A stere to searne stere with with witte with with wat low to ling source with with source lowser of how with the mane and the sears dears in the stere to with with with with with will and in with with the stere stere pase lowne lowser with with wayte with with with with worke to in indine to lis w

The Inter Google Internete Internete Inter Inter Interneters

A now with with ster of with with in the ster the with we dith with watter with with weble dite lister and source with the webs stere with sources and with with weble downer with dith with with we the dinters with we with watter and with work source of with lowse of the will sears of dow the dith 

####################
Temperature: 0.5
####################
The Linter and the some and compers downite lome think geans

Intenver Storte Inster Interrester Interseres

Ande Hantres in Ante

In [16]:
textgen_2 = textgenrnn(weights_path='new_model_weights.hdf5',
                       vocab_path='new_model_vocab.json',
                       config_path='new_model_config.json')

textgen_2.generate_samples()

####################
Temperature: 0.2
####################
A searce searn searnte source with wind worke source lose source with with with inter watter and with watter and the ster and will downer with the we watter and with source sill and with dith with watte to dite and with and with we dind will source with with with searce source with sources warking

A and and with wat and source and with and dite of the and chese searn with watte lowne source to lessers

A now the ster source pase to low with with with with with the web and to dite with make lowner searce worker with with with with stare and ditters with with with with with with with with with with with with pace lowne to ster and dite lose side of steral dith fill lide the dite to line dinters

####################
Temperature: 0.5
####################
Firs Humis A cesessers

Shew Inter List Learte Firte At Bowar Interrete Inter Show HN: Intromer Inter seve wister worke bary dite of source the your we source of mystter Inter so

## Train on Single Large Text

Although textgenrnn is intended to be trained on text documents, you can train it on a large text block using `train_from_largetext_file` (which loads the entire file and processes it as if it were a single document) and it should work fine. This is akin to more traditional char-rnn tutorials.

Training a new model is recommended (and is the default). When calling `generate`, you may want to increase the value of `max_gen_length`.

In [17]:
from keras.utils.data_utils import get_file

fulltext_path = get_file('nietzsche.txt', origin='https://s3.amazonaws.com/text-datasets/nietzsche.txt')

textgen.reset()
textgen.train_from_largetext_file(fulltext_path, new_model=True, num_epochs=1)

Training new model w/ 2-layer, 128-cell LSTMs
Epoch 1/1
####################
Temperature: 0.2
####################
or in the same self-perhaps the relative the morality the sense of the relative the since of the same relative the morality the seems of the seems of the relatification of the sense of the relatification of the seems the destint the relative the sense of the sentiment of the seems the relatificati

or the sense of the sense of the sentiment of the perhaps a morality the relative the seems of the sense of the seems the relative the sense of the relative the sense of the sense of the sense of the sense of the sense of the relatified the present the contempt of the same self-desting the moralit

or and the same so the relatified the person and the perhaps in the same and desting the sentiment of the relation of a desting the relative the relative the man in the sense of interto the relative the relative the sense of the relative the seems the sense of seems of the seems of th

In [18]:
textgen.generate(3, max_gen_length=1000)

I the sense of the morality for which disconders of the relation to the morals
its ourselves of the sense of the sill of the whole as the subter of the into short the view to delution of the unterportant; and not who is not dealthy who seems the decidest of the worsting in the former at the seemed the meant of rangerous partunation of the relative and not only the man of the sined
and a not one of a mind of a greation of value saint of the substicate in the recommens as the relatification of the seems of morality in the same relative the mort a stronger former of evil will one sense of men the others who, the decided for the the man not the sentiment of like as the redution of the perhaps it of the more of the relative to which has not to greated regard of the same believe therefore will for the regarded of the into the same mut to the same in the might in the man of the matter free dance to not himself, one hertically so the truth and pertain and allow of moral, of the world of a lo



# LICENSE

MIT License

Copyright (c) 2018 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.