#  pf_generator

### Using textgenrnn (and a GPU) to generate text based on scraped poetry.

*Based on work by [Max Woolf](http://minimaxir.com). For more about textgenrnn, you can visit [this GitHub repository](https://github.com/minimaxir/textgenrnn).*

**NOTE: This notebook is meant to be run in Google Colab, for functionality and much faster computation. The first cell, for example, will not run in a Jupyter Notebook.**

## Import packages

In [None]:
%tensorflow_version 1.x

TensorFlow 1.x selected.


In [None]:
!pip install -q textgenrnn
from google.colab import files
from textgenrnn import textgenrnn
from datetime import datetime

Using TensorFlow backend.


Instructions for updating:
If using Keras pass *_constraint arguments to layers.


## Upload corpus

In the Colaboratory Notebook sidebar on the left of the screen, select *Files*. From there you can upload files:

![alt text](https://i.imgur.com/TGcZT4h.png)

Upload **any text file** and update the file name in the cell below, then run the cell.

In [None]:
file_name = 'corpus_lined.txt'

## Character-based text generator

Set the textgenrnn model configuration here.

(see the [demo notebook](https://github.com/minimaxir/textgenrnn/blob/master/docs/textgenrnn-demo.ipynb) for more information about these parameters)

In [None]:
model_cfg_linechars = {
    'word_level': False,   # set to True if want to train a word-level model (requires more data and smaller max_length)
    'rnn_size': 128,   # number of LSTM cells of each layer (128/256 recommended)
    'rnn_layers': 4,   # number of LSTM layers (>=2 recommended)
    'rnn_bidirectional': False,   # consider text both forwards and backward, can give a training boost
    'max_length': 40,   # number of tokens to consider before predicting the next (20-40 for characters, 5-10 for words recommended)
    'max_words': 100000,   # maximum number of words to model; the rest will be ignored (word-level model only)
}

train_cfg_linechars = {
    'line_delimited': False,   # set to True if each text has its own line in the source file
    'num_epochs': 20,   # set higher to train the model for longer
    'gen_epochs': 5,   # generates sample text from model after given number of epochs
    'train_size': 0.8,   # proportion of input data to train on: setting < 1.0 limits model from learning perfectly
    'dropout': 0.0,   # ignore a random proportion of source tokens each epoch, allowing model to generalize better
    'validation': False,   # If train__size < 1.0, test on holdout dataset; will make overall training slower
    'is_csv': False   # set to True if file is a CSV exported from Excel/BigQuery/pandas
}

In [None]:
model_name = 'pf_char_gen_lined'   # change to set file name of resulting trained models/texts

The next cell will start the actual training. And thanks to the power of Keras's CuDNN layers, training is super-fast when compared to CPU training on a local machine!

Ideally, you want a training loss less than `1.0` in order for the model to create sensible text consistently.

In [None]:
%%time

textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg_linechars['line_delimited'] else textgen.train_from_largetext_file

train_function(
    file_path=file_name,
    new_model=True,
    num_epochs=train_cfg_linechars['num_epochs'],
    gen_epochs=train_cfg_linechars['gen_epochs'],
    batch_size=1024,
    train_size=train_cfg_linechars['train_size'],
    dropout=train_cfg_linechars['dropout'],
    validation=train_cfg_linechars['validation'],
    is_csv=train_cfg_linechars['is_csv'],
    rnn_layers=model_cfg_linechars['rnn_layers'],
    rnn_size=model_cfg_linechars['rnn_size'],
    rnn_bidirectional=model_cfg_linechars['rnn_bidirectional'],
    max_length=model_cfg_linechars['max_length'],
    dim_embeddings=100,
    word_level=model_cfg_linechars['word_level'])

Training new model w/ 4-layer, 128-cell LSTMs
Training on 5,121,185 character sequences.

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
####################
Temperature: 0.2
####################
 pale stream,
And the street was the stream of the stream of the sun.
And the cold candle is closed to the shadow
of the stone with the statues of the black of the street,
And she was a bright star that should be some one of the street,
And the sun was a brown stream,
And the ship of the street with

wind
And the stream is free,
And the world of the wind is bright,
And the street steam of the street with the stone.
The street was a man who should have the world with the sun.
And the world was not the little thing that we seem
And the little wind is blowing the street
And the street was the stree

 is bright and strange and shadow
And the streets of the street of the world.
The street was a fire and the silence of the street
And the world is come to the state of the countenance of the w

You can download a large amount of generated text from your model with the cell below! Rerun the cell as many times as you want for even more text!

In [None]:
# this temperature schedule cycles between 1 very unexpected token, 1 unexpected token, 2 expected tokens, repeat.
# changing the temperature schedule can result in wildly different output!
temperature = [1.0, 0.5, 0.2, 0.2]   
prefix = None   # if you want each generated text to start with a given seed text

if train_cfg_linechars['line_delimited']:
  n = 1000
  max_gen_length = 60 if model_cfg_linechars['word_level'] else 300
else:
  n = 1
  max_gen_length = 2000 if model_cfg_linechars['word_level'] else 10000
  
timestring = datetime.now().strftime('%Y%m%d_%H%M%S')
gen_file = '{}_gentext_{}.txt'.format(model_name, timestring)

textgen.generate_to_file(gen_file,
                         temperature=temperature,
                         prefix=prefix,
                         n=n,
                         max_gen_length=max_gen_length)
files.download(gen_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

You can download the weights and configuration files in the cell below, allowing you recreate the model on your own computer!

In [None]:
files.download(f'{model_name}_weights.hdf5')
files.download(f'{model_name}_vocab.json')
files.download(f'{model_name}_config.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## Word-based text generator

In [None]:
model_cfg_linedocs = {
    'word_level': True,   # set to True if want to train a word-level model (requires more data and smaller max_length)
    'rnn_size': 128,   # number of LSTM cells of each layer (128/256 recommended)
    'rnn_layers': 4,   # number of LSTM layers (>=2 recommended)
    'rnn_bidirectional': False,   # consider text both forwards and backward, can give a training boost
    'max_length': 10,   # number of tokens to consider before predicting the next (20-40 for characters, 5-10 for words recommended)
    'max_words': 100000,   # maximum number of words to model; the rest will be ignored (word-level model only)
}

train_cfg_linedocs = {
    'line_delimited': False,   # set to True if each text has its own line in the source file
    'num_epochs': 20,   # set higher to train the model for longer
    'gen_epochs': 5,   # generates sample text from model after given number of epochs
    'train_size': 0.8,   # proportion of input data to train on: setting < 1.0 limits model from learning perfectly
    'dropout': 0.0,   # ignore a random proportion of source tokens each epoch, allowing model to generalize better
    'validation': False,   # If train__size < 1.0, test on holdout dataset; will make overall training slower
    'is_csv': False   # set to True if file is a CSV exported from Excel/BigQuery/pandas
}

In [None]:
model_name = 'pf_word_gen_lined'   # change to set file name of resulting trained models/texts

The next cell will start the actual training. And thanks to the power of Keras's CuDNN layers, training is super-fast when compared to CPU training on a local machine!

Ideally, you want a training loss less than `1.0` in order for the model to create sensible text consistently.

In [None]:
%%time

textgen = textgenrnn(name=model_name)

train_function = textgen.train_from_file if train_cfg_linedocs['line_delimited'] else textgen.train_from_largetext_file

train_function(
    file_path=file_name,
    new_model=True,
    num_epochs=train_cfg_linedocs['num_epochs'],
    gen_epochs=train_cfg_linedocs['gen_epochs'],
    batch_size=1024,
    train_size=train_cfg_linedocs['train_size'],
    dropout=train_cfg_linedocs['dropout'],
    validation=train_cfg_linedocs['validation'],
    is_csv=train_cfg_linedocs['is_csv'],
    rnn_layers=model_cfg_linedocs['rnn_layers'],
    rnn_size=model_cfg_linedocs['rnn_size'],
    rnn_bidirectional=model_cfg_linedocs['rnn_bidirectional'],
    max_length=model_cfg_linedocs['max_length'],
    dim_embeddings=100,
    word_level=model_cfg_linedocs['word_level'])

Training new model w/ 4-layer, 128-cell LSTMs
Training on 1,296,679 word sequences.

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
####################
Temperature: 0.2
####################
sea is a tough old man



 the

 the old woman

 the red - faced
the red - faced hemlock ,
and the blue - apple trees
and the sea - tops ;
and the sea - star clusters fall ' d ;
and the dark winds were as the sea - mist
and the sea - light ,
and the wind that shines on the wind .
and the wind is a song of the night ,
and the stars are still in the sky
and the sea - light
that is a fire that is not a sight
of a light and a groan .



 the

 i have been
in the night
i am the present
to the last night .



 the house of the

 the river is a part of the sky ,
and the of the world ,
the of the world ,
the of the world , and the of the world .



 the

 the old familiar faces
are the lands
of the dead ,
the of their birth
is not the of the less .
the very best of the world is still .



 the

 i 

You can download a large amount of generated text from your model with the cell below! Rerun the cell as many times as you want for even more text!

In [None]:
# this temperature schedule cycles between 1 very unexpected token, 1 unexpected token, 2 expected tokens, repeat.
# changing the temperature schedule can result in wildly different output!
temperature = [1.0, 0.5, 0.2, 0.2]   
prefix = None   # if you want each generated text to start with a given seed text

if train_cfg_linedocs['line_delimited']:
  n = 1000
  max_gen_length = 60 if model_cfg_linedocs['word_level'] else 300
else:
  n = 1
  max_gen_length = 2000 if model_cfg_linedocs['word_level'] else 10000
  
timestring = datetime.now().strftime('%Y%m%d_%H%M%S')
gen_file = f'{model_name}_gentext_{timestring}.txt'

textgen.generate_to_file(gen_file,
                         temperature=temperature,
                         prefix=prefix,
                         n=n,
                         max_gen_length=max_gen_length)
files.download(gen_file)

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

You can download the weights and configuration files in the cell below, allowing you recreate the model on your own computer!

In [None]:
files.download(f'{model_name}_weights.hdf5')
files.download(f'{model_name}_vocab.json'))
files.download(f'{model_name}_config.json')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

# Etcetera

To recreate the model on your own computer, after installing textgenrnn and TensorFlow, you can create a Python script with:

```
from textgenrnn import textgenrnn
textgen = textgenrnn(weights_path='colaboratory_weights.hdf5',
                       vocab_path='colaboratory_vocab.json',
                       config_path='colaboratory_config.json')
                       
textgen.generate_samples(max_gen_length=1000)
textgen.generate_to_file('textgenrnn_texts.txt', max_gen_length=1000)
```

Have fun with your new model! :)

Note:
If the model fails to load on a local machine due to a model-size-not-matching bug (common in >30MB weights), this is due to a file export bug from Colaboratory. To work around this issue, save the weights to Google Drive with the two cells below and download from there.