# Training an LSTM to generate tweets

We're going to train an LSTM to generate 'tweets' (160 character snippets), using a training dataset of nearly a million tweets.

We've gathered the tweet data using the Twitter API to suck in all English language tweets that contain exactly one emoji.

In this notebook, we're going to forget about emojis, and just focus on training a model to generate text in the style of Twitter.

In [1]:
import numpy as np
import pandas as pd
import data_load_utils as util
from math import ceil

from importlib import reload
util = reload (util)

# for cpu and memory profiling
%load_ext line_profiler
%load_ext memory_profiler




In [2]:
%memit tweets = util.filter_tweets_min_count(util.read_tweet_data('data/emojis_homemade.csv'), min_count=1000)

%memit tweets['text'] = util.filter_text_for_handles(tweets['text'])

  returned = f(*args, **kw)


peak memory: 239.46 MiB, increment: 152.75 MiB
peak memory: 277.55 MiB, increment: 48.16 MiB


Just reading in the tweets from a CSV file and storing them in memory as a pandas DataFrame is about 300 MiB, which isn't awful, although to scale this up, the next thing to try will be storing it on disk as an HDF5 file, and just reading it in one batch at a time. 

Some tweet examples:

In [3]:
tweets.iloc[0,:]

text     RT [VID] 181023 - Foi adicionada a letra D no ...
emoji                                                    ©
Name: 0, dtype: object

In [4]:
tweets.iloc[1]

text     RT 181023 Kris Wu Studio update (3/3)Legendary...
emoji                                                    💫
Name: 1, dtype: object

In [5]:
tweets.shape

(461544, 2)

Whoa, that's a dataset of nearly half a million tweets, looking only at emojis that have at least 1,000 examples.

The naive way of loading the data was just to split each tweet into 'windows' of a certain number of characters, and just one-hot encode the whole DataFrame. Unfortunately it turns out if we use that approach we probably can't fit a very big dataset in the computer's RAM (and going out and buying more RAM, or using a bigger computer in the cloud, will only allow us to scale up so far).

So instead, we're going to use a more sophisticated approach and code up a generator function that only converts data one batch at a time.

Since we're dealing in batches, we're going to use a slightly different `WINDOW_SIZE` of 64, because that conveniently makes 32 training examples for each tweet, with a `step` of 3. Since it's a power of two, we can make batch sizes that are also powers of two, that will fit nicely on the GPU of whatever computational behemoth we train this thing on.

In [6]:
MAX_TWEET_LENGTH = 160
WINDOW_SIZE = 64
STEP = 3

samples_per_tweet = int(ceil((MAX_TWEET_LENGTH - WINDOW_SIZE) / STEP)) # 32
tweets_per_batch = 64
samples_per_batch = samples_per_tweet * tweets_per_batch # 2048

chars_univ, chars_univ_idx = util.get_universal_chars_list()

In [7]:
TRAIN_SIZE = 2**15 # 32,768  try 131072 = 2**17 for production
DEV_SIZE = 2**13   # 8192  try 8192 = 2**13 for production

n_train_batches = TRAIN_SIZE / tweets_per_batch
n_dev_batches = DEV_SIZE / tweets_per_batch

%memit tweets_train = tweets.iloc[0:TRAIN_SIZE] # 8192 = 2**13
%memit tweets_dev = tweets.iloc[TRAIN_SIZE:TRAIN_SIZE+DEV_SIZE] # 2048 = 2**11

peak memory: 277.82 MiB, increment: 0.03 MiB
peak memory: 277.83 MiB, increment: 0.00 MiB


In [10]:
# 64 tweets x 32 samples per tweet = 2048 training examples per batch
%memit train_generator = util.convert_tweet_to_xy_generator(tweets_train, length=MAX_TWEET_LENGTH, \
                                                            window_size=WINDOW_SIZE,step=STEP, \
                                                            batch_size=tweets_per_batch)

%memit dev_generator = util.convert_tweet_to_xy_generator(tweets_dev, length=MAX_TWEET_LENGTH, \
                                                          window_size=WINDOW_SIZE,step=STEP, \
                                                          batch_size=tweets_per_batch)

peak memory: 278.01 MiB, increment: 0.01 MiB
peak memory: 278.01 MiB, increment: 0.00 MiB


Now we can feed those generators directly into the model using `fit_generator()`

# Building a network
Intially, let's try generating tweets by training a network on just the tweet data. Once we have an idea how well we can get a network to generate tweets (remember, character by character), we'll compare it to a network that learns to generate tweets by predicting the next chracter jointly from the preceding text and an overall emoji. (remember, this dataset is tweets that all contain exactly one emoji).

## Simple network - a single LSTM into a Dense softmax classifier.

In [11]:
import keras
from keras import layers
from keras.models import Sequential
from keras import callbacks
model = keras.models.Sequential()
model.add(layers.LSTM(128, input_shape=(WINDOW_SIZE, len(chars_univ))))
model.add(layers.Dense(len(chars_univ), activation='softmax'))

# loss function - targets are one-hot encoded
optimizer = keras.optimizers.RMSprop(lr=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Using TensorFlow backend.


## Training the model and sampling from it using a standard character-by-character method
1. Draw a probability distribution for the next character
2. Reweight the distribution using a temperature parameter
3. Sample the next character at random using the reweighted distribution
4. Add the new character at the end of the available list

In [12]:
def sample (preds, temperature = 1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

## train the model, generate text
Use a range of temeratures after every epoch

In [13]:
tweets.iloc[0]['text'][0:10]

'RT [VID] 1'

In [15]:
import random
import sys

n_seed_chars = 15 # number of characters to use as a seed for text generation

model.optimizer.lr.assign(0.001) # to reset the learning rate if running additional training

checkpoint = callbacks.ModelCheckpoint(filepath='tweet_gen_model-{loss:.3f}.hdf5', 
                                       verbose=1, 
                                       save_best_only=True)

# train for 60 epochs
for epoch in range (1, 60):
    print ('epoch', epoch)

    # fit the model for one iteration
    model.fit_generator (train_generator,
                         steps_per_epoch=n_train_batches, # 64 x 32 = batches of 2048
                         epochs=1,
                         validation_data=dev_generator, 
                         validation_steps=n_dev_batches,
                         callbacks=[checkpoint],
                         verbose=1,
                         use_multiprocessing=True, # run the generator in a separate thread
                         )

    # select a text seed at random
    seed_tweet = tweets.iloc[random.randint(0, len(tweets))]
    generated_text = seed_tweet['text'][0:n_seed_chars]
    print ('--- Generating with seed: "' + generated_text + '"')

    # try a range of sampling temperatures
    for temperature in [0.3, 0.5, 0.8, 1.0]:
        print ('--------- temperature:', temperature)
        sys.stdout.write(generated_text)

        for i in range (MAX_TWEET_LENGTH - n_seed_chars):
            # one-hot encode the characters generated so far
            sampled = np.zeros((1, WINDOW_SIZE, len(chars_univ)))
            for t, char in enumerate (generated_text):
                sampled[0, t, chars_univ_idx[char]] = 1

            # sample the next character
            preds = model.predict(sampled, verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars_univ[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)

        print ("\n")    

epoch 1
Epoch 1/1
  6/512 [..............................] - ETA: 31:56 - loss: 2.7788

Process ForkPoolWorker-35:
Traceback (most recent call last):
  File "/home/nickdbn/anaconda3/envs/deeplearning/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/nickdbn/anaconda3/envs/deeplearning/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/nickdbn/anaconda3/envs/deeplearning/lib/python3.6/multiprocessing/pool.py", line 108, in worker
    task = get()
Process ForkPoolWorker-36:
Traceback (most recent call last):
  File "/home/nickdbn/anaconda3/envs/deeplearning/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/nickdbn/anaconda3/envs/deeplearning/lib/python3.6/multiprocessing/queues.py", line 335, in get
    res = self._reader.recv_bytes()
  File "/home/nickdbn/anaconda3/envs/deeplearning/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "/home/nickdbn/anaconda

KeyboardInterrupt: 

In [None]:
char_univ_idx