# Part 2 - Training the model generate text jointly from emojis and text

Following the success in getting the model to generate text based on training it on twitter data, we're going to modify the model to generate text based on two inputs - the emoji data and the text data.

While the text will still be fed character-by-character into the LSTM, we're going to add an additional Dense input, which is just an n_emoji dimensional vector, where n_emoji is the number of possible emojis that we have sufficient data for (ie., more than 1000 training examples). The emojis will be one-hot encoded.

First, we need to modify the data_load_utils files so that `convert_to_xy` gives us both one-hot encodings (text and emoji).

In [1]:
import numpy as np
import pandas as pd
import data_load_utils as util
from math import ceil

from importlib import reload
util = reload (util)

# for cpu and memory profiling
#%load_ext line_profiler
#%load_ext memory_profiler




Let's filter the data down to only emojis with more than 1000 training examples, and clean up the text by filtering out twitter handles, as we did before

In [2]:
tweets = util.filter_tweets_min_count(util.read_tweet_data('data/emojis_homemade.csv'), min_count=1000)

tweets['text'] = util.filter_text_for_handles(tweets['text'])

  if (yield from self.run_code(code, result)):


In [3]:
tweets.shape

(461544, 2)

First, let's get a list of all the emojis present in the tweets['emojis'] data series.

In [23]:
MAX_TWEET_LENGTH = 160
WINDOW_SIZE = 64
STEP = 3

samples_per_tweet = int(ceil((MAX_TWEET_LENGTH - WINDOW_SIZE) / STEP)) # 32
tweets_per_batch = 2 #64
samples_per_batch = samples_per_tweet * tweets_per_batch # 2048

chars_univ, chars_univ_idx = util.get_universal_chars_list()

emoji_series = tweets['emoji']
emojis, emoji_idx = util.get_emojis_list(emoji_series)

In [24]:
# for prototyping
TRAIN_SIZE = 2**12
DEV_SIZE = 2**10

#TRAIN_SIZE = 2**18 # 32,768  try 131072 = 2**17 for production
#DEV_SIZE = 2**12   # 8192  try 8192 = 2**13 for production

n_train_batches = TRAIN_SIZE / tweets_per_batch
n_dev_batches = DEV_SIZE / tweets_per_batch

tweets_train = tweets.iloc[0:TRAIN_SIZE] # 8192 = 2**13
tweets_dev = tweets.iloc[TRAIN_SIZE:TRAIN_SIZE+DEV_SIZE] # 2048 = 2**11

In [25]:
# 64 tweets x 32 samples per tweet = 2048 training examples per batch
train_generator = util.convert_tweet_to_xy_generator(tweets_train, length=MAX_TWEET_LENGTH, \
                                                     window_size=WINDOW_SIZE,step=STEP, \
                                                     batch_size=tweets_per_batch, emoji_set=emojis)

dev_generator = util.convert_tweet_to_xy_generator(tweets_dev, length=MAX_TWEET_LENGTH, \
                                                   window_size=WINDOW_SIZE,step=STEP, \
                                                   batch_size=tweets_per_batch, emoji_set = emojis)

# Building a network
Intially, let's try generating tweets by training a network on just the tweet data. Once we have an idea how well we can get a network to generate tweets (remember, character by character), we'll compare it to a network that learns to generate tweets by predicting the next chracter jointly from the preceding text and an overall emoji. (remember, this dataset is tweets that all contain exactly one emoji).

## Simple network - a single LSTM into a Dense softmax classifier.

In [26]:
# Establish the shapes of the inputs/outputs
([x_text, x_emoji], y) = next(train_generator)
print ("x_text shape:", x_text.shape)
print ("x_emoji shape:", x_emoji.shape)
print ("y shape:", y.shape)

x_text shape: (64, 64, 93)
x_emoji shape: (64, 112)
y shape: (64, 93)


x_text: 2048 x 64 x 93 (batch_size x window_size x len(chars_univ))
x_emoji 2048 x 112 (batch_size x chars_univ)
y 2048 x 93 (batch_size x chars_univ)

In [27]:
import keras
from keras import layers, Input
from keras.models import Model
from keras import callbacks

In [28]:
text_input = Input(shape=(WINDOW_SIZE, len(chars_univ)), dtype='float32', name='text_input')
lstm = layers.LSTM(256, name='text_lstm')(text_input)

emoji_input = Input(shape=(len(emojis),),
                    dtype='float32',
                    name='emoji_input')
emoji_dense = layers.Dense (50, activation='relu')(emoji_input)

concatenate = layers.concatenate([lstm, emoji_dense], name='concatenate')

output = layers.Dense(len(chars_univ), name='output',
                      activation='softmax')(concatenate)

model = Model([text_input, emoji_input], output)
model.compile(optimizer = keras.optimizers.RMSprop(lr=0.001),
              loss='categorical_crossentropy',
              metrics=['acc'])


model.summary()


# can use this saved file for transfer learning
# model = keras.models.load_model("models/tweet_gen_model-0.776.hdf5") # 256 LSTM units, ~30 epochs training  # 

#model = keras.models.Sequential()
#model.add(layers.LSTM(256, input_shape=(WINDOW_SIZE, len(chars_univ)))) # was 128 units
#model.add(layers.Dense(len(chars_univ), activation='softmax'))

# loss function - targets are one-hot encoded



__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
text_input (InputLayer)         (None, 64, 93)       0                                            
__________________________________________________________________________________________________
emoji_input (InputLayer)        (None, 112)          0                                            
__________________________________________________________________________________________________
text_lstm (LSTM)                (None, 256)          358400      text_input[0][0]                 
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 50)           5650        emoji_input[0][0]                
__________________________________________________________________________________________________
concatenat

## Training the model and sampling from it using a standard character-by-character method
1. Draw a probability distribution for the next character
2. Reweight the distribution using a temperature parameter
3. Sample the next character at random using the reweighted distribution
4. Add the new character at the end of the available list

In [29]:
def sample (preds, temperature = 1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

## train the model, generate text
Use a range of temeratures after every epoch

In [30]:
tweets.iloc[0]['text'][0:10]

'RT [VID] 1'

In [None]:
import random
import sys

# temp! for troubleshooting only
n_train_batches = 64
n_dev_batches = 16

n_seed_chars = 64 # number of characters to use as a seed for text generation

model.optimizer.lr.assign(0.001) # to reset the learning rate if running additional training

checkpoint = callbacks.ModelCheckpoint(filepath='tweet_gen_model-{loss:.3f}.hdf5', 
                                       verbose=1, 
                                       save_best_only=True)

# train for 60 epochs
for epoch in range (1, 60):
    print ('epoch', epoch)

    # fit the model for one iteration
    model.fit_generator (train_generator,
                         steps_per_epoch=n_train_batches, # 64 x 32 = batches of 2048
                         epochs=1,
                         validation_data=dev_generator, 
                         validation_steps=n_dev_batches,
                         callbacks=[checkpoint],
                         verbose=1,
                         use_multiprocessing=True, # run the generator in a separate thread
                         )

    # select a text seed at random
    seed_tweet = tweets.iloc[random.randint(0, len(tweets))]
    seed_text = util.pad_text(seed_tweet['text'][0:n_seed_chars], n_seed_chars)
    generated_text = seed_text
    #one-hot encode the emoji
    emoji_one_hot = util.get_emoji_bool_array(seed_tweet['emoji'], emoji_idx)
    print ('--- Generating with seed: "' + generated_text + '" | ' + seed_tweet['emoji'])

    
    # try a range of sampling temperatures
    for temperature in [0.3, 0.5, 0.8, 1.0]:
        generated_text = seed_text
        print ('--------- temperature:', temperature)
        sys.stdout.write(generated_text)

        for i in range (MAX_TWEET_LENGTH - n_seed_chars):
            # one-hot encode the characters generated so far
            sampled = np.zeros((1, WINDOW_SIZE, len(chars_univ)))
            for t, char in enumerate (generated_text):
                sampled[0, t, chars_univ_idx[char]] = 1

            # sample the next character
            preds = model.predict([sampled, emoji_one_hot], verbose=0)[0]
            next_index = sample(preds, temperature)
            next_char = chars_univ[next_index]

            generated_text += next_char
            generated_text = generated_text[1:]

            sys.stdout.write(next_char)

        print ("\n")    

epoch 1


Epoch 1/1


 1/64 [..............................] - ETA: 46s - loss: 2.1830 - acc: 0.4531

 2/64 [..............................] - ETA: 38s - loss: 1.5850 - acc: 0.6016

 3/64 [>.............................] - ETA: 36s - loss: 1.9209 - acc: 0.5260

 4/64 [>.............................] - ETA: 34s - loss: 2.1984 - acc: 0.4453

 5/64 [=>............................] - ETA: 32s - loss: 2.2410 - acc: 0.4625

 6/64 [=>............................] - ETA: 31s - loss: 2.3663 - acc: 0.4375

 7/64 [==>...........................] - ETA: 31s - loss: 2.1888 - acc: 0.4754

 8/64 [==>...........................] - ETA: 29s - loss: 2.1123 - acc: 0.4922

In [13]:
char_univ_idx

NameError: name 'char_univ_idx' is not defined