# Generating Conversation

## Setup

We're going to use the output of the [fb-chat-rnn message parser](https://github.com/iconix/fb-chat-rnn) as our data.

Message parser format:
   >&lt;Participant 1&gt;Hello&lt;/Participant 1&gt;
   >
   >&lt;Participant 2&gt;Hi&lt;/Participant 2&gt;
   >
   >&lt;Participant 1&gt;Yup.&lt;/Participant 1&gt;
   >
   >&lt;Participant 1&gt;Cool&lt;/Participant 1&gt;

In [1]:
import numpy as np

In [2]:
import os

BASE_DIR = os.getcwd()
DATA_DIR = BASE_DIR + '/data/'

In [3]:
model_path = DATA_DIR + 'models/'
if not os.path.exists(model_path): os.mkdir(model_path)

In [4]:
data = DATA_DIR + 'binary_tagged_dialogue.txt' # preprocessed

with open(data, 'r') as f:
    text = f.read()
print('corpus length:', len(text))

('corpus length:', 4815849)


Note that the Shakespeare corpus from [lesson6-hmwk.ipynb](https://github.com/iconix/fast.ai/blob/master/nbs/lesson6-hmwk.ipynb) had a corpus length of 5,291,227!

In [5]:
chars = sorted(list(set(text)))
vocab_size = len(chars)+1
print('total chars:', vocab_size)

('total chars:', 84)


Sometimes it's useful to have a zero value in the dataset, e.g. for padding

In [6]:
chars.insert(0, "\0")

In [7]:
''.join(chars)

'\x00\n\r !"&\'(),-./123456789:;<>?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]`abcdefghijklmnopqrstuvwxyz}'

Map chars to indices and vice versa

In [8]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

In [9]:
print(char_indices)

{'\x00': 0, '\n': 1, '\r': 2, '!': 4, ' ': 3, '"': 5, "'": 7, '&': 6, ')': 9, '(': 8, '-': 11, ',': 10, '/': 13, '.': 12, '1': 14, '3': 16, '2': 15, '5': 18, '4': 17, '7': 20, '6': 19, '9': 22, '8': 21, ';': 24, ':': 23, '<': 25, '?': 27, '>': 26, 'A': 28, 'C': 30, 'B': 29, 'E': 32, 'D': 31, 'G': 34, 'F': 33, 'I': 36, 'H': 35, 'K': 38, 'J': 37, 'M': 40, 'L': 39, 'O': 42, 'N': 41, 'Q': 44, 'P': 43, 'S': 46, 'R': 45, 'U': 48, 'T': 47, 'W': 50, 'V': 49, 'Y': 52, 'X': 51, '[': 54, 'Z': 53, ']': 55, 'a': 57, '`': 56, 'c': 59, 'b': 58, 'e': 61, 'd': 60, 'g': 63, 'f': 62, 'i': 65, 'h': 64, 'k': 67, 'j': 66, 'm': 69, 'l': 68, 'o': 71, 'n': 70, 'q': 73, 'p': 72, 's': 75, 'r': 74, 'u': 77, 't': 76, 'w': 79, 'v': 78, 'y': 81, 'x': 80, 'z': 82, '}': 83}


*idx* converts the conversation to character indices (based on the *char_indices* mapping above)

In [10]:
idx = [char_indices[c] for c in text]

In [11]:
print(idx[:70])

[25, 29, 74, 71, 59, 67, 26, 36, 70, 3, 60, 61, 68, 65, 78, 61, 74, 65, 70, 63, 3, 69, 81, 3, 75, 71, 70, 3, 62, 74, 71, 69, 3, 69, 61, 10, 3, 36, 3, 58, 77, 74, 81, 3, 57, 3, 75, 61, 59, 71, 70, 60, 3, 64, 77, 75, 58, 57, 70, 60, 12, 2, 25, 13, 29, 74, 71, 59, 67, 26]


In [12]:
''.join(indices_char[i] for i in idx[:70])

'<Brock>In delivering my son from me, I bury a second husband.\r</Brock>'

### GLOBALS

In [13]:
from keras.layers import Input, Embedding, LSTM, merge, SimpleRNN, TimeDistributed
from keras.layers.core import Dense, Dropout, Flatten
from keras.models import Model, Sequential
from keras.optimizers import Adam
from keras.layers.normalization import BatchNormalization

Using Theano backend.
Using gpu device 0: Tesla K80 (CNMeM is disabled, cuDNN 5103)


In [14]:
n_fac = 42 # number of latent factors (size of embedding matrix)
n_hidden = 256 # hyperparameter: size of hidden state

## Stateful model with keras

In [15]:
bs = 64
nc = 40

In [16]:
c_in_dat = [[idx[i+n] for i in xrange(0, len(idx)-1-nc, nc)]
           for n in range(nc)]
c_out_dat = [[idx[i+n] for i in xrange(1, len(idx)-nc, nc)]
            for n in range(nc)]

In [17]:
xs = [np.stack(c) for c in c_in_dat]
xs = np.concatenate([[np.array(o)] for o in xs])

In [18]:
ys = [np.stack(c) for c in c_out_dat]
ys = np.concatenate([[np.array(o)] for o in ys])

In [19]:
xs.shape, ys.shape

((40, 120396), (40, 120396))

In [20]:
x_rnn = np.stack(np.squeeze(xs), axis=1)
y_rnn = np.atleast_3d(np.stack(ys, axis=1))

In [21]:
x_rnn.shape, y_rnn.shape

((120396, 40), (120396, 40, 1))

In [22]:
def make_model(batch_size_override=None):
    if batch_size_override is None:
        batch_size_override = bs
    model = Sequential([
        Embedding(vocab_size, n_fac, input_length=nc, batch_input_shape=(batch_size_override,nc)),
        BatchNormalization(),
        LSTM(n_hidden, input_dim=n_fac, return_sequences=True, stateful=True, dropout_U=0.2, dropout_W=0.2,
             consume_less='gpu'),
        LSTM(n_hidden, input_dim=n_fac, return_sequences=True, stateful=True, dropout_U=0.2, dropout_W=0.2,
             consume_less='gpu'),
        TimeDistributed(Dense(n_hidden, activation='relu')),
        Dropout(0.2),
        TimeDistributed(Dense(vocab_size, activation='softmax'))
    ])
    model.compile(loss="sparse_categorical_crossentropy", optimizer=Adam())
    return model

In [23]:
def print_example(m, seed, num_lines=5):
    pred_m = make_model(batch_size_override=1) # batch_size_override is the important bit
    for layer, pred_layer in zip(m.layers, pred_m.layers):
        pred_layer.set_weights(layer.get_weights())
    
    output = seed
    i = 0
    while i <= (num_lines*2):
        x = np.array([char_indices[c] for c in output[-nc:]])[np.newaxis,:]
        preds = pred_m.predict(x, verbose=0, batch_size=1)[0][-1]
        preds = preds / np.sum(preds)
        new_char = np.random.choice(chars, p=preds)
        output += new_char
        if new_char == '>':
            i += 1
    print(output)

**TODO:** print num_**valid**_lines

In [24]:
def run_epochs(m, num_epochs=12, seed='<Brock>Hark! The herald angels sing, Glo'):
    for i in range(num_epochs):
        print 'Cycle', i+1
        m.reset_states()
        m.fit(x_rnn[:mx], y_rnn[:mx], batch_size=bs, nb_epoch=1, shuffle=False)
        print_example(m, seed)
        print

In [25]:
model = Sequential([
        Embedding(vocab_size, n_fac, input_length=nc, batch_input_shape=(bs,nc)),
        BatchNormalization(),
        LSTM(n_hidden, input_dim=n_fac, return_sequences=True, stateful=True, dropout_U=0.2, dropout_W=0.2,
             consume_less='gpu'),
        LSTM(n_hidden, input_dim=n_fac, return_sequences=True, stateful=True, dropout_U=0.2, dropout_W=0.2,
             consume_less='gpu'),
        TimeDistributed(Dense(n_hidden, activation='relu')),
        Dropout(0.2),
        TimeDistributed(Dense(vocab_size, activation='softmax'))
    ])

In [26]:
model.compile(loss='sparse_categorical_crossentropy', optimizer=Adam())
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (64, 40, 42)          3528        embedding_input_1[0][0]          
____________________________________________________________________________________________________
batchnormalization_1 (BatchNormal(64, 40, 42)          84          embedding_1[0][0]                
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (64, 40, 256)         306176      batchnormalization_1[0][0]       
____________________________________________________________________________________________________
lstm_2 (LSTM)                    (64, 40, 256)         525312      lstm_1[0][0]                     
___________________________________________________________________________________________

Since we're using a fixed batch shape, we have to ensure our inputs and outputs are an even multiple of the batch size.

In [27]:
mx = len(x_rnn)//bs*bs

In [28]:
run_epochs(model, 25)

Cycle 1
Epoch 1/1
<Brock>Hark! The herald angels sing, Glout, her weaty,
</Brock>
<Ash>Means still so' discoze poss'd adver your but
    In deserves but    Of repulble. Her make now; thought upon
</Ash>
<Brock>Yet, glid you are out theg and till the leer have
</Brock>
</Ash>
</Ash>
<Brock>

Cycle 2
Epoch 1/1
<Brock>Hark! The herald angels sing, Glouble him
    to might away
    has you are backing thee any one
    Whose bedow'd now to the most is beguilain; you lay anything;
    For my Tornable consuse.
    Our son us,
</Brock>
<Ash>O sheep for she die oppear;
    You not been far,
    The never e Lreigniex set some
    the heart better cannot so child herself; and rock>Had heard enther stand, d soff, at forward, though a month and
    her fight lead it, gave me, when I have put offection.
  SIR TOBY. Know, Pompose's leave, as to-morrow, somether goodness! I warry daughter
    Or to the wance,
    Of the are propity entures
    Bahe gentlementage from have his sex>I think you this stro

**IDEAS**:
- use training corpus as dictionary to correct 'high confidence' spelling mistakes
- gen_length -> num_lines
- more emojis?
- always throw out the first line of dialogue?

In [29]:
save5_path = model_path + 'shakespeare.h5'
if not os.path.exists(save5_path):
    model.save_weights(save5_path)
model.load_weights(save5_path)

  from ._conv import register_converters as _register_converters


In [30]:
run_epochs(model, 25)

Cycle 1
Epoch 1/1
</Ash>
</Ash>
<Brock>O, how may I, for her king, 'tis bear to pronounce of your poot
    Richard. Therefore 'call the old feed
    her be, let's servant fortrock>I have seen to the matter;
    Pass! I know their eye of Tybalt to be all the the full of privy, I
    right upon my lord? I'll have honour of by the To your love; I waken a course of measures that cross unspeak their modesty,
    The sake yet more great wilding equiral wrought
    To the waist mielb, iron-farewell, for perform'd,
</Brock>
<Ash>The dearest that hath there a hApe the stay-loss, from this father in smiles of
    sickness in angel
    Confess'd. She be your mind courtesy,
    So cot for things into them,
</Brock>
<Ash>Here, sir,
</Ash>

Cycle 2
Epoch 1/1
<Brock>Hark! The herald angels sing, Gloucester, stay with myself.'
    Nothing a   Forth answer appear.
    I am stinz'd ajoin again.
    I have were the uncle, speaks with men; so then onurance
</Ash>
</Ash>
</Ash>
<Brock> [Aside]  What's Grea

In [31]:
save5_2_path = model_path + 'shakespeare_2.h5'
if not os.path.exists(save5_2_path):
    model.save_weights(save5_2_path)
model.load_weights(save5_2_path)