# (Force and) Honor Project

## Short introduction

Well, you know why we are here. In this notebook, I detail how I create a chatbot using first a multilayer bi-LSTM connected to a simple [Glove embeddings](https://nlp.stanford.edu/projects/glove/), and then maybe, if I have time, using a more complexe approach.

First things first, let's start with the data.

## Data Gathering and Preprocessing

The first model will use Cornell dataset provided by the organiser. It's a small dataset so will give me probably a lesser quality bot, but it will be trained faster and will help for prototyping.
Rather than making it download again on your machine, I will suppose it is on your machine under `./data/cornell`.

In [1]:
import datasets
import chatterbot_interface
from importlib import reload
chatterbot_interface = reload(chatterbot_interface)
from chatterbot_interface import load_chatterbot_data
import os

#dataset_path = "./data/cornell"
# Be mindful that I had to change the code in datasets.py for fast_preprocessing to 
# be actually taken into account
#data = datasets.readCornellData(dataset_path, max_len=20, fast_preprocessing=True)

dataset_path = "./data/bot-txt"
data = load_chatterbot_data(dataset_path,8)

In [2]:
print(len(data))
print(data[:10])

745
[('each year in pro baseball the', 'the gold glove'), ('if you are riding fakie inside', 'snowboarding'), ('what is basketball', 'a game with tall players'), ('what soccer', 'i was born without the sports gene'), ('what is baseball', 'a game played with a hard rawhide covered'), ('teams of nine or ten players each it', 'a diamond shaped circuit'), ('what is soccer', 'a game played with a round ball by'), ('a goal at either end the ball is', 'of the body except the hands and arms'), ('i love baseball', 'i am not into sports that much'), ('i play soccer', 'you have to run very fast to be')]


As written above, we are going to use Glove embeddings. We'll start by using the smallest version which is glove.6B.50d.txt, an 164M file of 400k vocabulary, 6B tokens projected into a 50 dimensions space (bigger versions go up to 840B tokens and 300 dimensions).

Please note the treatment of special words: `<PAD>`, `<UNK>` and `<S>`. `<PAD>` is used as a padding word to put all the sentence at equal size (because tensor). `<UNK>` is a token replacing words which are not in the vocabulary. I deal here with out of vocabulary word by generating random vectors, which kind of treat them like noise. `<S>` is the starting token of the deoder. It doesn't need an embedding as the decoder output is a softmax on the vocabulary.

In [3]:
import numpy as np

# Constant values
PAD = "<PAD>"
UNK = "<UNK>"
START = "<START>"
END = "<EOS>"

class gloveEmbeddings:
    def __init__(self):
        self._embeddings = {}
        self._id2word = []
        self._word2id = {}
        self._embeddings_dim = 0
        
    def load(self,filename,voc_size=0):
        """
        Load the first voc_size words of a given glove size
        if voc_size == 0, loads the whole file.
        vocab is a set including the desired vocabulary. If not specified, we take everything
        Returns:
            embeddings: a dictionary word:embedding
            word_order: a list containing the words in the same order as in the document
            embeddings_dim: dimensionality of the embedding
        """
        
        # In case of multiple call
        self._embeddings = {}
        self._id2word = []
        self._word2id = {}
        self._embeddings_dim = 0
        
        self._id2word.append(PAD)
        self._word2id[PAD] = 0
        self._id2word.append(UNK)
        self._word2id[UNK] = 1
        self._id2word.append(START)
        self._word2id[START] = 2
        self._id2word.append(END)
        self._word2id[END] = 3
        count = 4
        with open(filename,"rt") as f:
            for line in f:
                word,*proj = line.split()
                self._embeddings[word] = np.array(proj,dtype=np.float32)
                self._id2word.append(word)
                self._word2id[word] = count
                count += 1
                if voc_size > 0 and count >= voc_size: break

        self._embeddings_dim = len(next(iter(self._embeddings.values())))
        self._embeddings[PAD] = np.zeros(self._embeddings_dim) # needed the dim
        self._embeddings[START] = self._embeddings[PAD] + 1 # totally arbitrary
        self._embeddings[END] = self._embeddings[PAD] - 1 # totally arbitrary
        # note that UNK doesn't have an embedding, as it's a random vector generated at execution

    def get(self,word):
        """
        We'll deal with unknown words by returning a random vector
        """
        if word not in self._embeddings or word == UNK:
            return np.random.rand(self._embeddings_dim) * 2 - 1
        return self._embeddings[word]
    
    def is_in(self,word):
        return word in self._embeddings
    
    def word2id(self,word):
        if word not in self._word2id:
            return self._word2id[UNK]
        return self._word2id[word]
    
    def id2word(self,index):
        return self._id2word[index] # might trigger out of range
    
    def size(self):
        return len(self._id2word)
    
    def get_dim(self):
        return self._embeddings_dim



In [4]:
glove = gloveEmbeddings()
glove.load("data/glove.6B.50d.txt")

Glove will be used in the input. As the output is only compared to the target sentences, we need to limit our output vocabulary to the set of words in the target vocabulary. We will also filter out words with less than a certain amount of occurences (most likely typos).

In [5]:
import sys


class outputVoc:
    def __init__(self):
        self._id2word = []
        self._word2id = {}
        self._word_freq = {}
    
    def learn_from_target(self,target_text,typo_limit=3):
        self._word_freq = {}
        for sen in target_text:
            for word in sen.split():
                if word not in self._word_freq:
                    self._word_freq[word] = 1
                else:
                    self._word_freq[word] += 1
        # Sanity check
        max_word = ""
        max_freq = 0
        min_word = ""
        min_freq = sys.maxsize
        for word,freq in self._word_freq.items():
            if freq > max_freq: 
                max_freq = freq
                max_word = word
            elif freq < min_freq:
                min_freq = freq
                min_word = word

        print("Max freq word = \"{}\" : {}".format(max_word,max_freq))
        print("Min freq word = \"{}\" {}".format(min_word,min_freq))
        
        self._id2word = []
        self._word2id = {}
        self._id2word.append(PAD)
        self._word2id[PAD] = 0
        self._id2word.append(UNK)
        self._word2id[UNK] = 1
        self._id2word.append(START)
        self._word2id[START] = 2
        self._id2word.append(END)
        self._word2id[END] = 3
        for word,freq in self._word_freq.items():
            if freq >= typo_limit:
                self._id2word.append(word)
                self._word2id[word] = len(self._id2word) - 1
        
    def word2id(self,word):
        if word not in self._word2id:
            return self._word2id[UNK]
        return self._word2id[word]
    
    def id2word(self,index):
        return self._id2word[index] # can cause out of range exception
    
    def size(self):
        return len(self._id2word)
    
    def words_to_one_hot(self,l_sen):
        """
        I am keeping the tokenizer responsibility out, so only accept already tokenized sentences
        """
        out = []
        for word in l_sen:
            one_hot = np.zeros(self.size())
            indice = self.word2id(word)
            one_hot[indice] = 1
            out.append(one_hot)
        return out
    
    def ids_to_one_hot(self,l_sen):
        """
        I am keeping the tokenizer responsibility out, so only accept already tokenized sentences
        """
        out = []
        for indice in l_sen:
            one_hot = np.zeros(self.size())
            one_hot[indice] = 1
            out.append(one_hot)
        return out
        
        

In [6]:
out_voc = outputVoc()
input_text,target_text = zip(*data)
out_voc.learn_from_target(target_text)
print("Output vocabulary size {}".format(out_voc.size()))

for x in out_voc.words_to_one_hot(["<START>","you","are","cute",",","reviewer","<EOS>","<PAD>"]):
    argmax = np.argmax(x)
    print("{} : {}".format(argmax,out_voc.id2word(argmax)))

Max freq word = "i" : 262
Min freq word = "shame" 1
Output vocabulary size 247
2 : <START>
12 : you
56 : are
1 : <UNK>
1 : <UNK>
1 : <UNK>
3 : <EOS>
0 : <PAD>


Let's put that in tensor form and split training / testing (90 / 10).

## Bi-LSTM building

Here come the big thing. Coding a bi-LSTM connected to the embeddings.
First we define the tensorflow dataset as input with mini-batches.

In [7]:
import tensorflow as tf

embedding_matrix = []
for i in range(glove.size()):
    embedding_matrix.append(glove.get(glove.id2word(i)))
embedding_matrix = np.matrix(embedding_matrix)

#dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
#dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [8]:
from sklearn.model_selection import train_test_split
from keras.preprocessing.sequence import pad_sequences        

# replace words by their id in the tensor, higher in the network they will be replaced by their embeddings
input_tensor,target_tensor = zip(*data)
input_tensor = [[glove.word2id(word) for word in sentence.split()+[END]] for sentence in input_tensor]
target_tensor = [[out_voc.word2id(word) for word in sentence.split()+[END]] for sentence in target_tensor]
target_input_tensor = [[out_voc.word2id(START)] + line[:-1] for line in target_tensor]

input_tensor = pad_sequences(input_tensor, maxlen=None,
                             dtype='int32', padding='post', value=glove.word2id(PAD))
target_tensor = pad_sequences(target_tensor, maxlen=None,
                             dtype='int32', padding='post', value=out_voc.word2id(PAD))

target_input_tensor = pad_sequences(target_input_tensor, maxlen=None,
                             dtype='int32', padding='post', value=out_voc.word2id(PAD))

# Remember that target_tensor must be a one-hot encoding


# And now target tensor to one hot

# train - eval split at 90-10
#input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val, target_input_tensor_train, target_input_tensor_val = train_test_split(input_tensor, target_tensor, target_input_tensor, test_size=0.1)


Using TensorFlow backend.


In [9]:
iinput_tensor = np.zeros(input_tensor.shape,dtype=int)
for i in range(input_tensor.shape[0]):
    for j in range(input_tensor.shape[1]):
        iinput_tensor[i,j] = input_tensor[i,input_tensor.shape[1]-j-1]
iinput_tensor

array([[    0,     0,     3, ...,    10,    66,   240],
       [    0,     0,     3, ...,    36,    85,    87],
       [    0,     0,     0, ...,  1792,    18,   106],
       ...,
       [    0,     0,     0, ...,     4,   840,    42],
       [    0,     0,     0, ...,     4,   840,    42],
       [    0,     0,     0, ..., 23560,   840,    42]])

In [10]:
print(target_tensor[:3])
target_tensor = np.array([out_voc.ids_to_one_hot(x) for x in target_tensor])
print(target_tensor[:3])
for i in range(3):
    for w in target_tensor[i]:
        print(out_voc.id2word(np.argmax(w)))
    print("="*20)

[[226   1   1   3   0   0   0   0   0]
 [  1   3   0   0   0   0   0   0   0]
 [139 160 240   1   1   3   0   0   0]]
[[[0. 0. 0. ... 0. 0. 0.]
  [0. 1. 0. ... 0. 0. 0.]
  [0. 1. 0. ... 0. 0. 0.]
  ...
  [1. 0. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]]

 [[0. 1. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]
  ...
  [1. 0. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]]

 [[0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  [0. 0. 0. ... 0. 0. 0.]
  ...
  [1. 0. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]
  [1. 0. 0. ... 0. 0. 0.]]]
the
<UNK>
<UNK>
<EOS>
<PAD>
<PAD>
<PAD>
<PAD>
<PAD>
<UNK>
<EOS>
<PAD>
<PAD>
<PAD>
<PAD>
<PAD>
<PAD>
<PAD>
a
game
with
<UNK>
<UNK>
<EOS>
<PAD>
<PAD>
<PAD>


In [11]:
import tensorflow as tf
from keras import Model
from keras.layers import Input, LSTM, CuDNNLSTM, Embedding, Dropout, Dense, Masking
from utils import BeamList
import math

class Seq2Seq():
    
    def __init__(self,glove,out_voc,depth=1):
        self.glove = glove
        self.out_voc = out_voc
        self.depth=depth

    def pretrained_embedding_layer(self):
        """
        Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.

        Arguments:
        word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
        word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

        Returns:
        embedding_layer -- pretrained layer Keras instance
        """

        vocab_len = self.glove.size() + 1 # adding 1 to fit Keras embedding (requirement)
        emb_dim = self.glove.get_dim()

        emb_matrix = np.zeros((vocab_len,emb_dim))

        for word, index in self.glove._word2id.items():
            emb_matrix[index, :] = self.glove.get(word)

        self.embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)
        self.embedding_layer.build((None,))
        self.embedding_layer.set_weights([emb_matrix])
        
    def maybe_CuDNNLSTM(self,units,return_sequences=False,return_state=False):
        if tf.test.is_gpu_available():
            print("GPU acceleration available. LSTM cells will run faster")
            return CuDNNLSTM(units,return_sequences=return_sequences,return_state=return_state)
        else:
            print("GPU acceleration not available. LSTM cells will be slow")
            return LSTM(units,return_sequences=return_sequences,return_state=return_state)

    def build_encoder(self,sentence_length=8,dropout=0.2):
        self.encoder_inputs = Input(shape=(None,)) # it's an int

        self.pretrained_embedding_layer()
        self.input_embeddings = self.embedding_layer(self.encoder_inputs)
        
        current_input = self.input_embeddings
        self.encoder_states = []
        for i in range(self.depth):
            output,state_h,state_c = self.maybe_CuDNNLSTM(self.hidden_size,return_sequences=True,
                                                          return_state=True)(current_input)
            state_h = Dropout(dropout)(state_h)
            state_c = Dropout(dropout)(state_c)
            current_input = Dropout(dropout)(output)
            self.encoder_states.append(state_h)
            self.encoder_states.append(state_c)

    
    def build_training_model(self,hidden_size,dropout=0.2):
        
        units_out = self.out_voc.size()
        
        self.hidden_size = hidden_size
        self.build_encoder(dropout=dropout)

        self.decoder_inputs = Input(shape=(None,))
        #self.decoder_masked_inputs = Masking(mask_value=self.glove.word2id(PAD), input_shape=(None))(self.decoder_inputs)
        self.decoder_input_embeddings = self.embedding_layer(self.decoder_inputs)
        
        # Decoder at training time
        
        self.decoder_lstm = []
        current_input = self.decoder_input_embeddings
        X = None
        for i in range(self.depth):
            self.decoder_lstm.append(self.maybe_CuDNNLSTM(hidden_size,return_sequences=True,return_state=True))
            X,_,_ = self.decoder_lstm[i](current_input,
                                         initial_state=[self.encoder_states[2*i],self.encoder_states[2*i+1]])
            X = Dropout(dropout)(X)
            current_input = X
        
        self.decoder_dense = Dense(units_out, activation='softmax')
        final_output = self.decoder_dense(current_input)

        model = Model([self.encoder_inputs,self.decoder_inputs],final_output)
        model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

        return model
    
    def build_decoder(self):
        """
        Decoder at evaluation time
        """
        
        self.encoder_model = Model(self.encoder_inputs, self.encoder_states)
        
        # state inputs
        decoder_states_inputs = []
        for i in range(self.depth):
            state_input_h = Input(shape=(self.hidden_size,))
            state_input_c = Input(shape=(self.hidden_size,))
            decoder_states_inputs.append(state_input_h)
            decoder_states_inputs.append(state_input_c)
        
        # going through the layers
        current_input = self.decoder_input_embeddings
        decoder_states = []
        for i in range(self.depth):
            states_input = [decoder_states_inputs[2*i],decoder_states_inputs[2*i+1]]
            decoder_outputs, state_h, state_c = self.decoder_lstm[i](
                current_input, initial_state=states_input)
            current_input = decoder_outputs
            decoder_states.append(state_h)
            decoder_states.append(state_c)
        decoder_outputs = self.decoder_dense(decoder_outputs)
            
        self.decoder_model = Model(
            [self.decoder_inputs] + decoder_states_inputs,
            [decoder_outputs] + decoder_states)
        
    def eval_seq(self,input_seq,max_length,beam=10):
        
        seq = []
        for w in input_seq.split():
            seq.append(self.glove.word2id(w))
        seq = [[seq]]
        print(seq)
        states_values = self.encoder_model.predict(seq)
        print(len(states_values))
        print(states_values[0].shape)
        print(states_values[1].shape)
        print(states_values[0][:,0])
        print(states_values[1][:,0])
        print("==============")
        target_seq = np.zeros((1, 1)) # we are using embedding indices as inputs
        #states_values = sv
        
        # First run (with START token)
        target_seq[0, 0] = self.glove.word2id(START)
        output_tokens, *states_output = self.decoder_model.predict([target_seq] + states_values)
        sorted_indices = np.argsort(output_tokens[0, 0, :])
        beamer = BeamList(beam)
        for b in range(beam): # beam update
            indice = sorted_indices[-(b+1)]
            new_id = self.glove.word2id(self.out_voc.id2word(indice))
            beamer.insert((math.log(output_tokens[0,0,indice]),states_output,[new_id])) 
        nb_words = 1
        
        while nb_words < max_length:
            # beam update
            new_beamer = BeamList(beam)
            for i in range(beam):
                log_prob, states_values, sequence = beamer.get(i)
                print(log_prob,end=" ")
                for s in sequence:
                    print(self.glove.id2word(s), end=" ")
                print()
                target_seq[0,0] = sequence[-1]
                output_tokens, *states_output = self.decoder_model.predict(
                [target_seq] + states_values)
                # beam update
                sorted_indices = np.argsort(output_tokens[0, 0, :])
                for j in range(beam):
                    indice = sorted_indices[-(j+1)]
                    new_id = self.glove.word2id(self.out_voc.id2word(indice))
                    new_beamer.insert((log_prob+math.log(output_tokens[0,0,indice]),states_output,sequence+[new_id]))
            print("\n===========")
            beamer = new_beamer
            
            nb_words += 1
        
        # decode best sequence
        # @ADD stop computing sequence with EOS (dead seq with fixed prob at end)
        # need to check if all sentences reached this point
        
        for i in range(beam):
            decoded_sentence = ""
            log_prob,_,sentence = beamer.get(i)
            for s in sentence:
                decoded_sentence += " {}".format(self.glove.id2word(s))
            print("{} {}".format(log_prob,decoded_sentence))

        return ""
            
    

In [12]:
samples,height,width = target_tensor.shape
iinput_tmpsor = np.zeros((2000,height))
target_tmpsor = np.zeros((2000,height,width))
target_input_tmpsor = np.zeros((2000,height))
for i in range(400):
    for j in range(5):
        iinput_tmpsor[i*5+j,:] = iinput_tensor[j,:]
        target_tmpsor[i*5+j,:,:] = target_tensor[j,:,:]
        target_input_tmpsor[i*5+j,:] = target_input_tensor[j,:]
print(samples,height,width)

745 9 247


In [98]:
# I want to penalise <PAD>, <UNK> and <EOS> (for longer sentences and no UNK output)
import keras
from keras.utils import plot_model

#BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 16
#N_BATCH = BUFFER_SIZE//BATCH_SIZE
lstm_dim = 512
#units = 1024
#vocab_inp_size = len(inp_lang.word2idx)
#vocab_tar_size = len(targ_lang.word2idx)


s2s = Seq2Seq(glove,out_voc,depth=2)
model = s2s.build_training_model(lstm_dim,dropout=0.3)
plot_model(model, to_file='model.png')
#model.summary()

# class_weight doesn't work anyway
class_weight = np.ones(out_voc.size())
class_weight[out_voc.word2id(PAD)] = 0.001
class_weight[out_voc.word2id(UNK)] = 0.001
#class_weight[out_voc.word2id(END)] = 1

callbacks = keras.callbacks.TensorBoard(log_dir='./Graph', histogram_freq=0,write_graph=True, write_images=True)

model.fit(x=[iinput_tensor,target_input_tensor], y=target_tensor,
          batch_size=BATCH_SIZE,epochs=50,
          validation_split=0.1,class_weight=class_weight,callbacks=[callbacks])



GPU acceleration available. LSTM cells will run faster
GPU acceleration available. LSTM cells will run faster
GPU acceleration available. LSTM cells will run faster
GPU acceleration available. LSTM cells will run faster
GPU acceleration available. LSTM cells will run faster
GPU acceleration available. LSTM cells will run faster
Train on 670 samples, validate on 75 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x7f88f4686240>

OK, let's try talking to it.

In [99]:
s2s.build_decoder()
plot_model(s2s.encoder_model, to_file="encoder_model.png")
plot_model(s2s.decoder_model, to_file='eval_model.png')
#s2s.decoder_model.summary()

In [102]:
def prep_input(text,max_len=8):
    out_text = ""
    tab = []
    for word in text.split():
        tab.insert(0,word)
    tab.insert(0,"<EOS>")
    while len(tab) < max_len:
        tab.insert(0,"<PAD>")
    return " ".join([word for word in tab])

input_text = prep_input("what is baseketball",8)
sentence = s2s.eval_seq(input_text,8,30)
print(sentence)

[[[0, 0, 0, 0, 3, 168476, 18, 106]]]
6
(1, 1024)
(1, 1024)
[0.04686449]
[0.22465827]
-1.120331726323114 a 
-1.2656580856106934 <UNK> 
-2.558579185618506 is 
-2.622851836881009 the 
-2.93099181457561 i 
-3.02710075883412 an 
-3.0611058466958934 he 
-4.228704329837161 not 
-4.229469685713195 thomas 
-4.451262926030143 hal 
-5.168568186037519 my 
-5.215365782084313 of 
-5.273697326699788 what 
-5.282078148975742 you 
-5.7858771845099515 do 
-6.077127437388312 it 
-6.396555309402902 to 
-6.444980062060247 no 
-6.611964126378443 by 
-6.687531420364789 in 
-6.88809764308106 history 
-7.001948332002698 how 
-7.030136181457714 have 
-7.0366925515107885 that 
-7.2414078516739115 or 
-7.3833727455827916 should 
-7.596174223623606 one 
-7.653330326958242 software 
-7.699461456345004 <EOS> 
-7.726842824456796 does 

-1.3664890806608472 <UNK> is 
-1.443146314759205 a <UNK> 
-2.9431826714503764 i <UNK> 
-2.9745844378502313 the <UNK> 
-3.231045927650467 he <UNK> 
-3.6798788065850516 is was 
-3.716161

# needed to [seq] the whole input