# (Force and) Honor Project

## Short introduction

Well, you know why we are here. In this notebook, I detail how I create a chatbot using first a multilayer bi-LSTM connected to a simple [Glove embeddings](https://nlp.stanford.edu/projects/glove/), and then maybe, if I have time, using a more complexe approach.

First things first, let's start with the data.

## Data Gathering and Preprocessing

The first model will use Cornell dataset provided by the organiser. It's a small dataset so will give me probably a lesser quality bot, but it will be trained faster and will help for prototyping.
Rather than making it download again on your machine, I will suppose it is on your machine under `./data/cornell`.

In [1]:
import datasets
import os

dataset_path = os.path.join("./data/cornell")
# Be mindful that I had to change the code in datasets.py for fast_preprocessing to 
# be actually taken into account
data = datasets.readCornellData(dataset_path, max_len=20, fast_preprocessing=True)

100%|██████████| 83097/83097 [00:02<00:00, 30267.65it/s]


In [2]:
print(len(data))
print(data[:10])

24792
[('there', 'where'), ('have fun tonight', 'tons'), ('what good stuff', 'the real you'), ('wow', 'lets go'), ('she okay', 'i hope so'), ('they do to', 'they do not'), ('who', 'joey'), ('its more', 'expensive'), ('hey sweet cheeks', 'hi joey'), ('whereve you been', 'nowhere hi daddy')]


As written above, we are going to use Glove embeddings. We'll start by using the smallest version which is glove.6B.50d.txt, an 164M file of 400k vocabulary, 6B tokens projected into a 50 dimensions space (bigger versions go up to 840B tokens and 300 dimensions).

Please note the treatment of special words: `<PAD>`, `<UNK>` and `<S>`. `<PAD>` is used as a padding word to put all the sentence at equal size (because tensor). `<UNK>` is a token replacing words which are not in the vocabulary. I deal here with out of vocabulary word by generating random vectors, which kind of treat them like noise. `<S>` is the starting token of the deoder. It doesn't need an embedding as the decoder output is a softmax on the vocabulary.

In [3]:
import numpy as np

# Constant values
PAD = "<PAD>"
UNK = "<UNK>"
START = "<START>"
END = "<EOS>"

class gloveEmbeddings:
    def __init__(self):
        self._embeddings = {}
        self._id2word = []
        self._word2id = {}
        self._embeddings_dim = 0
        
    def load(self,filename,voc_size=0):
        """
        Load the first voc_size words of a given glove size
        if voc_size == 0, loads the whole file.
        vocab is a set including the desired vocabulary. If not specified, we take everything
        Returns:
            embeddings: a dictionary word:embedding
            word_order: a list containing the words in the same order as in the document
            embeddings_dim: dimensionality of the embedding
        """
        
        # In case of multiple call
        self._embeddings = {}
        self._id2word = []
        self._word2id = {}
        self._embeddings_dim = 0
        
        self._id2word.append(PAD)
        self._word2id[PAD] = 0
        self._id2word.append(UNK)
        self._word2id[UNK] = 1
        self._id2word.append(START)
        self._word2id[START] = 2
        self._id2word.append(END)
        self._word2id[END] = 3
        count = 4
        with open(filename,"rt") as f:
            for line in f:
                word,*proj = line.split()
                self._embeddings[word] = np.array(proj,dtype=np.float32)
                self._id2word.append(word)
                self._word2id[word] = count
                count += 1
                if voc_size > 0 and count >= voc_size: break

        self._embeddings_dim = len(next(iter(self._embeddings.values())))
        self._embeddings[PAD] = np.zeros(self._embeddings_dim) # needed the dim
        self._embeddings[START] = self._embeddings[PAD] + 1 # totally arbitrary
        self._embeddings[END] = self._embeddings[PAD] - 1 # totally arbitrary
        # note that UNK doesn't have an embedding, as it's a random vector generated at execution

    def get(self,word):
        """
        We'll deal with unknown words by returning a random vector
        """
        if word not in self._embeddings:
            return np.random.rand(self._embeddings_dim) * 2 - 1
        return self._embeddings[word]
    
    def word2id(self,word):
        if word not in self._word2id:
            return -1
        return self._word2id[word]
    
    def id2word(self,index):
        return self._id2word[index] # might trigger out of range
    
    def size(self):
        return len(self._id2word)
    
    def get_dim(self):
        return self._embeddings_dim



In [4]:
glove = gloveEmbeddings()
glove.load("data/glove.6B.50d.txt")

Glove will be used in the input. As the output is only compared to the target sentences, we need to limit our output vocabulary to the set of words in the target vocabulary. We will also filter out words with less than a certain amount of occurences (most likely typos).

In [5]:
import sys


class outputVoc:
    def __init__(self):
        self._id2word = []
        self._word2id = {}
        self._word_freq = {}
    
    def learn_from_target(self,target_text,typo_limit=3):
        self._word_freq = {}
        for sen in target_text:
            for word in sen.split():
                if word not in self._word_freq:
                    self._word_freq[word] = 1
                else:
                    self._word_freq[word] += 1
        # Sanity check
        max_word = ""
        max_freq = 0
        min_word = ""
        min_freq = sys.maxsize
        for word,freq in self._word_freq.items():
            if freq > max_freq: 
                max_freq = freq
                max_word = word
            elif freq < min_freq:
                min_freq = freq
                min_word = word

        print("Max freq word = \"{}\" : {}".format(max_word,max_freq))
        print("Min freq word = \"{}\" {}".format(min_word,min_freq))
        
        self._id2word = []
        self._word2id = {}
        self._id2word.append(PAD)
        self._word2id[PAD] = 0
        self._id2word.append(UNK)
        self._word2id[UNK] = 1
        self._id2word.append(START)
        self._word2id[START] = 2
        self._id2word.append(END)
        self._word2id[END] = 3
        for word,freq in self._word_freq.items():
            if freq >= typo_limit:
                self._id2word.append(word)
                self._word2id[word] = len(self._id2word) - 1
        
    def word2id(self,word):
        if word not in self._word2id:
            return -1
        return self._word2id[word]
    
    def id2word(self,index):
        return self._id2word[index] # can cause out of range exception
    
    def size(self):
        return len(self._id2word)
    
    def list_to_one_hot(self,l_sen):
        """
        I am keeping the tokenizer responsibility out, so only accept already tokenized sentences
        """
        out = []
        for word in l_sen:
            one_hot = np.zeros(self.size())
            index = self._word2id[word] if word in self._word2id else self._word2id[UNK]
            one_hot[index] = 1
            out.append(one_hot)
        return out
        
        

In [6]:
out_voc = outputVoc()
input_text,target_text = zip(*data)
out_voc.learn_from_target(target_text)
print("Output vocabulary size {}".format(out_voc.size()))

for x in out_voc.list_to_one_hot(["<START>","you","are","cute",",","reviewer","<EOS>","<PAD>"]):
    argmax = np.argmax(x)
    print("{} : {}".format(argmax,out_voc.id2word(argmax)))

Max freq word = "you" : 2850
Min freq word = "net" 1
Output vocabulary size 1726
2 : <START>
1430 : you
1441 : are
215 : cute
1 : <UNK>
1 : <UNK>
3 : <EOS>
0 : <PAD>


Let's put that in tensor form and split training / testing (90 / 10).

## Bi-LSTM building

Here come the big thing. Coding a bi-LSTM connected to the embeddings.
First we define the tensorflow dataset as input with mini-batches.

In [7]:
import tensorflow as tf

#BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
#N_BATCH = BUFFER_SIZE//BATCH_SIZE
lstm_dim = 128
#units = 1024
#vocab_inp_size = len(inp_lang.word2idx)
#vocab_tar_size = len(targ_lang.word2idx)

embedding_matrix = []
for i in range(glove.size()):
    embedding_matrix.append(glove.get(glove.id2word(i)))
embedding_matrix = np.matrix(embedding_matrix)

#dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
#dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [20]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences        

# replace words by their id in the tensor, higher in the network they will be replaced by their embeddings
input_tensor,target_tensor = zip(*data)
input_tensor = [[glove.word2id(word) for word in sentence.split()] for sentence in input_tensor]
target_tensor = [[out_voc.word2id(word) for word in sentence.split()+[END]] for sentence in target_tensor]
target_input_tensor = [[out_voc.word2id(START)] + line[:-1] for line in target_tensor]

input_tensor = pad_sequences(input_tensor, maxlen=None,
                             dtype='int32', padding='post', value=glove.word2id(PAD))
target_tensor = pad_sequences(target_tensor, maxlen=None,
                             dtype='int32', padding='post', value=out_voc.word2id(PAD))

target_input_tensor = pad_sequences(target_input_tensor, maxlen=None,
                             dtype='int32', padding='post', value=out_voc.word2id(PAD))

# Remember that target_tensor must be a one-hot encoding
target_tensor = np.array([out_voc.list_to_one_hot(x) for x in target_tensor])

# And now target tensor to one hot

# train - eval split at 90-10
#input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val, target_input_tensor_train, target_input_tensor_val = train_test_split(input_tensor, target_tensor, target_input_tensor, test_size=0.1)


In [45]:
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, LSTM, CuDNNLSTM, Embedding, Dropout, Dense

class seq2Seq():
    
    def __init__(self,glove):
        self.glove = glove
        pass

    def pretrained_embedding_layer(self):
        """
        Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.

        Arguments:
        word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
        word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

        Returns:
        embedding_layer -- pretrained layer Keras instance
        """

        vocab_len = self.glove.size() + 1                  # adding 1 to fit Keras embedding (requirement)
        emb_dim = self.glove.get_dim()      # define dimensionality of your GloVe word vectors (= 50)

        ### START CODE HERE ###
        # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
        emb_matrix = np.zeros((vocab_len,emb_dim))

        # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
        for word, index in self.glove._word2id.items():
            emb_matrix[index, :] = self.glove.get(word)

        # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
        self.embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)
        ### END CODE HERE ###

        # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
        self.embedding_layer.build((None,))

        # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
        self.embedding_layer.set_weights([emb_matrix])
        
    def maybe_LSTM(self,units,return_sequences=False,return_state=False):
        if tf.test.is_gpu_available():
            print("GPU acceleration available. LSTM cells will run faster")
            return CuDNNLSTM(units,return_sequences=return_sequences,return_state=return_state)
        else:
            print("GPU acceleration not available. LSTM cells will be slow")
            return LSTM(units,return_sequences=return_sequences,return_state=return_state)

    def _build_encoder(self,training=True,dropout=0.2):
        self.encoder_input = Input(shape=(None,)) # it's an int

        self.pretrained_embedding_layer()
        #embedding_layer = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1],input_length=7,trainable=False)
        #embedding_layer.build((None,))
        #embedding_layer.set_weights([embedding_matrix])
        #embeddings = embedding_layer(sentence_indices)
        self.input_embeddings = self.embedding_layer(self.encoder_input)

        _,state_h,state_c = self.maybe_LSTM(self.hidden_size,return_state=True)(self.input_embeddings)
        if training:
            state_h = Dropout(dropout)(state_h)
            state_c = Dropout(dropout)(state_c)
        self.encoder_state = [state_h,state_c]

    
    def build_training_model(self,units_out,hidden_size,dropout=0.2):
        
        self.hidden_size = hidden_size
        self._build_encoder(training=True,dropout=dropout)

        self.decoder_input = Input(shape=(None,)) # also here it's an int
        # IS IT NECESSARY?
        #embedding_layer2 = pretrained_embedding_layer(glove)
        #embedding_layer2 = Embedding(embedding_matrix.shape[0], embedding_matrix.shape[1],trainable=False)
        #embedding_layer2.build((None,))
        #embedding_layer2.set_weights([embedding_matrix])
        self.decoder_input_embeddings = self.embedding_layer(self.decoder_input)
        
        self._decoder_lstm = self.maybe_LSTM(hidden_size,return_sequences=True,return_state=True)
        X,_,_ = self._decoder_lstm(self.decoder_input_embeddings,initial_state=self.encoder_state)
        X = Dropout(dropout)(X)
        self.decoder_dense = Dense(units_out, activation='softmax')
        X = self.decoder_dense(X)

        # Create Model instance which converts sentence_indices into X.
        model = Model([self.encoder_input,self.decoder_input],X)
        model.compile(loss='categorical_crossentropy', optimizer='rmsprop')

        return model
    
    def build_decoder(self):
        #self.build_encoder(glove,hidden_size,training=True,dropout=dropout)
        #encoder_model = Model(self.sentence_indices,self.encoded_state)
        #states = encoder_model.predict(seq)
        # NO, will be called in a separate function
        
        state_input_h = Input(shape=(self.hidden_size,))
        state_input_c = Input(shape=(self.hidden_size,))
        self.decoder_states_inputs = [decoder_state_input_h, decoder_state_input_c]
        decoder_outputs, state_h, state_c = decoder_lstm(
            self.decoder_input_embeddings, initial_state=self.decoder_states_inputs)
        decoder_states = self.encoder_state
        decoder_outputs = self.decoder_dense(decoder_outputs)
        return Model(
            [decoder_inputs] + decoder_states_inputs,
            [decoder_outputs] + decoder_states) # HERE
        
    def eval_seq(self,seq):
        # AND HERE
        pass
    

In [46]:
s2s = seq2Seq(glove)
model = s2s.build_training_model(out_voc.size(),lstm_dim)

GPU acceleration available. LSTM cells will run faster
GPU acceleration available. LSTM cells will run faster


In [47]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_21 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
input_20 (InputLayer)           (None, None)         0                                            
__________________________________________________________________________________________________
embedding_10 (Embedding)        (None, None, 50)     20000250    input_20[0][0]                   
                                                                 input_21[0][0]                   
__________________________________________________________________________________________________
cu_dnnlstm_14 (CuDNNLSTM)       [(None, 128), (None, 92160       embedding_10[0][0]               
__________

In [None]:
print(target_tensor.shape)
print(input_tensor.shape)
print(target_input_tensor.shape)

In [48]:
model.fit(x=[input_tensor,target_input_tensor], y=target_tensor,
          batch_size=BATCH_SIZE,epochs=1,validation_split=0.2)



Train on 19833 samples, validate on 4959 samples
Epoch 1/1


<tensorflow.python.keras.callbacks.History at 0x7f7ebc7bcb00>

# Might want to reencode decoder_input