# (Force and) Honor Project

## Short introduction

Well, you know why we are here. In this notebook, I detail how I create a chatbot using first a multilayer bi-LSTM connected to a simple [Glove embeddings](https://nlp.stanford.edu/projects/glove/), and then maybe, if I have time, using a more complexe approach.

First things first, let's start with the data.

## Data Gathering and Preprocessing

The first model will use Cornell dataset provided by the organiser. It's a small dataset so will give me probably a lesser quality bot, but it will be trained faster and will help for prototyping.
Rather than making it download again on your machine, I will suppose it is on your machine under `./data/cornell`.

In [3]:
import datasets
import os

dataset_path = os.path.join("./data/cornell")
# Be mindful that I had to change the code in datasets.py for fast_preprocessing to 
# be actually taken into account
data = datasets.readCornellData(dataset_path, max_len=20, fast_preprocessing=True)

100%|██████████| 83097/83097 [00:02<00:00, 30497.50it/s]


In [4]:
print(len(data))
print(data[:10])

24792
[('there', 'where'), ('have fun tonight', 'tons'), ('what good stuff', 'the real you'), ('wow', 'lets go'), ('she okay', 'i hope so'), ('they do to', 'they do not'), ('who', 'joey'), ('its more', 'expensive'), ('hey sweet cheeks', 'hi joey'), ('whereve you been', 'nowhere hi daddy')]


As written above, we are going to use Glove embeddings. We'll start by using the smallest version which is glove.6B.50d.txt, an 164M file of 400k vocabulary, 6B tokens projected into a 50 dimensions space (bigger versions go up to 840B tokens and 300 dimensions).

Please note the treatment of special words: `<PAD>`, `<UNK>` and `<S>`. `<PAD>` is used as a padding word to put all the sentence at equal size (because tensor). `<UNK>` is a token replacing words which are not in the vocabulary. I deal here with out of vocabulary word by generating random vectors, which kind of treat them like noise. `<S>` is the starting token of the deoder. It doesn't need an embedding as the decoder output is a softmax on the vocabulary.

In [20]:
import numpy as np

# Padding value... I am not sure if I'll use it...
PAD = "<PAD>"
UNK = "<UNK>"
START = "<S>"

class gloveEmbeddings:
    def __init__(self):
        self._embeddings = {}
        self._id2word = []
        self._word2id = {}
        self._embeddings_dim = 0
        
    def load(self,filename,voc_size=0):
        """
        Load the first voc_size words of a given glove size
        if voc_size == 0, loads the whole file.
        vocab is a set including the desired vocabulary. If not specified, we take everything
        Returns:
            embeddings: a dictionary word:embedding
            word_order: a list containing the words in the same order as in the document
            embeddings_dim: dimensionality of the embedding
        """
        
        # In case of multiple call
        self._embeddings = {}
        self._id2word = []
        self._word2id = {}
        self._embeddings_dim = 0
        
        self._id2word.append(PAD)
        self._word2id[PAD] = 0
        self._id2word.append(UNK)
        self._word2id[UNK] = 1
        count = 2
        with open(filename,"rt") as f:
            for line in f:
                word,*proj = line.split()
                self._embeddings[word] = np.array(proj,dtype=np.float32)
                self._id2word.append(word)
                self._word2id[word] = count
                count += 1
                if voc_size > 0 and count >= voc_size: break

        self._embeddings_dim = len(next(iter(self._embeddings.values())))
        self._embeddings[PAD] = np.zeros(self._embeddings_dim) # needed the dim
        # note that UNK doesn't have an embedding, as it's a random vector generated at execution

    def get(self,word):
        """
        We'll deal with unknown words by returning a random vector
        """
        if word not in self._embeddings:
            return np.random.rand(self._embeddings_dim) * 2 - 1
        return self._embeddings[word]
    
    def word2id(self,word):
        if word not in self._word2id:
            return -1
        return self._word2id[word]
    
    def id2word(self,index):
        return self._id2word[index] # might trigger out of range



In [21]:
glove = gloveEmbeddings()
glove.load("data/glove.6B.50d.txt")

Glove will be used in the input. As the output is only compared to the target sentences, we need to limit our output vocabulary to the set of words in the target vocabulary. We will also filter out words with less than a certain amount of occurences (most likely typos).

In [22]:
import sys


class outputVoc:
    def __init__(self):
        self._id2word = []
        self._word2id = {}
        self._word_freq = {}
    
    def learn_from_target(self,target_text,typo_limit=3):
        self._word_freq = {}
        for sen in target_text:
            for word in sen.split():
                if word not in self._word_freq:
                    self._word_freq[word] = 1
                else:
                    self._word_freq[word] += 1
        # Sanity check
        max_word = ""
        max_freq = 0
        min_word = ""
        min_freq = sys.maxsize
        for word,freq in self._word_freq.items():
            if freq > max_freq: 
                max_freq = freq
                max_word = word
            elif freq < min_freq:
                min_freq = freq
                min_word = word

        print("Max freq word = \"{}\" : {}".format(max_word,max_freq))
        print("Min freq word = \"{}\" {}".format(min_word,min_freq))
        
        self._id2word = []
        self._word2id = {}
        self._id2word.append(PAD)
        self._word2id[PAD] = 0
        self._id2word.append(START)
        self._word2id[UNK] = 1
        for word,freq in self._word_freq.items():
            if freq >= typo_limit:
                self._id2word.append(word)
                self._word2id[word] = len(self._id2word) - 1
        
    def word2id(self,word):
        if word not in self._word2id:
            return -1
        return self._word2id[word]
    
    def id2word(self,index):
        return self._id2word[index] # can cause out of range exception
    
    def size(self):
        return len(self._id2word)

In [23]:
out_voc = outputVoc()
input_text,target_text = zip(*data)
out_voc.learn_from_target(target_text)
print("Output vocabulary size {}".format(out_voc.size()))

Max freq word = "you" : 2850
Min freq word = "megan" 1
Output vocabulary size 1724


Let's put that in tensor form and split training / testing (90 / 10).

In [26]:
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences        

# replace words by their id in the tensor, higher in the network they will be replaced by their embeddings
input_tensor,target_tensor = zip(*data)
input_tensor = [[glove.word2id(word) for word in sentence.split()] for sentence in input_tensor]
target_tensor = [[out_voc.word2id(word) for word in sentence.split()] for sentence in target_tensor]

input_tensor = pad_sequences(input_tensor, maxlen=None,
                             dtype='int32', padding='post', value=glove.word2id(PAD))
target_tensor = pad_sequences(target_tensor, maxlen=None,
                             dtype='int32', padding='post', value=out_voc.word2id(PAD))

# train - eval split at 90-10
input_tensor_train, input_tensor_val, target_tensor_train, target_tensor_val = train_test_split(input_tensor, target_tensor, test_size=0.1)


In [27]:
# Sanity check
for i in range(5):
    sentence = [glove.id2word(x) for x in input_tensor_train[i]]
    print("{} <-> {}".format(input_tensor_train[i],sentence))
# Sanity check QA still align
print("="*50)
for i in range(5):
    q = [glove.id2word(x) for x in input_tensor_train[i]]
    a = [out_voc.id2word(x) for x in target_tensor_train[i]]
    print("{} :: {}".format(q,a))

[ 104 7193    0    0    0    0    0] <-> ['what', 'dame', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>']
[  43  348    9 1858    0    0    0] <-> ['i', 'know', 'a', 'guy', '<PAD>', '<PAD>', '<PAD>']
[810   0   0   0   0   0   0] <-> ['mother', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>']
[ 37   8 837   0   0   0   0] <-> ['were', 'in', 'love', '<PAD>', '<PAD>', '<PAD>', '<PAD>']
[86  0  0  0  0  0  0] <-> ['no', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>']
['what', 'dame', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>'] :: ['paine', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>']
['i', 'know', 'a', 'guy', '<PAD>', '<PAD>', '<PAD>'] :: ['lead', 'the', 'way', '<PAD>', '<PAD>', '<PAD>', '<PAD>']
['mother', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>'] :: ['hang', 'shit', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>']
['were', 'in', 'love', '<PAD>', '<PAD>', '<PAD>', '<PAD>'] :: ['sure', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>', '<PAD>']
['no', '<PAD>', '<PAD>',

## Bi-LSTM building

Here come the big thing. Coding a bi-LSTM connected to the embeddings.
First we define the tensorflow dataset as input with mini-batches.

In [98]:
import tensorflow as tf

BUFFER_SIZE = len(input_tensor_train)
BATCH_SIZE = 64
N_BATCH = BUFFER_SIZE//BATCH_SIZE
lstm_dim = 256
#units = 1024
#vocab_inp_size = len(inp_lang.word2idx)
#vocab_tar_size = len(targ_lang.word2idx)

dataset = tf.data.Dataset.from_tensor_slices((input_tensor_train, target_tensor_train)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

In [None]:
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, LSTM, Dense

def LSTM(dim):
  # If you have a GPU, we recommend using CuDNNGRU(provides a 3x speedup than GRU)
  # the code automatically does that.
    if tf.test.is_gpu_available():
        print("GPU acceleration available. LSTM cells will run faster")
        return tf.keras.layers.CuDNNLSTM(dim, 
                                    return_sequences=True, 
                                    return_state=True)
    else:
        print("GPU acceleration not available. LSTM cells will be slow")
        return tf.keras.layers.LSTM(dim,
                               return_sequences=True, 
                               return_state=True)
Model = Sequential()

In [12]:
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, LSTM, Dense

class Seq2SeqChat(object):
    def __init__(self):
        
    # You might recognize the same placeholder as in our seq2seq model in week 4
    def declare_placeholders(self):
    """Specifies placeholders for the model."""
    
    # Placeholders for input and its actual lengths.
    self.input_batch = tf.placeholder(shape=(None, None), dtype=tf.int32, name='input_batch')
    self.input_batch_lengths = tf.placeholder(shape=(None, ), dtype=tf.int32, name='input_batch_lengths')
    
    # Placeholders for groundtruth and its actual lengths.
    self.ground_truth = tf.placeholder(shape=(None,None), dtype=tf.int32, name="ground_truth")
    self.ground_truth_lengths = tf.placeholder(shape=(None, ), dtype=tf.int32, name='ground_truth_lengths')
        
    self.dropout_ph = tf.placeholder_with_default(tf.cast(1.0, tf.float32), shape=[])
    self.learning_rate_ph = tf.placeholder_with_default(tf.cast(0.001,tf.float32),shape=[])

def create_embeddings(self, vocab_size, embeddings_size):
    """Specifies embeddings layer and embeds an input batch."""
     
    random_initializer = tf.random_uniform((vocab_size, embeddings_size), -1.0, 1.0)
    self.embeddings = tf.Variable(random_initializer,dtype=tf.float32,name="embedding_matrix")
    
    # Perform embeddings lookup for self.input_batch. 
    self.input_batch_embedded = tf.nn.embedding_lookup(self.embeddings,
                                                       self.input_batch,
                                                      name="batch_embedded")
def build_encoder(self, hidden_size):
    # type of cell
    model = Sequential()
    model.add(Bidirectional(
                        LSTM(self.input_batch_lengths, activation='relu', return_sequences=True,dropout=dropout),
                        merge_mode='concat',
                        input_shape=(None, input_size),
                        batch_input_shape=(batch_size, None, input_size)))
    model.add(Bidirectional(LSTM(output_size, activation='relu', return_sequences=True,
                             dropout=dropout), merge_mode='sum'))
    model.add(Bidirectional(LSTM(output_size, activation='relu', return_sequences=True,
                             dropout=dropout), merge_mode='sum'))
    model.compile(loss='mse', optimizer=Adam(lr=0.001, clipnorm=1), metrics=['mse'])