# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
!pip install torchdata==0.3.0

Defaulting to user installation because normal site-packages is not writeable
Collecting torchdata==0.3.0
  Downloading torchdata-0.3.0-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 2.9 MB/s eta 0:00:011
Installing collected packages: torchdata
Successfully installed torchdata-0.3.0


In [1]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
import torch.nn as nn
from nltk.corpus import brown
from torchtext import data
from torchtext import datasets
from tqdm import tqdm  
from torchtext.datasets import SQuAD1
import torch.utils.data as data
import random
from torch.utils.data import DataLoader

In [None]:
nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

In [3]:
from nltk.stem.porter import *
from nltk.stem import *
from nltk.tokenize import RegexpTokenizer


# Define vocabulary class
class Vocab:
    def __init__(self):
        self.word_to_idx = {'<sos>': 0, '<eos>': 1, '<unk>': 2, '<pad>': 3}
        self.vocab_size = 4
        self.idx_to_word = {0: '<sos>', 1: '<eos>', 2: '<unk>', 3: '<pad>'}
        self.tokenizer = RegexpTokenizer(r'\w+')
        self.porter_stem = PorterStemmer()
        self.word_count = {}
        self.default_token = ('<unk>', 2)
              
    def __len__(self) -> int:
        return len(self.word_to_idx)
    
    def clean_text(self, sentence: str) -> list:
        
        assert type(sentence) == str, 'The word arg must be a string'
        
        return [self.porter_stem.stem(word) for word in self.tokenizer.tokenize(sentence)]

    def _load_word(self, word: str) -> None:
        
        assert type(word) == str, 'The word arg must be a string'
        
        if word not in self.word_to_idx:
            self.word_to_idx[word] = self.vocab_size
            self.idx_to_word[self.vocab_size] = word
            self.word_count[word] = 1
            self.vocab_size += 1
        else:
            self.word_count[word] += 1 
        
    def load(self, dataset) -> None:
        
        for question, answer in tqdm(dataset):

            self._load_sentence(question)

            self._load_sentence(answer)
            
    def _load_sentence(self, sentence: str) -> None:
        
        assert type(sentence) == str, 'The sentence arg must be a string'
        
        sentence = self.clean_text(sentence)
        
        for word in sentence:
            self._load_word(word)
    
    def filter_min_freq(self, min_freq: int = 0) -> None:
        
        word_count = {}
        word_to_idx = {'<sos>': 0, '<eos>': 1, '<unk>': 2, '<pad>': 3}
        idx_to_word = {0: '<sos>', 1: '<eos>', 2: '<unk>', 3: '<pad>'}
        vocab_size = 4
                
        words = [key for key, val in self.word_count.items() if val >= min_freq] 

        for word in words:
            word_to_idx[word] = vocab_size
            idx_to_word[vocab_size] = word
            vocab_size += 1
            
            word_count[word] = self.word_count[word]
        
        # Update state         
        self.word_to_idx = word_to_idx
        self.idx_to_word = idx_to_word
        self.word_count = word_count

    def stoi(self, sentence: str) -> list:
        
        assert type(sentence) == str, 'The word arg must be a string'
        
        sentence = self.clean_text(sentence)
                    
        return [self.word_to_idx[word] if word in self.word_to_idx else self.word_to_idx[self.default_token[0]] for word in sentence ]

    def itos(self, sentence: list) -> list:
        
        assert type(sentence) == list, 'The word arg must be a list of ints'
                    
        return [self.idx_to_word[word_idx] if word_idx in self.idx_to_word else self.idx_to_word[self.default_token[1]] for word_idx in sentence ]
    
    
    def _batch_wrap_and_pad(self, sentence: list, max_length: int, reverse=False) -> torch.LongTensor:

        # Reverse order of the batched data to remove short memory connections
        if reverse is True:
            sentence.reverse()
            
        # define two functions: one for wrapping the sentence with the sos and eos tokens, 
        # and one for padding the sentence when necesary

        wrapper = lambda x: [self.word_to_idx['<sos>']] + x[:min(len(x), max_length - 2)] + [self.word_to_idx['<eos>']]
        padding = lambda x: x[:min(len(x), max_length )] + [self.word_to_idx['<pad>']] * max(max_length -  min(len(x), max_length + 2), 0)

        return padding(wrapper(sentence))

    def collate_fcn(self, batch):
        
        # Compute max lengths for the padding             
        max_length = 10
            
        total_lengths = []
        
        for (question, answer) in batch:
            
            # Convert to index
            question = self.stoi(question)
            answer = self.stoi(answer)
            
            total_lengths.append(len(answer))
            total_lengths.append(len(question))
                
        batch_max_length = min(max(total_lengths), max_length)

        src = torch.zeros(size=(len(batch), batch_max_length), dtype=torch.int64)
        trg = torch.zeros(size=(len(batch), batch_max_length), dtype=torch.int64)
        
        for idx_batch, (question, answer) in enumerate(batch):
            
            # Convert to index
            question = self.stoi(question)
            answer = self.stoi(answer)
            
            src[idx_batch, :] = torch.LongTensor(self._batch_wrap_and_pad(question, max_length=batch_max_length, reverse=True))
            trg[idx_batch, :] = torch.LongTensor(self._batch_wrap_and_pad(answer, max_length=batch_max_length))


        return (src, trg)
        



In [4]:
# Cell to test the vocab
def test_vocab(w2v=None, use_w2v=False):
    if use_w2v:
        pass
    else:
        vocab = Vocab()
        sentence = "this is a test sentence with some words. Some words are duplicated for testing"
        vocab._load_sentence(sentence)
                
        assert vocab.word_count['some'] == 2, "the word 'some' should appear 2 times"
        assert vocab.word_count['word'] == 2, "the word 'words' should appear 2 times"
        
        assert len(vocab.word_to_idx) == 15, "the total number of entries in word_to_idx must be 15"
        assert len(vocab.idx_to_word) == 15, "the total number of entries in idx_to_word must be 15"

        print("Pre-filtering\n")
        sentence_idx = vocab.stoi(sentence)
        print(sentence_idx)
        sentence = ' '.join(vocab.itos(sentence_idx))
        print(sentence)
        
        vocab.filter_min_freq(min_freq=0)
        assert len(vocab.word_to_idx) == 15, "the total number of entries in word_to_idx must be 16"

        vocab.filter_min_freq(min_freq=2)
        assert len(vocab.word_to_idx) == 7, "the total number of entries in word_to_idx must be 6"
        
        print("\nPost-filtering min_freq == 2\n")
        sentence_idx = vocab.stoi(sentence)
        print(sentence_idx)
        sentence = ' '.join(vocab.itos(sentence_idx))
        print(sentence)
            
            
        batch = [("question 1, really", "answ 1, nope nope nope nope nope v v nope nope"), ("question 2, GM", "answ 2, nope sure yes"), ("question 3, heila", "answ 3, nope")]
            
        (src, trg) = vocab.collate_fcn(batch)
        
        assert src.shape[0] == 3, "The src(x,) must have the same dimension of the batch"
        assert src.shape[1] == 10, "The src(,x) must have the same dimension of the maximum allowed sentence = 10"
        assert trg.shape[0] == 3, "The trg(x,) must have the same dimension of the batch"
        assert trg.shape[1] == 10, "The trg(,x) must have the same dimension of the maximum allowed sentence = 10"
            
        print("\nEverything seems correct!")

test_vocab()    

Pre-filtering

[4, 5, 6, 7, 8, 9, 10, 11, 10, 11, 12, 13, 14, 7]
thi is a test sentenc with some word some word are duplic for test

Post-filtering min_freq == 2

[2, 2, 2, 4, 2, 2, 5, 6, 5, 6, 2, 2, 2, 4]
<unk> <unk> <unk> test <unk> <unk> some word some word <unk> <unk> <unk> test

Everything seems correct!


In [6]:
def reduce_dataset(dataset_in) -> list:
    
    dataset_out = []
    
    print('Reduce dataset to consider only Q/A')
    for context, question, answers, idx_context in tqdm(dataset_in):

        # Select only the first answer         
        dataset_out.append((question, answers[0]))

    return dataset_out
    
    
def create_dataloader(batch_size: int, train_valid_ratio: float = 0.8) -> (Vocab, DataLoader):
    
    # Create dataloaders      
    dataloders = {'train': None, 'valid': None, 'test': None}
    
    train_dataset, test_dataset = SQuAD1()
      
    # Batching the testing dataset
    train_dataset = reduce_dataset(train_dataset)
    
    train_batch_available = len(train_dataset) // batch_size

    # Discard the entries not fitting with the batch_size     
    train_dataset = train_dataset[:train_batch_available * batch_size]
    
    # Split the train dataset into training and validation
    train_batch_num = int((len(train_dataset) / batch_size) * train_valid_ratio)
    train_num = train_batch_num * batch_size

    valid_num = int((len(train_dataset) - train_num))

    train_dataset, valid_dataset = data.random_split(train_dataset, [train_num, valid_num])
    
    # Batching the testing dataset     
  
    test_dataset = reduce_dataset(test_dataset)
    test_batch_available = len(test_dataset) // batch_size

    # Discard the entries not fitting with the batch_size     
    test_dataset = test_dataset[:test_batch_available * batch_size]
    
    # Create vocabulary
    vocab = Vocab()
    vocab.load(train_dataset)
    vocab.filter_min_freq(min_freq=10)
    print(f"Final vocabulary size: {len(vocab)}")
    
    # Populate the dataloaders

    dataloders['train'] = DataLoader(train_dataset, batch_size=batch_size, shuffle=False, collate_fn=vocab.collate_fcn) 
    dataloders['valid'] = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False, collate_fn=vocab.collate_fcn ) 
    dataloders['test'] = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, collate_fn=vocab.collate_fcn) 
    
    return vocab, dataloders

In [8]:
BATCH_SIZE = 128
EPOCH_NUM = 10

In [9]:
vocab, dl = create_dataloader(BATCH_SIZE, train_valid_ratio=0.8)

0it [00:00, ?it/s]

Reduce dataset to consider only Q/A


87599it [00:02, 29200.44it/s]
0it [00:00, ?it/s]

Reduce dataset to consider only Q/A


10570it [00:00, 20128.97it/s]
100%|██████████| 70016/70016 [00:34<00:00, 2012.37it/s]

Final vocabulary size: 6226





In [11]:
class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_size = 100, dropout=0.5):
        
        super(Encoder, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embedding_size = embedding_size 
        
        # self.embedding provides a vector representation of the inputs to our model
        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, num_layers=2, dropout=dropout, batch_first=True)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        '''
        Inputs: src, the src vector
        Outputs: output, the encoder outputs
                 self.hidden, the hidden state
                 self.cell_state, the cell state
        '''
        
        embedded = self.dropout(self.embedding(src))

        output, (hidden, cell_state) = self.lstm(embedded)
            
        return output, hidden, cell_state
    

class Decoder(nn.Module):
      
    def __init__(self, hidden_size, output_size, embedding_size=100, dropout=0.5):
        
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embedding_size = embedding_size
                
        # self.embedding provides a vector representation of the target to our model
        
        self.embedding = nn.Embedding(output_size, self.embedding_size)

        # self.lstm, accepts the embeddings and outputs a hidden state
        
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, num_layers=2, dropout=dropout, batch_first=True)

        # self.ouput, predicts on the hidden state via a linear output layer
        self.fcl = nn.Linear(self.hidden_size, self.output_size)
        
        self.activation = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, trg, hidden, cell_state):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''        
        
        embedded = self.dropout(self.embedding(trg))

        output, (hidden, cell_state) = self.lstm(embedded, (hidden, cell_state))

        output = self.fcl(output.view(-1, self.hidden_size))

        return output, (hidden, cell_state)

class Seq2Seq(nn.Module):
    
    def __init__(self, encoder_input_size, encoder_hidden_size, decoder_hidden_size, decoder_output_size, device):
        
        super(Seq2Seq, self).__init__()
        
        self.device = device
        self.encoder = Encoder(encoder_input_size, encoder_hidden_size).to(device)
        self.decoder = Decoder(decoder_hidden_size, decoder_output_size).to(device)
        
    
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):      
               
        # Encoder
        output, hidden, cell = self.encoder(src)
        
        # Define tensor to store decoder outputs
        decoder_outputs = torch.zeros(trg.shape[0], trg.shape[1], self.decoder.output_size).to(self.device)
        
        best_output = trg[:, 0]

        for trg_col in range(trg.shape[1]):

            # Teaching modification
            trg_input =  trg[:, trg_col] if random.random() < teacher_forcing_ratio else best_output
            trg_input = trg_input.unsqueeze(1)
                                        
            output, (hidden, cell) = self.decoder(trg_input, hidden, cell)

            best_output = output.argmax(1)
            
            decoder_outputs[:, trg_col] = output

                
        return decoder_outputs


In [14]:
def train(epoch_num, dl, model, optimiser, criterion):

    # Define epoch losses
    train_loss = 0
    valid_loss = 0
    
    # Gradient clipping
    clip = 1.0
    
    for epoch in range(1, epoch_num + 1):
        
        # Enable training
        model.train()
        
        # Training
        for src, trg in tqdm(dl['train']):
            
            optimiser.zero_grad()
        
            # Move to cuda if available             
            if torch.cuda.is_available():
                src, trg = src.cuda(), trg.cuda()
            
            # Get seq2seq output             
            output = model(src, trg)

            output_dim = output.shape[-1]
            
            output = output[:, 1:].reshape(-1, output_dim)
            trg = trg[:, 1:].reshape(-1)
            
            # Compute loss     
            loss = criterion(output, trg)

            loss.backward()

            # Clip gradient     
            torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

            optimiser.step()

            train_loss += loss.item()
            
        # Disable training
        model.eval()
        
        # Validation              
        for src, trg in tqdm(dl['valid']):
        
            # Move to cuda if available             
            if torch.cuda.is_available():
                src, trg = src.cuda(), trg.cuda()
            
            # Get seq2seq output             
            output = model(src, trg)

            output_dim = output.shape[-1]

            output = output[:, 1:].reshape(-1, output_dim)
            trg = trg[:, 1:].reshape(-1)

            # Compute loss     
            loss = criterion(output, trg)

            valid_loss += loss.item()


        size_train = len(dl['train'])
        size_valid = len(dl['valid'])
#         if epoch % 1:
        print(f"Epoch: {epoch}, Training loss: {train_loss / size_train}, Valid loss: {valid_loss / size_valid}")
            
            
def testing(epoch_numb, dl, vocab, model):
    
    # Enable training
    model.eval()
    
    # Define epoch loss
    epoch_loss = 0

    # Define optimiser     
    criterion = nn.CrossEntropyLoss()
    
    
    with torch.no_grad():

        for epoch in range(1, epoch_numb + 1):

            for src, trg in tqdm(dl['test']):

                src = batch_wrap_and_pad(src, vocab)
                trg = batch_wrap_and_pad(trg, vocab)

                if torch.cuda.is_available():
                    src, trg = src.to(device), trg.to(device)

                output = model(src, trg)

                output_dim = output.shape[-1]

                output = output[:, 1:].reshape(-1, output_dim)
                trg = trg[:, 1:].reshape(-1)

                # Compute loss     
                loss = criterion(output, trg)

                epoch_loss += loss.item()


            print(f'Epoch loss {epoch_loss}')
            


In [15]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

In [16]:
encoder_input_size = len(vocab)
encoder_hidden_size = 256

decoder_hidden_size = 256
decoder_output_size = len(vocab)

# Check cuda availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Define model
model = Seq2Seq(encoder_input_size, encoder_hidden_size, decoder_hidden_size, decoder_output_size, device)
model.apply(init_weights)

# Define optimiser and criterion
optimiser = torch.optim.SGD(model.parameters(), lr=0.7)
criterion = nn.CrossEntropyLoss(ignore_index = vocab.stoi('<pad>')[0])


train(EPOCH_NUM, dl, model, optimiser, criterion)


100%|██████████| 547/547 [01:57<00:00,  4.67it/s]
100%|██████████| 137/137 [00:21<00:00,  6.38it/s]
  0%|          | 0/547 [00:00<?, ?it/s]

Epoch: 1, Training loss: 2.5142471831720967, Valid loss: 2.2129025520199406


100%|██████████| 547/547 [01:57<00:00,  4.68it/s]
100%|██████████| 137/137 [00:21<00:00,  6.33it/s]
  0%|          | 0/547 [00:00<?, ?it/s]

Epoch: 2, Training loss: 4.703420105538377, Valid loss: 4.367922130292349


 58%|█████▊    | 317/547 [01:07<00:48,  4.73it/s]

KeyboardInterrupt: 

In [None]:
!nvidia-smi