# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [2]:
!pip install torchdata==0.3.0

Defaulting to user installation because normal site-packages is not writeable
Collecting torchdata==0.3.0
  Downloading torchdata-0.3.0-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 2.9 MB/s eta 0:00:011
Installing collected packages: torchdata
Successfully installed torchdata-0.3.0


In [1]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
import torch.nn as nn
from nltk.corpus import brown
from torchtext import data
from torchtext import datasets
from tqdm import tqdm  
from torchtext.datasets import SQuAD1
import torch.utils.data as data
import random
from torch.utils.data import DataLoader

In [2]:
nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
from nltk.stem.porter import *
from nltk.stem import *
from nltk.tokenize import RegexpTokenizer


# Define vocabulary class
class Vocab:
    def __init__(self):
        self.word_to_idx = {'<sos>': 0, '<eos>': 1, '<unk>': 2, '<pad>': 3}
        self.vocab_size = 4
        self.idx_to_word = {0: '<sos>', 1: '<eos>', 2: '<unk>', 3: '<pad>'}
        self.tokenizer = RegexpTokenizer(r'\w+')
        self.word_count = {}
  
    def load_word(self, word):
        
        if type(word) != str:
            raise ValueError('The word arg must be a string')
        
        # Lower case words 
        word = word.lower()
        
        if word not in self.word_to_idx:
            self.word_to_idx[word] = self.vocab_size
            self.idx_to_word[self.vocab_size] = word
            self.word_count[word] = 1
            self.vocab_size += 1
        else:
            self.word_count[word] += 1 
        
    def load_sentence(self, sent):
        
        if type(sent) != str and type(sent) != list:
            raise ValueError('The sent arg must be either a string or a list of strings.')
        
        if type(sent) == str:
            sent = self.tokenizer.tokenize(sent)
                    
        for word in sent:
            self.load_word(word)
            
    def __len__(self):
        return len(self.word_to_idx)
    
    def convert_sentence(self, sent):
        
        if type(sent) != str and type(sent) != list:
            raise ValueError('The sent arg must be either a string or a list of strings.')
        
        if type(sent) == str:
            sent = sent.lower()
            sent = self.tokenizer.tokenize(sent)
                    
        sent_idx = [self.word_to_idx[word] if word in self.word_to_idx else self.word_to_idx['<unk>'] for word in sent ]
        
        return sent_idx
    
    def filter_by_freq(self, min_freq=0):
        
        word_count = {}
        word_to_idx = {'<sos>': 0, '<eos>': 1, '<unk>': 2, '<pad>': 3}
        idx_to_word = {0: '<sos>', 1: '<eos>', 2: '<unk>', 3: '<pad>'}
        vocab_size = 4
                
        words = [key for key, val in self.word_count.items() if val >= min_freq] 

        for word in words:
            word_to_idx[word] = vocab_size
            idx_to_word[vocab_size] = word
            vocab_size += 1
            
            word_count[word] = self.word_count[word]
        
        # Update state         
        self.word_to_idx = word_to_idx
        self.idx_to_word = idx_to_word
        self.word_count = word_count



In [4]:
# Cell to test the vocab
def test_vocab(w2v=None, use_w2v=False):
    if use_w2v:
        pass
    else:
        vocab = Vocab()
        sentence = "this is a test sentence with some words. Some words are duplicated for testing"
        vocab.load_sentence(sentence)
                
        assert vocab.word_count['some'] == 2, "the word 'some' should appear 2 times"
        assert vocab.word_count['words'] == 2, "the word 'words' should appear 2 times"
        
        assert len(vocab.word_to_idx) == 16, "the total number of entries in word_to_idx must be 16"
        assert len(vocab.idx_to_word) == 16, "the total number of entries in idx_to_word must be 16"
        
        vocab.filter_by_freq(min_freq=0)
        assert len(vocab.word_to_idx) == 16, "the total number of entries in word_to_idx must be 16"

        vocab.filter_by_freq(min_freq=2)
        assert len(vocab.word_to_idx) == 6, "the total number of entries in word_to_idx must be 6"
            
        print("Everything seems correct!")
test_vocab()
    

Everything seems correct!


In [5]:
# Prepare loaders

def load_vocabulary(train_dataset) -> Vocab:
    vocab = Vocab()
    
    print('Loading vocabulary from the train dataset')
    
    for context, question, answers, idx_context in tqdm(train_dataset):

        vocab.load_sentence(question)

        for answer in answers:
            vocab.load_sentence(answer)
    
    print(f"Original size: {len(vocab)}")
    vocab.filter_by_freq(min_freq=0.0001)
    print(f"After filtering size: {len(vocab)}")
    
    return vocab
    

def reduce_dataset(dataset_in) -> list:
    
    dataset_out = []
    
    print('Reduce dataset to consider only Q/A')
    for context, question, answers, idx_context in tqdm(dataset_in):

        # Select only the first answer         
        dataset_out.append((question, answers[0]))

    return dataset_out
    
    
def create_dataloader(batch_size: int, train_valid_ratio: float = 0.8) -> (Vocab, DataLoader):
    
    # Create dataloaders      
    dataloders = {'train': None, 'valid': None, 'test': None}
    
    train_dataset, test_dataset = SQuAD1()

    # load vocabulary
    vocab = load_vocabulary(train_dataset)
      
    # Batching the testing dataset
    train_dataset = reduce_dataset(train_dataset)
    
    train_batch_available = len(train_dataset) // batch_size

    # Discard the entries not fitting with the batch_size     
    train_dataset = train_dataset[:train_batch_available * batch_size]
    
    # Split the train dataset into training and validation
    train_batch_num = int((len(train_dataset) / batch_size) * train_valid_ratio)
    train_num = train_batch_num * batch_size

    valid_num = int((len(train_dataset) - train_num))

    train_dataset, valid_dataset = data.random_split(train_dataset, [train_num, valid_num])
    
    # Batching the testing dataset     
  
    test_dataset = reduce_dataset(test_dataset)
    test_batch_available = len(test_dataset) // batch_size

    # Discard the entries not fitting with the batch_size     
    test_dataset = test_dataset[:test_batch_available * batch_size]
    
    # Populate the dataloaders

    dataloders['train'] = DataLoader(train_dataset, batch_size=batch_size, shuffle=False) 
    dataloders['valid'] = DataLoader(valid_dataset, batch_size=batch_size, shuffle=False) 
    dataloders['test'] = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) 

    
    return vocab, dataloders

In [6]:
BATCH_SIZE = 128
EPOCH_NUM = 10

In [7]:
vocab, dl = create_dataloader(BATCH_SIZE, train_valid_ratio=0.8)

0it [00:00, ?it/s]

Loading vocabulary from the train dataset


87599it [00:08, 10567.09it/s]
0it [00:00, ?it/s]

Original size: 52602
After filtering size: 52602
Reduce dataset to consider only Q/A


87599it [00:03, 25056.94it/s]
0it [00:00, ?it/s]

Reduce dataset to consider only Q/A


10570it [00:00, 13500.92it/s]


In [8]:
def loadDF(path):
    '''

    You will use this function to load the dataset into a Pandas Dataframe for processing.

    '''

    df = pd.DataFrame()
    
    return df


def prepare_text(sentence):
    
    '''

    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html

    '''
    
    tokens = nltk.tokenize.word_tokenize(s) 
    
    return tokens



def train_test_split(SRC, TRG):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    
    return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset

In [25]:
class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_size = 100, dropout=0.5):
        
        super(Encoder, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        self.embedding_size = embedding_size 
        
        # self.embedding provides a vector representation of the inputs to our model
        self.embedding = nn.Embedding(self.input_size, self.embedding_size)
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, num_layers=2, dropout=dropout, batch_first=True)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        '''
        Inputs: src, the src vector
        Outputs: output, the encoder outputs
                 self.hidden, the hidden state
                 self.cell_state, the cell state
        '''
        
        embedded = self.dropout(self.embedding(src))

        output, (hidden, cell_state) = self.lstm(embedded)
            
        return output, hidden, cell_state
    

class Decoder(nn.Module):
      
    def __init__(self, hidden_size, output_size, embedding_size=100, dropout=0.5):
        
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embedding_size = embedding_size
                
        # self.embedding provides a vector representation of the target to our model
        
        self.embedding = nn.Embedding(output_size, self.embedding_size)

        # self.lstm, accepts the embeddings and outputs a hidden state
        
        self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, num_layers=2, dropout=dropout, batch_first=True)

        # self.ouput, predicts on the hidden state via a linear output layer
        self.fcl = nn.Linear(self.hidden_size, self.output_size)
        
        self.activation = nn.ReLU()
        self.dropout = nn.Dropout(dropout)
        
        
    def forward(self, trg, hidden, cell_state):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''        
        
        embedded = self.dropout(self.embedding(trg))

        output, (hidden, cell_state) = self.lstm(embedded, (hidden, cell_state))

        output = self.fcl(output.view(-1, self.hidden_size))

        return output, (hidden, cell_state)

class Seq2Seq(nn.Module):
    
    def __init__(self, encoder_input_size, encoder_hidden_size, decoder_hidden_size, decoder_output_size, device):
        
        super(Seq2Seq, self).__init__()
        
        self.device = device
        self.encoder = Encoder(encoder_input_size, encoder_hidden_size).to(device)
        self.decoder = Decoder(decoder_hidden_size, decoder_output_size).to(device)
        
    
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):      
               
        # Encoder
        output, hidden, cell = self.encoder(src)
        
        # Define tensor to store decoder outputs
        decoder_outputs = torch.zeros(trg.shape[0], trg.shape[1], self.decoder.output_size).to(self.device)
        
        best_output = trg[:, 0]

        for trg_col in range(trg.shape[1]):

            # Teaching modification
            trg_input =  trg[:, trg_col] if random.random() < teacher_forcing_ratio else best_output
            trg_input = trg_input.unsqueeze(1)
                                        
            output, (hidden, cell) = self.decoder(trg_input, hidden, cell)

            best_output = output.argmax(1)
            
            decoder_outputs[:, trg_col] = output

                
        return decoder_outputs


In [56]:
def batch_wrap_and_pad(batch_data, vocab: Vocab, max_length: int = 10, reverse=False) -> torch.LongTensor:

    # Convert to index
    batch_data = [vocab.convert_sentence(sentence) for sentence in batch_data]
    
    # Reverse order of the batched data to remove short memory connections
    if reverse is True:
        for sentence in batch_data:
            sentence.reverse()
            
    # Compute max lengths for the padding             
    batch_max_length = min(max([len(sentence) for sentence in batch_data]), max_length)

    # define two functions: one for wrapping the sentence with the sos and eos tokens, 
    # and one for padding the sentence when necesary
    wrapper = lambda x: [vocab.word_to_idx['<sos>']] + x[:min(len(x), batch_max_length - 2)] + [vocab.word_to_idx['<eos>']]
    padding = lambda x: x[:min(len(x), batch_max_length )] + [vocab.word_to_idx['<pad>']] * max(batch_max_length -  min(len(x), batch_max_length + 2), 0)

    return torch.LongTensor([padding(wrapper(sentence)) for sentence in batch_data])  

def convert_to_text(data, vocab: Vocab) -> str:

    text = ''
    for element in data:
        text += vocab.idx_to_word[int(element)] + " "
    
    return text 


In [57]:
# Test wrapping and padding
def test_wrap_and_pad(vocab):
    batch_data = ['this is a sentence', 'this is a very very very long sentence', 'this is a very very super uber mega ultra useless long sentence']
    
    batch_wp = batch_wrap_and_pad(batch_data, vocab)
        
    assert batch_wp[0][0] == 0 and batch_wp[1][0] == 0, "No <sos> token" 
    assert batch_wp[0][5] == 1, "No <eos> token"
    assert batch_wp[1][9] == 1, "No <eos> token"
    assert batch_wp[0][6] == 3 and batch_wp[0][7] == 3 and batch_wp[0][8] == 3 and batch_wp[0][9] == 3, "No <pad> token"
    
    batch_wp_rev = batch_wrap_and_pad(batch_data, vocab, reverse=True)
    
    assert all(torch.eq(batch_wp[0][1:5], batch_wp_rev[0][1:5].flip(dims=(0,)))), "The first sentence must be equal but reversed"
    assert all(torch.eq(batch_wp[1][1:9], batch_wp_rev[1][1:9].flip(dims=(0,)))), "The second sentence must be equal but reversed"
    
    print("Everthing seems ok!")
    
test_wrap_and_pad(vocab)


Everthing seems ok!


In [35]:
def train(epoch_num, dl, vocab, model):

    # Define epoch losses
    train_loss = 0
    valid_loss = 0

    # Define optimiser     
    optimizer = torch.optim.Adam(model.parameters())
    criterion = nn.CrossEntropyLoss(ignore_index = vocab.word_to_idx['<pad>'])
    
    # Gradient clipping
    clip = 1.0
    
    for epoch in range(1, epoch_num + 1):
        
        # Enable training
        model.train()
        
        # Training
        for src, trg in tqdm(dl['train']):
            
            optimizer.zero_grad()

            # Wrap and pad the batch             
            src = batch_wrap_and_pad(src, vocab, reverse=True)
            trg = batch_wrap_and_pad(trg, vocab)
        
            # Move to cuda if available             
            if torch.cuda.is_available():
                src, trg = src.cuda(), trg.cuda()
            
            # Get seq2seq output             
            output = model(src, trg)

            output_dim = output.shape[-1]
            
            output = output[:, 1:].reshape(-1, output_dim)
            trg = trg[:, 1:].reshape(-1)
            
            # Compute loss     
            loss = criterion(output, trg)

            loss.backward()

            # Clip gradient     
            torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

            optimizer.step()

            train_loss += loss.item()
            
        # Disable training
        model.eval()
        
        # Validation              
        for src, trg in tqdm(dl['valid']):
                    
            # Wrap and pad the batch             
            src = batch_wrap_and_pad(src, vocab, reverse=True)
            trg = batch_wrap_and_pad(trg, vocab)
        
            # Move to cuda if available             
            if torch.cuda.is_available():
                src, trg = src.cuda(), trg.cuda()
            
            # Get seq2seq output             
            output = model(src, trg)

            output_dim = output.shape[-1]

            output = output[:, 1:].reshape(-1, output_dim)
            trg = trg[:, 1:].reshape(-1)

            # Compute loss     
            loss = criterion(output, trg)

            valid_loss += loss.item()

#         if epoch % 1:
        print(f'Epoch: {epoch}, Training loss: {train_loss}, Valid loss: {valid_loss}')
            
            
def testing(epoch_numb, dl, vocab, model):
    
    # Enable training
    model.eval()
    
    # Define epoch loss
    epoch_loss = 0

    # Define optimiser     
    criterion = nn.CrossEntropyLoss()
    
    
    with torch.no_grad():

        for epoch in range(1, epoch_numb + 1):

            for src, trg in tqdm(dl['test']):

                src = batch_wrap_and_pad(src, vocab)
                trg = batch_wrap_and_pad(trg, vocab)

                if torch.cuda.is_available():
                    src, trg = src.to(device), trg.to(device)

                output = model(src, trg)

                output_dim = output.shape[-1]

                output = output[:, 1:].reshape(-1, output_dim)
                trg = trg[:, 1:].reshape(-1)

                # Compute loss     
                loss = criterion(output, trg)

                epoch_loss += loss.item()


            print(f'Epoch loss {epoch_loss}')
            


In [29]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

In [58]:
encoder_input_size = len(vocab)
encoder_hidden_size = 512

decoder_hidden_size = 512
decoder_output_size = len(vocab)

# Check cuda availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

model = Seq2Seq(encoder_input_size, encoder_hidden_size, decoder_hidden_size, decoder_output_size, device)
model.apply(init_weights)

train(EPOCH_NUM, dl, vocab, model)






  0%|          | 0/547 [00:00<?, ?it/s][A[A[A[A

What is the European equivalent of "Qi"?
<sos> qi of equivalent european the is what <eos> <pad> 
<sos> pneuma <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 






  0%|          | 1/547 [00:30<4:38:00, 30.55s/it][A[A[A[A

How is torque determined?
<sos> determined torque is how <eos> <pad> <pad> <pad> <pad> 
<sos> from the vector product of the interacting fields <eos> 






  0%|          | 2/547 [00:52<4:13:48, 27.94s/it][A[A[A[A

Who was criticized for not handling the Forsyth case independently?
<sos> independently case forsyth the handling not for criticized <eos> 
<sos> the head master <eos> <pad> <pad> <pad> <pad> <pad> 






  1%|          | 3/547 [01:12<3:50:59, 25.48s/it][A[A[A[A

The first private company to offer Hyderabad internet service began offering it in what year?
<sos> year what in it offering began service internet <eos> 
<sos> 1998 <eos> <pad> <pad> <pad> <pad> <pad> <pad> <pad> 






  1%|          | 4/547 [01:31<3:33:23, 23.58s/it][A[A[A[A

Who benefits from political corruption?
<sos> corruption political from benefits who <eos> <pad> <pad> <pad> 
<sos> government officials <eos> <pad> <pad> <pad> <pad> <pad> <pad> 



KeyboardInterrupt



In [None]:
!nvidia-smi