# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [None]:
!pip install torchdata==0.3.0

In [None]:
!pip install torchtext==0.9.0

In [1]:
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
import torchtext
from nltk.corpus import brown
from torchtext.datasets import SQuAD2
from torchtext.legacy.data import Field, Dataset, Example, BucketIterator
import torch.nn as nn
import torch.optim as optim
import src.assits as ast
import random
import time
import math

nltk.download('brown')
nltk.download('punkt')


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [2]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [3]:
def getDict(dataPipe):

    data_dict = {
        'Question': [],
        'Answer': []
    }
    
    for _, question, answers, _ in dataPipe:
        data_dict['Question'].append(question)
        data_dict['Answer'].append(answers[0])
        
    return data_dict


def loadDF(path):
    # load data
    train_data, val_data = torchtext.datasets.SQuAD2(path)
    
    # convert dataPipe to dictionary 
    train_dict, val_dict = getDict(train_data), getDict(val_data)
    
    # convert Dictionaries to Pandas DataFrame
    train_df = pd.DataFrame(train_dict)    
    validation_df = pd.DataFrame(val_dict)    
    
    return train_df , validation_df ##.append(validation_df)




In [4]:
# Output, save, and load brown embeddings

#model = gensim.models.Word2Vec(brown.sents())
#model.save('brown.embedding')

#w2v = gensim.models.Word2Vec.load('brown.embedding')

In [5]:
train_df , vlad_df =loadDF('data')
train_df.head(10)

Unnamed: 0,Question,Answer
0,When did Beyonce start becoming popular?,in the late 1990s
1,What areas did Beyonce compete in when she was...,singing and dancing
2,When did Beyonce leave Destiny's Child and bec...,2003
3,In what city and state did Beyonce grow up?,"Houston, Texas"
4,In which decade did Beyonce become famous?,late 1990s
5,In what R&B group was she the lead singer?,Destiny's Child
6,What album made her a worldwide known artist?,Dangerously in Love
7,Who managed the Destiny's Child group?,Mathew Knowles
8,When did Beyoncé rise to fame?,late 1990s
9,What role did Beyoncé have in Destiny's Child?,lead singer


In [6]:
train_df.info() , vlad_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130319 entries, 0 to 130318
Data columns (total 2 columns):
 #   Column    Non-Null Count   Dtype 
---  ------    --------------   ----- 
 0   Question  130319 non-null  object
 1   Answer    130319 non-null  object
dtypes: object(2)
memory usage: 2.0+ MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11873 entries, 0 to 11872
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Question  11873 non-null  object
 1   Answer    11873 non-null  object
dtypes: object(2)
memory usage: 185.6+ KB


(None, None)

In [7]:
train_df = train_df.iloc[:11000, :]

In [8]:
vlad_df = vlad_df.iloc[:5000, :]

In [9]:
SRC = Field(tokenize = ast.prepare_text, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

TRG = Field(tokenize = ast.prepare_text, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

# Define torchtext.legacy Example class for creating examples


In [10]:
class DataFrameDataset(Dataset):
    def __init__(self, examples, fields, filter_pred=None):
        
        self.examples = examples.apply(SQuADExample.fromSeries, args=(fields,), axis=1).tolist()
        if filter_pred is not None:
            self.examples = filter(filter_pred, self.examples)
        self.fields = dict(fields)
        # Unpack field tuples
        for n, f in list(self.fields.items()):
            if isinstance(n, tuple):
                self.fields.update(zip(n, f))
                del self.fields[n]

In [11]:
class SQuADExample(Example):
        
    @classmethod
    def fromSeries(cls, data, fields):
        return cls.fromdict(data.to_dict(), fields)

    @classmethod
    def fromdict(cls, data, fields):
        ex = cls()
             
        for key, field in fields.items():
            if key not in data:
                raise ValueError("Specified key {} was not found in "
                "the input data".format(key))
            if field is not None:
                setattr(ex, key, field.preprocess(data[key]))
            else:
                setattr(ex, key, data[key])
        return ex

In [12]:
 fields = { 'Question' : SRC, 'Answer' : TRG }

In [13]:
train_data = DataFrameDataset(train_df, fields)
valid_data = DataFrameDataset(vlad_df, fields)

In [14]:
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

In [15]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")

Number of training examples: 11000
Number of validation examples: 5000


In [16]:
print(f"Unique tokens in source (Question) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (Answer) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (Question) vocabulary: 5511
Unique tokens in target (Answer) vocabulary: 2959


In [17]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [18]:
BATCH_SIZE = 128

train_iterator, valid_iterator = BucketIterator.splits(
    (train_data, valid_data), 
    batch_size = BATCH_SIZE,
    sort_key = lambda x: len(x.Question),
    sort_within_batch=True,
    device = device)

In [19]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.lstm = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        
        #src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.lstm(embedded)
        
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #outputs are always from the top hidden layer
        
        return hidden, cell

In [20]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.lstm = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.softmax = nn.LogSoftmax(dim= 1)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.softmax(self.fc_out(output.squeeze(0)))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell

In [21]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            #decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio
            
            #get the highest predicted token from our predictions
            top1 = output.argmax(1) 
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else top1
        
        return outputs

In [22]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 3
ENC_DROPOUT = 0.0
DEC_DROPOUT = 0.0

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

In [23]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)
        
model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(5511, 256)
    (lstm): LSTM(256, 512, num_layers=3)
    (dropout): Dropout(p=0.0, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(2959, 256)
    (lstm): LSTM(256, 512, num_layers=3)
    (fc_out): Linear(in_features=512, out_features=2959, bias=True)
    (softmax): LogSoftmax(dim=1)
    (dropout): Dropout(p=0.0, inplace=False)
  )
)

In [29]:
optimizer = optim.Adam(model.parameters())#, lr= 0.1)

TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]

criterion = nn.NLLLoss(ignore_index = TRG_PAD_IDX)

In [30]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.Question
        trg = batch.Answer
        
        optimizer.zero_grad()
        
        output = model(src, trg)
        
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)


In [31]:
def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):
            src = batch.Question
            trg = batch.Answer
            

            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [32]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [33]:
N_EPOCHS = 40
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f}')


Epoch: 01 | Time: 0m 16s
	Train Loss: 4.990
Epoch: 02 | Time: 0m 16s
	Train Loss: 4.611
Epoch: 03 | Time: 0m 16s
	Train Loss: 4.456
Epoch: 04 | Time: 0m 16s
	Train Loss: 4.289
Epoch: 05 | Time: 0m 16s
	Train Loss: 4.180
Epoch: 06 | Time: 0m 16s
	Train Loss: 4.054
Epoch: 07 | Time: 0m 16s
	Train Loss: 3.932
Epoch: 08 | Time: 0m 16s
	Train Loss: 3.826
Epoch: 09 | Time: 0m 16s
	Train Loss: 3.699
Epoch: 10 | Time: 0m 16s
	Train Loss: 3.581
Epoch: 11 | Time: 0m 16s
	Train Loss: 3.465
Epoch: 12 | Time: 0m 16s
	Train Loss: 3.343
Epoch: 13 | Time: 0m 16s
	Train Loss: 3.226
Epoch: 14 | Time: 0m 16s
	Train Loss: 3.096
Epoch: 15 | Time: 0m 16s
	Train Loss: 2.989
Epoch: 16 | Time: 0m 16s
	Train Loss: 2.873
Epoch: 17 | Time: 0m 16s
	Train Loss: 2.738
Epoch: 18 | Time: 0m 16s
	Train Loss: 2.603
Epoch: 19 | Time: 0m 16s
	Train Loss: 2.442
Epoch: 20 | Time: 0m 16s
	Train Loss: 2.336
Epoch: 21 | Time: 0m 16s
	Train Loss: 2.191
Epoch: 22 | Time: 0m 16s
	Train Loss: 2.038
Epoch: 23 | Time: 0m 16s
	Train 