# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





### Libraries

In [28]:
# Black code formatter (Optional)
%load_ext lab_black

The lab_black extension is already loaded. To reload it, use:
  %reload_ext lab_black


In [29]:
import pandas as pd
import numpy as np
import gzip
from typing import List
import random
import time
import math


import torch
import torch.nn as nn
import torch.optim as optim

import torchtext
from torchtext.legacy.data import Field, BucketIterator
from torchtext.legacy.datasets import Multi30k

import gensim
import spacy

import nltk
from nltk.corpus import brown

nltk.download("brown")  # data files for bigram collocation

[nltk_data] Downloading package brown to /Users/ipinmi/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [30]:
SEED = 47  # for reproducibility

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Data Preprocessing 

In [31]:
# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save("brown.embedding")

w2v = gensim.models.Word2Vec.load("brown.embedding")

In [32]:
## Tokenization using Spacy
spacy_de = spacy.load("de_core_news_sm")
spacy_en = spacy.load("en_core_web_sm")


def tokenize_de(text):
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]


def tokenize_en(text):
    return [tok.text for tok in spacy_en.tokenizer(text)]

In [33]:
# define english preprocessing pipeline (after tokenization)
def en_prepareText(tokens):
    STOP_WORDS = spacy_en.Defaults.stop_words

    # remove stopwords
    tokens = [token for token in tokens if token not in STOP_WORDS]

    # lemmatize the tokens
    doc = spacy_en(" ".join(tokens))
    tokens = [token.lemma_ for token in doc]
    return tokens


# define german preprocessing pipeline (after tokenization)
def ger_prepareText(tokens):
    GER_STOP_WORDS = spacy_de.Defaults.stop_words

    # remove stopwords
    tokens = [token for token in tokens if token not in GER_STOP_WORDS]

    # lemmatize the tokens
    doc = spacy_de(" ".join(tokens))
    tokens = [token.lemma_ for token in doc]

    return tokens

In [34]:
SRC = Field(
    tokenize=tokenize_de,
    init_token="<sos>",
    eos_token="<eos>",
    pad_token="<pad>",
    unk_token="<unk>",
    lower=True,
    # preprocessing=ger_prepareText,
)
TRG = Field(
    tokenize=tokenize_en,
    init_token="<sos>",
    eos_token="<eos>",
    pad_token="<pad>",
    unk_token="<unk>",
    lower=True,
    # preprocessing=en_prepareText,
)


def loadDF(SRC, TRG):
    """

    You will use this function to load the dataset into a Pandas Dataframe for processing.

    Args:
        split_set: the dataset split you want to load into a Pandas Dataframe
    """

    train_data, valid_data, test_data = Multi30k.splits(
        exts=(".de", ".en"), fields=(SRC, TRG)
    )

    return train_data, valid_data, test_data

In [35]:
def buildVocab(SRC, TRG, train_dataset):
    """
    Input: SRC, our list of German texts from the dataset
            TRG, our list of English texts from the dataset

    Output: SRC and TRG vocabularies

    """

    # Build the vocabulary for the source and target languages
    SRC.build_vocab(train_dataset, min_freq=2)
    TRG.build_vocab(train_dataset, min_freq=2)

    # Print the number of unique tokens in the source and target vocabularies
    print("Source vocabulary size:", len(SRC.vocab))
    print("Target vocabulary size:", len(TRG.vocab))

    # Print the 10 most common tokens in the source vocabulary
    print(SRC.vocab.freqs.most_common(10))

    # Print the 10 most common tokens in the target vocabulary
    print(TRG.vocab.freqs.most_common(10))

    return SRC.vocab, TRG.vocab

In [36]:
def split_into_batches(dataset, BATCH_SIZE):
    """
    Creating batches of data.
    The BucketIterator will ensure that the sentences of similar length are batched together.

    Input: dataset (Tuple), the dataset to split into batches
            batch_size, the size of each batch

    Output: return a batch of data with a src and trg attribute
    """
    train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
        dataset, batch_size=BATCH_SIZE, device=device
    )

    return train_iterator, valid_iterator, test_iterator

In [37]:
train_data, valid_data, test_data = loadDF(SRC, TRG)

In [38]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


In [39]:
DATASET = (train_data, valid_data, test_data)
BATCH_SIZE = 128

train_batch, valid_batch, test_batch = split_into_batches(DATASET, BATCH_SIZE)

### Model Architecture

#### Encoder

In [40]:
class Encoder(nn.Module):
    """
    Input :
        - source batch
    Layer :
        source batch -> Embedding -> LSTM
    Output :
        - outputs: the top-layer hidden state for each time step
        - LSTM hidden state: the final hidden state for each layer, stacked on top of each other
        - LSTM cell state: the final cell state for each layer, stacked on top of each other

    Parmeters
    ---------
    input_size : int
        Input dimension, should equal to the source vocab size.

    embd_size : int
        Embedding layer's dimension.

    hidden_size : int
        LSTM Hidden/Cell state's dimension.

    n_layers : int
        Number of LSTM layers.

    dropout : float
        Dropout for the LSTM layer.
    """

    def __init__(self, input_size, embd_size, hidden_size, n_layers, dropout):
        super(Encoder, self).__init__()

        self.input_size = input_size
        self.embd_size = embd_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.dropout = dropout

        # self.embedding provides a vector representation of the inputs to our model
        self.embedding = nn.Embedding(input_size, embd_size)

        # self.lstm, accepts the vectorized input and passes a hidden state
        self.lstm = nn.LSTM(embd_size, hidden_size, n_layers, dropout=dropout)

    def forward(self, src):
        """
        Parameters
        ------
        src : the source vector [batch size, src length]
        embedded: [batch size, src length,  embedding size]

        Outputs
        ------
        Outputs: the encoder outputs from the top layer
                [src length, batch size, hidden size * n_directions]
        hidden: the hidden state, [n_layers * n_directions, batch size, hidden size]
        cell: the cell state, [n_layers * n_directions, batch size, hidden size]
        """
        embedded = self.embedding(src)

        outputs, (hidden, cell) = self.lstm(embedded)

        return hidden, cell

#### Decoder

In [41]:
class Decoder(nn.Module):
    def __init__(self, output_size, embd_size, hidden_size, n_layers, dropout):
        """
        Input :
            - first token in the target batch
            - LSTM hidden state from the encoder
            - LSTM cell state from the encoder
        Layer :
            target batch -> Embedding --
                                        |
            encoder hidden state ------ |--> LSTM -> Linear
                                        |
            encoder cell state   -------

        Output :
            - prediction
            - LSTM hidden state
            - LSTM cell state

        Parmeters
        ---------
        output_size : int
            Output dimension, should equal to the target vocab size.

        embd_size : int
            Embedding layer's dimension.

        hidden_size : int
            LSTM Hidden/Cell state's dimension.

        n_layers : int
            Number of LSTM layers.

        dropout : float
            Dropout for the LSTM layer.
        """

        super(Decoder, self).__init__()

        self.output_size = output_size
        self.embd_size = embd_size
        self.hidden_size = hidden_size
        self.n_layers = n_layers
        self.dropout = dropout

        # self.embedding provides a vector representation of the target to our model
        self.embedding = nn.Embedding(output_size, embd_size)

        # self.lstm, accepts the embeddings and outputs a hidden state
        self.lstm = nn.LSTM(embd_size, hidden_size, n_layers, dropout=dropout)

        # self.ouput, predicts on the hidden state via a linear output layer
        self.fcLayer = nn.Linear(hidden_size, output_size)

    def forward(self, target, hidden, cell):
        """

        Parameters
        ----------
        target : 1D torch.LongTensor
            Batched tokenized source sentence of shape [batch size].

        hidden, cell : 3D torch.FloatTensor
            Hidden and cell state of the LSTM layer. Each state's shape
            [n layers * n directions, batch size, hidden dim]

        Returns
        -------
        prediction : 2D torch.LongTensor
            For each token in the batch, the predicted target vobulary.
            [batch size, output dim]

        hidden, cell : 3D torch.FloatTensor
            Hidden and cell state of the LSTM layer. Each state's shape
            [n layers * n directions, batch size, hidden dim]
        """

        # [1, batch size, emb dim], the 1 serves as sent len
        embedded = self.embedding(target.unsqueeze(0))

        outputs, (hidden, cell) = self.lstm(embedded, (hidden, cell))

        prediction = self.fcLayer(outputs.squeeze(0))

        return prediction, hidden, cell

#### Seq2Seq

In [42]:
class Seq2Seq(nn.Module):
    """ """

    def __init__(self, encoder, decoder, device):
        super(Seq2Seq, self).__init__()

        self.encoder = encoder
        self.decoder = decoder
        self.device = device

        assert (
            encoder.hidden_size == decoder.hidden_size
        ), "Hidden dimensions of encoder and decoder must be equal!"
        assert (
            encoder.n_layers == decoder.n_layers
        ), "Encoder and decoder must have equal number of layers!"

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        trg_len, batch_size = trg.shape
        trg_vocab_size = self.decoder.output_size

        # 3D tensor to storing the decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        # decoder initial hidden and cell state = last encoder's hidden and cell state
        hidden, cell = self.encoder(src)

        # first input to the decoder is the <sos> token
        input = trg[0, :]

        for t in range(1, trg_len):
            # inputs: input token embedding, previous hidden and previous cell states
            # outputs: prediction, hidden state, cell state
            prediction, hidden, cell = self.decoder(input, hidden, cell)

            # store the decoder result in the outputs tensor
            outputs[t] = prediction

            # applying the teacher force method based on the teacher_forcing_ratio
            teacher_force = np.random.random() < teacher_forcing_ratio

            if teacher_force:
                # use the actual next token as the next input
                input = trg[t]
            else:
                # select only the highest predicted token from the predictions
                top1 = prediction.argmax(1)
                input = top1

        return outputs

#### Model Parameters

In [43]:
source_vocab, target_vocab = buildVocab(SRC, TRG, train_data)

Source vocabulary size: 7853
Target vocabulary size: 5893
[('.', 28809), ('ein', 18851), ('einem', 13711), ('in', 11895), ('eine', 9909), (',', 8938), ('und', 8925), ('mit', 8843), ('auf', 8745), ('mann', 7805)]
[('a', 49165), ('.', 27623), ('in', 14886), ('the', 10955), ('on', 8035), ('man', 7781), ('is', 7525), ('and', 7379), ('of', 6871), ('with', 6179)]


In [44]:
# adjustable parameters
INPUT_DIM = len(source_vocab)
OUTPUT_DIM = len(target_vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HIDDEN_SIZE = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

NUM_EPOCHS = 10
CLIP = 1

In [45]:
encoder = Encoder(INPUT_DIM, ENC_EMB_DIM, HIDDEN_SIZE, N_LAYERS, ENC_DROPOUT)
decoder = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HIDDEN_SIZE, N_LAYERS, DEC_DROPOUT)
seq2seq = Seq2Seq(encoder, decoder, device).to(device)

In [46]:
seq2seq

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7853, 256)
    (lstm): LSTM(256, 512, num_layers=2, dropout=0.5)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (lstm): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fcLayer): Linear(in_features=512, out_features=5893, bias=True)
  )
)

In [47]:
def model_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)


print(f"The model has {model_params(seq2seq):,} trainable parameters")

The model has 13,898,501 trainable parameters


#### Training

In [48]:
# Optimizer
optimizer = optim.Adam(seq2seq.parameters(), lr=0.001)

# Loss function
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)

In [49]:
def train(model, batch_iterator, optimizer, criterion, clip):
    """ """
    model.train()

    epoch_loss = 0

    for _, batch in enumerate(batch_iterator):
        # getting the source and target sentences from the batch
        src = batch.src
        trg = batch.trg  # [ batch size, trg len]

        optimizer.zero_grad()

        output = model(src, trg)  # [ batch size, trg len, output size]

        # flattening the output and getting only the first column for calculating the loss
        flatten_output = output[1:].view(
            -1, output.shape[-1]
        )  # [trg len * batch size, output size]
        flatten_trg = trg[1:].view(-1)  # [trg len * batch size]

        # calculate the loss
        loss = criterion(flatten_output, flatten_trg)

        # backward pass
        loss.backward()

        # clip the gradient to prevent exploding gradient problem
        nn.utils.clip_grad_norm_(model.parameters(), clip)

        # update the parameters
        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(batch_iterator)

#### Evaluation

In [50]:
def evaluate(model, batch_iterator, criterion):
    model.eval()

    val_epoch_loss = 0

    with torch.no_grad():
        for _, batch in enumerate(batch_iterator):
            src = batch.src
            trg = batch.trg

            output = model(src, trg, 0)  # removing the teacher forcing

            # flattening the output and getting only the first column for calculating the loss
            flatten_output = output[1:].view(
                -1, output.shape[-1]
            )  # [trg len * batch size, output size]
            flatten_trg = trg[1:].view(-1)  # [trg len * batch size]

            # calculate the loss
            loss = criterion(flatten_output, flatten_trg)

            val_epoch_loss += loss.item()

    return val_epoch_loss / len(batch_iterator)

In [51]:
best_valid_loss = float("inf")  # initialize a best validation loss to beat

In [52]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [53]:
for epoch in range(NUM_EPOCHS):
    start_time = time.time()

    train_loss = train(seq2seq, train_batch, optimizer, criterion, CLIP)

    valid_loss = evaluate(seq2seq, valid_batch, criterion)

    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(seq2seq.state_dict(), "seq2seq_model.pt")

    print(f"Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s")
    print(f"\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}")
    print(f"\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}")

Epoch: 01 | Time: 13m 43s
	Train Loss: 4.998 | Train PPL: 148.117
	 Val. Loss: 4.853 |  Val. PPL: 128.167
Epoch: 02 | Time: 12m 58s
	Train Loss: 4.407 | Train PPL:  82.017
	 Val. Loss: 4.624 |  Val. PPL: 101.910
Epoch: 03 | Time: 12m 58s
	Train Loss: 4.078 | Train PPL:  59.005
	 Val. Loss: 4.491 |  Val. PPL:  89.175
Epoch: 04 | Time: 12m 54s
	Train Loss: 3.808 | Train PPL:  45.065
	 Val. Loss: 4.276 |  Val. PPL:  71.943
Epoch: 05 | Time: 13m 5s
	Train Loss: 3.572 | Train PPL:  35.577
	 Val. Loss: 3.999 |  Val. PPL:  54.569
Epoch: 06 | Time: 12m 51s
	Train Loss: 3.363 | Train PPL:  28.877
	 Val. Loss: 3.883 |  Val. PPL:  48.588
Epoch: 07 | Time: 12m 40s
	Train Loss: 3.177 | Train PPL:  23.969
	 Val. Loss: 3.884 |  Val. PPL:  48.636
Epoch: 08 | Time: 12m 39s
	Train Loss: 3.037 | Train PPL:  20.834
	 Val. Loss: 3.816 |  Val. PPL:  45.411
Epoch: 09 | Time: 12m 45s
	Train Loss: 2.900 | Train PPL:  18.182
	 Val. Loss: 3.709 |  Val. PPL:  40.810
Epoch: 10 | Time: 13m 8s
	Train Loss: 2.779 | T

In [54]:
# Evaluating the trained model
seq2seq.load_state_dict(torch.load("seq2seq_model.pt"))

test_loss = evaluate(seq2seq, test_batch, criterion)

print(f"| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |")

| Test Loss: 3.712 | Test PPL:  40.931 |


### Results using the trained model

In [55]:
example_index = 28
example = train_data.examples[example_index]


print("source sentence: ", " ".join(example.src))
print("target sentence: ", " ".join(example.trg))

source sentence:  . bieber justin wie aussehe ich dass , weißt du
target sentence:  you know i am looking like justin bieber .


In [56]:
src_tensor = SRC.process([example.src]).to(device)
trg_tensor = TRG.process([example.trg]).to(device)
print(trg_tensor.shape)

seq2seq.eval()
with torch.no_grad():
    outputs = seq2seq(src_tensor, trg_tensor, teacher_forcing_ratio=0)

outputs.shape

torch.Size([11, 1])


torch.Size([11, 1, 5893])

In [57]:
output_idx = outputs[1:].squeeze(1).argmax(1)
" ".join([TRG.vocab.itos[idx] for idx in output_idx])

'<unk> <unk> <unk> <unk> <unk> <unk> <unk> . <eos> <eos>'

In [None]:
TEST_EXM = "ich bin keinen hunger ."

src_tensor = (
    torch.LongTensor([SRC.vocab.stoi[word] for word in example.src])
    .unsqueeze(1)
    .to(device)
)
trg_tensor = (
    torch.LongTensor([TRG.vocab.stoi[word] for word in example.trg])
    .unsqueeze(1)
    .to(device)
)

seq2seq.eval()
with torch.no_grad():
    prediction = seq2seq(src_tensor, trg_tensor, 0)  # turn off teacher forcing

print(trg_tensor.shape)

print(prediction.shape)

output_idx = prediction[1:].squeeze(1).argmax(1)

full_sentence = " ".join([TRG.vocab.itos[idx] for idx in output_idx])

print(full_sentence)