# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





# Important libraries for vocabulary buildup

In [1]:
!pip install torchdata    #dataset processing library

Defaulting to user installation because normal site-packages is not writeable
Collecting torch==1.12.1
  Using cached torch-1.12.1-cp37-cp37m-manylinux1_x86_64.whl (776.3 MB)
[31mERROR: torchtext 0.10.0 has requirement torch==1.9.0, but you'll have torch 1.12.1 which is incompatible.[0m
[31mERROR: torchvision 0.10.0 has requirement torch==1.9.0, but you'll have torch 1.12.1 which is incompatible.[0m
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.9.0
    Uninstalling torch-1.9.0:
      Successfully uninstalled torch-1.9.0
Successfully installed torch-1.12.1


In [2]:
!pip install torchtext==0.10.0    #dataset repository

Defaulting to user installation because normal site-packages is not writeable
Collecting torch==1.9.0
  Using cached torch-1.9.0-cp37-cp37m-manylinux1_x86_64.whl (831.4 MB)
[31mERROR: torchdata 0.4.1 has requirement torch==1.12.1, but you'll have torch 1.9.0 which is incompatible.[0m
Installing collected packages: torch
  Attempting uninstall: torch
    Found existing installation: torch 1.12.1
    Uninstalling torch-1.12.1:
      Successfully uninstalled torch-1.12.1
Successfully installed torch-1.9.0


In [3]:
!pip install spacy    # NLP pipleline
!python3 -m spacy download en_core_web_sm     #english pipeline optimized for cpu (based on wed text)
!python3 -m spacy download de_core_news_sm    #german pipeline optimized for cpu (based on media text)

Defaulting to user installation because normal site-packages is not writeable
Collecting spacy
  Downloading spacy-3.4.1-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (6.3 MB)
[K     |████████████████████████████████| 6.3 MB 5.1 MB/s eta 0:00:01
[?25hCollecting langcodes<4.0.0,>=3.2.0
  Downloading langcodes-3.3.0-py3-none-any.whl (181 kB)
[K     |████████████████████████████████| 181 kB 40.7 MB/s eta 0:00:01
Collecting preshed<3.1.0,>=3.0.2
  Downloading preshed-3.0.7-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (126 kB)
[K     |████████████████████████████████| 126 kB 41.0 MB/s eta 0:00:01
Collecting spacy-loggers<2.0.0,>=1.0.0
  Downloading spacy_loggers-1.0.3-py3-none-any.whl (9.3 kB)
Collecting wasabi<1.1.0,>=0.9.1
  Downloading wasabi-0.10.1-py3-none-any.whl (26 kB)
Collecting typer<0.5.0,>=0.3.0
  Downloading typer-0.4.2-py3-none-any.whl (27 kB)
Collecting spacy-legacy<3.1.0,>=3.0.9
  Downloading spacy_legacy-3.

Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
Defaulting to user installation because normal site-packages is not writeable
Collecting de-core-news-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/de_core_news_sm-3.4.0/de_core_news_sm-3.4.0-py3-none-any.whl (14.6 MB)
[K     |████████████████████████████████| 14.6 MB 5.4 MB/s eta 0:00:01


Installing collected packages: de-core-news-sm
Successfully installed de-core-news-sm-3.4.0
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('de_core_news_sm')


In [3]:
import os, math, time, spacy, torch, random
import numpy as np
import torch.nn as nn
import torch.optim as optim
from typing import List
from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator

# Loading pipelines using spacy

In [4]:
de_spacy = spacy.load('de_core_news_sm')
en_spacy = spacy.load('en_core_web_sm')

#Tokenizer define
def tokenize_de(text: str) -> List[str]:
    return [tok.text for tok in de_spacy.tokenizer(text)][::-1]

def tokenize_en(text: str) -> List[str]:
    return [tok.text for tok in en_spacy.tokenizer(text)]

# Generate pytext field

In [5]:
source = Field(tokenize=tokenize_de, init_token='<sos>', eos_token='<eos>', lower=True)
target = Field(tokenize=tokenize_en, init_token='<sos>', eos_token='<eos>', lower=True)

# Data split into training, validation and test

In [6]:
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), fields=(source, target))

In [7]:
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


In [8]:
print(vars(train_data.examples[0]))

{'src': ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei'], 'trg': ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']}


# Building vocabulary for source and target

In [9]:
source.build_vocab(train_data, min_freq = 2)
target.build_vocab(train_data, min_freq = 2)

In [10]:
print(f"Unique tokens in source (de) vocabulary: {len(source.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(target.vocab)}")

Unique tokens in source (de) vocabulary: 7853
Unique tokens in target (en) vocabulary: 5893


# set the data on GPU machine

In [11]:
BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size=BATCH_SIZE, device=device)

In [12]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
from nltk.corpus import brown

nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')


def loadDF(path):
  
  return df


def prepare_text(sentence):
    
    return tokens



def train_test_split(SRC, TRG):
    
    return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# Create Encoder

In [13]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        #src = [src len, batch size]
        embedded = self.dropout(self.embedding(src))
        #embedded = [src len, batch size, emb dim]
        outputs, (hidden, cell) = self.rnn(embedded)
        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        #outputs are always from the top hidden layer
        
        return hidden, cell

# Create Decoder

In [14]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):

        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell


# Create Seq2seq

In [15]:
class Seq2Seq(nn.Module):
    
    def __init__(self, encoder: Encoder, decoder: Decoder, device: torch.device):
        super().__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

        assert encoder.hid_dim == decoder.hid_dim, \
            'Hidden dimensions of encoder and decoder must be equal!'
        assert encoder.n_layers == decoder.n_layers, \
            'Encoder and decoder must have equal number of layers!'

    def forward(self, src_batch: torch.LongTensor, trg_batch: torch.LongTensor,
                teacher_forcing_ratio: float=0.5):

        max_len, batch_size = trg_batch.shape
        trg_vocab_size = self.decoder.output_dim

        # tensor to store decoder's output
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)

        # last hidden & cell state of the encoder is used as the decoder's initial hidden state
        hidden, cell = self.encoder(src_batch)

        trg = trg_batch[0]
        for i in range(1, max_len):
            prediction, hidden, cell = self.decoder(trg, hidden, cell)
            outputs[i] = prediction

            if random.random() < teacher_forcing_ratio:
                trg = trg_batch[i]
            else:
                trg = prediction.argmax(1)

        return outputs

In [16]:
# Parameters
INPUT = len(source.vocab)
OUTPUT = len(target.vocab)
ENC_EMB = 128
DEC_EMB = 128
HID = 256
N_LAYERS = 3
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5


decoder = Decoder(OUTPUT, DEC_EMB, HID, N_LAYERS, DEC_DROPOUT).to(device)

# Encoder and Decoder Architecture set on GPU device 

In [17]:
encoder = Encoder(INPUT, ENC_EMB, HID, N_LAYERS, ENC_DROPOUT)
decoder = Decoder(OUTPUT, DEC_EMB, HID, N_LAYERS, DEC_DROPOUT)
seq2seq = Seq2Seq(encoder, decoder, device).to(device)
seq2seq

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7853, 128)
    (rnn): LSTM(128, 256, num_layers=3, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 128)
    (rnn): LSTM(128, 256, num_layers=3, dropout=0.5)
    (fc_out): Linear(in_features=256, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

# Count parameters and optimize using ADAM

In [18]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(seq2seq):,} trainable parameters')

The model has 6,169,861 trainable parameters


In [19]:
optimizer = optim.Adam(seq2seq.parameters())

In [20]:
PAD_IDX = target.vocab.stoi['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)

# Define training model and training loss

In [21]:
def train(seq2seq, iterator, optimizer, criterion):
    seq2seq.train()

    epoch_loss = 0
    for batch in iterator:
        optimizer.zero_grad()
        outputs = seq2seq(batch.src, batch.trg)
        outputs_flatten = outputs[1:].view(-1, outputs.shape[-1])
        trg_flatten = batch.trg[1:].view(-1)
        loss = criterion(outputs_flatten, trg_flatten)

        loss.backward()
        optimizer.step()

        epoch_loss += loss.item()

    return epoch_loss / len(iterator)

# Define evaluation model and validation loss

In [22]:
def evaluate(seq2seq, iterator, criterion):
    seq2seq.eval()

    epoch_loss = 0
    with torch.no_grad():
        for batch in iterator:
            # turn off teacher forcing
            outputs = seq2seq(batch.src, batch.trg, teacher_forcing_ratio=0) 

            # trg = [trg sent len, batch size]
            # output = [trg sent len, batch size, output dim]
            outputs_flatten = outputs[1:].view(-1, outputs.shape[-1])
            trg_flatten = batch.trg[1:].view(-1)
            loss = criterion(outputs_flatten, trg_flatten)
            epoch_loss += loss.item()

    return epoch_loss / len(iterator)

In [23]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

#  Calculate training and validation loss

In [28]:
EPOCHS = 10
best_valid_loss = float('inf')

for epoch in range(EPOCHS):    
    start_time = time.time()
    train_loss = train(seq2seq, train_iterator, optimizer, criterion)
    valid_loss = evaluate(seq2seq, valid_iterator, criterion)
    end_time = time.time()

    epoch_mins, epoch_secs = epoch_time(start_time, end_time)

    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(seq2seq.state_dict(), 'chatbot-model.pt')

    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 1m 40s
	Train Loss: 5.097 | Train PPL: 163.527
	 Val. Loss: 4.805 |  Val. PPL: 122.101
Epoch: 02 | Time: 1m 41s
	Train Loss: 4.644 | Train PPL: 103.973
	 Val. Loss: 4.690 |  Val. PPL: 108.817
Epoch: 03 | Time: 1m 42s
	Train Loss: 4.361 | Train PPL:  78.343
	 Val. Loss: 4.528 |  Val. PPL:  92.550
Epoch: 04 | Time: 1m 41s
	Train Loss: 4.195 | Train PPL:  66.338
	 Val. Loss: 4.436 |  Val. PPL:  84.442
Epoch: 05 | Time: 1m 40s
	Train Loss: 4.078 | Train PPL:  59.034
	 Val. Loss: 4.380 |  Val. PPL:  79.866
Epoch: 06 | Time: 1m 41s
	Train Loss: 3.986 | Train PPL:  53.849
	 Val. Loss: 4.250 |  Val. PPL:  70.081
Epoch: 07 | Time: 1m 41s
	Train Loss: 3.879 | Train PPL:  48.354
	 Val. Loss: 4.241 |  Val. PPL:  69.492
Epoch: 08 | Time: 1m 41s
	Train Loss: 3.792 | Train PPL:  44.364
	 Val. Loss: 4.193 |  Val. PPL:  66.238
Epoch: 09 | Time: 1m 41s
	Train Loss: 3.724 | Train PPL:  41.424
	 Val. Loss: 4.069 |  Val. PPL:  58.479
Epoch: 10 | Time: 1m 40s
	Train Loss: 3.644 | Train PPL

# Define Testing model and calculate loss

In [24]:
seq2seq.load_state_dict(torch.load('chatbot-model.pt')) 

test_loss = evaluate(seq2seq, test_iterator, criterion)
print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 4.054 | Test PPL:  57.604 |


# Check the model performance

In [27]:
example_idx = 18
example = train_data.examples[example_idx]
sr = example.src
sr.reverse()
print('source sentence: ', ' '.join(sr))
print('target sentence: ', ' '.join(example.trg))

source sentence:  fünf personen sitzen mit instrumenten im kreis .
target sentence:  five people are sitting in a circle with instruments .


# Demonstration

In [None]:
while(1):  
  inp = input('enter sentence   ')
  inp = inp.split(' ')
  inp.reverse()

  src_tensor = source.process([inp]).to(device)
  trg_tensor = target.process([inp]).to(device)

  seq2seq.eval()
  with torch.no_grad():
      outputs = seq2seq(src_tensor, trg_tensor, 
                        teacher_forcing_ratio=0)

  output_idx = outputs[1:].squeeze(1).argmax(1)
  print(' '.join([target.vocab.itos[idx] for idx in output_idx]))

enter sentence   ein schwarzer hund und ein gefleckter hund kämpfen .
a brown dog and a dog dog . <eos> <eos>
enter sentence   vier typen , von denen drei hüte tragen und einer nicht , springen oben in einem treppenhaus
four boys and a and and and , , , , , , are in a a .
enter sentence   fünf personen sitzen mit instrumenten im kreis .
four people are sitting in a of a .
