# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [7]:
# Project Steps Overview and Estimated Duration
# Below you will find each of the components of the project, and estimated times to complete each portion. 
# These are estimates and not exact timings to help you expect the amount of time necessary to put aside to work on your project.

# Prepare data (~2 hours)
# Build your vocabulary from a corpus of language data. The Vocabulary object is described in Lesson Six: Seq2Seq.

# Build Model (~4 hours)
# Build your Encoder, Decoder, and larger Sequence to Sequence pattern in PyTorch. This pattern is described in Lesson Six: Seq2Seq.

# Train Model (~3 hours)
# Write your training procedure and divide your dataset into train/test/validation splits. Then, train your network and plot your evaluation metrics. Save your model after it reaches a satisfactory level of accuracy.

# Evaluate & Interact w/ Model (~1 hour)
# Write a script to interact with your network at the command line.

In [8]:
# Instructions Summary
# The LSTM Chatbot will help you show off your skills as a deep learning practitioner. You will develop the chatbot using a new architecture called a Seq2Seq. 
# Additionally, you can use pre-trained word embeddings to improve the performance of your model. Let's get started by following the steps below:

# Step 1: Build your Vocabulary & create the Word Embeddings
# The most important part of this step is to create your Vocabulary object using a corpus of data drawn from TorchText.

# (Extra Credit)
# Use Gensim to extract the word embeddings from one of its corpus'.
# Use NLTK and Gensim to create a function to clean your text and look up the index of a word's embeddings.

# Step 2: Create the Encoder
# A Seq2Seq architecture consists of an encoder and a decoder unit. You will use Pytorch to build a full Seq2Seq model.
# The first step of the architecture is to create an encoder with an LSTM unit.

# (Extra Credit)
# Load your pretrained embeddings into the LSTM unit.

# Step 3: Create the Decoder
# The second step of the architecture is to create a decoder using a second LSTM unit.

# Step 4: Combine them into a Seq2Seq Architecture
# To finalize your model, you will combine the encoder and decoder units into a working model.
# The Seq2Seq2 model must be able to instantiate the encoder and decoder. Then, it will accept the inputs for these units and manage their interaction to get an output using the forward pass function.

# Step 5: Train & evaluate your model
# Finally you will train and evaluate your model using a Pytorch training loop.

# Step 6: Interact with the Chatbot
# Demonstrate your chatbot by converting the outputs of the model to text and displaying it's responses at the command line.

In [9]:
# Pre-requisite: Select PyTorch 2.00 kernel

# Install requirements
# !pip install gensim==4.3.1 nltk==3.8.1 torchtext torchdata portalocker | grep -v "already satisfied"

# !pip install gensim==4.3.1 nltk==3.8.1  torchtext==0.15.1 portalocker>=2.0.0 | grep -v "already satisfied"
!pip install gensim==4.3.1 nltk==3.8.1  torchtext==0.6.0   | grep -v "already satisfied"

#  torchtext==0.12.0 --> torch==1.11.0
#  torchtext==0.13.0 --> torch==1.12.0
#  torchtext==0.14.0 --> torch==1.12.0
# torchtext==0.15.1 --> torch==2.0.0
# torchtext==0.15.2 --> torch==2.0.1

# !pip install gensim==4.3.1 nltk==3.8.1  torchtext==0.10.0 | grep -v "already satisfied"

# !pip install gensim==4.2.0 nltk torchtext  | grep -v "already satisfied"

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [10]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
import torch.nn as nn
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
# import sklearn.model_selection 
from torchtext.utils import download_from_url
from torchtext.data import Field, BucketIterator, TabularDataset
import random
import json

nltk.download('brown')
nltk.download('punkt')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Output, save, and load brown embeddings

# model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')

# w2v = gensim.models.Word2Vec.load('brown.embedding')

question_context_field = Field(tokenize=word_tokenize, init_token='<sos>', eos_token='<eos>', lower=True)
answer_field = Field(tokenize=word_tokenize, init_token='<sos>', eos_token='<eos>', lower=True)

def loadDF(path):    
    data_file_path = download_from_url(path, root="data")
    
    with open(data_file_path, 'r') as f:
        squad_data = json.load(f)
        
    data = []
    examples = []
    for item in squad_data['data']:
        for paragraph in item['paragraphs']:
            context = paragraph['context']
            for qa in paragraph['qas']:
                question = qa['question']                
                # question_context = f"{question} {separator_token} {context}"
                question_context = f"{question} {context}"
                answer = qa['answers'][0]['text']
                
                data.append((question_context, answer))                                

            #TODO: remove line below
            # break    
                
    df = pd.DataFrame.from_records(data, columns=['question_context', 'answer'])        
    df['question_context'] = df['question_context'].apply(prepare_text)
    df['answer'] = df['answer'].apply(prepare_text)

    return df


def prepare_text(sentence):
    return sentence

# def prepare_dataset(df):
#     df['context_tokens'] = df['context'].apply(prepare_text)
#     df['question_tokens'] = df['question'].apply(prepare_text)
#     df['answer_tokens'] = df['answer'].apply(prepare_text)
    
#     df['context_ids'] = df['context_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])
#     df['question_ids'] = df['question_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])
#     df['answer_ids'] = df['answer_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])

#     df['full_question_ids'] = df.apply(lambda row: [CONTEXT_index] + row['context_ids'] + [QUESTION_index] + row['question_ids'], axis=1)    

# def train_test_split(SRC, TRG, test_size=0.2, random_seed=42):
#     from sklearn.model_selection import train_test_split
    
#     SRC_train_dataset, SRC_val_dataset, TRG_train_dataset, TRG_val_dataset= train_test_split(SRC, TRG, test_size=test_size, random_state=random_seed)
    
#     # Return the training and test datasets
#     return SRC_train_dataset, SRC_val_dataset, TRG_train_dataset, TRG_val_dataset


def load_datasets(df, random_seed=42):
    df.to_csv("data/data.csv", index=False)
        
    fields = [('question_context', question_context_field), ('answer', answer_field)]
    dataset = TabularDataset("data/data.csv", format='csv', fields=fields)
    
    question_context_field.build_vocab(dataset, min_freq=1)
    answer_field.build_vocab(dataset, min_freq=1)

    train_data, valid_data = dataset.split(split_ratio=0.8, random_state=random.seed(random_seed))
    
    return train_data, valid_data

# def train_test_split(SRC, TRG,  test_size=0.2, random_seed=42):



[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [11]:
df = loadDF("https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json")

train_data, valid_data = load_datasets(df)

print(len(question_context_field.vocab))

print(len(answer_field.vocab))

102396
42794


In [12]:
df

Unnamed: 0,question_context,answer
0,To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous
1,What is in front of the Notre Dame Main Buildi...,a copper statue of Christ
2,The Basilica of the Sacred heart at Notre Dame...,the Main Building
3,What is the Grotto at Notre Dame? Architectura...,a Marian place of prayer and reflection
4,What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary
...,...,...
87594,In what US state did Kathmandu first establish...,Oregon
87595,What was Yangon previously known as? Kathmandu...,Rangoon
87596,With what Belorussian city does Kathmandu have...,Minsk
87597,In what year did Kathmandu create its initial ...,1975


In [13]:
print(df.iloc[-30]['question_context'])

print(df.iloc[-30]['answer'])

Where can a temple of the Jain faith be found? Sikhism is practiced primarily in Gurudwara at Kupundole. An earlier temple of Sikhism is also present in Kathmandu which is now defunct. Jainism is practiced by a small community. A Jain temple is present in Gyaneshwar, where Jains practice their faith. According to the records of the Spiritual Assembly of the Baha'is of Nepal, there are approximately 300 Baha'is in Kathmandu valley. They have a National Office located in Shantinagar, Baneshwor. The Baha'is also have classes for children at the National Centre and other localities in Kathmandu. Islam is practised in Kathmandu but Muslims are a minority, accounting for about 4.2% of the population of Nepal.[citation needed] It is said that in Kathmandu alone there are 170 Christian churches. Christian missionary hospitals, welfare organizations, and schools are also operating. Nepali citizens who served as soldiers in Indian and British armies, who had converted to Christianity while in se

In [14]:
import random

class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_size, num_layers=1, dropout=0):
        
        super(Encoder, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        self.hidden = torch.zeros(1, 1, hidden_size)        # How to use it ????
        # self.embedding_dim = embedding_size
        
        # self.embedding provides a vector representation of the inputs to our model
        # self.embedding = nn.Embedding(self.input_size, self.embedding_dim)
        self.embedding = nn.Embedding(input_size, embedding_size)
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        # self.lstm = nn.LSTM(self.embedding_dim, self.hidden_size, num_layers, dropout=(0 if num_layers == 1 else dropout))
        self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=(0 if num_layers == 1 else dropout))

        self.dropout = nn.Dropout(dropout)
    
    def forward(self, i):
        embedded = self.embedding(i)
        
        embedded = self.dropout(embedded)

        output, (hidden, cell) = self.lstm(embedded)
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        
        return output, hidden, cell
    

class Decoder(nn.Module):
      
    def __init__(self, hidden_size, output_size, embedding_size, num_layers=1, dropout=0):
        
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embedding_size = embedding_size
        
        # self.embedding provides a vector representation of the target to our model
        # self.embedding = nn.Embedding(self.hidden_size, self.hidden_size)  # From lesson
        
        self.embedding = nn.Embedding(output_size, embedding_size)
        # self.embedding = nn.Embedding(hidden_size, embedding_size) # Why ?
        
        
        # self.lstm, accepts the embeddings and outputs a hidden state
        # self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, num_layers, dropout=(0 if num_layers == 1 else dropout))
        self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=(0 if num_layers == 1 else dropout))

        # self.ouput, predicts on the hidden state via a linear output layer
        self.fc = nn.Linear(hidden_size, output_size)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        # print('[Decoder] input shape:', input.shape)
        
        input = input.unsqueeze(0)
        
        embedded = self.embedding(input)
        
        embedded = self.dropout(embedded)
            
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell)) 
        
        
        prediction = self.fc(output.squeeze(0))
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        
        return prediction, hidden, cell
        
        

class Seq2Seq(nn.Module):
        
    def __init__(self, encoder, decoder):
        
        super(Seq2Seq, self).__init__()
        
        self.encoder = encoder
        self.decoder = decoder        

    def forward(self, src, trg, teacher_forcing_ratio = 0.5):              
        #src = [src len, batch size]
        #trg = [trg len, batch size]
                
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_size
                       
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(device)
        
        _, hidden, cell = self.encoder(src)
        
        # Start with <sos> tokens
        input = trg[0, :]  
                
        for t in range(1, trg_len):

            output, hidden, cell = self.decoder(input, hidden, cell)

            outputs[t] = output
            
            # get highest predicted token
            top1 = output.argmax(1)
            
            use_teacher_forcing = random.random() < teacher_forcing_ratio

            input = trg[t] if use_teacher_forcing else top1
    
        return outputs

    



In [15]:
from torch.utils.data import DataLoader, TensorDataset

# Define the model and other parameters

encoder_input_size = len(question_context_field.vocab) # vocabulary.get_size()
encoder_embedding_size= 300
# encoder_num_layers = 1
encoder_dropout = 0.5

decoder_output_size = len(answer_field.vocab) # vocabulary.get_size()
# decoder_hidden_size = encoder_hidden_size
decoder_embedding_size = 300
# decoder_num_layers = 1
decoder_dropout = 0.5

hidden_size = 512
num_layers = 2
batch_size = 64

#     def __init__(self, input_size, hidden_size, embedding_size, num_layers=1, dropout=0):
encoder = Encoder(encoder_input_size, hidden_size, encoder_embedding_size, num_layers, encoder_dropout)

#     def __init__(self, hidden_size, output_size, embedding_size, num_layers=1, dropout=0):
decoder = Decoder(hidden_size, decoder_output_size, decoder_embedding_size, num_layers, decoder_dropout)
    
model = Seq2Seq(encoder, decoder).to(device)

train_iterator, valid_iterator = BucketIterator.splits(
    (train_data, valid_data),
   batch_size=batch_size,
   sort_within_batch=True,
    sort_key = lambda x: len(x.question_context),
    device=device)
                             
# train_dataset = TensorDataset(src_train_tensor, trg_train_tensor)
# val_dataset  = TensorDataset(src_val_tensor, trg_val_tensor)

# train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
# val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import time


# Define loss function and optimizer
learning_rate = 0.001
pad_idx = answer_field.vocab.stoi[answer_field.pad_token] 
criterion = nn.CrossEntropyLoss(ignore_index = pad_idx)

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
num_epochs =  20
clip = 1
valid_loss_min = None

for epoch in range(num_epochs):
    start_time = time.time()
    model.train()
    train_loss = 0.0
    avg_train_loss = 0.0
    # epoch_loss = 0.0

    # Training loop
    for batch_idx, batch in enumerate(train_iterator):
        src = batch.question_context.to(device)
        trg = batch.answer.to(device)
    
#         print('src:', src.shape)
#         print('trg:', trg.shape)        
        
        optimizer.zero_grad()
        
        # src = src.unsqueeze(1)
        # trg = trg.unsqueeze(1)
        
        # print (src.shape)
        # print (trg.shape)
        
        # Pass the source sequences through the encoder
        output = model(src, trg) # output: [trg length, batch size, output dim]
        # print('output:', output.shape)
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
#         print('output 2:', output.shape)
#         print('trg 2:', trg.shape)
        
#         print('output content:', output)
#         print('trg content:', trg)
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        # print(train_loss)
        
        train_loss += loss.item()
        
        # break
        
    average_train_loss = train_loss / len(train_iterator)
    
    end_time = time.time()    
    train_elapsed_time = end_time - start_time
    
    print(f'Epoch: {epoch+1:02} | Train Time: {train_elapsed_time}s')
        
          
    # Evaluation on the test dataset
    start_time = time.time()
    model.eval()
    with torch.no_grad():
        valid_loss = 0.0

        for batch_idx, batch in enumerate(valid_iterator):
            src = batch.question_context.to(device)
            trg = batch.answer.to(device)                    
            
            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]

            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            valid_loss += loss.item()
            
        average_val_loss = valid_loss / len(valid_iterator)
                    
    end_time = time.time()    
    val_elapsed_time = end_time - start_time
    
    # print(f'Eval Time: {elapsed_time}')
    print(f'Epoch: {epoch+1:02} | Eval Time: {val_elapsed_time}s')
    
    if valid_loss_min is None or (
            (valid_loss_min - average_val_loss) / valid_loss_min > 0.01
    ):
        print(f"New minimum validation loss: {average_val_loss:.6f}. Saving model ...")
        
        torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': train_loss,
            }, 'checkpoints/best_val_loss.pt')

        valid_loss_min = average_val_loss
        
    print(f"Epoch: {epoch+1:02}, Train Loss: {average_train_loss:.3f}, Val Loss: {average_val_loss:.3f}")



Epoch: 01 | Train Time: 177.93499732017517s
Epoch: 01 | Eval Time: 22.525410413742065s
New minimum validation loss: 6.887178. Saving model ...
Epoch: 01, Train Loss: 6.981, Val Loss: 6.887
Epoch: 02 | Train Time: 173.65457701683044s
Epoch: 02 | Eval Time: 21.290350198745728s
New minimum validation loss: 6.986544. Saving model ...
Epoch: 02, Train Loss: 6.581, Val Loss: 6.987
Epoch: 03 | Train Time: 172.9591293334961s
Epoch: 03 | Eval Time: 21.31435537338257s
New minimum validation loss: 7.041733. Saving model ...
Epoch: 03, Train Loss: 6.466, Val Loss: 7.042
Epoch: 04 | Train Time: 171.75205731391907s
Epoch: 04 | Eval Time: 21.43007493019104s
New minimum validation loss: 7.077194. Saving model ...
Epoch: 04, Train Loss: 6.376, Val Loss: 7.077


In [99]:
# print(batch.question_context)

print(batch.answer)

print(answer_field.vocab.itos[980])

tensor([[   2],
        [1818],
        [ 124],
        [   6],
        [1289],
        [   3]], device='cuda:0')
mexico
