# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [18]:
# Project Steps Overview and Estimated Duration
# Below you will find each of the components of the project, and estimated times to complete each portion. 
# These are estimates and not exact timings to help you expect the amount of time necessary to put aside to work on your project.

# Prepare data (~2 hours)
# Build your vocabulary from a corpus of language data. The Vocabulary object is described in Lesson Six: Seq2Seq.

# Build Model (~4 hours)
# Build your Encoder, Decoder, and larger Sequence to Sequence pattern in PyTorch. This pattern is described in Lesson Six: Seq2Seq.

# Train Model (~3 hours)
# Write your training procedure and divide your dataset into train/test/validation splits. Then, train your network and plot your evaluation metrics. Save your model after it reaches a satisfactory level of accuracy.

# Evaluate & Interact w/ Model (~1 hour)
# Write a script to interact with your network at the command line.

In [19]:
# Instructions Summary
# The LSTM Chatbot will help you show off your skills as a deep learning practitioner. You will develop the chatbot using a new architecture called a Seq2Seq. 
# Additionally, you can use pre-trained word embeddings to improve the performance of your model. Let's get started by following the steps below:

# Step 1: Build your Vocabulary & create the Word Embeddings
# The most important part of this step is to create your Vocabulary object using a corpus of data drawn from TorchText.

# (Extra Credit)
# Use Gensim to extract the word embeddings from one of its corpus'.
# Use NLTK and Gensim to create a function to clean your text and look up the index of a word's embeddings.

# Step 2: Create the Encoder
# A Seq2Seq architecture consists of an encoder and a decoder unit. You will use Pytorch to build a full Seq2Seq model.
# The first step of the architecture is to create an encoder with an LSTM unit.

# (Extra Credit)
# Load your pretrained embeddings into the LSTM unit.

# Step 3: Create the Decoder
# The second step of the architecture is to create a decoder using a second LSTM unit.

# Step 4: Combine them into a Seq2Seq Architecture
# To finalize your model, you will combine the encoder and decoder units into a working model.
# The Seq2Seq2 model must be able to instantiate the encoder and decoder. Then, it will accept the inputs for these units and manage their interaction to get an output using the forward pass function.

# Step 5: Train & evaluate your model
# Finally you will train and evaluate your model using a Pytorch training loop.

# Step 6: Interact with the Chatbot
# Demonstrate your chatbot by converting the outputs of the model to text and displaying it's responses at the command line.

In [20]:
# Pre-requisite: Select PyTorch 2.00 kernel

# Install requirements
# !pip install gensim==4.3.1 nltk==3.8.1 torchtext torchdata portalocker | grep -v "already satisfied"

!pip install gensim==4.3.1 nltk==3.8.1  torchtext==0.15.1 portalocker>=2.0.0 | grep -v "already satisfied"

#  torchtext==0.12.0 --> torch==1.11.0
#  torchtext==0.13.0 --> torch==1.12.0
#  torchtext==0.14.0 --> torch==1.12.0
# torchtext==0.15.1 --> torch==2.0.0
# torchtext==0.15.2 --> torch==2.0.1

# !pip install gensim==4.3.1 nltk==3.8.1  torchtext==0.10.0 | grep -v "already satisfied"

# !pip install gensim==4.2.0 nltk torchtext  | grep -v "already satisfied"

[0m

In [21]:
# from torchtext.legacy import data

import torchtext

print(torchtext.__version__)

from torchtext.datasets import Multi30k
# from torchtext.data import Field

0.15.1


In [22]:
from nltk.stem.porter import *
from nltk.stem import *
from nltk.tokenize import RegexpTokenizer
from torch.utils.data import Dataset

PAD_index = 0  # Used for padding short sentences
SOS_index = 1  # Start-of-sentence token
EOS_index = 2  # End-of-sentence token
CONTEXT_index = 3  # Used for padding short sentences
QUESTION_index = 4  # Used for padding short sentences

class Vocabulary:
    def __init__(self):        
        self.word2index = {}
        # self.index2word = []
        self.index2word =  [ "<pad>", "<sos>",  "<eos>",  "<context>", "<question>"]

    def load(self, words):
        for word in words:
            self.add_word(word)

    def clean_text(self, text):
        tokenizer = RegexpTokenizer(r'\w+')
        text = tokenizer.tokenize(text)
        return text

    def add_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = len(self.index2word)
            self.index2word.append(word)

    def get_index(self, word):
        return self.word2index.get(word, None)

    def get_word(self, index):
        if index >= 0 and index < len(self.index2word):
            return self.index2word[index]
        return None
    
    def get_size(self):
        return len(self.index2word)
    
    
class Seq2SeqDataset(Dataset):
    def __init__(self, dataframe):
        self.data = dataframe
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, index):
        question_ids = self.data.loc[index, 'question_ids']
        answer_ids = self.data.loc[index, 'answer_ids']
        
        return torch.tensor(question_ids), torch.tensor(answer_ids)
    
    
class VocabularyNew:
    def __init__(self, name): 
        self.name = name
        self.index = {}   # Index to word     
        self.words = {}  # Word to index
        self.count = 0

def clean_text(self, text):
	tokenizer = RegexpTokenizer(r'\w+')
	text = tokenizer.tokenize(text)
	return text

def index_word(self, word):
	if word not in self.words:
		self.words[word] = self.count
		self.index[str(self.count)] = word
		self.count += 1
		return True
	else:
		return False
    
vocabulary = Vocabulary()

In [23]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
import torch.nn as nn
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
from torchtext.datasets import SQuAD1
import sklearn.model_selection 

nltk.download('brown')
nltk.download('punkt')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Output, save, and load brown embeddings

# model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')

# w2v = gensim.models.Word2Vec.load('brown.embedding')

SQUAD1_CONTEXT_INDEX = 0
SQUAD1_QUESTION_INDEX = 1
SQUAD1_ANSWER_INDEX = 2

def loadDF(path):    
    torchtext.utils.download_from_url(path, root="data")
    dataset = SQuAD1(root="data", split="train")
    
    data = [
        (row[SQUAD1_CONTEXT_INDEX], row[SQUAD1_QUESTION_INDEX], row[SQUAD1_ANSWER_INDEX][0]) for row in dataset
    ]

    df = pd.DataFrame.from_records(data, columns=['context', 'question', 'answer'])    
    
    df['context_tokens'] = df['context'].apply(prepare_text)
    df['question_tokens'] = df['question'].apply(prepare_text)
    df['answer_tokens'] = df['answer'].apply(prepare_text)        
        
    df['context_ids'] = df['context_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])
    df['question_ids'] = df['question_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])
    df['answer_ids'] = df['answer_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])

    df['full_question_ids'] = df.apply(lambda row: [CONTEXT_index] + row['context_ids'] + [QUESTION_index] + row['question_ids'], axis=1)    
    
    #  Should append  <sos> and <eos>  ?
    df['full_question_ids'] = df['full_question_ids'].apply(lambda x: [SOS_index] + x + [EOS_index])
    df['answer_ids'] = df['answer_ids'].apply(lambda x: [SOS_index] + x + [EOS_index])

    return df


def prepare_text(sentence):    
    # By default, the word_tokenize function in NLTK converts all words to lowercase    
    tokens = word_tokenize(sentence)
    
    vocabulary.load(tokens)    
    
    return tokens

# def prepare_dataset(df):
#     df['context_tokens'] = df['context'].apply(prepare_text)
#     df['question_tokens'] = df['question'].apply(prepare_text)
#     df['answer_tokens'] = df['answer'].apply(prepare_text)
    
#     df['context_ids'] = df['context_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])
#     df['question_ids'] = df['question_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])
#     df['answer_ids'] = df['answer_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])

#     df['full_question_ids'] = df.apply(lambda row: [CONTEXT_index] + row['context_ids'] + [QUESTION_index] + row['question_ids'], axis=1)    

def train_test_split(SRC, TRG, test_size=0.2, random_seed=42):
    from sklearn.model_selection import train_test_split
    
    SRC_train_dataset, SRC_val_dataset, TRG_train_dataset, TRG_val_dataset= train_test_split(SRC, TRG, test_size=test_size, random_state=random_seed)
    
    # Return the training and test datasets
    return SRC_train_dataset, SRC_val_dataset, TRG_train_dataset, TRG_val_dataset

# def train_test_split(SRC, TRG,  test_size=0.2, random_seed=42):
    
#     '''
#     Input: SRC, our list of questions from the dataset
#             TRG, our list of responses from the dataset

#     Output: Training and test datasets for SRC & TRG

#     '''
# #     np.random.seed(random_seed)
# #     indices = np.arange(len(SRC))
# #     np.random.shuffle(indices)
# #     split_idx = int(len(indices) * (1 - test_size))
    
# #     SRC_train_dataset = [SRC[i] for i in indices[:split_idx]]
# #     SRC_test_dataset = [SRC[i] for i in indices[split_idx:]]
# #     TRG_train_dataset = [TRG[i] for i in indices[:split_idx]]
# #     TRG_test_dataset = [TRG[i] for i in indices[split_idx:]]
    
#     SRC_train, SRC_test, TRG_train, TRG_test = sklearn.model_selection.train_test_split(SRC, TRG, test_size=test_size, random_state=random_seed)
    
#     # Convert the training and test datasets into PyTorch tensors
#     SRC_train_dataset = Seq2SeqDataset(pd.DataFrame({'question_ids': SRC_train, 'answer_ids': TRG_train}))
#     SRC_test_dataset = Seq2SeqDataset(pd.DataFrame({'question_ids': SRC_test, 'answer_ids': TRG_test}))
    
    
#     return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [24]:
df = loadDF("https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json")

In [25]:
df

Unnamed: 0,context,question,answer,context_tokens,question_tokens,answer_tokens,context_ids,question_ids,answer_ids,full_question_ids
0,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous,"[Architecturally, ,, the, school, has, a, Cath...","[To, whom, did, the, Virgin, Mary, allegedly, ...","[Saint, Bernadette, Soubirous]","[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 7, 15, 16,...","[2918, 2814, 3023, 7, 24, 25, 9089, 3334, 27, ...","[1, 65, 66, 67, 2]","[1, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 7, 1..."
1,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,a copper statue of Christ,"[Architecturally, ,, the, school, has, a, Cath...","[What, is, in, front, of, the, Notre, Dame, Ma...","[a, copper, statue, of, Christ]","[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 7, 15, 16,...","[2289, 20, 27, 28, 23, 7, 91, 92, 15, 16, 1505]","[1, 10, 32, 22, 23, 33, 2]","[1, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 7, 1..."
2,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,the Main Building,"[Architecturally, ,, the, school, has, a, Cath...","[The, Basilica, of, the, Sacred, heart, at, No...","[the, Main, Building]","[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 7, 15, 16,...","[99, 46, 23, 7, 47, 7692, 59, 91, 92, 20, 4475...","[1, 7, 15, 16, 2]","[1, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 7, 1..."
3,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,a Marian place of prayer and reflection,"[Architecturally, ,, the, school, has, a, Cath...","[What, is, the, Grotto, at, Notre, Dame, ?]","[a, Marian, place, of, prayer, and, reflection]","[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 7, 15, 16,...","[2289, 20, 7, 51, 59, 91, 92, 1505]","[1, 10, 52, 53, 23, 54, 29, 55, 2]","[1, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 7, 1..."
4,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary,"[Architecturally, ,, the, school, has, a, Cath...","[What, sits, on, top, of, the, Main, Building,...","[a, golden, statue, of, the, Virgin, Mary]","[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 7, 15, 16,...","[2289, 23850, 135, 500, 23, 7, 15, 16, 59, 91,...","[1, 10, 21, 22, 23, 7, 24, 25, 2]","[1, 3, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 7, 1..."
...,...,...,...,...,...,...,...,...,...,...
87594,"Kathmandu Metropolitan City (KMC), in order to...",In what US state did Kathmandu first establish...,Oregon,"[Kathmandu, Metropolitan, City, (, KMC, ), ,, ...","[In, what, US, state, did, Kathmandu, first, e...",[Oregon],"[109309, 4411, 1214, 73, 109676, 83, 6, 27, 32...","[166, 3581, 2799, 2325, 3023, 109309, 322, 340...","[1, 4434, 2]","[1, 3, 109309, 4411, 1214, 73, 109676, 83, 6, ..."
87595,"Kathmandu Metropolitan City (KMC), in order to...",What was Yangon previously known as?,Rangoon,"[Kathmandu, Metropolitan, City, (, KMC, ), ,, ...","[What, was, Yangon, previously, known, as, ?]",[Rangoon],"[109309, 4411, 1214, 73, 109676, 83, 6, 27, 32...","[2289, 179, 71358, 351, 472, 113, 1505]","[1, 71359, 2]","[1, 3, 109309, 4411, 1214, 73, 109676, 83, 6, ..."
87596,"Kathmandu Metropolitan City (KMC), in order to...",With what Belorussian city does Kathmandu have...,Minsk,"[Kathmandu, Metropolitan, City, (, KMC, ), ,, ...","[With, what, Belorussian, city, does, Kathmand...",[Minsk],"[109309, 4411, 1214, 73, 109676, 83, 6, 27, 32...","[916, 3581, 81077, 4133, 157, 109309, 142, 10,...","[1, 78279, 2]","[1, 3, 109309, 4411, 1214, 73, 109676, 83, 6, ..."
87597,"Kathmandu Metropolitan City (KMC), in order to...",In what year did Kathmandu create its initial ...,1975,"[Kathmandu, Metropolitan, City, (, KMC, ), ,, ...","[In, what, year, did, Kathmandu, create, its, ...",[1975],"[109309, 4411, 1214, 73, 109676, 83, 6, 27, 32...","[166, 3581, 133, 3023, 109309, 694, 207, 4129,...","[1, 11592, 2]","[1, 3, 109309, 4411, 1214, 73, 109676, 83, 6, ..."


In [26]:
df['full_question_ids'].dtype

dtype('O')

In [27]:
SRC_train_dataset, SRC_val_dataset, TRG_train_dataset, TRG_val_dataset = train_test_split(df['full_question_ids'], df['answer_ids'])

In [28]:
SRC_train_dataset

41171    [1, 3, 1431, 936, 10, 5160, 521, 64806, 1182, ...
11421    [1, 3, 18568, 17, 3429, 27, 7, 4557, 5269, 23,...
45787    [1, 3, 56, 9, 349, 6467, 76, 69760, 48509, 754...
38917    [1, 3, 1431, 10844, 152, 5572, 6, 6813, 12142,...
72419    [1, 3, 1116, 12744, 6, 12611, 17, 4246, 726, 1...
                               ...                        
6265     [1, 3, 18553, 1083, 12347, 45, 17833, 2980, 17...
54886    [1, 3, 79525, 9, 349, 4572, 27, 6174, 268, 210...
76820    [1, 3, 97316, 20, 472, 200, 1025, 10, 100929, ...
860      [1, 3, 2381, 17, 691, 9, 2458, 1311, 2516, 377...
15795    [1, 3, 5841, 7, 1517, 23, 7, 215, 4665, 6, 155...
Name: full_question_ids, Length: 70079, dtype: object

In [29]:
len(SRC_train_dataset)

70079

In [36]:
from torch.nn.utils.rnn import pad_sequence

def pad_series_to_tensor(series):
    # Convert Pandas Series to list of tensors
    series_list = [torch.tensor(lst) for lst in series]

    # Pad sequences with zeroes
    padded_series_list = pad_sequence(series_list, batch_first=True, padding_value=PAD_index)

    # Convert padded sequences to PyTorch tensor
    tensor = torch.tensor(padded_series_list)
    
    return tensor

src_train_tensor = pad_series_to_tensor(SRC_train_dataset).to(device)
trg_train_tensor = pad_series_to_tensor(TRG_train_dataset).to(device)

src_val_tensor = pad_series_to_tensor(SRC_val_dataset).to(device)
trg_val_tensor = pad_series_to_tensor(TRG_val_dataset).to(device)

trg_train_tensor = torch.cat([trg_train_tensor, torch.zeros(trg_train_tensor.size(0), 784 - trg_train_tensor.size(1), dtype=torch.long).to(device)], dim=1)




  tensor = torch.tensor(padded_series_list)


In [38]:
trg_train_tensor.shape

torch.Size([70079, 784])

In [39]:
src_train_tensor.shape

torch.Size([70079, 784])

In [40]:
src_train_tensor

tensor([[    1,     3,  1431,  ...,     0,     0,     0],
        [    1,     3, 18568,  ...,     0,     0,     0],
        [    1,     3,    56,  ...,     0,     0,     0],
        ...,
        [    1,     3, 97316,  ...,     0,     0,     0],
        [    1,     3,  2381,  ...,     0,     0,     0],
        [    1,     3,  5841,  ...,     0,     0,     0]], device='cuda:0')

In [41]:
for src, trg in zip(src_train_tensor, src_val_tensor):
    display(src.shape)
    # display(src)
    
    src = src.unsqueeze(1)
    
    display(src.shape)
    
    break

torch.Size([784])

torch.Size([784, 1])

In [42]:
vocabulary.get_size()

115678

In [43]:
import random

class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_size, num_layers=1, dropout=0):
        
        super(Encoder, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        self.hidden = torch.zeros(1, 1, hidden_size)        # How to use it ????
        # self.embedding_dim = embedding_size
        
        # self.embedding provides a vector representation of the inputs to our model
        # self.embedding = nn.Embedding(self.input_size, self.embedding_dim)
        self.embedding = nn.Embedding(input_size, embedding_size)
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        # self.lstm = nn.LSTM(self.embedding_dim, self.hidden_size, num_layers, dropout=(0 if num_layers == 1 else dropout))
        self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=(0 if num_layers == 1 else dropout))

        self.dropout = nn.Dropout(dropout)
    
    def forward(self, i):
        embedded = self.embedding(i)
        
        embedded = self.dropout(embedded)

        output, (h, c) = self.lstm(embedded)
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        
        return output, h, c
    

class Decoder(nn.Module):
      
    def __init__(self, hidden_size, output_size, embedding_size, num_layers=1, dropout=0):
        
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embedding_size = embedding_size
        
        # self.embedding provides a vector representation of the target to our model
        # self.embedding = nn.Embedding(self.hidden_size, self.hidden_size)  # From lesson
        
        self.embedding = nn.Embedding(output_size, embedding_size) 
        # self.embedding = nn.Embedding(hidden_size, embedding_size) # Why ?
        
        
        # self.lstm, accepts the embeddings and outputs a hidden state
        # self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, num_layers, dropout=(0 if num_layers == 1 else dropout))
        self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=(0 if num_layers == 1 else dropout))

        # self.ouput, predicts on the hidden state via a linear output layer
        self.fc = nn.Linear(hidden_size, output_size)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        # print('[Decoder] input shape:', input.shape)
        
        input = input.unsqueeze(0)
        
        embedded = self.embedding(input)
        
        embedded = self.dropout(embedded)
            
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))  #  LSTM: Expected input to be 2-D or 3-D but received 4-D tensor
        
        
        prediction = self.fc(output.squeeze(0))
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        
        return prediction, hidden, cell
        
        

class Seq2Seq(nn.Module):
        
    def __init__(self, encoder, decoder):
        
        super(Seq2Seq, self).__init__()
        
        self.encoder = encoder
        self.decoder = decoder        

    def forward(self, src, trg, teacher_forcing_ratio = 0.5):              
        #src = [src len, batch size]
        #trg = [trg len, batch size]
                
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_size
                       
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(device)
        
        _, hidden, cell = self.encoder(src)
        
        # Start with <sos> tokens
        input = trg[0, :]  
                
        for t in range(1, trg_len):

            output, hidden, cell = self.decoder(input, hidden, cell)

            outputs[t] = output
            
            # get highest predicted token
            top1 = output.argmax(1)            
            
            use_teacher_forcing = random.random() < teacher_forcing_ratio

            input = trg[t] if use_teacher_forcing else top1
    
        return outputs

    



In [46]:
from torch.utils.data import DataLoader, TensorDataset

# Define the model and other parameters
encoder_input_size = vocabulary.get_size()
encoder_embedding_size= 300
# encoder_num_layers = 1
encoder_dropout = 0.5

decoder_output_size = vocabulary.get_size()
# decoder_hidden_size = encoder_hidden_size
decoder_embedding_size = 300
# decoder_num_layers = 1
decoder_dropout = 0.5

hidden_size = 512
num_layers = 2
batch_size = 4

#     def __init__(self, input_size, hidden_size, embedding_size, num_layers=1, dropout=0):
encoder = Encoder(encoder_input_size, hidden_size, encoder_embedding_size, num_layers, encoder_dropout)

#     def __init__(self, hidden_size, output_size, embedding_size, num_layers=1, dropout=0):
decoder = Decoder(hidden_size, decoder_output_size, decoder_embedding_size, num_layers, decoder_dropout)
    
model = Seq2Seq(encoder, decoder)

train_dataset = TensorDataset(src_train_tensor, trg_train_tensor)
val_dataset  = TensorDataset(src_val_tensor, trg_val_tensor)

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False)
val_dataloader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import time

# Create an instance of your Seq2Seq model
model = model.to(device)


# Define loss function and optimizer
learning_rate = 0.001
criterion = nn.CrossEntropyLoss(ignore_index = PAD_index)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
num_epochs =  20
clip = 1

# Training loop
for epoch in range(num_epochs):
    start_time = time.time()
    model.train()
    train_loss = 0.0
    epoch_loss = 0.0
    
    for src, trg in train_dataloader:
        src = src.to(device)
        trg = trg.to(device)
        
        # src = src.transpose(0, 1)
        # trg = trg.transpose(0, 1)
    # for src, trg in zip(src_train_tensor, trg_train_tensor):
    
#         print(src.shape)
#         print(trg.shape)        
        
        optimizer.zero_grad()
        
        # src = src.unsqueeze(1)
        # trg = trg.unsqueeze(1)
        
        # print (src.shape)
        # print (trg.shape)
        
        # Pass the source sequences through the encoder
        output = model(src, trg)
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        train_loss = loss.item()       
        # print(train_loss)
        
        epoch_loss += train_loss
        
    # Calculate average epoch loss
    average_train_loss = epoch_loss / src_train_tensor.shape[0]
    
    end_time = time.time()    
    elapsed_time = end_time - start_time
    
    print(f'Train Time: {elapsed_time}')
          
          
    # Evaluation on the test dataset
    start_time = time.time()
    model.eval()
    with torch.no_grad():
        val_loss = 0.0
        
        for src, trg in val_dataloader:
        # for src, trg in zip(src_val_tensor, trg_val_tensor):
            src = src.to(device)
            trg = trg.to(device)
            
            src = src.transpose(0, 1)
            trg = trg.transpose(0, 1)
            # src = src.unsqueeze(1)
            # trg = trg.unsqueeze(1)
        
            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]

            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            val_loss += loss.item()
            
        average_val_loss = val_loss / src_val_tensor.shape[0]
        
    end_time = time.time()    
    elapsed_time = end_time - start_time
    print(f'Eval Time: {elapsed_time}')
    
    # Print or log the epoch loss and evaluation metric
    print(f"Epoch: {epoch+1}, Train Loss: {average_train_loss:.3f}, Val Loss: {average_val_loss:.3f}")
    # print(f"Epoch: {epoch+1}, Train Loss: {average_loss}, Metric: {evaluation_metric}")

# for epoch in range(num_epochs):
#     model.train()
#     epoch_loss = 0.0
    
#     for src, trg in zip(SRC_train_dataset, TRG_train_dataset):
#         optimizer.zero_grad()
        
#         # Pass the source sequences through the encoder
#         encoder_outputs, hidden = model(src, trg)
#         decoder_hidden = hidden
        
#         # Initialize the decoder input with SOS token
#         decoder_input = torch.tensor([[SOS_token]])
        
#         # Iterate over target sequences one timestep at a time
#         for t in range(trg.size(0)):
#             decoder_output, decoder_hidden = model.decoder(decoder_input, decoder_hidden)
            
#             # Calculate loss between decoder output and target
#             loss = criterion(decoder_output.squeeze(0), trg[t])
            
#             epoch_loss += loss.item()
            
#             # Backpropagation and parameter update
#             loss.backward()
#             optimizer.step()
            
#             # Update decoder input with the current target token
#             decoder_input = trg[t].unsqueeze(0)
    
#     # Calculate average epoch loss
#     average_loss = epoch_loss / len(SRC_train_dataset)
    
#     # Evaluation on the test dataset
#     model.eval()
#     with torch.no_grad():
#         # Perform inference on the test dataset
#         # Calculate evaluation metric (e.g., accuracy, BLEU score)
#         evaluation_metric = calculate_evaluation_metric(model, SRC_test_dataset, TRG_test_dataset)
    
#     # Print or log the epoch loss and evaluation metric
#     print(f"Epoch: {epoch+1}, Loss: {average_loss}, Metric: {evaluation_metric}")


In [33]:
learning_rate = 0.001
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
num_epochs =  40

# Training loop
def train_model(src_inputs, trg_inputs):
    optimizer.zero_grad()
    
    # Forward pass
    outputs = model(src_inputs, trg_inputs)
    
    # Calculate loss
    loss = criterion(outputs, trg_inputs)
    
    # Backward pass and optimization
    loss.backward()
    optimizer.step()
    
    return loss.item()


for epoch in range(num_epochs):
    total_loss = 0.0
    
    # Iterate over the training dataset
    for batch_inputs, batch_targets in training_data:
        loss = train_model(batch_inputs, batch_targets)
        total_loss += loss
    
    average_loss = total_loss / len(training_data)
    print(f"Epoch: {epoch+1}, Loss: {average_loss:.4f}")

# After training, you can use the trained model for inference

NameError: name 'training_data' is not defined

In [27]:
import torchtext
from torchtext.datasets import SQuAD1
import nltk
from nltk.tokenize import word_tokenize

# Download and initialize Squad1 dataset
torchtext.utils.download_from_url("https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json", root="data")
train_dataset = SQuAD1(root="data", split="train")

In [27]:
SQUAD1_QUESTION_INDEX = 1
SQUAD1_ANSWER_INDEX = 2

# 0: context (string)
# 1: question
# 2: answer
# 3: answer start

# Tokenize the text
tokenized_data = []
# new_vocabulary = Vocabulary()


for example in train_dataset:
    print("Context:", example[0])
    print("Question:", example[1])
    print("Answer:", example[2])
    print("Answer start:", example[3])
    
    print(word_tokenize(example[1]))
    print(word_tokenize(example[2][0]))
    
    source = word_tokenize(example[1])
    target = word_tokenize(example[2][0])
    
    print(type(source))
    print(type(target))
    tokenized_data.append({"source": source, "target": target})
    break
    
for example in train_dataset:
    source = word_tokenize(example[1])
    target = word_tokenize(example[2][0])
    
    # new_vocabulary.load(source)
    # new_vocabulary.load(target)
    
    tokenized_data.append({"source": source, "target": target})
    # break    

Context: Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.
Question: To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?
Answer: ['Saint Bernadette Soubirous']
Answer start: [515]
['To', 'whom', 'did', 'the', 'Virgin', 'Mary', 'allegedly', 'appear', 'in', '1858', 'in', 'Lourdes', 'France', '?']
['Saint', 'Bernadette', 'Soubirous']
<

In [12]:
import torchtext
from torchtext.datasets import SQuAD1
# from torchtext.datasets import squad1

# Download and initialize Squad1 dataset
torchtext.utils.download_from_url("https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json", root="data")
# train_dataset = SQuAD1(root="data", split="train")
train_dataset = SQuAD1(root="data", split="train")
# train_dataset = squad1.SQuAD1(root="data", split="train")

def tokenize(label, line):
    return line.split()

# tokens = []
# for label, line in train_dataset:
#     tokens += tokenize(label, line)

for a in train_dataset:
    print(a)
    print(type(a))
    break

# Initialize training_data
# training_data = []

# for example in train_dataset:
#     source = example.context
#     target = example.question
#     training_data.append({"source": source, "target": target})


('Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.', 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?', ['Saint Bernadette Soubirous'], [515])
<class 'tuple'>


In [10]:
tokenized_data

[{'source': ['To',
   'whom',
   'did',
   'the',
   'Virgin',
   'Mary',
   'allegedly',
   'appear',
   'in',
   '1858',
   'in',
   'Lourdes',
   'France',
   '?'],
  'target': ['Saint', 'Bernadette', 'Soubirous']}]