# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [2]:
# Project Steps Overview and Estimated Duration
# Below you will find each of the components of the project, and estimated times to complete each portion. 
# These are estimates and not exact timings to help you expect the amount of time necessary to put aside to work on your project.

# Prepare data (~2 hours)
# Build your vocabulary from a corpus of language data. The Vocabulary object is described in Lesson Six: Seq2Seq.

# Build Model (~4 hours)
# Build your Encoder, Decoder, and larger Sequence to Sequence pattern in PyTorch. This pattern is described in Lesson Six: Seq2Seq.

# Train Model (~3 hours)
# Write your training procedure and divide your dataset into train/test/validation splits. Then, train your network and plot your evaluation metrics. Save your model after it reaches a satisfactory level of accuracy.

# Evaluate & Interact w/ Model (~1 hour)
# Write a script to interact with your network at the command line.

In [3]:
# Instructions Summary
# The LSTM Chatbot will help you show off your skills as a deep learning practitioner. You will develop the chatbot using a new architecture called a Seq2Seq. 
# Additionally, you can use pre-trained word embeddings to improve the performance of your model. Let's get started by following the steps below:

# Step 1: Build your Vocabulary & create the Word Embeddings
# The most important part of this step is to create your Vocabulary object using a corpus of data drawn from TorchText.

# (Extra Credit)
# Use Gensim to extract the word embeddings from one of its corpus'.
# Use NLTK and Gensim to create a function to clean your text and look up the index of a word's embeddings.

# Step 2: Create the Encoder
# A Seq2Seq architecture consists of an encoder and a decoder unit. You will use Pytorch to build a full Seq2Seq model.
# The first step of the architecture is to create an encoder with an LSTM unit.

# (Extra Credit)
# Load your pretrained embeddings into the LSTM unit.

# Step 3: Create the Decoder
# The second step of the architecture is to create a decoder using a second LSTM unit.

# Step 4: Combine them into a Seq2Seq Architecture
# To finalize your model, you will combine the encoder and decoder units into a working model.
# The Seq2Seq2 model must be able to instantiate the encoder and decoder. Then, it will accept the inputs for these units and manage their interaction to get an output using the forward pass function.

# Step 5: Train & evaluate your model
# Finally you will train and evaluate your model using a Pytorch training loop.

# Step 6: Interact with the Chatbot
# Demonstrate your chatbot by converting the outputs of the model to text and displaying it's responses at the command line.

In [2]:
# Pre-requisites: 
# - PyTorch 2.00 kernel
# - ml.g5.xlarge instance

# Install requirements
# !pip install gensim==4.3.1 nltk==3.8.1 torchtext torchdata portalocker | grep -v "already satisfied"

# !pip install gensim==4.3.1 nltk==3.8.1  torchtext==0.15.1 portalocker>=2.0.0 | grep -v "already satisfied"
!pip install gensim==4.3.1 nltk==3.8.1  torchtext==0.6.0   | grep -v "already satisfied"

#  torchtext==0.12.0 --> torch==1.11.0
#  torchtext==0.13.0 --> torch==1.12.0
#  torchtext==0.14.0 --> torch==1.12.0
# torchtext==0.15.1 --> torch==2.0.0
# torchtext==0.15.2 --> torch==2.0.1

# !pip install gensim==4.3.1 nltk==3.8.1  torchtext==0.10.0 | grep -v "already satisfied"

# !pip install gensim==4.2.0 nltk torchtext  | grep -v "already satisfied"

Collecting gensim==4.3.1
  Using cached gensim-4.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.4 MB)
Collecting nltk==3.8.1
  Using cached nltk-3.8.1-py3-none-any.whl (1.5 MB)
Collecting torchtext==0.6.0
  Using cached torchtext-0.6.0-py3-none-any.whl (64 kB)
Collecting regex>=2021.8.3 (from nltk==3.8.1)
  Using cached regex-2023.10.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (773 kB)
Collecting sentencepiece (from torchtext==0.6.0)
  Using cached sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
Installing collected packages: sentencepiece, regex, nltk, gensim, torchtext
  Attempting uninstall: torchtext
    Found existing installation: torchtext 0.15.1
    Uninstalling torchtext-0.15.1:
      Successfully uninstalled torchtext-0.15.1
Successfully installed gensim-4.3.1 nltk-3.8.1 regex-2023.10.3 sentencepiece-0.1.99 torchtext-0.6.0
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip 

## Dataset load

In [257]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
import torch.nn as nn
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
# import sklearn.model_selection 
from torchtext.utils import download_from_url
from torchtext.data import Field, BucketIterator, TabularDataset
import random
import json

nltk.download('brown')
nltk.download('punkt')

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
# Output, save, and load brown embeddings

# model = gensim.models.Word2Vec(brown.sents())
# model.save('brown.embedding')

# w2v = gensim.models.Word2Vec.load('brown.embedding')

question_context_field = Field(tokenize=word_tokenize, init_token='<sos>', eos_token='<eos>', lower=True)
answer_field = Field(tokenize=word_tokenize, init_token='<sos>', eos_token='<eos>', lower=True)

def prepare_text(sentence):
    return sentence

# def prepare_dataset(df):
#     df['context_tokens'] = df['context'].apply(prepare_text)
#     df['question_tokens'] = df['question'].apply(prepare_text)
#     df['answer_tokens'] = df['answer'].apply(prepare_text)
    
#     df['context_ids'] = df['context_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])
#     df['question_ids'] = df['question_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])
#     df['answer_ids'] = df['answer_tokens'].apply(lambda x: [vocabulary.get_index(token) for token in x])

#     df['full_question_ids'] = df.apply(lambda row: [CONTEXT_index] + row['context_ids'] + [QUESTION_index] + row['question_ids'], axis=1)    

# def train_test_split(SRC, TRG, test_size=0.2, random_seed=42):
#     from sklearn.model_selection import train_test_split
    
#     SRC_train_dataset, SRC_val_dataset, TRG_train_dataset, TRG_val_dataset= train_test_split(SRC, TRG, test_size=test_size, random_state=random_seed)
    
#     # Return the training and test datasets
#     return SRC_train_dataset, SRC_val_dataset, TRG_train_dataset, TRG_val_dataset


def loadDF(path):    
    data_file_path = download_from_url(path, root="data")
    
    with open(data_file_path, 'r') as f:
        squad_data = json.load(f)
        
    data = []
    examples = []
    for item in squad_data['data']:
        for paragraph in item['paragraphs']:
            context = paragraph['context']
            for qa in paragraph['qas']:
                question = qa['question']                
                # question_context = f"{question} {separator_token} {context}"
                question_context = f"{question} {context}"
                answer = qa['answers'][0]['text']
                
                data.append((question_context, answer))                                

            #TODO: remove line below
            # break    
                
    df = pd.DataFrame.from_records(data, columns=['question_context', 'answer'])        
    df['question_context'] = df['question_context'].apply(prepare_text)
    df['answer'] = df['answer'].apply(prepare_text)

    return df


def load_datasets(df, random_seed=42):
    df.to_csv("data/data.csv", index=False)
        
    fields = [('question_context', question_context_field), ('answer', answer_field)]
    dataset = TabularDataset("data/data.csv", format='csv', fields=fields)
    
    question_context_field.build_vocab(dataset, min_freq=1)
    answer_field.build_vocab(dataset, min_freq=1)

    train_data, valid_data = dataset.split(split_ratio=0.8, random_state=random.seed(random_seed))
    
    return train_data, valid_data

# def train_test_split(SRC, TRG,  test_size=0.2, random_seed=42):
    
def tokens_to_string(tokens):
    return " ".join([answer_field.vocab.itos[token] for token in tokens if token != eos_idx])



[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [259]:
df = loadDF("https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json")

train_data, valid_data = load_datasets(df)

In [260]:
print(len(question_context_field.vocab))

print(len(answer_field.vocab))

102396
42794


In [6]:
df

Unnamed: 0,question_context,answer
0,To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous
1,What is in front of the Notre Dame Main Buildi...,a copper statue of Christ
2,The Basilica of the Sacred heart at Notre Dame...,the Main Building
3,What is the Grotto at Notre Dame? Architectura...,a Marian place of prayer and reflection
4,What sits on top of the Main Building at Notre...,a golden statue of the Virgin Mary
...,...,...
87594,In what US state did Kathmandu first establish...,Oregon
87595,What was Yangon previously known as? Kathmandu...,Rangoon
87596,With what Belorussian city does Kathmandu have...,Minsk
87597,In what year did Kathmandu create its initial ...,1975


In [7]:
print(df.iloc[20]['question_context'])

print(df.iloc[20]['answer'])

What entity provides help with the management of time for new students at Notre Dame? All of Notre Dame's undergraduate students are a part of one of the five undergraduate colleges at the school or are in the First Year of Studies program. The First Year of Studies program was established in 1962 to guide incoming freshmen in their first year at the school before they have declared a major. Each student is given an academic advisor from the program who helps them to choose classes that give them exposure to any major in which they are interested. The program also includes a Learning Resource Center which provides time management, collaborative learning, and subject tutoring. This program has been recognized previously, by U.S. News & World Report, as outstanding.
Learning Resource Center


## Neural Network

In [221]:
import random

class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size, embedding_size, num_layers=1, dropout=0):
        
        super(Encoder, self).__init__()
        
        self.input_size = input_size
        self.hidden_size = hidden_size
        
        self.hidden = torch.zeros(1, 1, hidden_size)        # How to use it ????
        # self.embedding_dim = embedding_size
        
        # self.embedding provides a vector representation of the inputs to our model
        # self.embedding = nn.Embedding(self.input_size, self.embedding_dim)
        self.embedding = nn.Embedding(input_size, embedding_size)
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        # self.lstm = nn.LSTM(self.embedding_dim, self.hidden_size, num_layers, dropout=(0 if num_layers == 1 else dropout))
        self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=dropout)

        self.dropout = nn.Dropout(dropout)
    
    def forward(self, i):
        embedded = self.embedding(i)
        
        embedded = self.dropout(embedded)

        output, (hidden, cell) = self.lstm(embedded)
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        
        return output, hidden, cell
    

class Decoder(nn.Module):
      
    def __init__(self, hidden_size, output_size, embedding_size, num_layers=1, dropout=0):
        
        super(Decoder, self).__init__()
        
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embedding_size = embedding_size
        
        # self.embedding provides a vector representation of the target to our model
        # self.embedding = nn.Embedding(self.hidden_size, self.hidden_size)  # From lesson
        
        self.embedding = nn.Embedding(output_size, embedding_size)
        # self.embedding = nn.Embedding(hidden_size, embedding_size) # Why ?
        
        
        # self.lstm, accepts the embeddings and outputs a hidden state
        # self.lstm = nn.LSTM(self.embedding_size, self.hidden_size, num_layers, dropout=(0 if num_layers == 1 else dropout))
        self.lstm = nn.LSTM(embedding_size, hidden_size, num_layers, dropout=dropout)

        # self.ouput, predicts on the hidden state via a linear output layer
        self.fc = nn.Linear(hidden_size, output_size)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        
        # print('[Decoder] input shape:', input.shape)
        
        input = input.unsqueeze(0)
        
        embedded = self.embedding(input)
        
        embedded = self.dropout(embedded)
            
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell)) 
        
        
        prediction = self.fc(output.squeeze(0))
        
        print("prediction", prediction)
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        
        return prediction, hidden, cell
        
        

class Seq2Seq(nn.Module):
        
    def __init__(self, encoder, decoder):
        
        super(Seq2Seq, self).__init__()
        
        self.encoder = encoder
        self.decoder = decoder        

    def forward(self, src, trg, teacher_forcing_ratio = 0.5):              
        #src = [src len, batch size]
        #trg = [trg len, batch size]
                
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_size
                       
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(device)
        
        _, hidden, cell = self.encoder(src)
        
        # Start with <sos> tokens
        input = trg[0, :]
                
        for t in range(1, trg_len):

            output, hidden, cell = self.decoder(input, hidden, cell)

            outputs[t] = output
            
            # get highest predicted token
            top1 = output.argmax(1)
            
            print("Most likely token: ", top1)
            
            use_teacher_forcing = random.random() < teacher_forcing_ratio

            input = trg[t] if use_teacher_forcing else top1
    
        return outputs


class Seq2SeqInference(nn.Module):
        
    def __init__(self, encoder, decoder, answer_field):
        
        super(Seq2SeqInference, self).__init__()
        
        self.encoder = encoder
        self.decoder = decoder        

    def forward(self, src, max_length=20):
        #src = [src len, 1]
        #trg = [trg len, 1]
        # expected output shape = [max_length, 1]
        
        if src.shape[1] != 1:
            raise ValueError(f"src.shape[1] != 1: {src.shape[1]}")
            
        batch_size = 1 # src.shape[1]

        trg_vocab_size = self.decoder.output_size
                                   
        logits = torch.zeros(max_length + 1, batch_size, trg_vocab_size).to(device)
        # outputs = torch.zeros(max_length + 1, batch_size).to(device)
        
        _, hidden, cell = self.encoder(src)
        
        # Start with <sos> tokens
        # TODO: create array of <sos> tokens      
        sos_idx = answer_field.vocab.stoi[answer_field.init_token] 
        
        # input = trg[0, :]
        
        # print("input shape", trg.shape)
        input = torch.tensor([sos_idx]).to(device)        
        print("Start token: ", input)
        
        inferred_tokens = []
                
        # for t in range(1, trg_len):
        for t in range(1, max_length + 1):

            output, hidden, cell = self.decoder(input, hidden, cell)

            logits[t] = output
            
            # get highest predicted token
            top1 = output.argmax(1)
            
            # print(f"Top1: {top1} {top1.item()}")
            print(f"Most likely token: {top1.item()} ({answer_field.vocab.itos[top1.item()]})")
            inferred_tokens.append(top1)
                        
            input = top1                        
        
        inferred_tokens_tensor = torch.tensor(inferred_tokens).view(-1, 1)
        
        logits_dim = logits.shape[-1]
        logits = logits[1:].view(-1, logits_dim)
        
        return inferred_tokens_tensor, logits


## Model training

In [9]:
from torch.utils.data import DataLoader, TensorDataset

# Best hyperparameters: {'encoder_dropout': 0.5446679996568586, 'decoder_dropout': 0.24195810605904913, 'embedding_size': 362, 'hidden_size': 1602, 'num_layers': 4, 'learning_rate': 0.0002977109963568765, 'batch_size': 85}

# Define the model and other parameters
encoder_input_size = len(question_context_field.vocab) 
encoder_embedding_size= 362 #300
encoder_dropout = 0.5446679996568586 #0.5

decoder_output_size = len(answer_field.vocab)
decoder_embedding_size = 362 #300
decoder_dropout = 0.24195810605904913 #0.5

hidden_size = 1602 #512
num_layers = 4 #2
batch_size = 85 #128

learning_rate = 0.0002977109963568765 #0.001
num_epochs =  20

encoder = Encoder(encoder_input_size, hidden_size, encoder_embedding_size, num_layers, encoder_dropout)
decoder = Decoder(hidden_size, decoder_output_size, decoder_embedding_size, num_layers, decoder_dropout)
    
model = Seq2Seq(encoder, decoder).to(device)

train_iterator, valid_iterator = BucketIterator.splits(
    (train_data, valid_data),
   batch_size=batch_size,
   sort_within_batch=True,
    sort_key = lambda x: len(x.question_context),
    device=device)
                             

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
import time


# Define loss function and optimizer
# learning_rate = 0.001
pad_idx = answer_field.vocab.stoi[answer_field.pad_token] 
criterion = nn.CrossEntropyLoss(ignore_index = pad_idx)

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
# num_epochs =  20
clip = 1
valid_loss_min = None

for epoch in range(num_epochs):
    start_time = time.time()
    model.train()
    train_loss = 0.0
    avg_train_loss = 0.0
    # epoch_loss = 0.0

    # Training loop
    for batch_idx, batch in enumerate(train_iterator):
        src = batch.question_context.to(device)
        trg = batch.answer.to(device)
        
#         print('src:', src.shape)
#         print('trg:', trg.shape)        
        
        optimizer.zero_grad()
        
        # print (src.shape)
        # print (trg.shape)
        
        # Pass the source sequences through the encoder
        output = model(src, trg)        #  [trg length, batch size, output dim]
        # print('output:', output.shape)
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
#         print('output 2:', output.shape)
#         print('trg 2:', trg.shape)
        
#         print('output content:', output)
#         print('trg content:', trg)
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        # print(train_loss)
        
        train_loss += loss.item()
        
        # break
        
    average_train_loss = train_loss / len(train_iterator)
    
    end_time = time.time()    
    train_elapsed_time = end_time - start_time
    
    # print(f'Epoch: {epoch+1:02} | Train Time: {train_elapsed_time}s')
        
    # Evaluation on the test dataset
    start_time = time.time()
    model.eval()
    with torch.no_grad():
        valid_loss = 0.0

        for batch_idx, batch in enumerate(valid_iterator):
            src = batch.question_context.to(device)
            trg = batch.answer.to(device)                    
            
            output = model(src, trg, 0) #turn off teacher forcing

            #trg = [trg len, batch size]
            #output = [trg len, batch size, output dim]

            output_dim = output.shape[-1]

            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

            #trg = [(trg len - 1) * batch size]
            #output = [(trg len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            
            valid_loss += loss.item()
            
        average_val_loss = valid_loss / len(valid_iterator)
                    
    end_time = time.time()    
    val_elapsed_time = end_time - start_time
    
    # print(f'Eval Time: {elapsed_time}')
    # print(f'Epoch: {epoch+1:02} | Eval Time: {val_elapsed_time}s')
    
    torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': train_loss,
            }, f'checkpoints/train-{epoch}.pt')
    
    if valid_loss_min is None or (
            (valid_loss_min - average_val_loss) / valid_loss_min > 0.01
    ):
        print(f"New minimum validation loss: {average_val_loss:.6f}. Saving model ...")
        
        torch.save({
                'epoch': epoch,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'loss': train_loss,
            }, 'checkpoints/best_val_loss.pt')

        valid_loss_min = average_val_loss
        
    print(f"Epoch: {epoch+1:02}, Train Loss: {average_train_loss:.3f}, Val Loss: {average_val_loss:.3f} Train Time: {train_elapsed_time}s Eval Time: {val_elapsed_time}s")



In [25]:
batch = next(iter(valid_iterator))

text_batch = batch.question_context
label_batch = batch.answer

print(text_batch)

tensor([[     2,      2,      2,  ...,      2,      2,      2],
        [    27,     27,      4,  ...,     27,     27, 101321],
        [    22,     12,   3956,  ...,     12,     13,      3],
        ...,
        [   313,    133,    316,  ...,      1,      1,      1],
        [     7,      7,      7,  ...,      1,      1,      1],
        [     3,      3,      3,  ...,      1,      1,      1]],
       device='cuda:0')


## Running Inference

In [222]:
test_encoder = Encoder(encoder_input_size, hidden_size, encoder_embedding_size, num_layers, encoder_dropout)
test_decoder = Decoder(hidden_size, decoder_output_size, decoder_embedding_size, num_layers, decoder_dropout)
    
test_model = Seq2SeqInference(test_encoder, test_decoder, answer_field)
test_model.to(device)

pad_idx = answer_field.vocab.stoi[answer_field.pad_token] 
test_criterion = nn.CrossEntropyLoss(ignore_index = pad_idx) # https://pytorch.org/docs/2.0/generated/torch.nn.CrossEntropyLoss.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss
test_criterion.to(device)

checkpoint = torch.load('checkpoints/best_val_loss.pt')
test_model.load_state_dict(checkpoint['model_state_dict'])


test_iterator = BucketIterator(
        # dataset=valid_data,
        dataset=train_data,
       batch_size=1,
       sort_within_batch=True,
        sort_key = lambda x: len(x.question_context),
        device=device)


In [223]:
batch = next(iter(test_iterator))

text_batch = batch.question_context
label_batch = batch.answer

print(label_batch.shape)
print(label_batch)

label_batch = label_batch.squeeze(dim=1)

print(label_batch.shape)
print(label_batch)

torch.Size([3, 1])
tensor([[  2],
        [280],
        [  3]], device='cuda:0')
torch.Size([3])
tensor([  2, 280,   3], device='cuda:0')


In [224]:
# print(answer_field.vocab.itos[3])

In [None]:
with torch.no_grad():
    valid_loss = 0.0

    for batch_idx, batch in enumerate(test_iterator):
        src = batch.question_context.to(device)
        trg = batch.answer.to(device)
        
        print("src shape", src.shape)
        # print("src", src)
        # print("trg", trg)
        
        tokens, logits = test_model(src, 5)

        #trg = [trg len, batch size]
        #logits = [max length, batch, output dim]

#         print("logits shape", logits.shape)
#         print("seq2seq-logits-raw", logits)
        
#         logits_dim = logits.shape[-1]
#         logits = logits[1:].view(-1, logits_dim)
        
        trg = trg[1:].view(-1)

        print("seq2seq-logits", logits)
        print("seq2seq-trg", trg)
        print("tokens shape", tokens.shape)
        print("tokens", tokens)
                
        response_text = tokens_to_string(tokens.squeeze().tolist())
        
        print(response_text)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]

        # TODO: question: should 'output' really be the MLP result tensor ?
#         loss = test_criterion(output, trg)

#         valid_loss += loss.item()
        
        break # TODO: remove line





'the'

## Testing against the Validation (or Test ?) Set

In [62]:
test_encoder = Encoder(encoder_input_size, hidden_size, encoder_embedding_size, num_layers, encoder_dropout)
test_decoder = Decoder(hidden_size, decoder_output_size, decoder_embedding_size, num_layers, decoder_dropout)
    
test_model = Seq2Seq(test_encoder, test_decoder)
test_model.to(device)

pad_idx = answer_field.vocab.stoi[answer_field.pad_token] 
test_criterion = nn.CrossEntropyLoss(ignore_index = pad_idx) # https://pytorch.org/docs/2.0/generated/torch.nn.CrossEntropyLoss.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss
test_criterion.to(device)

checkpoint = torch.load('checkpoints/best_val_loss.pt')
test_model.load_state_dict(checkpoint['model_state_dict'])


test_iterator = BucketIterator(
        # dataset=valid_data,
        dataset=train_data,
       batch_size=1,
       sort_within_batch=True,
        sort_key = lambda x: len(x.question_context),
        device=device)


In [65]:
batch = next(iter(test_iterator))

question_context_item = batch.question_context
answer_item = batch.answer

# print(question_context_item.shape)
# print(question_context_item)
print(answer_item.shape)
print(answer_item)

# print(answer_field.vocab.itos[2])
# print(answer_field.vocab.itos[1324])
# print(answer_field.vocab.itos[2015])
# print(answer_field.vocab.itos[238])
# print(answer_field.vocab.itos[3])

print(answer_field.vocab.itos[2])
print(answer_field.vocab.itos[1568])
print(answer_field.vocab.itos[5])
print(answer_field.vocab.itos[24849])
print(answer_field.vocab.itos[3])

torch.Size([4, 1])
tensor([[   2],
        [2987],
        [  16],
        [   3]], device='cuda:0')
<sos>
consumption
of
carcinogenic
<eos>


In [28]:
# TEXT.postprocess(tokenized_text)

# question_context_field = Field(tokenize=word_tokenize, init_token='<sos>', eos_token='<eos>', lower=True)
# answer_field = Field(tokenize=word_tokenize, init_token='<sos>', eos_token='<eos>', lower=True)

input_text = "Notre Dame"

# tokenized_text = question_context_field.preprocess("Notre Dame")

tokenized_text = question_context_field.tokenize(input_text)  # Tokenize the input text
input_ids = [question_context_field.vocab.stoi[token] for token in tokenized_text]  # Convert tokens to IDs

# Output the tokenized text (list of tokens)
print(tokenized_text)
print(input_ids)

# Convert tokens back to a string using the Field's postprocessing function
# reconstructed_text = question_context_field.postprocess(tokenized_text)
# print(reconstructed_text)

['Notre', 'Dame']
[0, 0]


In [36]:
import time

start_time = time.time()
model.eval()
with torch.no_grad():
    valid_loss = 0.0

    for batch_idx, batch in enumerate(test_iterator):
        src = batch.question_context.to(device)
        trg = batch.answer.to(device)

        print("src shape", src.shape)
        # print("src", src)
        print("trg", trg)
        
        output = test_model(src, trg, teacher_forcing_ratio=0) #turn off teacher forcing

        # print(output)
        # print ("output", output)
        #trg = [trg len, batch size]
        #output = [trg len, batch size, output dim]

        print("output shape", output.shape)
        print("seq2seq-output-raw", output)
        
        output_dim = output.shape[-1]

        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)

        print("seq2seq-output", output)
        print("seq2seq-trg", trg)
        
        #trg = [(trg len - 1) * batch size]
        #output = [(trg len - 1) * batch size, output dim]

        # TODO: question: should 'output' really be the MLP result tensor ?
        loss = test_criterion(output, trg)

        valid_loss += loss.item()
        
        break # TODO: remove line

    average_val_loss = valid_loss / len(valid_iterator)

end_time = time.time()    
val_elapsed_time = end_time - start_time

print(f"Val Loss: {average_val_loss:.3f} Eval Time: {val_elapsed_time}s")    

src shape torch.Size([72, 1])
trg tensor([[  2],
        [652],
        [ 11],
        [  3]], device='cuda:0')
prediction tensor([[-4.8642, -4.7556, -4.7049,  ..., -3.7168, -4.2894, -4.2225]],
       device='cuda:0')
Most likely token:  tensor([4], device='cuda:0')
prediction tensor([[-5.8746, -5.8165, -5.7985,  ..., -4.1126, -5.4001, -4.5968]],
       device='cuda:0')
Most likely token:  tensor([3], device='cuda:0')
prediction tensor([[-6.4086, -6.2010, -6.2469,  ..., -4.8947, -5.8284, -4.6923]],
       device='cuda:0')
Most likely token:  tensor([3], device='cuda:0')
output shape torch.Size([4, 1, 42794])
seq2seq-output-raw tensor([[[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[-4.8642, -4.7556, -4.7049,  ..., -3.7168, -4.2894, -4.2225]],

        [[-5.8746, -5.8165, -5.7985,  ..., -4.1126, -5.4001, -4.5968]],

        [[-6.4086, -6.2010, -6.2469,  ..., -4.8947, -5.8284, -4.6923]]],
       device='cuda:0')
seq2seq-output tensor([[-4.8642, -4.7556, -4.7049

In [14]:
test_model

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(102396, 362)
    (lstm): LSTM(362, 1602, num_layers=4, dropout=0.5446679996568586)
    (dropout): Dropout(p=0.5446679996568586, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(42794, 362)
    (lstm): LSTM(362, 1602, num_layers=4, dropout=0.24195810605904913)
    (fc): Linear(in_features=1602, out_features=42794, bias=True)
    (dropout): Dropout(p=0.24195810605904913, inplace=False)
  )
)

### HyperParameter Optimization

In [15]:
!pip install  sqlalchemy==1.4.8  optuna | grep -v "already satisfied"

Collecting sqlalchemy==1.4.8
  Using cached SQLAlchemy-1.4.8-cp310-cp310-linux_x86_64.whl
Collecting optuna
  Downloading optuna-3.4.0-py3-none-any.whl (409 kB)
[2K     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 409.6/409.6 kB 6.8 MB/s eta 0:00:00
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.12.1-py3-none-any.whl (226 kB)
[2K     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 226.8/226.8 kB 4.5 MB/s eta 0:00:00
[?25hCollecting colorlog (from optuna)
  Using cached colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Using cached Mako-1.2.4-py3-none-any.whl (78 kB)
Installing collected packages: sqlalchemy, Mako, colorlog, alembic, optuna
Successfully installed Mako-1.2.4 alembic-1.12.1 colorlog-6.7.0 optuna-3.4.0 sqlalchemy-1.4.8
[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.1.2[0m[39;49m -> [0m[32;49m23.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update,

In [None]:
from sqlalchemy import orm
import optuna
import time

encoder_input_size = len(question_context_field.vocab) 
decoder_output_size = len(answer_field.vocab)
pad_idx = answer_field.vocab.stoi[answer_field.pad_token] 

def objective(trial):
    torch.cuda.empty_cache()
    
    encoder_dropout = trial.suggest_float('encoder_dropout', 0.1, 0.9)
    decoder_dropout = trial.suggest_float('decoder_dropout', 0.1, 0.9)
    embedding_size = trial.suggest_int('embedding_size', 200, 512)
    hidden_size = trial.suggest_int('hidden_size', 64, 2048)
    num_layers = trial.suggest_int('num_layers', 1, 4)
    
    
    learning_rate = trial.suggest_float('learning_rate', 1e-5, 1e-1, log=True)
    # weight_decay = trial.suggest_loguniform('weight_decay', 1e-6, 1e-3)
    # opt = trial.suggest_categorical('opt', ['sgd', 'adam'])
    # momentum = trial.suggest_uniform('momentum', 0.1, 0.9)
    
    batch_size = trial.suggest_int('batch_size', 32, 128)
    
    print(f'Starting trial {trial.number}:\tHyperparameters={trial.params}') 
    
    train_iterator, valid_iterator = BucketIterator.splits(
        (train_data, valid_data),
       batch_size=batch_size,
       sort_within_batch=True,
        sort_key = lambda x: len(x.question_context),
        device=device)
    
    encoder = Encoder(encoder_input_size, hidden_size, embedding_size, num_layers, encoder_dropout)
    decoder = Decoder(hidden_size, decoder_output_size, embedding_size, num_layers, decoder_dropout)
    model = Seq2Seq(encoder, decoder).to(device)    
    
    criterion = nn.CrossEntropyLoss(ignore_index = pad_idx) # https://pytorch.org/docs/2.0/generated/torch.nn.CrossEntropyLoss.html?highlight=crossentropyloss#torch.nn.CrossEntropyLoss
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    num_epochs = 5
    clip = 1
    valid_loss_min = None

    for epoch in range(num_epochs):
        start_time = time.time()
        model.train()
        train_loss = 0.0
        avg_train_loss = 0.0

        # Training loop
        for batch_idx, batch in enumerate(train_iterator):
            src = batch.question_context.to(device)
            trg = batch.answer.to(device)

    #         print('src:', src.shape)
    #         print('trg:', trg.shape)        

            optimizer.zero_grad()


            # print (src.shape)
            # print (trg.shape)

            # Pass the source sequences through the encoder
            output = model(src, trg)        #  [trg length, batch size, output dim]
            # print('output:', output.shape)

            output_dim = output.shape[-1]

            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)

    #         print('output 2:', output.shape)
    #         print('trg 2:', trg.shape)

    #         print('output content:', output)
    #         print('trg content:', trg)

            loss = criterion(output, trg)

            loss.backward()

            torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

            optimizer.step()

            # print(train_loss)

            train_loss += loss.item()


        average_train_loss = train_loss / len(train_iterator)

        end_time = time.time()    
        train_elapsed_time = end_time - start_time

        # print(f'Epoch: {epoch+1:02} | Train Time: {train_elapsed_time}s')

        # Evaluation on the test dataset
        start_time = time.time()
        model.eval()
        with torch.no_grad():
            valid_loss = 0.0

            for batch_idx, batch in enumerate(valid_iterator):
                src = batch.question_context.to(device)
                trg = batch.answer.to(device)                    

                output = model(src, trg, 0) #turn off teacher forcing

                #trg = [trg len, batch size]
                #output = [trg len, batch size, output dim]

                output_dim = output.shape[-1]

                output = output[1:].view(-1, output_dim)
                trg = trg[1:].view(-1)

                #trg = [(trg len - 1) * batch size]
                #output = [(trg len - 1) * batch size, output dim]

                loss = criterion(output, trg)

                valid_loss += loss.item()

            average_val_loss = valid_loss / len(valid_iterator)

        end_time = time.time()    
        val_elapsed_time = end_time - start_time

        # print(f'Eval Time: {elapsed_time}')
        # print(f'Epoch: {epoch+1:02} | Eval Time: {val_elapsed_time}s')

        if valid_loss_min is None or (
                (valid_loss_min - average_val_loss) / valid_loss_min > 0.01
        ):
            print(f"New minimum validation loss: {average_val_loss:.6f}. Saving model ...")

            torch.save({
                    'epoch': epoch,
                    'model_state_dict': model.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict(),
                    'loss': train_loss,
                }, f'checkpoints/hpo_{trial.number}_model.pt",')

            valid_loss_min = average_val_loss

        print(f"Epoch: {epoch+1:02}, Train Loss: {average_train_loss:.3f}, Val Loss: {average_val_loss:.3f} Train Time: {train_elapsed_time}s Eval Time: {val_elapsed_time}s")    
        
        return valid_loss_min
    
# Define study and optimize hyperparameters
study = optuna.create_study(direction='minimize')

study.optimize(objective, n_trials=100, catch=(torch.cuda.OutOfMemoryError,))

In [13]:
print('Best trial:', study.best_trial)
print('Best hyperparameters:', study.best_params)
print('Validation loss:', study.best_value)

Best trial: FrozenTrial(number=38, state=TrialState.COMPLETE, values=[6.859984294227932], datetime_start=datetime.datetime(2023, 7, 17, 13, 41, 28, 407974), datetime_complete=datetime.datetime(2023, 7, 17, 13, 54, 17, 155683), params={'encoder_dropout': 0.5446679996568586, 'decoder_dropout': 0.24195810605904913, 'embedding_size': 362, 'hidden_size': 1602, 'num_layers': 4, 'learning_rate': 0.0002977109963568765, 'batch_size': 85}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'encoder_dropout': FloatDistribution(high=0.9, log=False, low=0.1, step=None), 'decoder_dropout': FloatDistribution(high=0.9, log=False, low=0.1, step=None), 'embedding_size': IntDistribution(high=512, log=False, low=200, step=1), 'hidden_size': IntDistribution(high=2048, log=False, low=64, step=1), 'num_layers': IntDistribution(high=4, log=False, low=1, step=1), 'learning_rate': FloatDistribution(high=0.1, log=True, low=1e-05, step=None), 'batch_size': IntDistribution(high=128, log=False, 

In [None]:
# Baseline
# Epoch: 01, Train Loss: 6.988, Val Loss: 6.898

#  Round 1
# Best hyperparameters: {'encoder_dropout': 0.6030214714084597, 'decoder_dropout': 0.2892322144702524, 'embedding_size': 203, 'hidden_size': 1662, 'num_layers': 2, 'learning_rate': 2.4979480026780982e-05, 'batch_size': 82}
# Validation accuracy: 6.920195218558623

#  Round 2
# Best hyperparameters: {'encoder_dropout': 0.5446679996568586, 'decoder_dropout': 0.24195810605904913, 'embedding_size': 362, 'hidden_size': 1602, 'num_layers': 4, 'learning_rate': 0.0002977109963568765, 'batch_size': 85}
# Validation loss: 6.859984294227932


In [99]:
# print(batch.question_context)

print(batch.answer)

print(answer_field.vocab.itos[980])

tensor([[   2],
        [1818],
        [ 124],
        [   6],
        [1289],
        [   3]], device='cuda:0')
mexico
