# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [2]:
!pip install numpy



In [3]:
!pip install torch

Collecting torch
  Downloading torch-1.13.0-cp39-cp39-win_amd64.whl (167.2 MB)
Installing collected packages: torch
Successfully installed torch-1.13.0


In [6]:
import random
    
import torch.cuda
from torch import nn, optim

from evaluate import evaluate_input
from model import Encoder, Decoder
from prepare_data import get_vocab_and_sentence_pairs
from train import train_iterations, build_models


dataset = 'squad1'

voc, pairs_train, pairs_valid = get_vocab_and_sentence_pairs(dataset)

randomize = random.choice(pairs_train)
print('random sentence {}'.format(randomize))

#print number of words
input_size = voc.num_words
output_size = voc.num_words
# print('Input : {} Output : {}'.format(input_size, output_size))

############################################
#
# Note: I did not have the ability to test
# this on a GPU-enabled PC, so this might not
# work on GPU
#
############################################

batch_size = 128
embed_size = 256
hidden_size = 512
num_layers = 1
num_iteration = 50  # 0 means use the whole dataset, otherwise n=50 would take the first 50 of the randomized pairs
num_epochs = 50
learning_rate = 0.0001
teacher_forcing_ratio = 0.5

embedding = nn.Embedding(voc.num_words, embed_size)

model_name = 'Simple_model'


encoder = Encoder(input_size, hidden_size, embedding, num_layers)
decoder = Decoder(hidden_size, voc.num_words, embedding, num_layers)
encoder_optimizer = optim.Adam(encoder.parameters(), lr=learning_rate)
decoder_optimizer = optim.Adam(decoder.parameters(), lr=learning_rate * 5.0)
scheduler_encoder = torch.optim.lr_scheduler.ReduceLROnPlateau(encoder_optimizer, mode='min', patience=5,
                                                               factor=0.5, min_lr=0.0000001, verbose=True)
scheduler_decoder = torch.optim.lr_scheduler.ReduceLROnPlateau(decoder_optimizer, mode='min', patience=5,
                                                               factor=0.5, min_lr=0.0000001, verbose=True)

for epoch in range(num_epochs):
    print('Beginning training for epoch {}'.format(epoch + 1))
    train_iterations(epoch, model_name, voc, pairs_train, pairs_valid, encoder, decoder, encoder_optimizer,
                     decoder_optimizer, scheduler_encoder, scheduler_decoder, embedding, num_iteration, batch_size,
                     10, 5, dataset, teacher_forcing_ratio, num_epochs)


Reading the json file
processing...


  main = pd.concat([ m[['id','question','context']].set_index('id'),js.set_index('q_idx')],1,sort=False).reset_index()


shape of the dataframe is (87599, 6)
Done
Reading the json file
processing...
shape of the dataframe is (10570, 5)
Done
Reading lines...
Read 87599 sentence pairs (train)
Trimmed to 87063 sentence pairs (train)
Read 10570 sentence pairs (valid)
Trimmed to 10514 sentence pairs (valid)
Counting words...
Counted words: 21685
keep_words 12800 / 21682 = 0.5904
Trimmed from 87063 pairs to 22987, 0.2640 of total
Trimmed from 10514 pairs to 6919, 0.6581 of total
random sentence ['how many times was american idol nominated for an emmy ?', 'nine']
Beginning training for epoch 1
Initializing ...
Training...
Epoch: 1; Iteration: 10; Percent complete: 0.4%; Average loss: 9.1866
Epoch: 1; Iteration: 20; Percent complete: 0.8%; Average loss: 7.9834
Epoch: 1; Iteration: 30; Percent complete: 1.2%; Average loss: 6.7597
Epoch: 1; Iteration: 40; Percent complete: 1.6%; Average loss: 6.5931
Epoch: 1; Iteration: 50; Percent complete: 2.0%; Average loss: 6.4788
Beginning epoch validation...
Epoch 0 validati

KeyboardInterrupt: 

In [5]:
from train import build_models

batch_size = 128
embed_size = 256
hidden_size = 512
num_layers = 1
num_iteration = 50  # 0 means use the whole dataset, otherwise n=50 would take the first 50 of the randomized pairs
num_epochs = 5
learning_rate = 0.0001
teacher_forcing_ratio = 0.5
dataset='squad1'
model_name = 'Simple_model'


# reload the checkpoints:
iteration = 20  # If testing, change this to whichever iteration showed up last                                    
epoch = 35     # likewise for the epoch
encoder, decoder, _, _ = build_models(load_filename=True,
                                      hidden_size=hidden_size,
                                      encoder_n_layers=num_layers,
                                      decoder_n_layers=num_layers,
                                      batch_size=batch_size,
                                      dataset_name=dataset,
                                      embedding_size=embed_size,
                                      model_name=model_name,
                                      iteration=iteration,
                                      epoch=epoch)

evaluate_input(encoder, decoder, voc)


Reading the json file
processing...
shape of the dataframe is (87599, 6)
Done
Reading the json file
processing...
shape of the dataframe is (10570, 5)
Done
Reading lines...
Read 87599 sentence pairs (train)
Trimmed to 87063 sentence pairs (train)
Read 10570 sentence pairs (valid)
Trimmed to 10514 sentence pairs (valid)
Counting words...
Counted words: 21685
keep_words 12800 / 21682 = 0.5904
Trimmed from 87063 pairs to 22987, 0.2640 of total
Trimmed from 10514 pairs to 6919, 0.6581 of total


RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory