# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
import pandas as pd
import numpy as np
import gzip
from typing import List
import random

import torch 
import torchtext
from torchtext.legacy.data import Field, BucketIterator
from torchtext.legacy.datasets import Multi30k

import gensim
import spacy

import nltk
from nltk.corpus import stopwords, brown
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

nltk.download('punkt') #data files for tokenization
nltk.download('wordnet') #data files for lemmatization
nltk.download('brown') #data files for bigram collocation

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ipinmi/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/ipinmi/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /Users/ipinmi/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package brown to /Users/ipinmi/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [None]:
SEED = 47  # for reproducibility

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [None]:
# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

In [None]:
## Tokenization using Spacy
spacy_de = spacy.load("de_core_news_sm")
spacy_en = spacy.load("en_core_web_sm")


def tokenize_de(text: str) -> List[str]:
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]


def tokenize_en(text: str) -> List[str]:
    return [tok.text for tok in spacy_en.tokenizer(text)]

In [1]:
def loadDF(split_set):
    '''
    
    You will use this function to load the dataset into a Pandas Dataframe for processing.

    Args:
        split_set: the dataset split you want to load into a Pandas Dataframe
    '''
  # convert the dataset into a Pandas Dataframe
    data_examples = []
    for example in split_set:
        data = {
            "question": example[1],
            "context": example[0],
            "answer": example[2][0],
            "answer_start": example[3],
        }
        data_examples.append(data)
    df = pd.DataFrame(data_examples)

    return df


def prepare_text(sentence):
    
    '''

    Our text needs to be cleaned with a tokenizer. This function will perform that task.
    https://www.nltk.org/api/nltk.tokenize.html

    '''

    # Convert text to lowercase
    text = sentence.lower()
    
    # Remove punctuation and digits
    text = ''.join(c for c in text if c.isalpha() or c.isspace())
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # Remove stop words
    tokens = [t for t in tokens if t not in stop_words]

    # Lemmatize the tokens
    lemmatizer = nltk.WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    
    # Return the cleaned text as a list of tokens
    return tokens



def train_test_split(SRC, TRG):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    
    return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [None]:

class Encoder(nn.Module):
    
    def __init__(self, input_size, hidden_size):
        
        super(Encoder, self).__init__()
        
        # self.embedding provides a vector representation of the inputs to our model
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        
    
    def forward(self, i):
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        
        return o, h, c
    

class Decoder(nn.Module):
      
    def __init__(self, hidden_size, output_size):
        
        super(Decoder, self).__init__()
        
        # self.embedding provides a vector representation of the target to our model
        
        # self.lstm, accepts the embeddings and outputs a hidden state

        # self.ouput, predicts on the hidden state via a linear output layer     
        
    def forward(self, i, h):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        
        return o, h
        
        

class Seq2Seq(nn.Module):
    
    def __init__(self, encoder_input_size, encoder_hidden_size, decoder_hidden_size, decoder_output_size):
        
        super(Seq2Seq, self).__init__()
        
    
    
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):      
        
        return o

    

