# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [1]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
import torch.nn as nn
from nltk.corpus import brown
from nltk.tokenize import RegexpTokenizer
import torch.optim as optim
import random

In [2]:
# this cell for testing
!pip install torchdata==0.3.0

Defaulting to user installation because normal site-packages is not writeable
Collecting torchdata==0.3.0
  Downloading torchdata-0.3.0-py3-none-any.whl (47 kB)
[K     |████████████████████████████████| 47 kB 3.4 MB/s eta 0:00:011
Installing collected packages: torchdata
Successfully installed torchdata-0.3.0


In [2]:
from torchtext.datasets import SQuAD2
train , dev = SQuAD2(split=('train','dev'))
w,q,z = [] , [] ,[]
for i  in train :
    w.append(i[0])
    q.append(i[1])
    z.append(*i[2])
squad_dict_train={'text':w , 'SRC':q , 'TRG':z}
u,y,t = [] , [] ,[]
for l  in dev :
    u.append(l[0])
    y.append(l[1])
    t.append(l[2])
squad_dict_dev={'text':u , 'SRC':y , 'TRG':t}

In [3]:
df_train=pd.DataFrame.from_dict(squad_dict_train)
df_dev=pd.DataFrame.from_dict(squad_dict_dev)
df_dev
df_train["text_src"] = " SOS " + df_train["text"] +" SOS " +df_train["SRC"]+" EOS "
df_dev["text_src"] = " SOS " + df_dev["text"] +" SOS " +df_dev["SRC"]+ " EOS " 

In [4]:
df_train

Unnamed: 0,text,SRC,TRG,text_src
0,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce start becoming popular?,in the late 1990s,SOS Beyoncé Giselle Knowles-Carter (/biːˈjɒns...
1,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,What areas did Beyonce compete in when she was...,singing and dancing,SOS Beyoncé Giselle Knowles-Carter (/biːˈjɒns...
2,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce leave Destiny's Child and bec...,2003,SOS Beyoncé Giselle Knowles-Carter (/biːˈjɒns...
3,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,In what city and state did Beyonce grow up?,"Houston, Texas",SOS Beyoncé Giselle Knowles-Carter (/biːˈjɒns...
4,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,In which decade did Beyonce become famous?,late 1990s,SOS Beyoncé Giselle Knowles-Carter (/biːˈjɒns...
...,...,...,...,...
130314,"The term ""matter"" is used throughout physics i...",Physics has broadly agreed on the definition o...,,"SOS The term ""matter"" is used throughout phys..."
130315,"The term ""matter"" is used throughout physics i...",Who coined the term partonic matter?,,"SOS The term ""matter"" is used throughout phys..."
130316,"The term ""matter"" is used throughout physics i...",What is another name for anti-matter?,,"SOS The term ""matter"" is used throughout phys..."
130317,"The term ""matter"" is used throughout physics i...",Matter usually does not need to be used in con...,,"SOS The term ""matter"" is used throughout phys..."


In [27]:
df_train.to_csv('train.csv',index=False)
df_dev.to_csv('dev.csv',index=False)

In [6]:
nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [7]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [8]:
SOS_token = 0
EOS_token = 1
class chatbot:
    def __init__(self):
       #initialize containers to hold the words and corresponding index
        self.word2index = {}
        self.word2count = {}
        self.index2word = {0: "SOS", 1: "EOS"}
        self.n_words = 2  # Count SOS and EOS
        
       #split a sentence into words and add it to the container
    def addSentence(self, sentence):
        tokenizer = RegexpTokenizer(r'\w+')
        tokens = tokenizer.tokenize(str(sentence))
        for word in tokens :
            self.addword(word)
      
    #If the word is not in the container, the word will be added to it, 
    #else, update the word counter
    
    def addword(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1
        else:
            self.word2count[word] += 1


In [9]:
def process_data(path):
    
    #Load the dataset into a Pandas Dataframe for processing.
    
    df = pd.read_csv(path)
    
    TEXT_SRC = chatbot()
    TRG  = chatbot()
    pairs = []
    
    for i in range(len(df)):
        
        full=[df['text_src'][i],df['TRG'][i]]
        TEXT_SRC.addSentence(df['text_src'][i])
        TRG.addSentence(str(df['TRG'][i]))
        pairs.append(full)
        
    return TEXT_SRC , TRG , pairs

In [10]:

class Encoder(nn.Module):
    
    def __init__(self,input_dim,emb_dim,hid_dim,n_layers,dropout ):
        
        super(Encoder, self).__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.input_dim = input_dim
        self.emb_dim = emb_dim
        
        # self.embedding provides a vector representation of the inputs to our model
        self.embedding = nn.Embedding(input_dim, self.emb_dim)
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        self.lstm = nn.LSTM(self.emb_dim, self.hid_dim, num_layers=self.n_layers,dropout=dropout)
        self.dropout = nn.Dropout(dropout)
    def forward(self, src):
        
        
        '''
        Inputs: src , the src vector
        Outputs: outputs, the encoder outputs
                hidden, the hidden state
                cell, the cell state
        '''
        # src : [sen_len, batch_size]
        # embedded : [sen_len, batch_size, emb_dim]
        # outputs = [sen_len, batch_size, hid_dim * n_directions]
        # hidden = [n_layers * n_direction, batch_size, hid_dim]
        # cell = [n_layers * n_direction, batch_size, hid_dim]
        embedded = self.dropout(self.embedding(src).view(1,1,-1))
        outputs, (hidden, cell) = self.lstm(embedded)
        return outputs , hidden, cell
    

class Decoder(nn.Module):
      
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        
        super(Decoder, self).__init__()
        
        self.output_dim = output_dim
        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        # self.emb_dim provides a vector representation of the target to our model
        
        self.embedding = nn.Embedding(output_dim, self.emb_dim)
        
        # self.lstm, accepts the embeddings and outputs a hidden state
        
        self.lstm = nn.LSTM(emb_dim, hid_dim, num_layers=self.n_layers, dropout=dropout)

        # self.ouput, predicts on the hidden state via a linear output layer
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        self.softmax = nn.LogSoftmax(dim=1)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, i , hidden, cell):
                
        # input = [batch_size]
        # hidden = [n_layers * n_dir, batch_size, hid_dim]
        # cell = [n_layers * n_dir, batch_size, hid_dim]
        
        embedded = self.dropout(self.embedding(input))
        
        # embedded = [1, batch_size, emb_dim]
        i =  i.view(1, -1)
        '''
        Inputs: i, the target vector
        Outputs: prediction, the prediction
                hidden, the hidden state
        '''
        embedded = self.dropout(F.relu(self.embedding(i)))
        # embedded = [1, batch_size, emb_dim]
        
        output, (hidden, cell) = self.lstm(embedded, (hidden, cell))
        # output = [seq_len, batch_size, hid_dim * n_dir]
        # hidden = [n_layers * n_dir, batch_size, hid_dim]
        # cell = [n_layers * n_dir, batch_size, hid_dim]
        
        # seq_len and n_dir will always be 1 in the decoder
        prediction = self.softmax(self.fc_out(output[0]))
        # prediction = [batch_size, output_dim]
        
        return prediction, hidden, cell
        
        

class Seq2Seq(nn.Module):
    
    def __init__(self, encoder, decoder, device):
        
        super(Seq2Seq, self).__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            'hidden dimensions of encoder and decoder must be equal.'
        assert encoder.n_layers == decoder.n_layers, \
            'n_layers of encoder and decoder must be equal.'
    
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):      

        # src = [sen_len, batch_size]
        # trg = [sen_len, batch_size]
        # teacher_forcing_ratio : the probability to use the teacher forcing.
        
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        # last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        # first input to the decoder is the <sos> token.
        i = trg[0, :]
        for t in range(1, trg_len):
            # insert input token embedding, previous hidden and previous cell states 
            # receive output tensor (predictions) and new hidden and cell states.
            output, hidden, cell = self.decoder(i, hidden, cell)
            
            # replace predictions in a tensor holding predictions for each token
            outputs[t] = output
            
            # decide if we are going to use teacher forcing or not.
            teacher_force = random.random() < teacher_forcing_ratio
            
            # get the highest predicted token from our predictions.
            top1 = output.argmax(1)
            # update input : use ground_truth when teacher_force 
            i = trg[t] if teacher_force else top1
        
        return outputs

    



In [19]:
TEXT_SRC , TRG , pairs = process_data("train.csv")
print(TEXT_SRC.size())

AttributeError: 'chatbot' object has no attribute 'size'

In [20]:
input_dim=TEXT_SRC.n_words
output_dim=TRG.n_words
emb_dim = 256
hid_dim = 512
n_layers = 2
num_iteration = 100000
dropout=0.2
encoder=Encoder(input_dim,emb_dim,hid_dim,n_layers,dropout)
decoder=Decoder(output_dim, emb_dim, hid_dim, n_layers, dropout)
model = Seq2Seq(encoder, decoder, device).to(device)


In [39]:
print(input_dim)

98438


In [21]:
print(model)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(98439, 256)
    (lstm): LSTM(256, 512, num_layers=2, dropout=0.2)
    (dropout): Dropout(p=0.2, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(42761, 256)
    (lstm): LSTM(256, 512, num_layers=2, dropout=0.2)
    (fc_out): Linear(in_features=512, out_features=42761, bias=True)
    (softmax): LogSoftmax(dim=1)
    (dropout): Dropout(p=0.2, inplace=False)
  )
)


In [22]:
def indexesFromSentence(lang, sentence):
    tokenizer = RegexpTokenizer(r'\w+')
    tokens = tokenizer.tokenize(str(sentence))
    return [lang.word2index[word] for word in tokens]

def tensorFromSentence(lang, sentence):
    indexes = indexesFromSentence(lang, sentence)
    indexes.append(EOS_token)
    return torch.tensor(indexes, dtype=torch.long, device=device).view(-1, 1)

def tensorsFromPair(input_text_srg, output_trg, pair):
    text_srg_tensor = tensorFromSentence(input_text_srg, pair[0])
    trg_tensor = tensorFromSentence(output_trg, pair[1])
    return (text_srg_tensor, trg_tensor)

In [25]:


def trainModel(model, TEXT_SRG,TRG, pairs, num_iteration=20000):
    model.train()
    optimizer = optim.SGD(model.parameters(), lr=0.01)
    criterion = nn.NLLLoss()
    total_loss_iterations = 0
    training_pairs = [tensorsFromPair(TEXT_SRG,TRG, random.choice(pairs)) for i in range(num_iteration)]
    for iter in range(1, num_iteration+1):
        training_pair = training_pairs[iter - 1]
        input_tensor = training_pair[0]
        trg_tensor = training_pair[1]
        
        input_length = input_tensor.size(0)
        loss = 0
        epoch_loss = 0
        print(input_tensor.size())

        output = model(input_tensor, trg_tensor)

        num_iter = output.size(0)
        print(num_iter)

        #calculate the loss from a predicted sentence with the expected result
        for ot in range(num_iter):
            loss += criterion(output[ot], target_tensor[ot])

        loss.backward()
        model_optimizer.step()
        epoch_loss = loss.item() / num_iter

        total_loss_iterations += epoch_loss
        if iter % 5000 == 0:
            avarage_loss= total_loss_iterations / 5000
            total_loss_iterations = 0
            print('%d %.4f' % (iter, avarage_loss))
            
    torch.save(model.state_dict(), 'mytraining.pt')
    return model

In [26]:
model = trainModel(model, TEXT_SRC, TRG , pairs, num_iteration)

torch.Size([106, 1])


RuntimeError: input.size(-1) must be equal to input_size. Expected 256, got 27136

In [42]:
print(encoder)

Encoder(
  (embedding): Embedding(98438, 256)
  (lstm): LSTM(256, 512, num_layers=2, dropout=0.2)
  (dropout): Dropout(p=0.2, inplace=False)
)
