# LSTM Bot

## Project Overview

In this project, you will build a chatbot that can converse with you at the command line. The chatbot will use a Sequence to Sequence text generation architecture with an LSTM as it's memory unit. You will also learn to use pretrained word embeddings to improve the performance of the model. At the conclusion of the project, you will be able to show your chatbot to potential employers.

Additionally, you have the option to use pretrained word embeddings in your model. We have loaded Brown Embeddings from Gensim in the starter code below. You can compare the performance of your model with pre-trained embeddings against a model without the embeddings.



---



A sequence to sequence model (Seq2Seq) has two components:
- An Encoder consisting of an embedding layer and LSTM unit.
- A Decoder consisting of an embedding layer, LSTM unit, and linear output unit.

The Seq2Seq model works by accepting an input into the Encoder, passing the hidden state from the Encoder to the Decoder, which the Decoder uses to output a series of token predictions.

## Dependencies

- Pytorch
- Numpy
- Pandas
- NLTK
- Gzip
- Gensim


Please choose a dataset from the Torchtext website. We recommend looking at the Squad dataset first. Here is a link to the website where you can view your options:

- https://pytorch.org/text/stable/datasets.html





In [2]:
#!pip install dataloader
#from torch.utils.data import Dataset, DataLoader

In [3]:
import pandas as pd
import json
import string

In [4]:
import gensim
import nltk
import numpy as np
import pandas as pd
import gzip
import torch
from nltk.corpus import brown
from nltk.tokenize import word_tokenize
from pandas.io.json import json_normalize 

nltk.download('brown')
nltk.download('punkt')

# Output, save, and load brown embeddings

model = gensim.models.Word2Vec(brown.sents())
model.save('brown.embedding')

w2v = gensim.models.Word2Vec.load('brown.embedding')

def clean_text(x):
    if len(x) > 0:
        return x[0]["text"]

def loadDF(path):
    
    df = pd.read_json(path) 
    data = pd.json_normalize(data = df['data'],
                            record_path =['paragraphs', 'qas'])
    data = data.drop('id', axis=1)
    data = data.drop('is_impossible', axis=1)
    data = data.drop('plausible_answers', axis=1)
    data['answers'] = data['answers'].apply(clean_text)
    
    return data

def tokenize_en(text):
    if text != None:
        text = text.translate(str.maketrans('', '', string.punctuation))
        return [tok.lower().strip() for tok in word_tokenize(text)]
    return;

def prepare_text(df_train, dev_test):
    
    for i,r, in df_train.iterrows():
        answerlist = r['answers']
        if answerlist != None:
            answerlist = [i for i in answerlist if i]
            if len(answerlist) == 0:
                r['answers'] =  np.NaN
                     
    df_train = df_train.dropna(axis=0, subset=['answers'])
    df_train.reset_index(drop=True)
   
    
    df_train['question_token'] = df_train['question'].apply(tokenize_en)
    df_train['answer_token'] = df_train['answers'].apply(tokenize_en) 
    
    dev_test['question_token'] = dev_test['question'].apply(tokenize_en)
    dev_test['answer_token'] = dev_test['answers'].apply(tokenize_en) 
    
    return df_train, dev_test



def train_test_split(SRC, TRG):
    
    '''
    Input: SRC, our list of questions from the dataset
            TRG, our list of responses from the dataset

    Output: Training and test datasets for SRC & TRG

    '''
    
    return SRC_train_dataset, SRC_test_dataset, TRG_train_dataset, TRG_test_dataset


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [5]:
df_train = loadDF('./train-v2.0.json')

In [6]:
dev_test = loadDF('./dev-v2.0.json')

In [7]:
pd.set_option('display.max_rows', None)

In [8]:
df_train.head()

Unnamed: 0,question,answers
0,When did Beyonce start becoming popular?,in the late 1990s
1,What areas did Beyonce compete in when she was...,singing and dancing
2,When did Beyonce leave Destiny's Child and bec...,2003
3,In what city and state did Beyonce grow up?,"Houston, Texas"
4,In which decade did Beyonce become famous?,late 1990s


In [9]:
dev_test.head()

Unnamed: 0,question,answers
0,In what country is Normandy located?,France
1,When were the Normans in Normandy?,10th and 11th centuries
2,From which countries did the Norse originate?,"Denmark, Iceland and Norway"
3,Who was the Norse leader?,Rollo
4,What century did the Normans first gain their ...,10th century


In [10]:
df_train,  dev_test = prepare_text(df_train,  dev_test)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [11]:
df_train.head()

Unnamed: 0,question,answers,question_token,answer_token
0,When did Beyonce start becoming popular?,in the late 1990s,"[when, did, beyonce, start, becoming, popular]","[in, the, late, 1990s]"
1,What areas did Beyonce compete in when she was...,singing and dancing,"[what, areas, did, beyonce, compete, in, when,...","[singing, and, dancing]"
2,When did Beyonce leave Destiny's Child and bec...,2003,"[when, did, beyonce, leave, destinys, child, a...",[2003]
3,In what city and state did Beyonce grow up?,"Houston, Texas","[in, what, city, and, state, did, beyonce, gro...","[houston, texas]"
4,In which decade did Beyonce become famous?,late 1990s,"[in, which, decade, did, beyonce, become, famous]","[late, 1990s]"


In [12]:
dev_test.head()

Unnamed: 0,question,answers,question_token,answer_token
0,In what country is Normandy located?,France,"[in, what, country, is, normandy, located]",[france]
1,When were the Normans in Normandy?,10th and 11th centuries,"[when, were, the, normans, in, normandy]","[10th, and, 11th, centuries]"
2,From which countries did the Norse originate?,"Denmark, Iceland and Norway","[from, which, countries, did, the, norse, orig...","[denmark, iceland, and, norway]"
3,Who was the Norse leader?,Rollo,"[who, was, the, norse, leader]",[rollo]
4,What century did the Normans first gain their ...,10th century,"[what, century, did, the, normans, first, gain...","[10th, century]"


In [13]:
word2VectorDict = w2v.wv.key_to_index 

In [14]:
SOS_token = 55000
EOS_token = 55001

In [15]:
class Lang:
    def __init__(self, name):
        self.name = name
        self.word2index = {"SOS": 55000, "EOS": 55001}
        self.word2count = {}
        self.index2word = {55000: "SOS", 55001: "EOS"}
        self.n_words = 2  

    def addSentence(self, sentence):
        for word in sentence.split(' '):
            self.addWord(word)
            

    def addWord(self, word):
        if word not in self.word2index:
                if word in word2VectorDict:
                    indexOfWord = word2VectorDict[word]
                    self.word2index[word] = indexOfWord
                    self.index2word[indexOfWord] = word
                    self.n_words += 1
                else:
                    self.word2index[word] = self.n_words
                    self.index2word[self.n_words] = word
                    self.n_words += 1
                self.word2count[word] = 1
        else:
            self.word2count[word] += 1

In [16]:
lang = Lang(name='alltokens')
Questionlang = Lang(name='allQuestiontokens')
Answerlang = Lang(name='allAnswertokens')

In [17]:
print(df_train.size)
df_train = df_train.head(5000)
print(df_train.size)

347284
20000


In [18]:
len(df_train.index)

5000

In [19]:
for i,r, in df_train.iterrows():
    text = r['question_token']
    if text != None and text != '':
        text = [i for i in text if i]
        for t in text:
            lang.addWord(t)
            Questionlang.addWord(t)

for i,r, in df_train.iterrows():
    text = r['answer_token']
    if text != None and text != '':
        text = [i for i in text if i]
        for t in text:
            lang.addWord(t)
            Answerlang.addWord(t)

In [20]:
lang.word2index

{'SOS': 55000,
 'EOS': 55001,
 'when': 71,
 'did': 105,
 'beyonce': 4,
 'start': 728,
 'becoming': 2011,
 'popular': 1141,
 'what': 79,
 'areas': 420,
 'compete': 4477,
 'in': 7,
 'she': 58,
 'was': 10,
 'growing': 1025,
 'up': 61,
 'leave': 485,
 'destinys': 17,
 'child': 461,
 'and': 4,
 'become': 270,
 'a': 6,
 'solo': 10641,
 'singer': 9817,
 'city': 375,
 'state': 182,
 'grow': 1774,
 'which': 35,
 'decade': 2422,
 'famous': 1232,
 'rb': 30,
 'group': 248,
 'the': 0,
 'lead': 825,
 'album': 11822,
 'made': 97,
 'her': 40,
 'worldwide': 37,
 'known': 400,
 'artist': 2053,
 'who': 53,
 'managed': 3109,
 'beyoncé': 42,
 'rise': 1077,
 'to': 5,
 'fame': 5979,
 'role': 1044,
 'have': 33,
 'first': 87,
 'released': 4046,
 'as': 17,
 'release': 3132,
 'dangerously': 52,
 'love': 453,
 'how': 159,
 'many': 112,
 'grammy': 56,
 'awards': 5808,
 'win': 2057,
 'for': 11,
 'beyoncés': 60,
 'name': 335,
 'of': 3,
 'after': 123,
 'second': 292,
 'other': 74,
 'entertainment': 3716,
 'venture': 

In [21]:
max(lang.word2count, key=lang.word2count.get)

'the'

In [22]:
for i,r, in dev_test.iterrows():
    text = r['question_token']
    if text != None and text != '':
        text = [i for i in text if i]
        for t in text:
            lang.addWord(t)

for i,r, in dev_test.iterrows():
    text = r['answer_token']
    if text != None and text != '':
        text = [i for i in text if i]
        for t in text:
            lang.addWord(t)

In [23]:
import torch
device = torch.device(
  'cuda' if torch.cuda.is_available() else 'cpu')
print(device)

cpu


In [24]:
def indexesFromSentence(lang, sentencelist):
    if sentencelist != None and sentencelist != '':
        sentencelist = [i for i in sentencelist if i]
        if len(sentencelist) > 0:
            return [lang.word2index[word] for word in sentencelist]
    return [];

def tensorFromSentence(lang, sentence):
        indexes = indexesFromSentence(lang, sentence)
        indexes.append(EOS_token)
        #indexes.insert(0, SOS_token)
        return torch.tensor(indexes, dtype=torch.long, device= device).view(-1, 1)

In [25]:
source_data = [tensorFromSentence(lang, sentencelist) for sentencelist in df_train['question_token']]

In [26]:
target_data = [tensorFromSentence(lang, sentencelist) for sentencelist in df_train['answer_token']]  

In [27]:
dev_source_data = [tensorFromSentence(lang, sentencelist) for sentencelist in dev_test['question_token']]

In [28]:
dev_target_data = [tensorFromSentence(lang, sentencelist) for sentencelist in dev_test['answer_token']]  

In [29]:
source_list = source_data[0:5000]
target_list = target_data[0:5000]

In [30]:
def split(list_a, chunk_size):
    for i in range(0, len(list_a), chunk_size):
        yield list_a[i:i + chunk_size]

In [31]:
valid_target_data = dev_target_data[0:5000]

In [32]:
valid_source_data = dev_source_data[0:5000]

In [33]:
testabc = target_list

In [34]:
testabc[0][0,:]

tensor([7])

In [35]:
src = testabc[0]

In [36]:
x = src.view(  1, -1)

In [37]:
x

tensor([[    7,     0,   564,  4329, 55001]])

In [38]:
#df_train['source_data'] = [tensorFromSentence(lang, sentencelist) for sentencelist in df_train['question_token']]

In [39]:
#df_train['target_data'] = [tensorFromSentence(lang, sentencelist) for sentencelist in df_train['answer_token']]  

In [40]:
#dev_test['source_data'] = [tensorFromSentence(lang, sentencelist) for sentencelist in dev_test['question_token']]

In [41]:
#dev_test['target_data'] = [tensorFromSentence(lang, sentencelist) for sentencelist in dev_test['answer_token']]  

In [42]:
#dev_test.head()

In [43]:
#train_data = df_train
#columns = ['question', 'answers']
#train_data.drop(columns, inplace=True, axis=1)
#pd.set_option('display.max_columns', None)
#pd.reset_option('max_columns')
#train_data.head()

In [44]:
#valid_data = dev_test
#columns = ['question', 'answers']
#valid_data.drop(columns, inplace=True, axis=1)
#valid_data.head()
#valid_data.shape

In [45]:
#def chunker(seq, size):
#    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

In [46]:
#for i in chunker(train_data[['source_data', 'target_data']].head(10),2):
#    print(i['target_data'])
#a = train_data['target_data'].head(1)
#a[0].shape[0]
#a[0][0,:]

In [47]:
import torch.nn as nn
import torch.optim as optim

class Encoder(nn.Module):
    
    def __init__(self, input_dim, hid_dim, n_layers, dropout):
        
        super(Encoder, self).__init__()
        
        # self.embedding provides a vector representation of the inputs to our model
        
        # self.lstm, accepts the vectorized input and passes a hidden state
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        self.input_dim = input_dim
        
        self.embedding = nn.Embedding(self.input_dim, self.hid_dim)
        
        self.rnn = nn.LSTM(self.hid_dim, self.hid_dim, n_layers) #, dropout = dropout
        
        #self.dropout = nn.Dropout(dropout)

    #,  
    def forward(self, src, hidden, cell_state):
        
        '''
        Inputs: i, the src vector
        Outputs: o, the encoder outputs
                h, the hidden state
                c, the cell state
        '''
        
        #embedded = self.dropout(self.embedding(src))
        embedded = self.embedding(src)
        embedded = embedded.view(1, 1, -1)
        outputs, (hidden, cell) = self.rnn(embedded, (hidden, cell_state))
        
        #outputs are always from the top hidden layer
        
        return outputs, hidden, cell
    

class Decoder(nn.Module):
      
    def __init__(self,  output_dim, hid_dim, n_layers, dropout):
        
        super(Decoder, self).__init__()
        
        # self.embedding provides a vector representation of the target to our model
        
        # self.lstm, accepts the embeddings and outputs a hidden state

        # self.ouput, predicts on the hidden state via a linear output layer
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(self.output_dim, self.hid_dim)
        
        self.rnn = nn.LSTM(self.hid_dim, self.hid_dim, n_layers) #, dropout = dropout
        
        self.fc_out = nn.Linear(self.hid_dim, self.output_dim)
        
        self.softmax = nn.LogSoftmax(dim=1)
        
        #self.dropout = nn.Dropout(dropout)

        
    def forward(self, src, hidden, cell):
        
        '''
        Inputs: i, the target vector
        Outputs: o, the prediction
                h, the hidden state
        '''
        
              
        #embedded = self.dropout(self.embedding(input))     
        embedded = self.embedding(src) 
        embedded = embedded.view(1, 1, -1)       
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
                
        prediction = self.fc_out(output[0])
        
        prediction = self.softmax(prediction)
   
        return prediction, hidden, cell

        
        

class Seq2Seq(nn.Module):
    
    def __init__(self, encoder, decoder, device):
        
        super(Seq2Seq, self).__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"

        
    
    
    def forward(self, src, trg, src_len, trg_len, teacher_forcing_ratio = 0.5):
        
        output = {
            'decoder_output':[]
        }

        encoder_hidden = torch.zeros([1, 1, self.encoder.hid_dim]).to(device) # 1 = number of LSTM layers
        cell_state = torch.zeros([1, 1, self.encoder.hid_dim]).to(device)
        
        for i in range(src_len):
            encoder_output, encoder_hidden, cell_state = self.encoder(src[i], encoder_hidden, cell_state)
            #hidden, cell = self.encoder(src)
        
       
        #input = trg[0,:]
        decoder_input = torch.Tensor([[55000]]).long().to(device) # 0 = SOS_token
        decoder_hidden = encoder_hidden
        
        for t in range(trg_len):
            decoder_output, decoder_hidden, cell_state  = self.decoder(decoder_input, decoder_hidden, cell_state)
            output['decoder_output'].append(decoder_output)
            
            if self.training: # Model not in eval mode
                decoder_input = target_tensor[i] if random.random() > teacher_forcing_ratio else decoder_output.argmax(1) # teacher forcing
            else:
                _, top_index = decoder_output.data.topk(1)
                decoder_input = top_index.squeeze().detach()
            
        return outputs

In [48]:
INPUT_DIM =  len(Questionlang.word2index) #len(SRC.vocab)
print(INPUT_DIM)
print(Questionlang.n_words)
OUTPUT_DIM = len(Answerlang.word2index) #len(TRG.vocab)
print(OUTPUT_DIM)
print(Answerlang.n_words)

5855
5855
4432
4432


In [49]:
INPUT_DIM =  len(Questionlang.word2index) #len(SRC.vocab)
OUTPUT_DIM = len(Answerlang.word2index) #len(TRG.vocab)

HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

modelSeq2Seq = Seq2Seq(enc, dec, device).to(device)

In [50]:
#def init_weights(m):
#    for name, param in m.named_parameters():
#        nn.init.uniform_(param.data, -0.08, 0.08)
        
#modelSeq2Seq.apply(init_weights)

In [51]:
#def count_parameters(model):
#    return sum(p.numel() for p in model.parameters() if p.requires_grad)

#print(f'The model has {count_parameters(modelSeq2Seq):,} trainable parameters')

In [52]:
optimizer = optim.Adam(modelSeq2Seq.parameters())
criterion = nn.CrossEntropyLoss()

In [53]:
#TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
 #ignore_index = TRG_PAD_IDX

In [54]:
from sklearn.model_selection import KFold

def train(model, source_data, target_data, optimizer, criterion, epochs, print_every, batch_size):
    
    model.to(device)
    best_valid_loss = float('inf')
    total_training_loss = 0
    total_valid_loss = 0
    loss = 0
    
    kf = KFold(n_splits=epochs, shuffle=True)
    for e, (train_index, test_index) in enumerate(kf.split(source_data), 1):
        model.train()
        
        for i in range(0, len(train_index)):
            src = source_data[i]
            trg = target_data[i]
            
            output = model(src, trg, src.size(0), trg.size(0))
        
        current_loss = 0
        for (s, t) in zip(output["decoder_output"], trg): 
                current_loss += criterion(s, t)

        loss += current_loss
        total_training_loss += (current_loss.item() / trg.size(0)) # add the iteration loss
        
        if i % batch_size == 0 or i == (len(train_index)-1):
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            loss = 0
    
    model.eval()
    for i in range(0, len(test_index)):
        src = source_data[i]
        trg = target_data[i]

        output = model(src, trg, src.size(0), trg.size(0))

        current_loss = 0
        for (s, t) in zip(output["decoder_output"], trg): 
            current_loss += criterion(s, t)
            total_valid_loss += (current_loss.item() / trg.size(0)) # add the iteration loss
    
    if e % print_every == 0:
            training_loss_average = total_training_loss / (len(train_index)*print_every)
            validation_loss_average = total_valid_loss / (len(test_index)*print_every)
            print("{}/{} Epoch  -  Training Loss = {:.4f}  -  Validation Loss = {:.4f}".format(e, epochs, training_loss_average, validation_loss_average))
            total_training_loss = 0
            total_valid_loss = 0 
            
    if validation_loss_average < best_valid_loss:
        best_valid_loss = validation_loss_average
        torch.save(model.state_dict(), 'chatbot-model.pt')

        


In [55]:
train(model = modelSeq2Seq,
      source_data = source_list,
      target_data = target_list,
      optimizer = optimizer,
      criterion = criterion,
      epochs = 65,
      print_every = 5,
      batch_size = 128)

RuntimeError: Expected hidden[0] size (2, 1, 512), got [1, 1, 512]