# NEW ATTEMPT

## RNN LSTM Chatbot project
In this project I'm creating a chatbot that is supposed to answer questions from the Stanford Questions & Answers dataset SQuAD1, using a sequence-to-sequence Encoder-Decoder recurrent neural network architecture in PyTorch.

To make the notebook more readable and the code more modular, all helper functions (data ingestion and preparation, data analysis, vocabulary creation) were moved to modules.

The model for easier debugging is kept in the main notebook for now.

## STRATEGY

As I want to follow the example provided by the mentor, and get to the point where I can have a correctly working dataloader and process batches of data, I will make the following modification to my previous approach:

- Add words from both questions and answers to the same vocabulary, in other words use only one vocabulary instead of two separate ones
    - the consequence of this approach is that the resulting answers from the chatbot would use "chopped", stemmed words, meanwhile ideally the answers would have their own vocabulary with unstemmed words
- Instead of creating a list of pairs with questions and answers converted to tensors, I will turn the sequences of tokens into simple lists of integers (indexes) and have them in the dataframe, to later feed to the dataset/dataloader
- Everything will be processed together and only then the complete dataframe will be split into train and test (and val)
- Both questions and answers will be padded to the same length

In [1]:
import torch

In [2]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

In [3]:
device

'cuda:0'

In [4]:
from torchtext.datasets import SQuAD1

In [5]:
train, test = SQuAD1("root")

In [6]:
from modules.data import get_dataframe, tokenize_sentence, sample_df_num, sample_df_perc, get_outliers

[nltk_data] Downloading package wordnet to
[nltk_data]     /shared/home/u076079/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /shared/home/u076079/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /shared/home/u076079/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


### Using just the train_df for now (big enough to use for training, testing and validation)

In [7]:
train_df = get_dataframe(train)

In [8]:
train_df.shape

(87599, 2)

In [9]:
train_df.head(3)

Unnamed: 0,Question,Answer
0,To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous
1,What is in front of the Notre Dame Main Building?,a copper statue of Christ
2,The Basilica of the Sacred heart at Notre Dame...,the Main Building


In [10]:
from modules.vocab import Vocab

In [11]:
for col in ['Question', 'Answer']:
    train_df[col + '_tokens'] = train_df[col].apply(lambda s: tokenize_sentence(s, normalization='stem'))

In [12]:
train_df.head(3)

Unnamed: 0,Question,Answer,Question_tokens,Answer_tokens
0,To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous,"[whom, virgin, mari, alleg, appear, 1858, lour...","[saint, bernadett, soubir]"
1,What is in front of the Notre Dame Main Building?,a copper statue of Christ,"[what, front, notr, dame, main, build]","[copper, statu, christ]"
2,The Basilica of the Sacred heart at Notre Dame...,the Main Building,"[basilica, sacr, heart, notr, dame, besid, whi...","[main, build]"


# NOTE:

I could remove the short sentences here, but as there must be also a vocabulary cleanup to get rid of rare words, it might be a better idea to remove those words first, and then drop rows containing the removed words, and also the very short and very long sequences.

### Single vocabulary for both questions and answers

In [13]:
commonVocab = Vocab()

In [14]:
for col in ['Question_tokens', 'Answer_tokens']:
    for idx, row in train_df.iterrows():
        commonVocab.add_sentence(row[col])
        

In [15]:
commonVocab.n_words


44534

### Remove the least common words from the ~sentences~ vocabulary

In [16]:
# how many times at most a word occurs to be considered an outlier
outlier_threshold = 3

In [17]:
vocab_outliers = get_outliers(commonVocab,outlier_threshold+1)

In [18]:
vocab_outliers[:10]

['lourd',
 'grotto',
 'businessweek',
 'professorship',
 'gurian',
 'publican',
 'kellogg',
 'bout',
 'showdown',
 'anticathol']

In [19]:
len(vocab_outliers)

31068

In [20]:
for word in vocab_outliers:
    commonVocab.remove_word(word)

In [21]:
commonVocab.n_words

13466

### Remove rows containing the words not present in the cleaned vocabulary

In [22]:
test_outlier = 'kellogg'

In [23]:
for idx, row in train_df.iterrows():
    for col in ['Question_tokens', 'Answer_tokens']:
        if test_outlier in row[col]:
            print(row[col], idx)

['kellogg', 'institut', 'intern', 'studi', 'part', 'which', 'univers'] 69
['kellogg', 'poptart'] 7711
['which', 'campus', 'hold', 'undergradu', 'school', 'graduat', 'school', 'kellogg', 'school', 'manag'] 39477


In [32]:
train_df['all_tokens'] = train_df['Question_tokens'] + train_df['Answer_tokens']

In [35]:
train_df['all_tokens'] = train_df['all_tokens'].apply(lambda x: list(set(x)))

In [36]:
train_df.head(3)

Unnamed: 0,Question,Answer,Question_tokens,Answer_tokens,all_tokens
0,To whom did the Virgin Mary allegedly appear i...,Saint Bernadette Soubirous,"[whom, virgin, mari, alleg, appear, 1858, lour...","[saint, bernadett, soubir]","[1858, mari, appear, alleg, soubir, saint, who..."
1,What is in front of the Notre Dame Main Building?,a copper statue of Christ,"[what, front, notr, dame, main, build]","[copper, statu, christ]","[build, front, statu, dame, main, copper, what..."
2,The Basilica of the Sacred heart at Notre Dame...,the Main Building,"[basilica, sacr, heart, notr, dame, besid, whi...","[main, build]","[build, heart, basilica, dame, which, main, sa..."


In [57]:
num_rows = train_df.shape[0]

outliers_set = set(vocab_outliers)

outlier_idxs = []

# scan row by row
# in each row go word by word in 'all_tokens' column
# if there IS an intersection between the whole 'all_tokens' and the 'outliers_set', it means that the row contains an outlier and has to be removed

for idx, row in train_df.iterrows():                
    intersection = outliers_set.intersection(row['all_tokens'])
    if len(intersection) > 0:            
        #print(f'row {idx} contains an outlier: {intersection}')
        outlier_idxs.append(idx)
        
    

In [58]:
len(outlier_idxs)

32636

In [60]:
train_df.drop?

[0;31mSignature:[0m
[0mtrain_df[0m[0;34m.[0m[0mdrop[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0mlabels[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0maxis[0m[0;34m:[0m [0;34m'Axis'[0m [0;34m=[0m [0;36m0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mindex[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcolumns[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mlevel[0m[0;34m:[0m [0;34m'Level | None'[0m [0;34m=[0m [0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0minplace[0m[0;34m:[0m [0;34m'bool'[0m [0;34m=[0m [0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0merrors[0m[0;34m:[0m [0;34m'str'[0m [0;34m=[0m [0;34m'raise'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding
axis, or by specifying directly inde

In [61]:
train_df.shape

(87599, 5)

In [62]:
train_df.drop(outlier_idxs).shape

(54963, 5)

In [42]:
outliers_set.intersection('witcher')

set()

# NOTE:

I need a function that removes the outliers from the vocabulary. And after those are removed, I need one that removes the dataframe rows without those words.

In [34]:
type(commonVocab)

modules.vocab.Vocab

In [53]:
class Vocab:
    def __init__(self):        
        self.word2index = {"<PAD>":0, "<SOS>":1, "<EOS>":2, "<UNK>":3}        
        self.index2word = {0: "<PAD>", 1: "<SOS>", 2: "<EOS>", 3: "<UNK>"}
        # make sure that the special tokens don't get removed as too rare!
        self.word2count = {"<PAD>":9999999, "<SOS>":9999999, "<EOS>":9999999, "<UNK>":9999999, }
        self.n_words = len(self.word2index) # count PAD, SOS, EOS and UNK tokens
        
    def add_sentence(self, sentence):
        for word in sentence:
            self.add_word(word) # using lists of tokens so no need to split
            
    def add_word(self, word):
        if word not in self.word2index:
            self.word2index[word] = self.n_words
            self.word2count[word] = 1
            self.index2word[self.n_words] = word
            self.n_words += 1     
        else:
            self.word2count[word] += 1

    def remove_word(self, word):
        idx = self.word2index[word]
        
        self.word2index.pop(word)
        self.index2word.pop(idx)
        self.word2count.pop(word)
        self.n_words -= 1


In [54]:
testVocab = Vocab()

In [55]:
testVocab

<__main__.Vocab at 0x7fc091839df0>

In [56]:
testVocab.word2count

{'<PAD>': 9999999, '<SOS>': 9999999, '<EOS>': 9999999, '<UNK>': 9999999}

In [57]:
testVocab.index2word

{0: '<PAD>', 1: '<SOS>', 2: '<EOS>', 3: '<UNK>'}

In [58]:
testVocab.add_word('witcher')

In [59]:
testVocab.index2word

{0: '<PAD>', 1: '<SOS>', 2: '<EOS>', 3: '<UNK>', 4: 'witcher'}

In [60]:
testVocab.word2count

{'<PAD>': 9999999,
 '<SOS>': 9999999,
 '<EOS>': 9999999,
 '<UNK>': 9999999,
 'witcher': 1}

In [61]:
testVocab.remove_word('witcher')

In [63]:
testVocab.word2count

{'<PAD>': 9999999, '<SOS>': 9999999, '<EOS>': 9999999, '<UNK>': 9999999}

In [64]:
testVocab.word2index

{'<PAD>': 0, '<SOS>': 1, '<EOS>': 2, '<UNK>': 3}

In [65]:
testVocab.n_words

4