Replacement of GLOVE embeddings with BERT embedding.

We will be experimenting with a batch of 64 samples from SQuAD 2.0. 

In [78]:
# !pip install torch==1.6.0+cu101 torchvision==0.7.0+cu101 -f https://download.pytorch.org/whl/torch_stable.html
# import torch
# print(torch.__version__)

In [79]:
import torch 
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

Cloning Huggingface repo.
Huggingface transformers has a significant number of pending issues. We clone a tested version of Huggingface transformers for all our experiments. 


In [80]:
!git clone https://github.com/huggingface/transformers \
&& cd transformers \
&& git checkout a3085020ed0d81d4903c50967687192e3101e770 

fatal: destination path 'transformers' already exists and is not an empty directory.


In [None]:
!pip install ./transformers
!pip install tensorboardX

In [82]:
PRE_TRAINED_MODEL_NAME = 'bert-base-cased'
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

import numpy as np
import torch 
from transformers import BertModel

The file 'word2idx.json' is a file that contains all the GLOVE indices of every word in the vocabulary. 

In [83]:
import json  
f = open('/content/word2idx.json',) 
data = json.load(f)

In [84]:
idx2word = {}
for key in data.keys():
    idx2word[data[key]] = key

Tensor ops

Tensors obtained from the pre-processing steps of the baseline BiDAF from Stanford. 
These tensors will now be transformed into equivalent BERT embeddings.

cw_idxs - pickle file that contains all the contexts tokenized using the GLOVE tokenizer.
qw_idxs - pickle file that contains all the questions tokenized using the GLOVE tokenizer.

y1 is a vector with answer start indices and y2 for answer end indices.

In [85]:
# Context file for one batch
import pickle
with open('cw_idxs.pickle', 'rb') as handle:
    cw_idxs = pickle.load(handle)

# Question file for one batch
with open('qw_idxs.pickle', 'rb') as handle:
    qw_idxs = pickle.load(handle)

# Answer starts
with open('y1.pickle', 'rb') as handle:
    y1 = pickle.load(handle)

# Answer ends
with open('y2.pickle', 'rb') as handle:
    y2 = pickle.load(handle)  

shapes of pre-processed tensors

In [86]:
print(cw_idxs.shape)
print(qw_idxs.shape)
print(y1.shape)
print(y2.shape)

torch.Size([64, 376])
torch.Size([64, 23])
torch.Size([64])
torch.Size([64])


Out of 64 samples, use a small batch of 16 samples for quick testing.

In [87]:
b = 16
cw_idxs = cw_idxs[:b]
qw_idxs = qw_idxs[:b]
y1 = y1[:b]
y2 = y2[:b]

print(cw_idxs.shape)
print(qw_idxs.shape)
print(y1.shape)
print(y2.shape)

torch.Size([16, 376])
torch.Size([16, 23])
torch.Size([16])
torch.Size([16])


The function 'swap_tokens' performs a reverse of tokenization process to generate the contexts. Once the original sentences are obtained BERT tokenizer is used to tokenize these sentences. The tokenizer used by BERT is a word piece tokenizer. All BERT embeddings are converted to the same length by padding. 
This function returns a tuple of 3 values. The first one is tokenized version of context sentences, tokenized using BERT's tokenizer. The second one is a mask with binary values that can be used to seperate the padding and actual token information. The third one is list of word pieces. 



In [95]:
# NEW SWAP_TOKENS FUNCTION
def swap_tokens(cw_idxs):
    cw_idxs_words = []
    for c in cw_idxs:
        new_list = []
        for i in c:
            new_list.append(idx2word[i.item()])
        cw_idxs_words.append(new_list) 

    sentences = []
    for l in cw_idxs_words:
        sent = []
        for i in l:
            if i=='--OOV--' or i =='--NULL--':
                continue
            else:
                sent.append(i)

        sent = ' '.join(sent)   
        sentences.append(sent)

    sentences_tokenized = []

    bert_words = []
    for s in sentences:
        tt = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(s))
        bert_words.append(tokenizer.convert_ids_to_tokens(tt))
        tt = torch.Tensor(tt).type(torch.LongTensor)
        sentences_tokenized.append(tt)

    max_len = 0
    for s in sentences_tokenized:
        max_len = max(len(s),max_len)

    sentences_tokenized_tensors = [] 
    for s in sentences_tokenized:
        tt = torch.nn.ConstantPad1d((0, max_len - s.shape[0]), 0)(s)
        sentences_tokenized_tensors.append(tt)

    CT_new = torch.Tensor([])

    for l in sentences_tokenized_tensors:
        l = l.reshape((1,l.shape[0]))
        CT_new = torch.cat((CT_new, l), 0)   

    c_mask = torch.zeros_like(CT_new) != CT_new 

    return (CT_new, c_mask, bert_words)    

 A major problem that we facedduring the implementation of this portion of the project was performing suitable tokenization pre-possessing.  GLOVE uses a word based tokenizer to convert words to numbers whereas BERT usesa WordPiece tokenizer.  A word word piece tokenizer assigns a token to the different snippets ofa word.  For example the word ’calligraphy’ gets split into three tokens, ’call’, ’##ig’, ’##raphy’.A snippet with ’##’ symbol get attached to the snippet before it.

The function 'collect_hash_words' is used to collect the indices of the words that are split by the word piece tokenizer. The output is a list of lists. Each inner list contains the indices for words that were split in that sentence. 

In [96]:
def collect_hash_words(bert_words):
    import more_itertools as mit
    hash_words_list = []

    for sample in range(len(bert_words)):
        test_mask = []
        for i in range(len(bert_words[sample])):
            if '#' in bert_words[sample][i]:
                test_mask.append(1)
            else:
                test_mask.append(0)

        ones = []
        for i in range(len(test_mask)):
            if test_mask[i]==1:
                ones.append(i)

        start_ones = []
        for i in ones:
            start_ones.append(i-1)
        full_ones = sorted(list(set(sorted(start_ones + ones))))

        ll = [list(group) for group in mit.consecutive_groups(full_ones)] 
        hash_words_list.append(ll)

    return hash_words_list

In [97]:
context, context_m, bert_words_C = swap_tokens(cw_idxs)
print(context.shape)
print(context_m.shape)
print(len(bert_words_C))
print(len(bert_words_C[0]))

torch.Size([16, 388])
torch.Size([16, 388])
16
388


In [18]:
print(bert_words_C[0])

['In', 'the', 'China', 'of', 'the', 'Yuan', ',', 'or', 'Mongol', 'era', ',', 'various', 'important', 'developments', 'in', 'the', 'arts', 'occurred', 'or', 'continued', 'in', 'their', 'development', ',', 'including', 'the', 'areas', 'of', 'painting', ',', 'mathematics', ',', 'call', '##ig', '##raphy', ',', 'poetry', ',', 'and', 'theater', ',', 'with', 'many', 'great', 'artists', 'and', 'writers', 'being', 'famous', 'today', '.', 'Due', 'to', 'the', 'coming', 'together', 'of', 'painting', ',', 'poetry', ',', 'and', 'call', '##ig', '##raphy', 'at', 'this', 'time', 'many', 'of', 'the', 'artists', 'practicing', 'these', 'different', 'pursuits', 'were', 'the', 'same', 'individuals', ',', 'though', 'perhaps', 'more', 'famed', 'for', 'one', 'area', 'of', 'their', 'achievements', 'than', 'others', '.', 'Often', 'in', 'terms', 'of', 'the', 'further', 'development', 'of', 'landscape', 'painting', 'as', 'well', 'as', 'the', 'classical', 'joining', 'together', 'of', 'the', 'arts', 'of', 'painting'

In [19]:
question, question_m, bert_words_Q = swap_tokens(qw_idxs)
print(question.shape)
print(question_m.shape)

torch.Size([16, 25])
torch.Size([16, 25])


In [20]:
hash_words_list_C =  collect_hash_words(bert_words_C)
hash_words_list_Q =  collect_hash_words(bert_words_Q)
print(len(hash_words_list_C))
print(len(hash_words_list_Q))

16
16


Identifying words with a hash-connection (Ex:'call', '##ig', '##raphy')

Each list in 'hash_words_list_C' is a list of words that need to be combined. Words such as 'call', '##ig', '##raphy'
We take an average of the embeddings of each of these components(call', '##ig', '##raphy') and return the final embedding for the word 'calligraphy'. Every context and question has a variable number of such words. 

Out of a batch of 16 context paragraphs, the example below shows that there 10 words that were split by the word piece tokenizer. 


In [21]:
hash_words_list_C[0]

[[32, 33, 34],
 [62, 63, 64],
 [120, 121, 122],
 [155, 156, 157],
 [162, 163, 164, 165],
 [182, 183],
 [239, 240],
 [265, 266],
 [288, 289, 290],
 [332, 333]]

In [22]:
context.type()

'torch.FloatTensor'

In [23]:
# Converting float tensot to long tensor
context = torch.Tensor(context).type(torch.LongTensor)
question = torch.Tensor(question).type(torch.LongTensor)

In [24]:
#  16 context paragraphs, each paragraph is of size 388. 
context.shape

torch.Size([16, 388])

In [25]:
# Output of bert tokenization, for the first 20 words
context[0][:20]

tensor([ 1130,  1103,  1975,  1104,  1103, 13049,   117,  1137, 18739,  3386,
          117,  1672,  1696,  9093,  1107,  1103,  3959,  3296,  1137,  1598])

Now that we have our tokenized context we can use the BERT model to generate embeddings for all the context sentences.

In [26]:
# model_name_or_path = 'bert-base-uncased'
import torch
import torch.nn as nn
from transformers import BertModel
from tqdm import tqdm
import torch.nn.functional as F
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
hidden_size = 100

class Bert_Embeddings(nn.Module):

  def __init__(self, hidden_size):
    super(Bert_Embeddings, self).__init__()
    self.bert = BertModel.from_pretrained(PRE_TRAINED_MODEL_NAME)
  
  def forward(self, input_ids, attention_mask):
    last_hidden_state ,_ = self.bert(input_ids=input_ids,attention_mask=attention_mask)
    output = last_hidden_state
    return output

In [27]:
model = Bert_Embeddings(hidden_size)
with torch.no_grad():
    c_hs = model(input_ids=context[:b].reshape((b,context.shape[1])), 
                attention_mask=context_m[:b].reshape((b,context.shape[1])))

    q_hs = model(input_ids=question[:b].reshape((b,question.shape[1])), 
                attention_mask=question_m[:b].reshape((b,question.shape[1])))

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=435779157.0, style=ProgressStyle(descri…




In [28]:
c_hs.shape

torch.Size([388, 768])

In [None]:
q_hs.shape

Once the embeddings are generated, use the output from 'collect_hash_words' function to find which words have been split by BERT's tokenizer. For every word that is split compute the average of the it's word snippets and form a single embedding for the word. This carried out using the 'remove function'.

In [29]:
def remove_hash(f, hash_words_list, hs):
    sub = []
    for l in hash_words_list[f]:
        arr = []
        for i in l:
            c = hs[f][i].detach().numpy()
            arr.append(c)

        arr = np.array(arr)
        arr = np.mean(arr, axis=0)
        sub.append((arr, l[0]))

    # sub --> [([],__),  ([],__),  ([],__)....]    

    #  Replace all means
    for s,i in sub:
        hs[f][i] = torch.Tensor(s)     

    # Remove unnecessary values
    remove = []
    for l in hash_words_list[f]:
        remove.append(l[1:])
    flat_list = [item for sublist in remove for item in sublist]  


    hs_new = torch.Tensor([])
    for i in range(len(hs[f])):
        if i in flat_list:
            continue
        else:    
            p = hs[f][i].reshape((1,-1))
            hs_new = torch.cat((hs_new, p), 0)

    return hs_new, flat_list        

In [30]:
all_mods = []
all_falt_lists_C = []

for i in range(16):
    hs_new, flat_list = remove_hash(i, hash_words_list_C, c_hs)
    all_falt_lists_C.append(flat_list)
    all_mods.append(hs_new)

Once the embeddings for the word snippets are replaced by a single embedding, we need to pad the embeddings ince again to get a Tensor with uniform size. 

In [31]:
for l in all_mods:
    print(l.shape)

torch.Size([371, 768])
torch.Size([361, 768])
torch.Size([379, 768])
torch.Size([372, 768])
torch.Size([381, 768])
torch.Size([365, 768])
torch.Size([381, 768])
torch.Size([381, 768])
torch.Size([379, 768])
torch.Size([384, 768])
torch.Size([355, 768])
torch.Size([352, 768])
torch.Size([383, 768])
torch.Size([379, 768])
torch.Size([381, 768])
torch.Size([368, 768])


In [32]:
max_len = 0
for a in all_mods:
    max_len = max(max_len, a.shape[0])
print(max_len)      

384


In [33]:
all_mods_tensors = [] 
for s in all_mods:
    tt = torch.transpose(torch.nn.ConstantPad2d((0, max_len - s.shape[0]), 0)(torch.transpose(s, 0, 1)), 0, 1)
    all_mods_tensors.append(tt)  

Successful padding

In [34]:
for l in all_mods_tensors:
    print(l.shape)

torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])
torch.Size([384, 768])


In [35]:
rect_c_hs = torch.Tensor([])

for l in all_mods_tensors:
    l = l.reshape((1,l.shape[0], l.shape[1]))
    rect_c_hs = torch.cat((rect_c_hs, l), 0)

Final BERT embedding that is ready to be fed into the BiDAF network.


In [36]:
print(rect_c_hs.shape)

torch.Size([16, 384, 768])


Perform the same steps for questions Tensor

In [37]:
# Similarly for questions
all_mods = []
all_falt_lists_Q = []

for i in range(16):
    hs_new, flat_list = remove_hash(i, hash_words_list_Q, q_hs)
    all_falt_lists_Q.append(flat_list)
    all_mods.append(hs_new)

max_len = 0
for a in all_mods:
    max_len = max(max_len, a.shape[0])   

all_mods_tensors = [] 
for s in all_mods:
    tt = torch.transpose(torch.nn.ConstantPad2d((0, max_len - s.shape[0]), 0)(torch.transpose(s, 0, 1)), 0, 1)
    all_mods_tensors.append(tt)      

rect_q_hs = torch.Tensor([])

for l in all_mods_tensors:
    l = l.reshape((1,l.shape[0], l.shape[1]))
    rect_q_hs = torch.cat((rect_q_hs, l), 0)    

print(rect_q_hs.shape)

torch.Size([16, 25, 768])


Our BiDAF model needs a mask that can be used to indentify the actual embeddings and padding embeddings. 

In [38]:
context.shape

torch.Size([16, 388])

In [39]:
context_m.shape

torch.Size([16, 388])

In [40]:
context[1]

tensor([ 1650, 27241,  1103, 20164,  7222, 12512,  1116,  1127,  1113,  1103,
         5341,   117,  1105,  1103,  1433,  5672,  3666,  2997,   119,   138,
         1326,  1104,  1210,  1353,  2987,  8755,  1227,  1112,  1103, 20164,
         7222, 12512,  8833,  1116,  2795,  1149,   117,  2871,  1107, 10231,
         1699,   117,  1206, 19163,  1475,  1105, 19163,  1580,   119, 11733,
         1174,  1222,  4276,  3748,   119,  1109, 13034,  3296,   170,  4967,
         1378,  1103,  1473,  1104,  1985,  4191,   117,   170, 20164,  7222,
        12512,  1196, 17821,  1106, 17164,   117,  1150,  1125,  4921, 21516,
         1194,  1103,  5316, 17882,  1104, 20689,  3052,   119,  1230,  5714,
         2535, 16214,   117,  1223,  1103,  1231,  4915,  3457,  1104,  1117,
         2169,  2336,  1534,  4238,  1260,   112, 25650,   117,  1245,  1167,
         1154,  2879,  2861,  1104,  7999,  1863,   119,  1109, 20164,  7222,
        12512,  1116,  6297,  1118,  7046,  2457,  1741,  1105, 

In [41]:
context_m[1]

tensor([ True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True,  True,  True,
         True,  True,  True,  True,  True,  True,  True,  True, 

In [42]:
for l in all_falt_lists_C:
    print(l)

[33, 34, 63, 64, 121, 122, 156, 157, 163, 164, 165, 183, 240, 266, 289, 290, 333]
[4, 5, 6, 30, 31, 32, 33, 44, 47, 50, 69, 70, 83, 86, 96, 97, 111, 112, 115, 119, 120, 121, 141, 148, 151, 152, 153]
[19, 33, 36, 40, 46, 72, 122, 128, 131]
[21, 34, 35, 36, 56, 57, 66, 69, 70, 71, 75, 96, 97, 109, 110, 111]
[1, 2, 3, 53, 73, 112, 153]
[7, 8, 15, 16, 60, 68, 69, 77, 82, 89, 90, 91, 113, 129, 132, 141, 174, 175, 176, 184, 185, 189, 190]
[55, 66, 84, 85, 108, 109, 110]
[8, 18, 19, 30, 75, 80, 96]
[31, 32, 59, 96, 103, 104, 176, 209, 210]
[3, 18, 92, 101]
[5, 8, 14, 47, 58, 59, 66, 67, 81, 82, 88, 89, 107, 108, 111, 130, 139, 140, 141, 149, 152, 155, 156, 157, 182, 183, 198, 199, 200, 204, 224, 225, 226]
[10, 11, 12, 28, 29, 30, 37, 40, 41, 59, 63, 64, 65, 75, 76, 104, 105, 106, 107, 108, 118, 119, 139, 148, 149, 168, 169, 170, 177, 196, 197, 198, 202, 203, 214, 258]
[30, 36, 47, 80, 95]
[31, 32, 59, 96, 103, 104, 176, 209, 210]
[52, 53, 54, 55, 75, 107, 136]
[9, 10, 24, 29, 30, 40, 41, 50, 

In [42]:
context_m.shape

torch.Size([16, 388])

In [43]:
context_np = context.numpy()
context_np.shape

(16, 388)

In [56]:
all_mod_mask_C = []
for i in range(len(context_np)):
    arr = []
    for j in range(len(context_np[i])):
        if j in all_falt_lists_C[i]:
            continue
        else:
            arr.append(context_np[i][j])

    for z in range(len(all_falt_lists_C[i])):
        arr.append(0)
    all_mod_mask_C.append(arr)                

In [59]:
for l in all_mod_mask_C:
    print(len(l))

[1130, 1103, 1975, 1104, 1103, 13049, 117, 1137, 18739, 3386, 117, 1672, 1696, 9093, 1107, 1103, 3959, 3296, 1137, 1598, 1107, 1147, 1718, 117, 1259, 1103, 1877, 1104, 3504, 117, 6686, 117, 1840, 117, 4678, 117, 1105, 5184, 117, 1114, 1242, 1632, 2719, 1105, 5094, 1217, 2505, 2052, 119, 4187, 1106, 1103, 1909, 1487, 1104, 3504, 117, 4678, 117, 1105, 1840, 1120, 1142, 1159, 1242, 1104, 1103, 2719, 13029, 1292, 1472, 27305, 1127, 1103, 1269, 2833, 117, 1463, 3229, 1167, 16916, 1111, 1141, 1298, 1104, 1147, 10227, 1190, 1639, 119, 12812, 1107, 2538, 1104, 1103, 1748, 1718, 1104, 5882, 3504, 1112, 1218, 1112, 1103, 4521, 4577, 1487, 1104, 1103, 3959, 1104, 3504, 117, 4678, 117, 1105, 1840, 117, 1103, 3765, 6107, 1105, 1103, 13049, 6107, 1132, 5128, 1487, 119, 1130, 1103, 1298, 1104, 1922, 3504, 1219, 1103, 13049, 6107, 1175, 1127, 1242, 2505, 15233, 119, 1130, 1103, 1298, 1104, 1840, 1242, 1104, 1103, 1632, 1840, 1127, 1121, 1103, 13049, 6107, 3386, 119, 1130, 13049, 4678, 117, 1103, 1514,

In [66]:
all_mod_mask_C_pt = []

for l in all_mod_mask_C:
    all_mod_mask_C_pt.append(torch.Tensor(l))

In [71]:
context_new = torch.Tensor([])

for l in all_mod_mask_C_pt:
    l = l.reshape((1,l.shape[0]))
    context_new = torch.cat((context_new, l), 0)

print(context_new.shape)

torch.Size([16, 388])


In [None]:
context_new[0]

In [72]:
c_mask_new = torch.zeros_like(context_new) != context_new 
print(c_mask_new.shape)

torch.Size([16, 388])


In [76]:
# Similarly generate a new mask for questions

question_np = question.numpy()
question_np.shape

all_mod_mask_Q = []
for i in range(len(question_np)):
    arr = []
    for j in range(len(question_np[i])):
        if j in all_falt_lists_Q[i]:
            continue
        else:
            arr.append(question_np[i][j])

    for z in range(len(all_falt_lists_Q[i])):
        arr.append(0)
    all_mod_mask_Q.append(arr) 


all_mod_mask_Q_pt = []
for l in all_mod_mask_Q:
    all_mod_mask_Q_pt.append(torch.Tensor(l))   


question_new = torch.Tensor([])
for l in all_mod_mask_Q_pt:
    l = l.reshape((1,l.shape[0]))
    question_new = torch.cat((question_new, l), 0)
print(question_new.shape)   

q_mask_new = torch.zeros_like(question_new) != question_new 
print(q_mask_new.shape)

torch.Size([16, 25])
torch.Size([16, 25])
