<a href="https://colab.research.google.com/github/krishnarevi/Capstone-Part-1/blob/main/Capstone_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this notebook, we show how we can take advantage of these recent advances to train a long form question answering system which takes in a question, fetches relevant passages from a document corpus, and writes a multi-sentence answer based on the question and retrieved passages.In particular, training embedding-based retrieval models to gather supporting evidence for open-domain questions is relatively new research area: the last few months have seen some significant progress in cases where direct supervision is available, or with extensive task-specific pretraining. Here, we show how our custom dataset allows us to train a dense retrieval system without access to either, making dense retrieval models more accessible.

## 1.a - Preliminaries
The implementation presented here relies on the Hugging Face ü§ótransformers and ü§ónlp libraries. Wikipedia indexing relies on faiss for the dense version. You can get all of these by running:

<!-- pip install elasticsearch -->
pip install faiss_gpu
pip install nlp
pip install transformers
<!-- 
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.7.1-linux-x86_64.tar.gz
tar -xzvf elasticsearch-7.7.1-linux-x86_64.tar.gz -->

In [1]:
!pip install faiss_gpu nlp transformers

Collecting faiss_gpu
  Downloading faiss_gpu-1.7.1.post2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (89.7 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 89.7 MB 9.1 kB/s 
[?25hCollecting nlp
  Downloading nlp-0.4.0-py3-none-any.whl (1.7 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1.7 MB 56.2 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.10.0-py3-none-any.whl (2.8 MB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 2.8 MB 50.0 MB/s 
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 243 kB 92.0 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-5.4.1-cp37-cp37m-manylinux1_x86_64.whl (636 kB)
[K     |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñ

In [2]:
import functools
import math
import os  # noqa: F401
from random import choice, randint
from time import time

import numpy as np
import torch
import torch.utils.checkpoint as checkpoint
from torch.utils.data import DataLoader, Dataset, RandomSampler, SequentialSampler
from tqdm import tqdm

import faiss  # noqa: F401
import nlp  # noqa: F401
import pandas as pd
from transformers import AdamW, AutoModel, AutoModelForSeq2SeqLM, AutoTokenizer, get_linear_schedule_with_warmup
import json

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
import sys
sys.path.append('/content/drive/MyDrive/TSAI/Capstone_1')

In [5]:
# from lfqa_utils import *

In [6]:
import os
print(os.getcwd())
path = '/content/drive/MyDrive/TSAI/Capstone_1'

/content


In [7]:
folder = "retriever_models"
# os.chdir(path)
print("current dir is: %s" % (os.getcwd()))

if os.path.isdir(folder):
    print("Report directory exists")
else:
    print("Report directory Doesn't exists, creating one")
    os.mkdir(folder)

current dir is: /content
Report directory Doesn't exists, creating one


In [8]:
with open(path+'/train_data.json') as f:
        train = json.load(f)
with open(path+'/test_data.json') as f:
        test = json.load(f)
with open(path+'/final_data.json') as f:
        passage_snippets = json.load(f)

In [9]:
train[0]

{'id': 1,
 'x': 'What does delete_submodule do to the given submodule from self?',
 'y': 'Deletes',
 'z': 'This installs empty Modules where none exist yet if they are subpaths of target. target ‚Äì The fully-qualified string name of the new submodule (See example in nn.Module.get_submodule for how to specify a fully-qualified string.) m ‚Äì The submodule itself; the actual object we want to install in the current Module  this method to return True, each object in the chain denoted by target must either a) not exist yet, or b) reference an nn.Module (not a parameter or other attribute) bool Return the Python code generated from the Graph underlying this GraphModule. Deletes all unused submodules from self. A Module is considered ‚Äúused‚Äù if any one of the following is true: 1. It has children that are used 2. Its forward is called directly via a call_module node 3. It has a non-Module attribute that is used from a get_attr node This method can be called to clean up an nn.Module witho

In [10]:
len(train)

25130

In [11]:
test[100]

{'id': 101,
 'x': 'What kind of function does nn.ReLU6 apply?',
 'y': 'element-wise',
 'z': 'nn.ReLU6 Applies the element-wise function:'}

In [12]:
len(test)

6283

### Retrieving Support Documents with an ELI5-Trained Dense Model

The sparse retriever works by finding passages which feature the words from the query. However, it has no way to know a priori which of these words are more important in context, and seems to struggle with understanding the central theme of the query (human-perceived temperature).

Thankfully, some recent works have taken advantage of advances in pre-trained contextual word representations to solve this problem. Models such as DPR or REALM for example learn to compute a vector representation of the query, as well as vector representations of Wikipedia passages in such a way that the passages that best answers a question maximize the dot product between the two representations. Retrieval is then reduced to a Maximum Inner Product Search, which can be executed efficiently using systems like FAISS.

These successes are very encouraging for our Open-Domain Long Form QA application. However, our task and setup do not quite meet the requirements of either of either of these approaches. On the one hand, the DPR system is trained using gold passage annotations: most major QA dataset tell the system which Wikipedia passage contains the answer. Unfortunately, we do not have such annotations for the ELI5 data. On the other hand, while REALM is trained without passage supervision, it requires a pretty expensive pre-training step with an Inverse Cloze Task (100,000 steps with batch size 4096), and the ability to re-compute the embeddings of all Wikipedia passages regularly during training.

In order to train a similar dense retrieval system at reduced cost without having access to gold passage annotation, we will have to take advantage of another unique feature of our dataset, namely the fact that the long form answers are quite similar in style to the Wikipedia passages we want to index. Our hypothesis then is that if we train a system to embed the questions and answers in our dataset in a way that allows us to easily match questions to answers, then using the answer embedder on Wikipedia passages should allow us to similarly match questions to supporting evidence from Wikipedia.

4.a - Contrastive Training with ELI5 In-Batch Negatives
As mentioned above, we want to train a system to produce question and answer embeddings, such that the dot product between the representation of a question and any of its answers is greater than between it and answers of all of the other questions in the dataset.

Unfortunately, actually comparing all questions to all answers before taking every single gradient step is computationally prohibitive: instead, we follow previous work in simply processing medium to large batches of question-answer pairs, and making sure that the dot product of a question with its answer is larger than with all other answers in the batch, and vice versa.

We use a cross-entropy loss for the multinomial distribution over all of the answers (or questions) in a batch, and make use of PyTorch gradient checkpointing to be able to use large batches with limited GPU memory: you can find all implementation details in the RetrievalQAEmbedder class in eli5_utils.py.

We use a single BERT-style pre-trained model to embed the questions and answers, and learn different projection matrices to bring both representations down to dimension 128: the projection matrices are trained from scratch as the sentence embedding model is fine-tuned. We found that the 8-layer distilled version of BERT from the Well-Read Students Learn Better paper performed as well or better as full BERT for a notable gain in computation speed: if you want an even faster model, that work provides pre-trained models spanning the full range of computation/accuracy trade-offs.

The model can than be trained with the following code: with batch size 32/512 on a single 16GB GPU, you can run 10 training epochs in under 6 hours.

In [13]:
###############
# retriever training
###############
class ELI5DatasetQARetriver(Dataset):
    def __init__(self, examples_array, num_rows, extra_answer_threshold=3, min_answer_length=2, training=True, n_samples=None):
        self.data = examples_array
        self.answer_thres = extra_answer_threshold
        self.min_length = min_answer_length
        self.training = training
        self.n_samples = num_rows if n_samples is None else n_samples
        self.num_rows = num_rows

    def __len__(self):
        return self.n_samples

    def make_example(self, idx):
        example = self.data[idx]
        question = example["x"]
        # if self.training:
        #     answers = [a for i, (a, sc) in enumerate(zip(example["answers"]["text"], example["answers"]["score"]))]
        #     answer_tab = choice(answers).split(" ")
        #     start_idx = randint(0, max(0, len(answer_tab) - self.min_length))
        #     answer_span = " ".join(answer_tab[start_idx:])
        # else:
            # answer_span = example["answers"]["text"][0]
        answer = example["y"]
        return (question, answer)

    def __getitem__(self, idx):
        return self.make_example(idx % self.num_rows)


class RetrievalQAEmbedder(torch.nn.Module):
    def __init__(self, sent_encoder, dim):
        super(RetrievalQAEmbedder, self).__init__()
        self.sent_encoder = sent_encoder
        self.output_dim = 128
        self.project_q = torch.nn.Linear(dim, self.output_dim, bias=False)
        self.project_a = torch.nn.Linear(dim, self.output_dim, bias=False)
        self.ce_loss = torch.nn.CrossEntropyLoss(reduction="mean")

    def embed_sentences_checkpointed(self, input_ids, attention_mask, checkpoint_batch_size=-1):
        # reproduces BERT forward pass with checkpointing
        if checkpoint_batch_size < 0 or input_ids.shape[0] < checkpoint_batch_size:
            return self.sent_encoder(input_ids, attention_mask=attention_mask)[1]
        else:
            # prepare implicit variables
            device = input_ids.device
            input_shape = input_ids.size()
            token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
            head_mask = [None] * self.sent_encoder.config.num_hidden_layers
            extended_attention_mask: torch.Tensor = self.sent_encoder.get_extended_attention_mask(
                attention_mask, input_shape, device
            )

            # define function for checkpointing
            def partial_encode(*inputs):
                encoder_outputs = self.sent_encoder.encoder(inputs[0], attention_mask=inputs[1], head_mask=head_mask,)
                sequence_output = encoder_outputs[0]
                pooled_output = self.sent_encoder.pooler(sequence_output)
                return pooled_output

            # run embedding layer on everything at once
            embedding_output = self.sent_encoder.embeddings(
                input_ids=input_ids, position_ids=None, token_type_ids=token_type_ids, inputs_embeds=None
            )
            # run encoding and pooling on one mini-batch at a time
            pooled_output_list = []
            for b in range(math.ceil(input_ids.shape[0] / checkpoint_batch_size)):
                b_embedding_output = embedding_output[b * checkpoint_batch_size : (b + 1) * checkpoint_batch_size]
                b_attention_mask = extended_attention_mask[b * checkpoint_batch_size : (b + 1) * checkpoint_batch_size]
                pooled_output = checkpoint.checkpoint(partial_encode, b_embedding_output, b_attention_mask)
                pooled_output_list.append(pooled_output)
            return torch.cat(pooled_output_list, dim=0)

    def embed_questions(self, q_ids, q_mask, checkpoint_batch_size=-1):
        q_reps = self.embed_sentences_checkpointed(q_ids, q_mask, checkpoint_batch_size)
        return self.project_q(q_reps)

    def embed_answers(self, a_ids, a_mask, checkpoint_batch_size=-1):
        a_reps = self.embed_sentences_checkpointed(a_ids, a_mask, checkpoint_batch_size)
        return self.project_a(a_reps)

    def forward(self, q_ids, q_mask, a_ids, a_mask, checkpoint_batch_size=-1):
        device = q_ids.device
        q_reps = self.embed_questions(q_ids, q_mask, checkpoint_batch_size)
        a_reps = self.embed_answers(a_ids, a_mask, checkpoint_batch_size)
        compare_scores = torch.mm(q_reps, a_reps.t())
        loss_qa = self.ce_loss(compare_scores, torch.arange(compare_scores.shape[1]).to(device))
        loss_aq = self.ce_loss(compare_scores.t(), torch.arange(compare_scores.shape[0]).to(device))
        loss = (loss_qa + loss_aq) / 2
        return loss


def make_qa_retriever_model(model_name="google/bert_uncased_L-8_H-512_A-8", from_file=None, device="cuda"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    bert_model = AutoModel.from_pretrained(model_name).to(device)
    # run bert_model on a dummy batch to get output dimension
    d_ids = torch.LongTensor(
        [[bert_model.config.bos_token_id if bert_model.config.bos_token_id is not None else 1]]
    ).to(device)
    d_mask = torch.LongTensor([[1]]).to(device)
    sent_dim = bert_model(d_ids, attention_mask=d_mask)[1].shape[-1]
    qa_embedder = RetrievalQAEmbedder(bert_model, sent_dim).to(device)
    if from_file is not None:
        param_dict = torch.load(from_file)  # has model weights, optimizer, and scheduler states
        qa_embedder.load_state_dict(param_dict["model"])
    return tokenizer, qa_embedder


def make_qa_retriever_batch(qa_list, tokenizer, max_len=128, device="cuda"):
    q_ls = [q for q, a in qa_list]
    a_ls = [a for q, a in qa_list]
 
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
    
    q_ids, q_mask = (torch.LongTensor(q_toks["input_ids"]).to(device),torch.LongTensor(q_toks["attention_mask"]).to(device),)
    # print(len(a_ls))

    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=max_len, pad_to_max_length=True)
    # TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence]]
    # print(a_toks)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )

    return (q_ids, q_mask, a_ids, a_mask)


def train_qa_retriever_epoch(model, dataset, tokenizer, optimizer, scheduler, args, e=0):
    model.train()
    # make iterator
    train_sampler = RandomSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_retriever_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    # print(next(iter(data_loader)).shape)
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    for step, batch in enumerate(epoch_iterator):
        # print("q_ids",q_ids.shape)
        # print(" q_mask,", q_mask.shape)
        # print("A_id", a_ids.shape)
        q_ids, q_mask, a_ids, a_mask = batch
        pre_loss = model(q_ids, q_mask, a_ids, a_mask, checkpoint_batch_size=args.checkpoint_batch_size)
        loss = pre_loss.sum()
        # optimizer
        loss.backward()
        optimizer.step()
        scheduler.step()
        model.zero_grad()
        # some printing within the epoch
        loc_loss += loss.item()
        loc_steps += 1
        if step % args.print_freq == 0 or step == 1:
            print(
                "{:2d} {:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                    e, step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                )
            )
            loc_loss = 0
            loc_steps = 0


def train_qa_retriever_joint_epoch(model, dataset_list, tokenizer, optimizer, scheduler, args, e=0):
    model.train()
    model_collate_fn = functools.partial(
        make_qa_retriever_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    # make iterator
    train_samplers = [RandomSampler(dataset) for dataset in dataset_list]
    data_loaders = [
        DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
        for dataset, train_sampler in zip(dataset_list, train_samplers)
    ]
    iterators = [iter(dloader) for dloader in data_loaders]
    joint_iter = zip(*iterators)
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    for step, (batches,) in enumerate(zip(joint_iter)):
        for batch in batches:
            q_ids, q_mask, a_ids, a_mask = batch

            loss = model(q_ids, q_mask, a_ids, a_mask, checkpoint_batch_size=args.checkpoint_batch_size)
            # optimizer
            loss.backward()
            optimizer.step()
            scheduler.step()
            model.zero_grad()
            # some printing within the epoch
            loc_loss += loss.item()
            loc_steps += 1
        if step % args.print_freq == 0:
            print(
                "{:2d} {:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                    e, step, len(dataset_list[0]) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                )
            )
            loc_loss = 0
            loc_steps = 0


def evaluate_qa_retriever(model, dataset, tokenizer, args):
    model.eval()
    # make iterator
    eval_sampler = SequentialSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_retriever_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=eval_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    tot_loss = 0.0
    with torch.no_grad():
        for step, batch in enumerate(epoch_iterator):
            q_ids, q_mask, a_ids, a_mask = batch
            loss = model(q_ids, q_mask, a_ids, a_mask)
            tot_loss += loss.item()
        return tot_loss / (step + 1)


def train_qa_retriever(qar_model, qar_tokenizer, qar_train_dset, qar_valid_dset, qar_args):
    qar_optimizer = AdamW(qar_model.parameters(), lr=qar_args.learning_rate, eps=1e-8)
    qar_scheduler = get_linear_schedule_with_warmup(
        qar_optimizer,
        num_warmup_steps=100,
        num_training_steps=(qar_args.num_epochs + 1) * math.ceil(len(qar_train_dset) / qar_args.batch_size),
    )
    for e in range(qar_args.num_epochs):
        train_qa_retriever_epoch(qar_model, qar_train_dset, qar_tokenizer, qar_optimizer, qar_scheduler, qar_args, e)
        m_save_dict = {
            "model": qar_model.state_dict(),
            "optimizer": qar_optimizer.state_dict(),
            "scheduler": qar_scheduler.state_dict(),
        }
        print("Saving model {}".format(qar_args.model_save_name))
        torch.save(m_save_dict, "{}_{}.pth".format(qar_args.model_save_name, e))
        eval_loss = evaluate_qa_retriever(qar_model, qar_valid_dset, qar_tokenizer, qar_args)
        print("Evaluation loss epoch {:4d}: {:.3f}".format(e, eval_loss))

In [14]:
# training arguments
class ArgumentsQAR():
    def __init__(self):
        self.batch_size = 512
        self.max_length = 128
        self.checkpoint_batch_size = 32
        self.print_freq = 100
        self.pretrained_model_name = "google/bert_uncased_L-8_H-768_A-12"
        self.model_save_name = "/content/retriever_models/eli5_retriever_model_l-8_h-768_b-512-512"
        self.learning_rate = 2e-4
        self.num_epochs = 1

qar_args = ArgumentsQAR()

# prepare torch Dataset objects
qar_train_dset = ELI5DatasetQARetriver(train,num_rows=len(train), training=True)
qar_valid_dset = ELI5DatasetQARetriver(test,num_rows=len(test), training=False)

# load pre-trained BERT and make model
qar_tokenizer, qar_model = make_qa_retriever_model(
        model_name=qar_args.pretrained_model_name,
        from_file=None,
        device="cuda"
)

# train the model
train_qa_retriever(qar_model, qar_tokenizer, qar_train_dset, qar_valid_dset, qar_args)

Downloading:   0%|          | 0.00/384 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/327M [00:00<?, ?B/s]

Some weights of the model checkpoint at google/bert_uncased_L-8_H-768_A-12 were not used when initializing BertModel: ['cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.decoder.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncat

 0     0 of    49 	 L: 6.455 	 -- 17.069
 0     1 of    49 	 L: 6.449 	 -- 34.344
Saving model /content/retriever_models/eli5_retriever_model_l-8_h-768_b-512-512
Evaluation loss epoch    0: 5.232


Once the model is trained, it can be used to compute passage embeddings for all document corpus. The make_qa_dense_index method takes advantage of numpy memory-mapping, so embeddings are written directly to disk. Again with a single GPU, computing the full set of passage embeddings should take about 18 hours.

In [15]:
type(passage_snippets[0 * 512 : (0 + 1) * 512][0]['z'])

str

In [16]:
# passages = [p for p in passage_snippets[0 * 512 : (0 + 1) * 512]["z"]]
# passages

In [17]:
def embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length=128, device="cuda"):
    a_toks = tokenizer.batch_encode_plus(passages, max_length=max_length, pad_to_max_length=True)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )
    with torch.no_grad():
        a_reps = qa_embedder.embed_answers(a_ids, a_mask).cpu().type(torch.float)
    return a_reps.numpy()

In [18]:

def embed_questions_for_retrieval(q_ls, tokenizer, qa_embedder, device="cuda"):
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=128, pad_to_max_length=True)
    q_ids, q_mask = (
        torch.LongTensor(q_toks["input_ids"]).to(device),
        torch.LongTensor(q_toks["attention_mask"]).to(device),
    )
    with torch.no_grad():
        q_reps = qa_embedder.embed_questions(q_ids, q_mask).cpu().type(torch.float)
    return q_reps.numpy()

In [19]:
def make_qa_dense_index(
    qa_embedder,
    tokenizer,
    passages_dset,
    batch_size=512,
    max_length=128,
    index_name="kilt_passages_reps.dat",
    dtype="float32",
    device="cuda",
):
    st_time = time()
    fp = np.memmap(index_name, dtype=dtype, mode="w+", shape=(len(passages_dset),128))
    n_batches = math.ceil(len(passages_dset) / batch_size)
    for i in range(n_batches):
        passages = [p["z"] for p in passages_dset[i * batch_size : (i + 1) * batch_size]]
        reps = embed_passages_for_retrieval(passages, tokenizer, qa_embedder, max_length, device)
        fp[i * batch_size : (i + 1) * batch_size] = reps
        if i % 50 == 0:
            print(i, time() - st_time)


In [20]:
if not os.path.isfile('wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat'):

  make_qa_dense_index(
          qar_model, qar_tokenizer, passage_snippets, device='cuda',
          index_name='wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat' )



0 3.3619418144226074
50 169.1546573638916


### 4.b - Using the Trained Dense Retriever and Wikipedia Index
Now that we have trained our model to compute query and answer embeddings and used it to compute passage embeddings for all our Wikipedia snippets, let's see whether it can actually find supporting evidence for a new question. Recall the the two steps to using the dense retriever: we first compute an embedding for a new question, then do Max Inner Product Search with the pre-computed passage representations.

The MIPS part can be executed efficiently with the faiss library. Additionally, since we computed 128-dimensional passage embeddings, the whole of the representations fits on a GPU, making retrieval even faster. We can create the faiss_gpu index with the following code:

In [21]:
faiss_res = faiss.StandardGpuResources()
wiki40b_passage_reps = np.memmap(
            'wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat',
            dtype='float32', mode='r',
            # shape=(wiki40b_snippets.num_rows, 128)
            # wiki40b_snippets.num_rows = 11378343,english sections from wiki40B dataset
            shape=(len(passage_snippets), 128)
)

wiki40b_index_flat = faiss.IndexFlatIP(128)
wiki40b_gpu_index = faiss.index_cpu_to_gpu(faiss_res, 0, wiki40b_index_flat)
wiki40b_gpu_index.add(wiki40b_passage_reps)

In [22]:

# build a support document for the question out of Wikipedia snippets
# def query_qa_dense_index(
#     question, qa_embedder, tokenizer, wiki_passages, wiki_index, n_results=10, min_length=2, device="cuda"
# ):
#     q_rep = embed_questions_for_retrieval([question], tokenizer, qa_embedder, device=device)
#     D, I = wiki_index.search(q_rep, 2 * n_results)
#     res_passages = [wiki_passages[int(i)] for i in I[0]]
#     support_doc = "<P> " + " <P> ".join([p["z"] for p in res_passages])
#     res_list = [p['z'] for p in res_passages]

#     for r, sc in zip(res_list, D[0]):
#         r["score"] = float(sc)
#     return support_doc, res_list
# find nearest neighbors of an answer or declarative text in Wikipedia snippets

# build a support document for the question out of Wikipedia snippets
def query_qa_dense_index(
    question, qa_embedder, tokenizer, wiki_passages, wiki_index, n_results=10, min_length=2, device="cuda"
):
    q_rep = embed_questions_for_retrieval([question], tokenizer, qa_embedder, device=device)
    D, I = wiki_index.search(q_rep, 2 * n_results)
    res_passages = [wiki_passages[int(i)] for i in I[0]]
    support_doc = "<P> " + " <P> ".join([p["z"] for p in res_passages])
    res_list = [dict([(k, p[k]) for k in ["id","z"]]) for p in res_passages]
    res_list = [res for res in res_list if len(res["z"].split()) > min_length][:n_results]
    for r, sc in zip(res_list, D[0]):
        r["score"] = float(sc)
    return support_doc, res_list

Now we can use the query_qa_dense_index function to query the dense index for our running example question :

In [27]:
question = test[6]['x']
question

'Bubble up non null rows of a 3-D Tensor PyTorch'

In [28]:
doc, res_list = query_qa_dense_index(question, qar_model, qar_tokenizer, passage_snippets, wiki40b_gpu_index, device='cuda')
print(res_list)
df = pd.DataFrame({
    
    'Text': ['--- ' + question] + [res['z'] for res in res_list],
})
df.style.set_properties(**{'text-align': 'left'})

[{'id': 4562, 'z': '<p>I consider this an ugly solution, but perhaps this is what you want to achieve:</p>\n\n<pre><code>a = torch.Tensor([[1,2],[3,4]])\nb = torch.ones((2,2,2))\n\nA = torch.mm(a[0].view(-1, 2), b[0])\nB = torch.mm(a[1].view(-1, 2), b[1])\nres = torch.cat([A, B], dim=0)\nprint(res)\n</code></pre>\n\n<p><strong>output:</strong></p>\n\n<pre><code>tensor([[3., 3.],\n        [7., 7.]])\n</code></pre>\n', 'score': 6.735563278198242}, {'id': 473, 'z': '<p>You can try this one.</p>\n<pre class="lang-py prettyprint-override"><code>import torch\nc, m, n = input_size[0], input_size[1], input_size[2]\n\nt = torch.zeros(c, m, n, c, m, n)\ni, j, k = torch.meshgrid(torch.arange(c), torch.arange(m), torch.arange(n))\ni = i.flatten()\nj = j.flatten()\nk = k.flatten()\n\nt[i, j, k, i, j, k] = 1\n</code></pre>\n<p>Here is how <a href="https://pytorch.org/docs/master/generated/torch.meshgrid.html" rel="nofollow noreferrer">meshgrid works in case you need reference.</p>\n', 'score': 6.651



Unnamed: 0,Text
0,--- Bubble up non null rows of a 3-D Tensor PyTorch
1,"I consider this an ugly solution, but perhaps this is what you want to achieve: a = torch.Tensor([[1,2],[3,4]]) b = torch.ones((2,2,2)) A = torch.mm(a[0].view(-1, 2), b[0]) B = torch.mm(a[1].view(-1, 2), b[1]) res = torch.cat([A, B], dim=0) print(res) output: tensor([[3., 3.],  [7., 7.]])"
2,"You can try this one. import torch c, m, n = input_size[0], input_size[1], input_size[2] t = torch.zeros(c, m, n, c, m, n) i, j, k = torch.meshgrid(torch.arange(c), torch.arange(m), torch.arange(n)) i = i.flatten() j = j.flatten() k = k.flatten() t[i, j, k, i, j, k] = 1 Here is how meshgrid works in case you need reference."
3,"You have to slightly modify tensor b: a = torch.tensor([[1, 2, 3],  [1, 2, 3],  [1, 2, 3],  [1, 2, 3]]) b = torch.tensor([4,4,4,4]) b = b.reshape(1, 4) Then you get your ""joined"" tensor: c = torch.cat((a, torch.t(b)), 1) And backward: a1 = c[:,:-1] b1 = torch.t(c[:,-1:])"
4,"You can do something like this exploiting broadcasting: import torch a = torch.tensor([[1, 0], [0, 1], [2, 0], [3, 2]]) b = torch.tensor([[0, 1], [2, 0]]) indices = ((a == b[:, None]).sum(axis = 2) != a.shape[1]).all(axis = 0) print(indices) print(a[indices]) indices = tensor([ True, False, False, True]) a[indices] = tensor([[1, 0],  [3, 2]]) Works for all tensors a and b of shapes m x n and p x n respectively i.e, the number of columns (a.shape[1]) must be same and you can compare among any no. of rows."
5,"The most efficient way of doing this would be to add an extra second dimension to b and use broadcasting to add: a = torch.Tensor([[[1,2],[1,2],[1,2]],[[3,4],[3,4],[3,4]]]) b = torch.Tensor([[1,2],[3,4]]) a += b.unsqueeze(1)"
6,"Assuming that the first dimension of a is 1 as in your example, you could do the following: a = torch.Tensor([[1, 2, 3, 4]]) b_abbreviated = torch.Tensor([[1, 2, 3], [4, 5, 6]]) torch.mm(a.reshape(-1, 2), b_abbreviated).sum(axis=0, keepdim=True) Here, instead of repeating the rows, you multiply a in chunks, then add them up column-wise to get the same result. If the first dimension of a is not necessarily 1, you could try the following: torch.cat(torch.split(torch.mm(a.reshape(-1,2),b_abbreviated), a.shape[0]), dim=1).sum( dim=0, keepdim=True).reshape(a.shape[0], -1) Here, you do the following: With torch.mm(a.reshape(-1,2),b_abbreviated, you again split each row of a into chunks of size 2 and stack them one over the other, and then stack each row over the other. With torch.split(torch.mm(a.reshape(-1,2),b_abbreviated), a.shape[0]), these stacks are then separated row-wise, so that each resultant component of the split corresponds to chunks of a single row. With torch.cat(torch.split(torch.mm(a.reshape(-1,2),b_abbreviated), a.shape[0]), dim=1) these stacks are then concatenated column-wise. With .sum(dim=0, keepdim=True), results corresponding to separate chunks of individual rows in a are added up. With .reshape(a.shape[0], -1), rows of a that were concatenated column-wise are again stacked row-wise. It seems quite slow compared to direct matrix multiplication, which is not surprising, but I have not yet checked in comparison to explicit iteration. There are likely better ways of doing this, will edit if I think of any."
7,"Suppose we want to define a sparse tensor with the entry 3 at location (0, 2), entry 4 at location (1, 0), and entry 5 at location (1, 2). Unspecified elements are assumed to have the same value, fill value, which is zero by default. We would then write:Note that the input i is NOT a list of index tuples. If you want to write your indices this way, you should transpose before passing them to the sparse constructor: >>> i = [[0, 2], [1, 0], [1, 2]] >>> v = [3, 4, 5 ] >>> s = torch.sparse_coo_tensor(list(zip(*i)), v, (2, 3)) >>> # Or another equivalent formulation to get s >>> s = torch.sparse_coo_tensor(torch.tensor(i).t(), v, (2, 3)) >>> torch.sparse_coo_tensor(i.t(), v, torch.Size([2,3])).to_dense() tensor([[0, 0, 3],  [4, 0, 5]])"
8,"You can use vector indexing to extract the diagonal, process it, and then put it back into your original matrix: N=10 a = torch.randint(0,N,[N,N]) #tensor([[0, 9, 6, 6, 9, 9, 3, 1, 8, 4], # [8, 1, 6, 8, 5, 8, 7, 8, 1, 4], # [1, 9, 8, 4, 7, 0, 2, 9, 6, 2], # [9, 5, 9, 6, 7, 1, 4, 0, 2, 6], # [1, 2, 8, 0, 9, 0, 4, 3, 9, 9], # [1, 4, 6, 9, 6, 5, 1, 2, 0, 7], # [4, 8, 1, 3, 1, 6, 1, 3, 5, 6], # [3, 8, 9, 9, 1, 3, 0, 9, 6, 6], # [7, 4, 3, 0, 3, 5, 6, 6, 9, 2], # [3, 1, 0, 8, 3, 5, 6, 6, 5, 5]]) diag = a[range(N),range(N)] #index (1,1), (2,2), ... etc diag[diag==0] = 1 # set according to your condition a[range(N),range(N)] = diag #return the diagonal to its place #tensor([[1, 9, 6, 6, 9, 9, 3, 1, 8, 4], # [8, 1, 6, 8, 5, 8, 7, 8, 1, 4], # [1, 9, 8, 4, 7, 0, 2, 9, 6, 2], # [9, 5, 9, 6, 7, 1, 4, 0, 2, 6], # [1, 2, 8, 0, 9, 0, 4, 3, 9, 9], # [1, 4, 6, 9, 6, 5, 1, 2, 0, 7], # [4, 8, 1, 3, 1, 6, 1, 3, 5, 6], # [3, 8, 9, 9, 1, 3, 0, 9, 6, 6], # [7, 4, 3, 0, 3, 5, 6, 6, 9, 2], # [3, 1, 0, 8, 3, 5, 6, 6, 5, 5]])"
9,"You can just iterate over all tensors and for each of them calculate distance to four corners and take the corner with minimum distance. import torch a = torch.tensor([[3,2,2,3], [1,1,2,2]]) c = torch.zeros(a.shape[0], 2) for idx, x in enumerate(a):  d1 = x[0] ** 2 + x[1] ** 2  d2 = x[2] ** 2 + x[3] ** 2  d3 = x[0] ** 2 + x[3] ** 2  d4 = x[2] ** 2 + x[1] ** 2  dmin = min(d1, d2, d3, d4)  if d1 == dmin:  c[idx] = torch.tensor([x[0], x[1]])  elif d2 == dmin:  c[idx] = torch.tensor([x[2], x[3]])  elif d3 == dmin:  c[idx] = torch.tensor([x[0], x[3]])  elif d4 == dmin:  c[idx] = torch.tensor([x[2], x[1]]) print(c) # tensor([[2., 2.], [1., 1.]])"


### 4.c - Retriever Model Evaluation
We have trained a retrieval model that seems to be working a little better than the traditional word-matching based approach, at least on our running example. Before we use it to actually answer questions, however, we would like to be able to get some quantitative evaluation of the performances of both approaches.

For the retriever, we want to favor recall over precision: our first priority is to make sure that all of the information needed to write the answers is present in the support document. If there is unrelated information, the generation model can learn to sort it out. We measure this by computing the proportion of words in the high-scoring answers which are present in the retrieved support document. To focus on important words, we also weigh answer words by their Inverse Document Frequency. This gives us the following IDF-recall scoring function:

SyntaxError: ignored

## 5. Generating Answers with a Sequence-to-Sequence Model

In [29]:
# ELI5 seq2seq model training
###############
class ELI5DatasetS2S(Dataset):
    def __init__(
        self, examples_array,num_rows, make_doc_fun=None, extra_answer_threshold=3, document_cache=None, training=True
    ):
        self.training = training
        self.data = examples_array
        self.make_doc_function = make_doc_fun
        self.document_cache = {} if document_cache is None else document_cache
        self.num_rows = num_rows
        assert not (make_doc_fun is None and document_cache is None)
        # make index of specific question-answer pairs from multi-answers
        if self.training:
            self.qa_id_list = [(i, 0) for i in range(self.num_rows)]

            #  [
            #     (i, j)
            #     for i, qa in enumerate(self.data)
            #     for j, a in enumerate(qa["y"]))
            #     if j == 0 
            # ]
        else:
            self.qa_id_list = [(i, 0) for i in range(self.num_rows)]

    def __len__(self):
        return len(self.qa_id_list)

    def make_example(self, idx):
        i, j = self.qa_id_list[idx]
        example = self.data[i]
        question = example["x"] 
        answer = example["y"]
        q_id = example["q_id"]
        if self.make_doc_function is not None:
            self.document_cache[q_id] = self.document_cache.get(q_id, self.make_doc_function(example["title"]))
        document = self.document_cache[q_id]
        in_st = "question: {} context: {}".format(
            question.lower().strip(), document.lower().strip(),
        )
        out_st = answer
        return (in_st, out_st)

    def __getitem__(self, idx):
        return self.make_example(idx)


def make_qa_s2s_model(model_name="facebook/bart-large", from_file=None, device="cuda:0"):
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)
    if from_file is not None:
        param_dict = torch.load(from_file)  # has model weights, optimizer, and scheduler states
        model.load_state_dict(param_dict["model"])
    return tokenizer, model


def make_qa_s2s_batch(qa_list, tokenizer, max_len=64, max_a_len=360, device="cuda:0"):
    q_ls = [q for q, a in qa_list]
    a_ls = [a for q, a in qa_list]
    q_toks = tokenizer.batch_encode_plus(q_ls, max_length=max_len, pad_to_max_length=True)
    q_ids, q_mask = (
        torch.LongTensor(q_toks["input_ids"]).to(device),
        torch.LongTensor(q_toks["attention_mask"]).to(device),
    )
    a_toks = tokenizer.batch_encode_plus(a_ls, max_length=min(max_len, max_a_len), pad_to_max_length=True)
    a_ids, a_mask = (
        torch.LongTensor(a_toks["input_ids"]).to(device),
        torch.LongTensor(a_toks["attention_mask"]).to(device),
    )
    lm_labels = a_ids[:, 1:].contiguous().clone()
    lm_labels[a_mask[:, 1:].contiguous() == 0] = -100
    model_inputs = {
        "input_ids": q_ids,
        "attention_mask": q_mask,
        "decoder_input_ids": a_ids[:, :-1].contiguous(),
        "lm_labels": lm_labels,
    }
    return model_inputs


def train_qa_s2s_epoch(model, dataset, tokenizer, optimizer, scheduler, args, e=0, curriculum=False):
    model.train()
    # make iterator
    if curriculum:
        train_sampler = SequentialSampler(dataset)
    else:
        train_sampler = RandomSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_s2s_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda:0"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    for step, batch_inputs in enumerate(epoch_iterator):
        pre_loss = model(**batch_inputs)[0]
        loss = pre_loss.sum() / pre_loss.shape[0]
        loss.backward()
        # optimizer
        if step % args.backward_freq == 0:
            optimizer.step()
            scheduler.step()
            model.zero_grad()
        # some printing within the epoch
        loc_loss += loss.item()
        loc_steps += 1
        if step % args.print_freq == 0 or step == 1:
            print(
                "{:2d} {:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                    e, step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                )
            )
            loc_loss = 0
            loc_steps = 0


def eval_qa_s2s_epoch(model, dataset, tokenizer, args):
    model.eval()
    # make iterator
    train_sampler = SequentialSampler(dataset)
    model_collate_fn = functools.partial(
        make_qa_s2s_batch, tokenizer=tokenizer, max_len=args.max_length, device="cuda:0"
    )
    data_loader = DataLoader(dataset, batch_size=args.batch_size, sampler=train_sampler, collate_fn=model_collate_fn)
    epoch_iterator = tqdm(data_loader, desc="Iteration", disable=True)
    # accumulate loss since last print
    loc_steps = 0
    loc_loss = 0.0
    st_time = time()
    with torch.no_grad():
        for step, batch_inputs in enumerate(epoch_iterator):
            pre_loss = model(**batch_inputs)[0]
            loss = pre_loss.sum() / pre_loss.shape[0]
            loc_loss += loss.item()
            loc_steps += 1
            if step % args.print_freq == 0:
                print(
                    "{:5d} of {:5d} \t L: {:.3f} \t -- {:.3f}".format(
                        step, len(dataset) // args.batch_size, loc_loss / loc_steps, time() - st_time,
                    )
                )
    print("Total \t L: {:.3f} \t -- {:.3f}".format(loc_loss / loc_steps, time() - st_time,))


def train_qa_s2s(qa_s2s_model, qa_s2s_tokenizer, s2s_train_dset, s2s_valid_dset, s2s_args):
    s2s_optimizer = AdamW(qa_s2s_model.parameters(), lr=s2s_args.learning_rate, eps=1e-8)
    s2s_scheduler = get_linear_schedule_with_warmup(
        s2s_optimizer,
        num_warmup_steps=400,
        num_training_steps=(s2s_args.num_epochs + 1) * math.ceil(len(s2s_train_dset) / s2s_args.batch_size),
    )
    for e in range(s2s_args.num_epochs):
        train_qa_s2s_epoch(
            qa_s2s_model,
            s2s_train_dset,
            qa_s2s_tokenizer,
            s2s_optimizer,
            s2s_scheduler,
            s2s_args,
            e,
            curriculum=(e == 0),
        )
        m_save_dict = {
            "model": qa_s2s_model.state_dict(),
            "optimizer": s2s_optimizer.state_dict(),
            "scheduler": s2s_scheduler.state_dict(),
        }
        print("Saving model {}".format(s2s_args.model_save_name))
        eval_qa_s2s_epoch(qa_s2s_model, s2s_valid_dset, qa_s2s_tokenizer, s2s_args)
        torch.save(m_save_dict, "{}_{}.pth".format(s2s_args.model_save_name, e))


# generate answer from input "question: ... context: <p> ..."
def qa_s2s_generate(
    question_doc,
    qa_s2s_model,
    qa_s2s_tokenizer,
    num_answers=1,
    num_beams=None,
    min_len=64,
    max_len=256,
    do_sample=False,
    temp=1.0,
    top_p=None,
    top_k=None,
    max_input_length=512,
    device="cuda:0",
):
    model_inputs = make_qa_s2s_batch([(question_doc, "A")], qa_s2s_tokenizer, max_input_length, device=device,)
    n_beams = num_answers if num_beams is None else max(num_beams, num_answers)
    generated_ids = qa_s2s_model.generate(
        input_ids=model_inputs["input_ids"],
        attention_mask=model_inputs["attention_mask"],
        min_length=min_len,
        max_length=max_len,
        do_sample=do_sample,
        early_stopping=True,
        num_beams=1 if do_sample else n_beams,
        temperature=temp,
        top_k=top_k,
        top_p=top_p,
        eos_token_id=qa_s2s_tokenizer.eos_token_id,
        no_repeat_ngram_size=3,
        num_return_sequences=num_answers,
        decoder_start_token_id=qa_s2s_tokenizer.bos_token_id,
    )
    return [qa_s2s_tokenizer.decode(ans_ids, skip_special_tokens=True).strip() for ans_ids in generated_ids]


In [None]:
n_ret = 10

In [None]:
# pre-computing support documents
eli5_train_docs = []
for example in train:
    support_doc, dense_res_list = query_qa_dense_index(
        example['x'], qar_model, qar_tokenizer,passage_snippets, wiki40b_gpu_index, n_results=n_ret
    )
    eli5_train_docs += [(example['id'], support_doc, dense_res_list)]

eli5_valid_docs = []
for example in test:
    support_doc, dense_res_list = query_qa_dense_index(
        example['x'], qar_model, qar_tokenizer, passage_snippets, wiki40b_gpu_index, n_results=n_ret
    )
    eli5_valid_docs += [(example['id'], support_doc, dense_res_list)]

# training loop proper
class ArgumentsS2S():
    def __init__(self):
        self.batch_size = 8
        self.backward_freq = 16
        self.max_length = 1024
        self.print_freq = 100
        self.model_save_name = "seq2seq_models/eli5_bart_model"
        self.learning_rate = 2e-4
        self.num_epochs = 3

s2s_args = ArgumentsS2S()

# eli5_train_docs = json.load(open('precomputed/eli5_train_precomputed_dense_docs.json'))
# eli5_valid_docs = json.load(open('precomputed/eli5_valid_precomputed_dense_docs.json'))
s2s_train_dset = ELI5DatasetS2S(train, document_cache=dict([(k, d) for k, d, src_ls in eli5_train_docs]))
s2s_valid_dset = ELI5DatasetS2S(test, document_cache=dict([(k, d) for k, d, src_ls in eli5_valid_docs]), training=False)

qa_s2s_tokenizer, pre_model = make_qa_s2s_model(
    model_name="facebook/bart-large",
    from_file=None,
    device="cuda"
)
qa_s2s_model = torch.nn.DataParallel(pre_model)

train_qa_s2s(qa_s2s_model, qa_s2s_tokenizer, s2s_train_dset, s2s_valid_dset, s2s_args)