In [None]:
!pip install nlp
!pip install transformers
!pip install rouge_score
!pip install faiss_gpu

In [None]:
import pandas as pd
import nlp
from lfqa_utils import *
import json

<a id='reddit_biases'></a>
### 1.b - Note on Data and Biases

Before we go any further, let us take a moment to talk about the provenance of our training data. While Reddit hosts a number of thriving communities with high quality discussions, it is also widely known to have corners where sexism, hate, and harassment are significant issues. See for example the [recent post from Reddit founder u/spez](https://www.reddit.com/r/announcements/comments/gxas21/upcoming_changes_to_our_content_policy_our_board/) outlining some of the ways he thinks the website's historical policies have been responsible for this problem, [Adrienne Massanari's 2015 article on GamerGate](https://www.researchgate.net/publication/283848479_Gamergate_and_The_Fappening_How_Reddit's_algorithm_governance_and_culture_support_toxic_technocultures) and follow-up works, or a [2019 Wired article on misogyny on Reddit](https://www.wired.com/story/misogyny-reddit-research/).

While there has been some recent work in the NLP community on *de-biasing* models (e.g. [Black is to Criminal as Caucasian is to Police:
Detecting and Removing Multiclass Bias in Word Embeddings](https://arxiv.org/abs/1904.04047) for word embeddings trained specifically on Reddit data), this problem is far from solved, and the likelihood that a trained model might learn the biases present in the data remains a significant concern.

As mentioned above, the magnitude of the problem depends on the specific communities/subreddits. This work uses data from [r/explainlikeimfive](https://www.reddit.com/r/explainlikeimfive/), and the `nlp` library also gives access to examples from [r/askscience](https://www.reddit.com/r/askscience/), and [r/AskHistorians](https://www.reddit.com/r/AskHistorians/). There are some encouraging signs for all of these communities: [r/explainlikeimfive](https://www.reddit.com/r/explainlikeimfive/) and [r/askscience](https://www.reddit.com/r/askscience/) have similar structures and purposes, and [r/askscience](https://www.reddit.com/r/askscience/) was found in 2015 to show medium supportiveness and very low toxicity when compared to other subreddits (see a [hackerfall post](https://hackerfall.com/story/study-and-interactive-visualization-of-toxicity-in), [thecut.com write-up](https://www.thecut.com/2015/03/interactive-chart-of-reddits-toxicity.html) and supporting [data](https://chart-studio.plotly.com/~bsbell21/210/toxicity-vs-supportiveness-by-subreddit/#data)). Meanwhile, the [r/AskHistorians rules](https://www.reddit.com/r/AskHistorians/wiki/rules) mention that the admins will not tolerate "*racism, sexism, or any other forms of bigotry*".

This is obviously not enough to exonerate the model (the pre-training step, for example, raises its own questions on that topic), and there is still a lot of interesting work to do to be able to quantify the biases in a conditional text generation model. One thing you can do to help: if you find any particularly egregious answers provided by the model when using the demo, or want to work on this research question please send a DM to [@YJernite on Twitter](https://twitter.com/YJernite)!

<a id='task_description'></a>
# 2. Task and Data Description

Let's recap: we are interested in the task of Long Form Question Answering. As in other Question Answering tasks, the model is presented with a question, and is required to generate a natural language answer. Whereas a majority of QA datasets contain mostly **factoid** questions, where the answer, such as a date or the name of a single entity, can be expressed in a few words or single sentence, Long Form QA focuses on questions which call for an **explanation** consisting of a few sentences or a few paragraphs.

In order to teach a model to answer such questions, we use questions and answers written by Reddit users. Note that the `nlp.load_dataset` command above actually downloaded questions and their associated answers from the [r/explainlikeimfive](https://www.reddit.com/r/explainlikeimfive/), [r/askscience](https://www.reddit.com/r/askscience/), and [r/AskHistorians](https://www.reddit.com/r/AskHistorians/) subreddits. We focus here on the **ELI5/explainlikeimfive** part to train the system, as these examples tend to be a little simpler.  

Let's look at one item from the test set:

In [21]:
parrot = nlp.load_dataset('json',data_files='parrot-qa-train.json', field='data')['train']
parrot_dev = nlp.load_dataset('json',data_files='parrot-qa-dev.json', field='data')['train']
parrot_test = nlp.load_dataset('json',data_files='parrot-qa2.json', field='qa_pairs')['train']
chunks = nlp.load_dataset('json',data_files='parrot-qa.json', field="documents")['train']

Using custom data configuration default
Using custom data configuration default


So here we have the question:
> Why does water heated to room temperature feel colder than the air around it?  


This definitely requires a multi-step explanation: no single phrase can sum up all of the information we are looking for. Here are the answers that were given on ELI5, and were given scores of +5 and +2 respectively by Reddit users:
> 1. Water transfers heat more efficiently than air. When something feels cold it's because heat is being transferred from your skin to whatever you're touching. Since water absorbs the heat more readily than air, it feels colder.  

> 2. Air isn't as good at transferring heat compared to something like water or steel (sit on a room temperature steel bench vs. a room temperature wooden bench, and the steel one will feel more cold). When you feel cold, what you're feeling is heat being transferred out of you. If there is no breeze, you feel a certain way.  If there's a breeze, you will get colder faster (because the moving air is pulling the heat away from you), and if you get into water, its quite good at pulling heat from you. Get out of the water and have a breeze blow on you while you're wet, all of the water starts evaporating, pulling even more heat from you.  

First, note that in this case **we have two answers** which broadly describe the same phenomenon: the first one is scored higher because it is more succint and to the point. This example already illustrates one important feature of the LFQA task: **there are usually several valid ways to answer a given question.** Of the 272K examples in the ELI5 training set, nearly two thirds (167K) have at least two answers. We'll need to keep this in mind when training and evaluation of the model.  

Secondly, we need to give our model access to the information that is expressed in both these answers. Recently released models have been shown to include a significant amount of world knowledge in their parameters without the need of any external knowledge at all (see e.g. the [Closed-book QA performance of the T5 model](https://arxiv.org/abs/2002.08910)). There are several advantages to giving the model explicit access to information in text form however. First, a larger number of parameters in a model implies a larger computational cost. Secondly, getting information from a text database allows us to easily update the model's knowledge without having to re-train its parameters.

--- 

<img src="images/ELI5animation.gif" width="750" align="center"/>  

---

<center> Overview of the full question answering process.</center>
<center> First, the Document Retriever selects a set of passages from Wikipedia that have information relevant to the question.</center>
<center> Then, the Answer Generation Model reads the concatenation of the question and retrieverd passages, and writes out the answer.</center>

---  
Here, we choose to give the model access to Wikipedia text. Full Wikipedia articles are typically too long for most current models to handle, and notable exceptions like the [Reformer](https://arxiv.org/abs/2001.04451) or [Longformer](https://arxiv.org/abs/2004.05150) architectures unfortunately do not yet have pre-trained sequence-to-sequence variants. Thus, we follow previous work in splitting Wikipedia articles into disjoint snippets of 100 words, and keep track of the title of the article and sections a snippet came from. Here's how you can get a pre-processed Wiki40b version split into 100-word passages with the `nlp` library, and an example snippet which has some of the information we're looking for ("*little conduction would occur since air is a poor conductor of heat*"):

In the next two sections, we show how we can use either a [sparse retriever](#elasticsearch) or a [trained dense retriever](#dense_retrieval) to automatically find relevant snippets for a question.

<a id='dense_retrieval'></a>
# 4. Retrieving Support Documents with an ELI5-Trained Dense Model

The sparse retriever works by finding passages which feature the words from the query. However, it has no way to know *a priori* which of these words are more important in context, and seems to struggle with understanding the central theme of the query (human-perceived temperature).

Thankfully, some recent works have taken advantage of advances in pre-trained contextual word representations to solve this problem. Models such as [DPR](https://arxiv.org/abs/2004.04906) or [REALM](https://arxiv.org/abs/2002.08909) for example learn to compute a vector representation of the query, as well as vector representations of Wikipedia passages in such a way that the passages that best answers a question maximize the dot product between the two representations. Retrieval is then reduced to a Maximum Inner Product Search, which can be executed efficiently using systems like [FAISS](https://github.com/facebookresearch/faiss).

These successes are very encouraging for our Open-Domain Long Form QA application. However, our task and setup do not quite meet the requirements of either of either of these approaches. On the one hand, the [DPR](https://arxiv.org/abs/2004.04906) system is trained using gold passage annotations: most major QA dataset tell the system which Wikipedia passage contains the answer. Unfortunately, we do not have such annotations for the ELI5 data. On the other hand, while [REALM](https://arxiv.org/abs/2002.08909) is trained without passage supervision, it requires a pretty expensive pre-training step with an [Inverse Cloze Task](https://arxiv.org/abs/1906.00300) (100,000 steps with batch size 4096), and the ability to re-compute the embeddings of all Wikipedia passages regularly during training.

In order to train a similar dense retrieval system at reduced cost without having access to gold passage annotation, we will have to **take advantage of another unique feature of our dataset**, namely the fact that the long form answers are quite similar in style to the Wikipedia passages we want to index. Our hypothesis then is that if we train a system to embed the questions and answers in our dataset in a way that allows us to easily match questions to answers, then using the answer embedder on Wikipedia passages should allow us to similarly match questions to supporting evidence from Wikipedia.

<a id='dense_train'></a>
### 4.a - Contrastive Training with ELI5 In-Batch Negatives

As mentioned above, we want to train a system to produce question and answer embeddings, such that the dot product between the representation of a question and any of its answers is greater than between it and answers of all of the other questions in the dataset.  

Unfortunately, actually comparing all questions to all answers before taking every single gradient step is computationally prohibitive: instead, we follow previous work in simply processing medium to large batches of question-answer pairs, and making sure that the dot product of a question with its answer is larger than with all other answers in the batch, and *vice versa*.  

We use a cross-entropy loss for the multinomial distribution over all of the answers (or questions) in a batch, and make use of [PyTorch gradient checkpointing](https://pytorch.org/docs/stable/checkpoint.html) to be able to use large batches with limited GPU memory: you can find all implementation details in the `RetrievalQAEmbedder` class in `eli5_utils.py`.

---  

<img src="images/ELI5contrastive.svg" width="700" align="center"/>  

---  

<center> To train the retriever, we show the model batches of 512 question-answer pairs.</center>
<center> The model needs to ensure that the embedding of each question in the batch is closer to the embedding</center>
<center> of its corresponding answer than to the embedding of any other answer in the batch.</center>

---  


We use a single BERT-style pre-trained model to embed the questions and answers, and learn different projection matrices to bring both representations down to dimension 128: the projection matrices are trained from scratch as the sentence embedding model is fine-tuned. We found that the 8-layer distilled version of BERT from the [Well-Read Students Learn Better paper](https://arxiv.org/abs/1908.08962) performed as well or better as full BERT for a notable gain in computation speed: if you want an even faster model, that work provides pre-trained models spanning the full range of computation/accuracy trade-offs.

The model can than be trained with the following code: with batch size 32/512 on a single 16GB GPU, you can run 10 training epochs in under 6 hours.

In [None]:
# training arguments

class ArgumentsQAR():
    def __init__(self):
        self.batch_size = 512
        self.max_length = 128
        self.checkpoint_batch_size = 32
        self.print_freq = 100
        self.pretrained_model_name = "google/bert_uncased_L-8_H-768_A-12"
        self.model_save_name = "retriever_models/eli5_retriever_model_l-8_h-768_b-512-512"
        self.learning_rate = 2e-4
        self.num_epochs = 10

qar_args = ArgumentsQAR()

# prepare torch Dataset objects
qar_train_dset = ELI5DatasetQARetriver(parrot, training=True)
qar_valid_dset = ELI5DatasetQARetriver(parrot_dev, training=False)

# load pre-trained BERT and make model
qar_tokenizer, qar_model = make_qa_retriever_model(
        model_name=qar_args.pretrained_model_name,
        from_file=None#,
        #device="cuda:0"
)

# train the model
train_qa_retriever(qar_model, qar_tokenizer, qar_train_dset, qar_valid_dset, qar_args)

Once the model is trained, it can be used to compute passage embeddings for all Wikipedia snippets. The `make_qa_dense_index` method takes advantage of `numpy` memory-mapping, so embeddings are written directly to disk. Again with a single GPU, computing the full set of passage embeddings should take about 18 hours.

In [49]:
if not os.path.isfile('wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat'):
    make_qa_dense_index(
        qar_model, qar_tokenizer, chunks, #device='cuda:0',
        index_name='wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat'
    )

<a id='dense_use'></a>
### 4.b -  Using the Trained Dense Retriever and Wikipedia Index

Now that we have trained our model to compute query and answer embeddings and used it to compute passage embeddings for all our Wikipedia snippets, let's see whether it can actually find supporting evidence for a new question. Recall the the two steps to using the dense retriever: we first compute an embedding for a new question, then do Max Inner Product Search with the pre-computed passage representations.

---  

<img src="images/ELI5wiki_index.svg" width="600" align="center"/>  

---  

<center> At test time, the Retriever Model encodes the question, and compares its embedding to the pre-computed representation of</center>
<center>  all the Wikipedia passages. The ten passages with the closest embedding are returned to create the support document.</center>

---  

The MIPS part can be executed efficiently with the `faiss` library. Additionally, since we computed 128-dimensional passage embeddings, the whole of the representations fits on a GPU, making retrieval even faster. We can create the `faiss_gpu` index with the following code:

In [50]:
faiss_res = faiss.StandardGpuResources()
wiki40b_passage_reps = np.memmap(
            'wiki40b_passages_reps_32_l-8_h-768_b-512-512.dat',
            dtype='float32', mode='r+',
            shape=(chunks.num_rows, 128)
)

wiki40b_index_flat = faiss.IndexFlatIP(128)
wiki40b_gpu_index = faiss.index_cpu_to_gpu(faiss_res, 0, wiki40b_index_flat)
wiki40b_gpu_index.add(wiki40b_passage_reps)

Now we can use the `query_qa_dense_index` function to query the dense index for our running example question about perceived temperature:

In [52]:
question = parrot_test[3]['title']
doc, res_list = query_qa_dense_index(question, qar_model, qar_tokenizer, chunks, wiki40b_gpu_index, device='cuda:0')

df = pd.DataFrame({
    'Article': ['---'] + [res['article_title'] for res in res_list],
    'Sections': ['---'] + [res['section_title'] if res['section_title'].strip() != '' else res['article_title']
                 for res in res_list],
    'Text': ['--- ' + question] + [res["passage_text"] for res in res_list],
})
df.style.set_properties(**{'text-align': 'left'})



Unnamed: 0,Article,Sections,Text
0,---,---,"--- Error: Could not import scheme HI there, I keep getting this error. I have quit and re-opened VS code as well as re-downloaded scheme completely. Do you know how to fix it?"
1,Syllabus & Course Policies: Assignments,Syllabus & Course Policies: Assignments,"Each week, there will be problems assigned for you to work on, most of which will involve writing and analyzing programs. These assignments come in three categories: lab exercises, homework assignments, and projects."
2,Syllabus & Course Policies: Course Format: Exam Prep,Syllabus & Course Policies: Course Format: Exam Prep,"Exam prep sessions will be held every Friday from 9:30am to 11am starting 1/28. The goal will be to recap the past week's material by way of going over past exam problems related to that material, as well as exploring some test-taking strategies. The problem walkthroughs will be recorded, so if you can't attend live, you'll still be able to watch the walkthroughs on your own."
3,Syllabus & Course Policies: Overview: Preparatory Classes: CS 10,Syllabus & Course Policies: Overview: Preparatory Classes: CS 10,"CS 10: The Beauty and Joy of Computing is an introductory computer science course which is similar to CS 61A but moves at a friendlier pace. CS 10 covers variables, functions, recursion, algorithmic complexity, object-oriented programming, and many other relevant CS 61A topics, with the overall content overlap being about 50%. CS 10 starts the semester in Snap!, a block-based programming language which allows students to focus on conceptual understanding without worrying about unfamiliar syntax. After the midterm, the course transitions into Python (the primary language 61A uses), applying the same concepts you already learned to the new language, as well as introducing new concepts more relevant to Python. CS 10 also covers big ideas and social implications that go beyond programming, showing you the beauty and joy of computing."
4,Syllabus & Course Policies: Grading,Syllabus & Course Policies: Grading,"Your course grade is computed using a point system with a total of 300 points, broken down as follows: • Midterm 1, worth 40 points • Midterm 2, worth 50 points • Final Exam, worth 75 points • Projects, worth 99 points • Homework, worth 16 points • Lab, worth 10 points • Lab Participation, worth 5 points • Discussion Participation, worth 5 points There are a handful extra credit points throughout the semester, perhaps around 10, that are available to everyone. Each letter grade for the course corresponds to a range of scores: Your final score will be rounded to the nearest integer before being converted to a letter grade. 0.5 rounds up to 1, but 0.49 rounds down to 0. There is no curve; your grade will depend only on how well you do, and not on how well everyone else does. Score thresholds are based on how students performed in previous semesters. Unlike some previous semesters you may have heard about, these thresholds will not be adjusted based on student performance. You could all get A's. You could all get D's. These are the exact thresholds that will be used at the end of the course to assign grades. In a typical semester, about 60% of students taking the course for a letter grade will receive a B+ or higher. Incomplete grades will be granted only for medical or personal emergencies that cause you to miss the final or last part of the course, only for students who have completed the majority of the coursework, and only if work up to the point of the emergency has been satisfactory. Your lowest homework score will be dropped. Each lab that you complete is worth 1 point, and you can receive a maximum of 10 lab points. There are going to be at least 12 lab assignments, so you can skip some and still get full credit."
5,Syllabus & Course Policies: Grading: Lab Participation,Syllabus & Course Policies: Grading: Lab Participation,"The lab participation score is designed to make sure that all students attend at least the first few weeks of lab sections to try them out. Attending a lab will earn you one lab participation credit. There will be about 12 possible credits available. To earn a perfect lab participation score in the course, you need to earn at least 5 credits. Your course lab participation score is the number of lab participation credits you earn over the semester, up to 5. These are separate from the lab score component, which is graded based on lab completion and correctness."
6,Syllabus & Course Policies: Overview: Alternative Classes: CS 88,Syllabus & Course Policies: Overview: Alternative Classes: CS 88,"CS 88: Computational Structures in Data Science is an introduction to programming and computing that has more than 50% concept overlap with CS 61A. It is designed for students interested in data science who want to expand their knowledge of programming and program structures beyond what is covered in Data 8. Students who complete CS 88 can either proceed directly to CS 61B or subsequently take CS 61A, a path that offers a substantial amount of review because of the high topic overlap between the courses."
7,Syllabus & Course Policies: Overview: Preparatory Classes: Data 8,Syllabus & Course Policies: Overview: Preparatory Classes: Data 8,"Data 8: The Foundations of Data Science is an introduction to data science designed to be accessible and useful for all Berkeley students. This course was built for students without prior programming experience. It teaches students to program in Python 3, but covers a much smaller subset of the language than CS 61A. Most of the course focuses on data processing and statistical techniques that are central to using computers to answer questions about the world. Taking Data 8 before 61A is a good way to gain prior programming experience, but taking CS 10 is a better way."
8,Syllabus & Course Policies: Accommodations (DSP and Otherwise): Accommodations Appointments,Syllabus & Course Policies: Accommodations (DSP and Otherwise): Accommodations Appointments,"If you're not enrolled in DSP, or are in the process of being onboarded by DSP, you may still be eligible for accommodations. You may also be eligible for accommodations if serious extenuating circumstances should come up during the semester. If you believe you may require accommodations, please visit this calendar to book a short (20-minute) appointment with our Student Support TA, Cooper Bedin. You can also reach them via email at at cooper.bedin@berkeley.edu."
9,Syllabus & Course Policies: Grading: Citizenship,Syllabus & Course Policies: Grading: Citizenship,"For exceptionally rude or disrespectful behavior toward the course staff or other students, your final grade will be lowered by up to a full letter grade (e.g., from an A- to a B-) at the discretion of the course instructors. You don't need to be concerned about this policy if you treat other human beings with even a bare minimum of respect and consideration and do not engage in behavior that is actively harmful to others."


The retrieved documents are quite different from the ones returned by the sparse retrieval, with a greater focus on how water helps draw heat from a body, either through evaporation or through better conduction, which is information the model needs to answer this question.

The retriever still misses out on one aspect of the query: the way the question is formulated implies that in the considered scenario the person is immersed in water rather than just wet, which makes the "latent heat" and evaporation arguments a little less relevant, but that's a really subtle distinction!

<a id='dense_eval'></a>
### 4.c -  Retriever Model Evaluation

We have trained a retrieval model that *seems* to be working a little better than the traditional word-matching based approach, at least on our running example. Before we use it to actually answer questions, however, we would like to be able to get some **quantitative evaluation** of the performances of both approaches.

For the retriever, we want to favor **recall** over precision: our first priority is to make sure that all of the information needed to write the answers is present in the support document. If there is unrelated information, the generation model can learn to sort it out. We measure this by computing the proportion of words in the high-scoring answers which are present in the retrieved support document. To focus on important words, we also weigh answer words by their *Inverse Document Frequency*. This gives us the following **IDF-recall** scoring function:

In [53]:
# We first select high-scoring answers (answers beyond the first must have a score of at least 3)
test_qa_list = [(exple['title'],
                ' '.join([a 
                          for i, (a, sc) in enumerate(zip(exple['answers']['text'], exple['answers']['score'])) \
                          if i == 0 or sc >= 3
                         ]))
                for exple in parrot_test]

# We then compute word frequencies in answer text
answer_doc_freq = {}
for q, a in test_qa_list:
    for w in a.lower().split():
        answer_doc_freq[w] = answer_doc_freq.get(w, 0) + 1

# The IDF-recall function is then:
def da_idf_recall(doc, answer):
    d_words = dict([(w, True) for w in doc.lower().split()])
    a_words = answer.lower().split()   
    recall = sum([1. / math.log(1 + answer_doc_freq.get(w, 1)) for w in a_words if w in d_words]) / \
                sum([1. / math.log(1 + answer_doc_freq.get(w, 1)) for w in a_words])
    return recall

The `evaluate_retriever` function in `eli5_utils.py` takes a retrieval and scoring function and computes both the average retrieval time and score of the document relative the the provided answer. Let's write some short-hand functions for the dense and sparse retrievers with our currently loaded indexes, and evaluate them on the ELI5 test set (be advised that evaluating the retriever on the full test set takes up to two hours):

In [54]:
def dense_ret_for_eval(question, n_ret):
    _, dense_res_list = query_qa_dense_index(
        question, qar_model, qar_tokenizer, chunks, wiki40b_gpu_index, n_results=n_ret, device='cuda:0'
    )
    #print(dense_res_list)
    dense_doc = ' '.join([res['passage_text'] for res in dense_res_list])
    with open('dense_docs.json', 'w', encoding='utf-8') as f:
      json.dump(dense_res_list, f, ensure_ascii=False, indent=4)
    return dense_doc

def sparse_ret_for_eval(question, n_ret):
    _, sparse_res_list = query_es_index(
        question, es_client, index_name='wiki40b_snippets_100w', n_results=n_ret
    )
    sparse_doc = ' '.join([res['passage_text'] for res in sparse_res_list])
    return sparse_doc

dense_score = evaluate_retriever(test_qa_list, dense_ret_for_eval, da_idf_recall)
#sparse_score = evaluate_retriever(test_qa_list, sparse_ret_for_eval, da_idf_recall)

df = pd.DataFrame({
    'IDF-Recall': [dense_score['idf_recall']],
    'Time/Query': [dense_score['retrieval_time']],
}, index=[ 'Sparse', 'Dense'])
df.style.format({'IDF-Recall': "{:.4f}", 'Time/Query': "{:.4f}"})



Unnamed: 0,IDF-Recall,Time/Query
Sparse,0.2394,0.0098
Dense,0.2394,0.0098


This metric obviously has limitations. Since it only looks at individual word matches, it is oblivious to *word order* or *paraphrases* among others. However, we can be encouraged by the fact that the dense retriever not only yields **higher IDF-recall**, it also takes **less than a third of the time** of the ElasticSearch-based system! Considering these results, we can confidently use it for the next part: training the sequence-to-sequence answer generation system.

<a id='generation'></a>
# 5. Generating Answers with a Sequence-to-Sequence Model

Now that we know how to create an evidence document with supporting information for a given question, let's look into training the second component of our system: the **answer generation module**. We will instantiate it as a sequence-to-sequence model which uses the [BART](https://arxiv.org/abs/1910.13461) architecture, and initialize it with the [bart-large pretrained weights](https://huggingface.co/facebook/bart-large).  

In short, the [BART paper](https://arxiv.org/abs/1910.13461) uses a denoising auto-encoder style objective to pre-train an encoder-decoder model (similarly to how masked language modeling is used to pre-trained BERT-style encoders). Among other applications, they show that large-scale pre-training with their objective followed by fine-tuning on ELI5 data yields the state-of-the-art ROUGE performance for the original version of the dataset (which uses pre-computed support documents made from CommonCrawl pages).

We provide the concatenation of the question and support document as input to the model, and train the decoder to minimize the perplexity of the gold answer. One notable choice is that **we train the model using all high-scoring answers in the training set**, so the model will see several instances of the same question-document input with different outputs. The supporting passages are separated by a special token `<P>`, so the input for our running example will look like:

> question: Why does water heated to room temperature feel colder than the air around it? context: \\<P\> when the skin is completely wet. The body continuously loses ... this heat comes from the liquid itself and the surrounding gas and surfaces. \\<P\> protected by a glass panel. Consequently, these types of collectors... Since heat loss due to convection cannot cross a vacuum, it forms an efficient isolation mechanism to keep heat inside the collector pipes. Since two flat \\<P\> ... \\<P\> changes. Conduction On... Fluids—especially gases—are less conductive. Thermal contact conductance is the study of heat conduction between solid bodies in contact. The process of heat transfer

The first thing we do is pre-compute the support documents for the training and validation sets so we can use all available GPUs to train the sequence-to-sequence model. The model is then trained with the `train_qa_s2s` function in `eli5_utils.py`. A 16GB GPU accomodates up to two examples at a time, so here is the code to train the model using 4 GPUs with `torch.nn.DataPArallel`. One epoch should take about 18 hours:

Again, if you don't want to train the model yourself, we made trained weights available on the [Hugging Face model repository](https://huggingface.co/models) , which you can download with:

In [57]:
qa_s2s_tokenizer = AutoTokenizer.from_pretrained('yjernite/bart_eli5')
qa_s2s_model = AutoModelForSeq2SeqLM.from_pretrained('yjernite/bart_eli5').to('cuda:0')
_ = qa_s2s_model.eval()

We now have everything we need to answer any question! Now let's try the full system on our running example along with the first four questions of the test set:

In [58]:
questions = []
answers = []

for i in [25] + [j for j in range(4)]:
    # create support document with the dense index
    question = parrot_test[i]['title']
    doc, res_list = query_qa_dense_index(
        question, qar_model, qar_tokenizer,
        chunks, wiki40b_gpu_index, device='cuda:0'
    )
    # concatenate question and support document into BART input
    question_doc = "question: {} context: {}".format(question, doc)
    # generate an answer with beam search
    answer = qa_s2s_generate(
            question_doc, qa_s2s_model, qa_s2s_tokenizer,
            num_answers=1,
            num_beams=8,
            min_len=64,
            max_len=256,
            max_input_length=1024,
            device="cuda:0"
    )[0]
    questions += [question]
    answers += [answer]

df = pd.DataFrame({
    'Question': questions,
    'Answer': answers,
})
df.style.set_properties(**{'text-align': 'left'})

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Unnamed: 0,Question,Answer
0,"Midterm 2 Corrections Hello, I have some questions in regards to 2c. --> How do we know reversed is an iterable? For the seventh line in the doctest, why isn't None added after lst3 when extending? Correct if me if Im wrong, but doesnt extend return None?","I'm not sure what you mean by ""how do we know""? It's not that we know, it's that we can prove it's true. If it's not true, then it can't be an iterable. An iterable is an operation that can be repeated over and over again. It can't just be repeated once, it has to be repeated multiple times."
1,Attendance Is there a way to check how many labs and discussions we have attended?,"There is no such thing as a way to check how many labs and discussions you have attended. There is a way, however, to track how many classes you have taken and how many lectures you've attended. This is called the [ Attendance-to-Class Ratio]( URL_0 ), and it's used by colleges and universities to track the number of students who have taken a course."
2,"Submission Is it okay that if my partner submitted both checkpoint1 and checkpoint2, can I submit extra credits this time?","No, you can't submit both at the same time. You can only submit one at a time, and if you submit both simultaneously, you'll get the same amount of credits as if you had only submitted checkpoint1 and checkpoint2. If you submit checkpoint1, checkpoint2, and checkpoint1 again, you will get the extra credits you submitted the first time."
3,"Error on another device I have passed all tests on my MacBook, but I got a lot of errors when running the same code on a new device with windows system. Why could this happen?","It's possible that you're running the same code on a different device with a different operating system. It's also possible that the operating system on your new device is different than the one on your old device, or that the OS is different from the OS on the old device. It could also be that your operating system is different, or the OS in the new device isn't the same as the one in the old one."
4,"Error: Could not import scheme HI there, I keep getting this error. I have quit and re-opened VS code as well as re-downloaded scheme completely. Do you know how to fix it?","I'm not sure what you mean by ""could not import scheme"", but it sounds like an error in the source code. If you want to know how to fix it, you'll have to go back to the original source code and look for the error. If it's not there, you can't fix it."


<img src="images/fireworks.gif" width="1000" align="center"/>  

We made it, and a lot of these answers actually make sense! The model seems to sometimes struggle with coherence and with starting some of the answers, but we're getting some pretty good information overall.

The last thing we'll do is see how we can get a quantitative evaluation of the model performance. Here, we'll use the ROUGE implementation provided in the `nlp` library.  

Note that it is a different implementation than the one used in the [BART](https://arxiv.org/abs/1910.13461) and [ELI5](https://arxiv.org/abs/1907.09190) papers: the [rouge](https://pypi.org/project/rouge/) Python package they use normalises all numerical values, among other pre-processing choices, leading to higher numbers. We reproduce their evaluation in the Appendix section, but recommend using the more sensitive metric provided by the `nlp` package, which can be computed with:

In [None]:
predicted = []
reference = []

# Generate answers for the full test set
for i in range(parrot_test.num_rows):
    # create support document with the dense index
    question = parrot_test[i]['title']
    doc, res_list = query_qa_dense_index(
        question, qar_model, qar_tokenizer,
        chunks, wiki40b_gpu_index, device='cuda:0'
    )
    # concatenate question and support document into BART input
    question_doc = "question: {} context: {}".format(question, doc)
    # generate an answer with beam search
    answer = qa_s2s_generate(
            question_doc, qa_s2s_model, qa_s2s_tokenizer,
            num_answers=1,
            num_beams=8,
            min_len=96,
            max_len=256,
            max_input_length=1024,
            device="cuda:0"
    )[0]
    predicted += [answer]
    reference += [parrot_test[i]['answers']['text'][0]]

In [61]:
# Compare each generation to the fist answer from the dataset
nlp_rouge = nlp.load_metric('rouge')

scores = nlp_rouge.compute(
    predicted, reference,
    rouge_types=['rouge1', 'rouge2', 'rougeL', 'rougeLsum'],
    use_agregator=True, use_stemmer=False
)
df = pd.DataFrame({
    'rouge1': [scores['rouge1'].mid.precision, scores['rouge1'].mid.recall, scores['rouge1'].mid.fmeasure],
    'rouge2': [scores['rouge2'].mid.precision, scores['rouge2'].mid.recall, scores['rouge2'].mid.fmeasure],
    'rougeL': [scores['rougeL'].mid.precision, scores['rougeL'].mid.recall, scores['rougeL'].mid.fmeasure],
}, index=[ 'P', 'R', 'F'])
df.style.format({'rouge1': "{:.4f}", 'rouge2': "{:.4f}", 'rougeL': "{:.4f}"})

Unnamed: 0,rouge1,rouge2,rougeL
P,0.0522,0.0031,0.0403
R,0.2308,0.01,0.1841
F,0.0832,0.0048,0.0647


That's it for today! And once again, if you want to play with the model a bit more and ask it whatever question comes to mind, please feel free to head over to:
# [**Our Live Demo!**](https://huggingface.co/qa/)  

Thank you for reading!

# Appendix:

Here we reproduce the ROUGE evaluation from the original [ELI5 paper](https://arxiv.org/abs/1907.09190) to be able to comparable our performance to theirs. Our generation setting leads to lower ROUGE-1 and ROUGE-2 than the state-of-the-art reported in [BART](https://arxiv.org/abs/1910.13461) (30.6 and 6.2 respectively), and higher ROUGE-L (24.3).

In [None]:
from nltk import PorterStemmer
from rouge import Rouge
from spacy.lang.en import English
from time import time

stemmer = PorterStemmer()
rouge = Rouge()
tokenizer = English().Defaults.create_tokenizer()

def compute_rouge_eli5(compare_list):
    preds = [" ".join([stemmer.stem(str(w))
                       for w in tokenizer(pred)])
             for gold, pred in compare_list]
    golds = [" ".join([stemmer.stem(str(w))
                       for w in tokenizer(gold)])
             for gold, pred in compare_list]
    scores = rouge.get_scores(preds, golds, avg=True)
    return scores


compare_list = [(g, p) for p, g in zip(predicted, reference)]
scores = compute_rouge_eli5(compare_list)
df = pd.DataFrame({
    'rouge1': [scores['rouge-1']['p'], scores['rouge-1']['r'], scores['rouge-1']['f']],
    'rouge2': [scores['rouge-2']['p'], scores['rouge-2']['r'], scores['rouge-2']['f']],
    'rougeL': [scores['rouge-l']['p'], scores['rouge-l']['r'], scores['rouge-l']['f']],
}, index=[ 'P', 'R', 'F'])
df.style.format({'rouge1': "{:.4f}", 'rouge2': "{:.4f}", 'rougeL': "{:.4f}"})

Unnamed: 0,rouge1,rouge2,rougeL
P,0.3254,0.068,0.3251
R,0.3118,0.0631,0.256
F,0.2729,0.0551,0.2583
