# Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how to build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

## Vector Index

A vector index is a data structure used in information retrieval to quickly and efficiently perform searches on large collections of documents. In a vector index, each document is represented as a vector of numerical values, where each value represents the importance of a particular term or concept in the document.

## Semantic Search

Semantic search, on the other hand, is a search technique that aims to understand the meaning of the query and the context in which it is used, in order to provide more relevant search results. Unlike traditional keyword-based search, which relies on exact matches, semantic search takes into account synonyms, related concepts, and other factors to identify documents that are most relevant to the query.

To use a vector index for semantic search, each document vector should be constructed in a way that captures not only the presence or absence of specific keywords, but also the overall meaning and context of the document. 

Once the vector index is constructed, the semantic search process involves comparing the query vector to the document vectors in the index to identify the documents that are most similar in meaning and context to the query. This can be done using similarity measures such as cosine similarity or Euclidean distance.

## Retriever Model

A retriever model for embedding context passages is a type of natural language processing (NLP) model that is designed to identify and retrieve relevant passages of text from a larger corpus of documents, and embed them as vectors in a high-dimensional space for further processing.

Retrieval models are often used in NLP tasks such as question answering and information retrieval, where the goal is to identify relevant passages of text that contain answers to specific queries. In the case of a retriever model for embedding context passages, the goal is to not only retrieve relevant passages but also embed them as vectors that can be used for further processing.

Once the retriever model has identified and embedded relevant passages of text, these embeddings can be used for a variety of downstream tasks such as classification, clustering, or summarization. For example, the embeddings could be used to identify similar passages of text, cluster related documents, or summarize the main ideas of a collection of documents.




# Install Dependencies

In [None]:
!pip install -qU datasets pinecone-client sentence-transformers torch

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 KB[0m [31m8.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.2/177.2 KB[0m [31m12.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m619.9/619.9 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.8/199.8 KB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.9/132.9 KB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 KB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 KB[0m [31m14.6 MB/s

In [None]:
!pip install faiss-gpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [None]:
import numpy as np
import pandas as pd
from datasets import load_dataset
from tqdm.auto import tqdm  # progress bar
import torch
from sentence_transformers import SentenceTransformer
import faiss
from pprint import pprint

# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [None]:
# load the dataset from huggingface in streaming mode and shuffle it
wiki_data = load_dataset(
    'vblagoje/wikipedia_snippets_streamed',
    split='train',
    streaming=True
).shuffle(seed=960)

Downloading builder script:   0%|          | 0.00/4.58k [00:00<?, ?B/s]

We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [None]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'wiki_id': 'Q7649565',
 'start_paragraph': 20,
 'start_character': 272,
 'end_paragraph': 24,
 'end_character': 380,
 'article_title': 'Sustainable Agriculture Research and Education',
 'section_title': "2000s & Evaluation of the program's effectiveness",
 'passage_text': "preserving the surrounding prairies. It ran until March 31, 2001.\nIn 2008, SARE celebrated its 20th anniversary. To that date, the program had funded 3,700 projects and was operating with an annual budget of approximately $19 million. Evaluation of the program's effectiveness As of 2008, 64% of farmers who had received SARE grants stated that they had been able to earn increased profits as a result of the funding they received and utilization of sustainable agriculture methods. Additionally, 79% of grantees said that they had experienced a significant improvement in soil quality though the environmentally friendly, sustainable methods that they were"}

In [None]:
# filter only documents with History as section_title
history = wiki_data.filter(
    lambda d: d['section_title'].startswith('History')
)

Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [None]:
total_doc_count = 5000

counter = 0
docs = []
# iterate through the dataset and apply our filter
for d in tqdm(history, total=total_doc_count):
    # extract the fields we need
    doc = {
        "article_title": d["article_title"],
        "section_title": d["section_title"],
        "passage_text": d["passage_text"]
    }
    # add the dict containing fields we need to docs list
    docs.append(doc)

    # stop iteration once we reach 50k
    if counter == total_doc_count:
        break

    # increase the counter on every iteration
    counter += 1

  0%|          | 0/5000 [00:00<?, ?it/s]

In [None]:
# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,article_title,section_title,passage_text
0,Taupo District,History,was not until the 1950s that the region starte...
1,Sutarfeni,History & Western asian analogues,Sutarfeni History strand-like pheni were Phena...
2,The Bishop Wand Church of England School,History,The Bishop Wand Church of England School Histo...
3,Teufelsmoor,History & Situation today,"made to preserve the original landscape, altho..."
4,Surface Hill Uniting Church,History,in perpetual reminder that work and worship go...


# Generating embeddings and using FAISS for similarity search

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use dot product metric in FAISS to compute the similarity between query and context vectors generated by this model.

In [None]:
# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base", device=device)
retriever

Downloading (…)e933c/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)cbe6ee933c/README.md:   0%|          | 0.00/9.85k [00:00<?, ?B/s]

Downloading (…)e6ee933c/config.json:   0%|          | 0.00/591 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)33c/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)e933c/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)933c/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading (…)cbe6ee933c/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)6ee933c/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
  (2): Normalize()
)

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and add them as index. We are zipping together an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [None]:
# we will use batches of 64
batch_size = 64

data_embed_list = []

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(df))
    # extract batch
    batch = df.iloc[i:i_end]
    # generate embeddings for batch
    emb = retriever.encode(batch["passage_text"].tolist()).tolist()
    # get metadata
    meta = batch.to_dict(orient="records")
    # create unique IDs
    ids = [f"{idx}" for idx in range(i, i_end)]
    # add all to upsert list
    data_with_embed = list(zip(ids, emb, meta))
    data_embed_list.extend(data_with_embed)
  

  0%|          | 0/79 [00:00<?, ?it/s]

In [None]:
data_emb = []
metadata_list = []
for item in data_embed_list:
  idx = item[0]
  data_emb.append(item[1])
  metadata_list.append(item[2])
  

data_emb = np.array(data_emb, dtype=np.float32)

In [None]:
## figure out what is the second dimension of data embedding's shape

In [None]:
data_emb.shape[1]

768

In [None]:
## Alternative to Pinecone using FAISS

# create faiss index
dimension = data_emb.shape[1]
nlist = 100
quantizer = faiss.IndexFlatIP(dimension)
index = faiss.IndexIVFFlat(quantizer, dimension, nlist, faiss.METRIC_INNER_PRODUCT)

# train index
index.train(data_emb)

# add data to index
index.add(data_emb)

# save index to disk (optional)
faiss.write_index(index, 'abstractive-question-answering.index')

In [None]:
# generating the query 

def query_faiss(query, top_k):
    xq = np.array(retriever.encode([query]).tolist(), dtype=np.float32)
    index = faiss.read_index('abstractive-question-answering.index')
    D, I = index.search(xq, top_k)
    return D,I

def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context = [f"<P> {m['metadata']['passage_text']}" for m in context]
    # concatinate all context passages
    context = " ".join(context)
    # contcatinate the query and context passages
    query = f"question: {query} context: {context}"
    return query

def format_match_list(D,I):
  dict_x = {}
  match_list = []
  for i in range(len(I[0])):
    dict_x['id'] = str(I[0][i])
    dict_x['metadata'] = metadata_list[I[0][i]]
    dict_x['score'] = D[0][i]
    match_list.append(dict_x)
  return match_list

In [None]:
query = "when was the first electric power system built?"
D,I = query_faiss(query, top_k=5)
match_list = format_match_list(D,I)

# I is the index and D is the similarity score 

In [None]:
query = format_query(query, match_list)
pprint(query)

('question: when was the first electric power system built? context: <P> 100 '
 'horsepower (75\xa0kW) synchronous electric motor, not just provide electric '
 'lighting, at Telluride, Colorado. On the other side of the Atlantic, Mikhail '
 'Dolivo-Dobrovolsky of AEG and Charles Eugene Lancelot Brown of '
 'Maschinenfabrik Oerlikon, built the very first long-distance (175 km, a '
 'distance never tried before) high-voltage (15 kV, then a record) three-phase '
 'transmission line from Lauffen am Neckar to Frankfurt am Main for the '
 'Electrical Engineering Exhibition in Frankfurt, where power was used light '
 'lamps and move a water pump. In the US the AC/DC competition came to an end '
 'when Edison General Electric was taken over by their chief <P> 100 '
 'horsepower (75\xa0kW) synchronous electric motor, not just provide electric '
 'lighting, at Telluride, Colorado. On the other side of the Atlantic, Mikhail '
 'Dolivo-Dobrovolsky of AEG and Charles Eugene Lancelot Brown of '
 'Ma

# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

In [None]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [None]:
generate_answer(query)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


("I'm not sure if this qualifies as an answer to your question, but I think "
 "it's worth noting that the first electric power system was built in the late "
 '19th century. The first')


In [None]:
query = "How was the first wireless message sent?"
D,I = query_faiss(query, top_k=5)
match_list = format_match_list(D,I)
query = format_query(query, match_list)
generate_answer(query)

("I'm not sure if this is what you're looking for, but I can tell you that the "
 'first wireless message was sent in the early 1900s. The first wireless '
 'message was sent by')


In [None]:
query = "where did COVID-19 originate?"
D,I = query_faiss(query, top_k=3)
match_list = format_match_list(D,I)
query = format_query(query, match_list)
generate_answer(query)

('COVID-19 is a virus that causes the spread of HIV. The virus was first '
 'discovered in the United States in 1998. It was first used to treat the AIDS '
 'epidemic in the United')


In [None]:
query = "what was the war of currents?"
D,I = query_faiss(query, top_k=5)
match_list = format_match_list(D,I)
query = format_query(query, match_list)
generate_answer(query)

('The war of currents is a term that has been used to describe a number of '
 'different phenomena. The most famous of these is the Great Divergence, which '
 'occurred in the late 19th century')


In [None]:
query = "who was the first person on the moon?"
D,I = query_faiss(query, top_k=10)
match_list = format_match_list(D,I)
query = format_query(query, match_list)
generate_answer(query)

('The first person to walk on the moon was Neil Armstrong, who walked on the '
 'moon in 1969. He was the first person to walk on the moon, and he was the '
 'first person to')


In [None]:
query = "what was NASAs most expensive project?"
D,I = query_faiss(query, top_k=5)
match_list = format_match_list(D,I)
query = format_query(query, match_list)
generate_answer(query)

("I don't know if this counts as a project, but the Apollo missions were the "
 'most expensive in terms of total cost. The Apollo missions cost about $2.5 '
 'billion. The Apollo')


--------