# LAB | Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [22]:
!pip install -qU datasets==2.16.1 pinecone-client==3.1.0 sentence-transformers torch

# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [23]:
from datasets import load_dataset
# load the SQuAD dataset which contains Wikipedia contexts
# We will use streaming mode and shuffle it
wiki_data = load_dataset(
  'squad',
  split='train',
  streaming=True
).shuffle(seed=960)

We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [24]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'id': '56bf8c8aa10cfb140055116f',
 'title': 'Beyoncé',
 'context': 'In July 2002, Beyoncé continued her acting career playing Foxxy Cleopatra alongside Mike Myers in the comedy film, Austin Powers in Goldmember, which spent its first weekend atop the US box office and grossed $73 million. Beyoncé released "Work It Out" as the lead single from its soundtrack album which entered the top ten in the UK, Norway, and Belgium. In 2003, Beyoncé starred opposite Cuba Gooding, Jr., in the musical comedy The Fighting Temptations as Lilly, a single mother whom Gooding\'s character falls in love with. The film received mixed reviews from critics but grossed $30 million in the U.S. Beyoncé released "Fighting Temptation" as the lead single from the film\'s soundtrack album, with Missy Elliott, MC Lyte, and Free which was also used to promote the film. Another of Beyoncé\'s contributions to the soundtrack, "Summertime", fared better on the US charts.',
 'question': "How did the critics view the movie

In [25]:
# filter only documents with History as section_title - Replace None with your code
history = wiki_data.filter(lambda x: "history" in x["title"].lower())

::Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [26]:
from tqdm.auto import tqdm  # progress bar

total_doc_count = 50000

counter = 0
docs = []
# iterate through the dataset and apply our filter
for d in tqdm(history, total=total_doc_count):
  # extract the fields we need - article, section, and passage
  docs.append({
        "article_title": d["title"],
        "section_title": "History",
        "passage_text": d["context"],
    })

counter += 1

  0%|          | 0/50000 [00:00<?, ?it/s]

In [27]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,article_title,section_title,passage_text
0,Military_history_of_the_United_States,History,Secretary of War Elihu Root (1899–1904) led th...
1,Military_history_of_the_United_States,History,By far the largest military action in which th...
2,Military_history_of_the_United_States,History,"The loss of eight battleships and 2,403 Americ..."
3,Military_history_of_the_United_States,History,The war started badly for the US and UN. North...
4,Military_history_of_the_United_States,History,By far the largest military action in which th...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [41]:
import os
from pinecone import Pinecone

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [29]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [30]:
index_name = "abstractive-question-answering" #give your index a meaningful name

In [51]:
import os
from google.colab import userdata

os.environ["PINECONE_API_KEY"] = userdata.get("PINECONE_API_KEY")

In [53]:
from pinecone import Pinecone

api_key = os.environ["PINECONE_API_KEY"]
pc = Pinecone(api_key=api_key)

In [55]:
from pinecone import Pinecone, ServerlessSpec

index_name = "abstractive-question-answering"

# create index if not exists
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=768,
        metric="cosine",
        spec=ServerlessSpec(
            cloud="aws",
            region="us-east-1"   # or use your own region
        )
    )
index = pc.Index(index_name)
print(index.describe_index_stats())


{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

In [32]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer(
    "flax-sentence-embeddings/all_datasets_v3_mpnet-base",
    device=device
)

retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'MPNetModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [56]:
# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    end = min(i + batch_size, len(df))

    # extract batch
    batch = df.iloc[i:end]

    # generate embeddings for batch (list of floats)
    embeds = retriever.encode(
        batch["passage_text"].tolist(),
        convert_to_numpy=True
    )

    # prepare Pinecone upsert format
    to_upsert = []
    for j, (idx, row) in enumerate(batch.iterrows()):
        vector_id = str(idx)
        metadata = {
            "article_title": row["article_title"],
            "section_title": row["section_title"],
            "passage_text": row["passage_text"]
        }
        to_upsert.append((vector_id, embeds[j].tolist(), metadata))

    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()


  0%|          | 0/24 [00:00<?, ?it/s]

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 1483}},
 'total_vector_count': 1483}

# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [66]:
def query_pinecone(query, top_k):
    xq = retriever.encode(query, convert_to_numpy=True).tolist()
    res = index.query(
        vector=xq,
        top_k=top_k,
        include_metadata=True
    )
    return res  # full response dict



In [67]:
result = query_pinecone(query, top_k=5)
query = format_query(query, result["matches"])

In [62]:
def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context = [f"<P> {m['metadata']['passage_text']}" for m in context]
    # concatenate all context passages
    context = " ".join(context)
    # concatenate the query and context passages
    query = f"question: {query} context: {context}"
    return query

Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [68]:
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)
result

{'matches': [{'id': '963',
              'metadata': {'article_title': 'Modern_history',
                           'passage_text': 'In the latter part of the second '
                                           'revolution, Thomas Alva Edison '
                                           'developed many devices that '
                                           'greatly influenced life around the '
                                           'world and is often credited with '
                                           'the creation of the first '
                                           'industrial research laboratory. In '
                                           '1882, Edison switched on the '
                                           "world's first large-scale "
                                           'electrical supply network that '
                                           'provided 110 volts direct current '
                                           'to fifty-nine custom

In [69]:
from pprint import pprint

In [70]:
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)

('question: when was the first electric power system built? context: <P> In '
 'the latter part of the second revolution, Thomas Alva Edison developed many '
 'devices that greatly influenced life around the world and is often credited '
 'with the creation of the first industrial research laboratory. In 1882, '
 "Edison switched on the world's first large-scale electrical supply network "
 'that provided 110 volts direct current to fifty-nine customers in lower '
 'Manhattan. Also toward the end of the second industrial revolution, Nikola '
 'Tesla made many contributions in the field of electricity and magnetism in '
 'the late 19th and early 20th centuries.')


The output looks great. Now let's write a function to generate answers.

In [71]:
from pprint import pprint

def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer(
        [query],
        max_length=1024,
        return_tensors="pt",
        truncation=True
    ).to(device)

    # use generator to predict output ids
    ids = generator.generate(
        inputs["input_ids"],
        num_beams=2,
        min_length=20,
        max_length=40
    )

    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(
        ids,
        skip_special_tokens=True,
        clean_up_tokenization_spaces=False
    )[0]

    pprint(answer)      # show it
    return answer       # also return the string


In [74]:
from transformers import T5Tokenizer, T5ForConditionalGeneration
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# load generator model (FLAN-T5 is commonly used for abstractive QA)
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")
generator = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base").to(device)


tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [75]:
generate_answer(query)

("1882, Edison switched on the world's first large-scale electrical supply "
 'network that provided 110 volts direct current to fifty-nine customers in '
 'lower Manhattan')


"1882, Edison switched on the world's first large-scale electrical supply network that provided 110 volts direct current to fifty-nine customers in lower Manhattan"

As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [76]:
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('P> In the latter part of the second revolution, Thomas Alva Edison developed '
 'many devices that greatly influenced life around the world and is often '
 'credited with the creation of the first')


'P> In the latter part of the second revolution, Thomas Alva Edison developed many devices that greatly influenced life around the world and is often credited with the creation of the first'

To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [77]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

In the latter part of the second revolution, Thomas Alva Edison developed many devices that greatly influenced life around the world and is often credited with the creation of the first industrial research laboratory. In 1882, Edison switched on the world's first large-scale electrical supply network that provided 110 volts direct current to fifty-nine customers in lower Manhattan. Also toward the end of the second industrial revolution, Nikola Tesla made many contributions in the field of electricity and magnetism in the late 19th and early 20th centuries.
---
In the latter part of the second revolution, Thomas Alva Edison developed many devices that greatly influenced life around the world and is often credited with the creation of the first industrial research laboratory. In 1882, Edison switched on the world's first large-scale electrical supply network that provided 110 volts direct current to fifty-nine customers in lower Manhattan. Also toward the end of the second industrial re

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [78]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('France and then, further propelled by the French Revolution of 1848, soon '
 'spread to the rest of Europe')


'France and then, further propelled by the French Revolution of 1848, soon spread to the rest of Europe'

In [79]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

The European Revolutions of 1848, known in some countries as the Spring of Nations or the Year of Revolution, were a series of political upheavals throughout the European continent. Described as a revolutionary wave, the period of unrest began in France and then, further propelled by the French Revolution of 1848, soon spread to the rest of Europe. Although most of the revolutions were quickly put down, there was a significant amount of violence in many areas, with tens of thousands of people tortured and killed. While the immediate political effects of the revolutions were reversed, the long-term reverberations of the events were far-reaching.
---
The European Revolutions of 1848, known in some countries as the Spring of Nations or the Year of Revolution, were a series of political upheavals throughout the European continent. Described as a revolutionary wave, the period of unrest began in France and then, further propelled by the French Revolution of 1848, soon spread to the rest of 

Let’s finish with a final few questions.

In [80]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The Balkan Wars were two wars in South-eastern Europe in 1912–1913 in the '
 'course of which the Balkan League (Bulgaria, Montene')


'The Balkan Wars were two wars in South-eastern Europe in 1912–1913 in the course of which the Balkan League (Bulgaria, Montene'

In [81]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('the oceans. P> Earth was initially molten due to extreme volcanism and '
 'frequent collisions with other bodies. Eventually, the outer layer of the '
 'planet cooled')


'the oceans. P> Earth was initially molten due to extreme volcanism and frequent collisions with other bodies. Eventually, the outer layer of the planet cooled'

In [82]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('Sputnik 1 by the Soviet Union. P> The Space Age is a period encompassing the '
 'activities related to the Space Race, space exploration, space technology, '
 'and the cultural')


'Sputnik 1 by the Soviet Union. P> The Space Age is a period encompassing the activities related to the Space Race, space exploration, space technology, and the cultural'

As we can see, the model can generate some decent answers.

#### Add a few more questions

In [83]:
query = "Who invented the microwave?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('Otto Hahn and Fritz Strassmann discovered nuclear fission with radiochemical '
 'methods, and in 1939 Lise Meitner and Otto Robert Frisch wrote the first')


'Otto Hahn and Fritz Strassmann discovered nuclear fission with radiochemical methods, and in 1939 Lise Meitner and Otto Robert Frisch wrote the first'

In [84]:
query = "Who is the smartest person in the world?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('Democritus. P> The earliest Greek philosophers, known as the pre-Socratics, '
 'provided competing answers to the question found in the myths of their '
 'neighbors:')


'Democritus. P> The earliest Greek philosophers, known as the pre-Socratics, provided competing answers to the question found in the myths of their neighbors:'

In [86]:
query = "How many kids do people get on average in South Africa?"
context = query_pinecone(query, top_k=4)
query = format_query(query, context["matches"])
generate_answer(query)

'6.6 million people in 1750, had reached 389 million by 1941. P>'


'6.6 million people in 1750, had reached 389 million by 1941. P>'