# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [61]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY= os.getenv('PINECONE_API_KEY')

# Install Dependencies

In [62]:
!pip install -qU datasets pinecone-client sentence-transformers torch

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [63]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

In [64]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

# select only the 'title', 'question', and 'context' columns
df = df[['title', 'question', 'context']]

# drop rows containing duplicate context passages
df = df.drop_duplicates(subset=["context"])

df.head()  # Display the first few rows


Unnamed: 0,title,question,context
0,University_of_Notre_Dame,To whom did the Virgin Mary allegedly appear i...,"Architecturally, the school has a Catholic cha..."
5,University_of_Notre_Dame,When did the Scholastic Magazine of Notre dame...,"As at most other universities, Notre Dame's st..."
10,University_of_Notre_Dame,Where is the headquarters of the Congregation ...,The university is the major seat of the Congre...
15,University_of_Notre_Dame,How many BS level degrees are offered in the C...,The College of Engineering was established in ...
20,University_of_Notre_Dame,What entity provides help with the management ...,All of Notre Dame's undergraduate students are...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [65]:
!pip install -qU langchain-pinecone pinecone-notebooks

[0m

In [66]:
from pinecone import Pinecone, ServerlessSpec

# Setup the spec (choose your region and cloud)
spec = ServerlessSpec(
    cloud="aws",
    region="us-east-1"
)

# Initialize Pinecone
pc = Pinecone(api_key=PINECONE_API_KEY, environment="us-east-1")

# Name your index
index_name = "question-answering"

# Check if index exists
if index_name not in pc.list_indexes().names():
    # Create the index with the spec
    pc.create_index(
        name=index_name,
        dimension=384,          # Dimension must match your embeddings
        metric="cosine",        # Matching your model (cosine similarity)
        spec=spec               # âœ… Required spec argument
    )


Now we create a new index called "question-answering" â€” we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [67]:
from pinecone import Pinecone, ServerlessSpec

# Define index name and spec
index_name = "extractive-question-answering"
spec = ServerlessSpec(cloud="aws", region="us-east-1")

# Initialize Pinecone client
pc = Pinecone(api_key=PINECONE_API_KEY, environment="us-east-1")

# Create the index if it doesn't exist
if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=384,  # Embedding size from 'multi-qa-MiniLM-L6-cos-v1'
        metric="cosine",
        spec=spec
    )

# Connect to the index
index = pc.Index(index_name)


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [68]:
import torch
from sentence_transformers import SentenceTransformer

# Set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# Load the retriever model from HuggingFace
retriever = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1')
retriever = retriever.to(device)

retriever


SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [69]:
from tqdm.auto import tqdm

# We will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # Find end of batch
    end = i + batch_size

    # Extract batch of context passages
    batch = df.iloc[i:end]

    # Generate embeddings for the context column
    emb = retriever.encode(batch['context'].tolist(), device=device, show_progress_bar=False)

    # Prepare metadata (e.g., article title + full context)
    meta = [{'title': title, 'context': context} for title, context in zip(batch['title'], batch['context'])]

    # Create unique IDs for each item
    ids = [f"id-{i+j}" for j in range(len(batch))]

    # Combine into tuples for upserting
    to_upsert = list(zip(ids, emb, meta))

    # Upsert the batch to Pinecone
    index.upsert(vectors=to_upsert)

# Verify index status
index.describe_index_stats()


  0%|          | 0/296 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891,
 'vector_type': 'dense'}

# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [70]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

Device set to use cuda


<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7f669c2b9590>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [71]:
def get_context(question, top_k=1):
    # Generate embedding for the question and convert to list
    xq = retriever.encode([question], device=device)[0].tolist()

    # Search Pinecone index for top_k most similar contexts
    xc = index.query(vector=xq, top_k=top_k, include_metadata=True)

    # Extract context passages from the metadata
    c = [match['metadata']['context'] for match in xc['matches']]

    return c



In [72]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # Feed the reader the question and context to extract answers
        answer = reader(question=question, context=c)
        # Add the context to the answer dict
        answer["context"] = c
        results.append(answer)

    # Sort results by confidence score in descending order
    sorted_results = sorted(results, key=lambda x: x['score'], reverse=True)

    # Print and return the sorted results
    pprint(sorted_results)
    return sorted_results


In [73]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

['Egypt was producing 691,000 bbl/d of oil and 2,141.05 Tcf of natural gas (in 2013), which makes Egypt as the largest oil producer not member of the Organization of the Petroleum Exporting Countries (OPEC) and the second-largest dry natural gas producer in Africa. In 2013, Egypt was the largest consumer of oil and natural gas in Africa, as more than 20% of total oil consumption and more than 40% of total dry natural gas consumption in Africa. Also, Egypt possesses the largest oil refinery capacity in Africa 726,000 bbl/d (in 2012). Egypt is currently planning to build its first nuclear power plant in El Dabaa city, northern Egypt.']

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [74]:
extract_answer(question, context)

[{'answer': '691,000 bbl/d',
  'context': 'Egypt was producing 691,000 bbl/d of oil and 2,141.05 Tcf of '
             'natural gas (in 2013), which makes Egypt as the largest oil '
             'producer not member of the Organization of the Petroleum '
             'Exporting Countries (OPEC) and the second-largest dry natural '
             'gas producer in Africa. In 2013, Egypt was the largest consumer '
             'of oil and natural gas in Africa, as more than 20% of total oil '
             'consumption and more than 40% of total dry natural gas '
             'consumption in Africa. Also, Egypt possesses the largest oil '
             'refinery capacity in Africa 726,000 bbl/d (in 2012). Egypt is '
             'currently planning to build its first nuclear power plant in El '
             'Dabaa city, northern Egypt.',
  'end': 33,
  'score': 0.9999852180480957,
  'start': 20}]


[{'score': 0.9999852180480957,
  'start': 20,
  'end': 33,
  'answer': '691,000 bbl/d',
  'context': 'Egypt was producing 691,000 bbl/d of oil and 2,141.05 Tcf of natural gas (in 2013), which makes Egypt as the largest oil producer not member of the Organization of the Petroleum Exporting Countries (OPEC) and the second-largest dry natural gas producer in Africa. In 2013, Egypt was the largest consumer of oil and natural gas in Africa, as more than 20% of total oil consumption and more than 40% of total dry natural gas consumption in Africa. Also, Egypt possesses the largest oil refinery capacity in Africa 726,000 bbl/d (in 2012). Egypt is currently planning to build its first nuclear power plant in El Dabaa city, northern Egypt.'}]

The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [75]:
question = "What are the first names of the men that invented youtube?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'Hurley and Chen',
  'context': 'According to a story that has often been repeated in the media, '
             'Hurley and Chen developed the idea for YouTube during the early '
             'months of 2005, after they had experienced difficulty sharing '
             "videos that had been shot at a dinner party at Chen's apartment "
             'in San Francisco. Karim did not attend the party and denied that '
             'it had occurred, but Chen commented that the idea that YouTube '
             'was founded after a dinner party "was probably very strengthened '
             'by marketing ideas around creating a story that was very '
             'digestible".',
  'end': 79,
  'score': 0.9999276399612427,
  'start': 64}]


[{'score': 0.9999276399612427,
  'start': 64,
  'end': 79,
  'answer': 'Hurley and Chen',
  'context': 'According to a story that has often been repeated in the media, Hurley and Chen developed the idea for YouTube during the early months of 2005, after they had experienced difficulty sharing videos that had been shot at a dinner party at Chen\'s apartment in San Francisco. Karim did not attend the party and denied that it had occurred, but Chen commented that the idea that YouTube was founded after a dinner party "was probably very strengthened by marketing ideas around creating a story that was very digestible".'}]

In [76]:
question = "What is Albert Eistein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'his theories of special relativity and general relativity',
  'context': 'Albert Einstein is known for his theories of special relativity '
             'and general relativity. He also made important contributions to '
             'statistical mechanics, especially his mathematical treatment of '
             'Brownian motion, his resolution of the paradox of specific '
             'heats, and his connection of fluctuations and dissipation. '
             'Despite his reservations about its interpretation, Einstein also '
             'made contributions to quantum mechanics and, indirectly, quantum '
             'field theory, primarily through his theoretical studies of the '
             'photon.',
  'end': 86,
  'score': 0.9500371217727661,
  'start': 29}]


[{'score': 0.9500371217727661,
  'start': 29,
  'end': 86,
  'answer': 'his theories of special relativity and general relativity',
  'context': 'Albert Einstein is known for his theories of special relativity and general relativity. He also made important contributions to statistical mechanics, especially his mathematical treatment of Brownian motion, his resolution of the paradox of specific heats, and his connection of fluctuations and dissipation. Despite his reservations about its interpretation, Einstein also made contributions to quantum mechanics and, indirectly, quantum field theory, primarily through his theoretical studies of the photon.'}]

Let's run another question. This time for top 3 context passages from the retriever.

In [77]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'Armstrong',
  'context': 'The trip to the Moon took just over three days. After achieving '
             'orbit, Armstrong and Aldrin transferred into the Lunar Module, '
             'named Eagle, and after a landing gear inspection by Collins '
             'remaining in the Command/Service Module Columbia, began their '
             'descent. After overcoming several computer overload alarms '
             'caused by an antenna switch left in the wrong position, and a '
             'slight downrange error, Armstrong took over manual flight '
             'control at about 180 meters (590 ft), and guided the Lunar '
             'Module to a safe landing spot at 20:18:04 UTC, July 20, 1969 '
             '(3:17:04 pm CDT). The first humans on the Moon would wait '
             'another six hours before they ventured out of their craft. At '
             '02:56 UTC, July 21 (9:56 pm CDT July 20), Armstrong became the '
             'first human to set foot on the Moon.',

[{'score': 0.9998037815093994,
  'start': 71,
  'end': 80,
  'answer': 'Armstrong',
  'context': 'The trip to the Moon took just over three days. After achieving orbit, Armstrong and Aldrin transferred into the Lunar Module, named Eagle, and after a landing gear inspection by Collins remaining in the Command/Service Module Columbia, began their descent. After overcoming several computer overload alarms caused by an antenna switch left in the wrong position, and a slight downrange error, Armstrong took over manual flight control at about 180 meters (590 ft), and guided the Lunar Module to a safe landing spot at 20:18:04 UTC, July 20, 1969 (3:17:04 pm CDT). The first humans on the Moon would wait another six hours before they ventured out of their craft. At 02:56 UTC, July 21 (9:56 pm CDT July 20), Armstrong became the first human to set foot on the Moon.'},
 {'score': 0.695867121219635,
  'start': 240,
  'end': 246,
  'answer': 'Aldrin',
  'context': 'The first step was witnessed by at 

The result looks pretty good.

In [None]:
pc.delete_index(index_name)

### Add a few more questions. What did you observe?

In [78]:
question = "Who developed the theory of general relativity?"
context = get_context(question, top_k=1)
extract_answer(question, context)


[{'answer': 'Einstein',
  'context': 'Albert Einstein is known for his theories of special relativity '
             'and general relativity. He also made important contributions to '
             'statistical mechanics, especially his mathematical treatment of '
             'Brownian motion, his resolution of the paradox of specific '
             'heats, and his connection of fluctuations and dissipation. '
             'Despite his reservations about its interpretation, Einstein also '
             'made contributions to quantum mechanics and, indirectly, quantum '
             'field theory, primarily through his theoretical studies of the '
             'photon.',
  'end': 15,
  'score': 3.506648443840632e-11,
  'start': 7}]


[{'score': 3.506648443840632e-11,
  'start': 7,
  'end': 15,
  'answer': 'Einstein',
  'context': 'Albert Einstein is known for his theories of special relativity and general relativity. He also made important contributions to statistical mechanics, especially his mathematical treatment of Brownian motion, his resolution of the paradox of specific heats, and his connection of fluctuations and dissipation. Despite his reservations about its interpretation, Einstein also made contributions to quantum mechanics and, indirectly, quantum field theory, primarily through his theoretical studies of the photon.'}]

In [79]:
question = "What is the capital of France?"
context = get_context(question, top_k=1)
extract_answer(question, context)


[{'answer': 'Paris',
  'context': 'Most French rulers since the Middle Ages made a point of leaving '
             "their mark on a city that, contrary to many other of the world's "
             'capitals, has never been destroyed by catastrophe or war. In '
             'modernising its infrastructure through the centuries, Paris has '
             'preserved even its earliest history in its street map.[citation '
             'needed] At its origin, before the Middle Ages, the city was '
             'composed around several islands and sandbanks in a bend of the '
             'Seine; of those, two remain today: the Ã®le Saint-Louis, the Ã®le '
             'de la CitÃ©; a third one is the 1827 artificially created Ã®le aux '
             'Cygnes. Modern Paris owes much to its late 19th century Second '
             "Empire remodelling by the Baron Haussmann: many of modern Paris' "
             'busiest streets, avenues and boulevards today are a result of '
             'that cit

[{'score': 3.590054087343475e-10,
  'start': 245,
  'end': 250,
  'answer': 'Paris',
  'context': 'Most French rulers since the Middle Ages made a point of leaving their mark on a city that, contrary to many other of the world\'s capitals, has never been destroyed by catastrophe or war. In modernising its infrastructure through the centuries, Paris has preserved even its earliest history in its street map.[citation needed] At its origin, before the Middle Ages, the city was composed around several islands and sandbanks in a bend of the Seine; of those, two remain today: the Ã®le Saint-Louis, the Ã®le de la CitÃ©; a third one is the 1827 artificially created Ã®le aux Cygnes. Modern Paris owes much to its late 19th century Second Empire remodelling by the Baron Haussmann: many of modern Paris\' busiest streets, avenues and boulevards today are a result of that city renovation. Paris also owes its style to its aligned street-fronts, distinctive cream-grey "Paris stone" building ornamentat

In [80]:
question = "When did World War II end?"
context = get_context(question, top_k=1)
extract_answer(question, context)


[{'answer': 'Cold War',
  'context': 'The end of World War II set the stage for the Eastâ€“West '
             'confrontation known as the Cold War. With the outbreak of the '
             'Korean War, concerns over the defense of Western Europe rose. '
             'Two corps, V and VII, were reactivated under Seventh United '
             'States Army in 1950 and American strength in Europe rose from '
             'one division to four. Hundreds of thousands of U.S. troops '
             'remained stationed in West Germany, with others in Belgium, the '
             'Netherlands and the United Kingdom, until the 1990s in '
             'anticipation of a possible Soviet attack.',
  'end': 91,
  'score': 0.0003886170161422342,
  'start': 83}]


[{'score': 0.0003886170161422342,
  'start': 83,
  'end': 91,
  'answer': 'Cold War',
  'context': 'The end of World War II set the stage for the Eastâ€“West confrontation known as the Cold War. With the outbreak of the Korean War, concerns over the defense of Western Europe rose. Two corps, V and VII, were reactivated under Seventh United States Army in 1950 and American strength in Europe rose from one division to four. Hundreds of thousands of U.S. troops remained stationed in West Germany, with others in Belgium, the Netherlands and the United Kingdom, until the 1990s in anticipation of a possible Soviet attack.'}]

In [81]:
context = get_context(question, top_k=3)
extract_answer(question, context)


You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


[{'answer': 'Cold War',
  'context': 'The end of World War II set the stage for the Eastâ€“West '
             'confrontation known as the Cold War. With the outbreak of the '
             'Korean War, concerns over the defense of Western Europe rose. '
             'Two corps, V and VII, were reactivated under Seventh United '
             'States Army in 1950 and American strength in Europe rose from '
             'one division to four. Hundreds of thousands of U.S. troops '
             'remained stationed in West Germany, with others in Belgium, the '
             'Netherlands and the United Kingdom, until the 1990s in '
             'anticipation of a possible Soviet attack.',
  'end': 91,
  'score': 0.0003886170161422342,
  'start': 83},
 {'answer': '1918',
  'context': 'The First World War began in 1914 and lasted to the final '
             'Armistice in 1918. The Allied Powers, led by the British Empire, '
             'France, Russia until March 1918, Japan and the United St

[{'score': 0.0003886170161422342,
  'start': 83,
  'end': 91,
  'answer': 'Cold War',
  'context': 'The end of World War II set the stage for the Eastâ€“West confrontation known as the Cold War. With the outbreak of the Korean War, concerns over the defense of Western Europe rose. Two corps, V and VII, were reactivated under Seventh United States Army in 1950 and American strength in Europe rose from one division to four. Hundreds of thousands of U.S. troops remained stationed in West Germany, with others in Belgium, the Netherlands and the United Kingdom, until the 1990s in anticipation of a possible Soviet attack.'},
 {'score': 6.4616085919400046e-12,
  'start': 71,
  'end': 75,
  'answer': '1918',
  'context': 'The First World War began in 1914 and lasted to the final Armistice in 1918. The Allied Powers, led by the British Empire, France, Russia until March 1918, Japan and the United States after 1917, defeated the Central Powers, led by the German Empire, Austro-Hungarian Empire a

âœ… Observation:
The model should confidently return "Albert Einstein", along with a high score and supporting context. This is a well-known fact and likely present in the dataset.


âœ… Observation:
The answer should be "Paris". If the dataset includes geography or general knowledge topics, it'll do well. If not, the retriever might struggle to find a relevant context.


âœ… Observation:
You should get "1945" or "September 2, 1945" as the predicted answer. Again, depends on the presence of historical context in the data.



ðŸ“‰ Observation:
The model might give a partial or general answer if the SQuAD dataset doesn't cover this in depth. In such cases, using top_k=3 can help:


