# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

In [1]:
!pip install python-dotenv

Collecting python-dotenv
  Using cached python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Using cached python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1


Let's get started by installing the packages needed for notebook to run:

In [8]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY= os.getenv('PINECONE_API_KEY')

# Install Dependencies

In [3]:
!pip install -qU datasets pinecone-client sentence-transformers torch

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [4]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

  from .autonotebook import tqdm as notebook_tqdm
Generating train split: 100%|██████████| 87599/87599 [00:00<00:00, 1433368.72 examples/s]
Generating validation split: 100%|██████████| 10570/10570 [00:00<00:00, 1538566.49 examples/s]


In [7]:
df.head()
PINECONE_API_KEY

In [5]:
# select only title and context column
df = df[['title','context']]
# drop rows containing duplicate context passages
df = df.drop_duplicates(subset=['context'])
df = df.head(200)

# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [10]:
from pinecone import Pinecone, ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# connect to pinecone environment
pc = Pinecone(
    api_key = PINECONE_API_KEY,
    environment='us-east-1'  # find next to API key in console
)

index_name = "extractive-question-answering"

# check if the index_name exists
if index_name not in pc.list_indexes():
    # create the index if it does not exist
    pc.create_index(
        name=index_name,
        dimension=384,  # dimensionality of text-embedding-ada-002
        metric='cosine',
        spec=spec
    )

# connect to index we created
index = pc.Index(index_name)

PineconeApiException: (409)
Reason: Conflict
HTTP response headers: HTTPHeaderDict({'content-type': 'text/plain; charset=utf-8', 'access-control-allow-origin': '*', 'vary': 'origin,access-control-request-method,access-control-request-headers', 'access-control-expose-headers': '*', 'x-pinecone-api-version': '2024-07', 'X-Cloud-Trace-Context': '25cf94ade47d10dc1db4a33e28756812', 'Date': 'Tue, 22 Oct 2024 13:31:08 GMT', 'Server': 'Google Frontend', 'Content-Length': '85', 'Via': '1.1 google', 'Alt-Svc': 'h3=":443"; ma=2592000,h3-29=":443"; ma=2592000'})
HTTP response body: {"error":{"code":"ALREADY_EXISTS","message":"Resource  already exists"},"status":409}


In [17]:
index = pc.Index(index_name)

# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [15]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
if torch.backends.mps.is_available():
    device = 'mps'  # Use Metal Performance Shaders (MPS) for Apple Silicon
elif torch.cuda.is_available():
    device = 'cuda'  # Use CUDA for NVIDIA GPUs (not applicable on Mac)
else:
    device = 'cpu'   # Use CPU
# load the retriever model from huggingface model hub
retriever = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')
retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [12]:
device

'mps'

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [20]:
from tqdm.auto import tqdm
import unicodedata

# we will use batches of 64
batch_size = 64

# Function to remove or replace non-ASCII characters
def normalize_string(text):
    return ''.join(c for c in unicodedata.normalize('NFKD', text) if ord(c) < 128)


for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    end_idx = min(i + batch_size, len(df))
    # extract batch
    batch = df.iloc[i:end_idx]
    # generate embeddings for batch
    texts = batch['context'].tolist()
    emb = retriever.encode(texts)
    # get metadata
    meta = [{'title': row['title'], 'context': row['context']} for _, row in batch.iterrows()]
    # create unique IDs
    ids = [normalize_string(str(row['title'])) for _, row in batch.iterrows()]
    # add all to upsert list
    to_upsert = [(ids[j], emb[j], meta[j]) for j in range(len(emb))]
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()

100%|██████████| 4/4 [00:04<00:00,  1.20s/it]


{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [21]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x32cf25520>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [65]:
# gets context passages from the pinecone index
def get_context(question, top_k=5):
    # generate embeddings for the question
    xq = retriever.encode([question]).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(  # Perform the query on the Pinecone index
        vector=xq[0],  # The question's embedding
        top_k=top_k,   # How many top results to return
        include_metadata=True  # Include the metadata (e.g., context)
    )
    #print(xc)
    context = [match['metadata']['context'] for match in xc['matches']]
    return context

In [73]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    print('RESULTS:')
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    #pprint(sorted_result)
    return sorted_result

In [70]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

['When the U.S. entered World War II on December 8, 1941, many Montanans already had enlisted in the military to escape the poor national economy of the previous decade. Another 40,000-plus Montanans entered the armed forces in the first year following the declaration of war, and over 57,000 joined up before the war ended. These numbers constituted about 10 percent of the state\'s total population, and Montana again contributed one of the highest numbers of soldiers per capita of any state. Many Native Americans were among those who served, including soldiers from the Crow Nation who became Code Talkers. At least 1500 Montanans died in the war. Montana also was the training ground for the First Special Service Force or "Devil\'s Brigade," a joint U.S-Canadian commando-style force that trained at Fort William Henry Harrison for experience in mountainous and winter conditions before deployment. Air bases were built in Great Falls, Lewistown, Cut Bank and Glasgow, some of which were used 

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [71]:
extract_answer(question, context)

c When the U.S. entered World War II on December 8, 1941, many Montanans already had enlisted in the military to escape the poor national economy of the previous decade. Another 40,000-plus Montanans entered the armed forces in the first year following the declaration of war, and over 57,000 joined up before the war ended. These numbers constituted about 10 percent of the state's total population, and Montana again contributed one of the highest numbers of soldiers per capita of any state. Many Native Americans were among those who served, including soldiers from the Crow Nation who became Code Talkers. At least 1500 Montanans died in the war. Montana also was the training ground for the First Special Service Force or "Devil's Brigade," a joint U.S-Canadian commando-style force that trained at Fort William Henry Harrison for experience in mountainous and winter conditions before deployment. Air bases were built in Great Falls, Lewistown, Cut Bank and Glasgow, some of which were used as

The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [72]:
question = "What are the first names of the men that invented youtube?"
context = get_context(question, top_k=1)
extract_answer(question, context)

c In December, Beyoncé along with a variety of other celebrities teamed up and produced a video campaign for "Demand A Plan", a bipartisan effort by a group of 950 US mayors and others designed to influence the federal government into rethinking its gun control laws, following the Sandy Hook Elementary School shooting. Beyoncé became an ambassador for the 2012 World Humanitarian Day campaign donating her song "I Was Here" and its music video, shot in the UN, to the campaign. In 2013, it was announced that Beyoncé would work with Salma Hayek and Frida Giannini on a Gucci "Chime for Change" campaign that aims to spread female empowerment. The campaign, which aired on February 28, was set to her new music. A concert for the cause took place on June 1, 2013 in London and included other acts like Ellie Goulding, Florence and the Machine, and Rita Ora. In advance of the concert, she appeared in a campaign video released on 15 May 2013, where she, along with Cameron Diaz, John Legend and Kyli

In [37]:
question = "What is Albert Eistein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'id': 'University_of_Notre_Dame',
 'metadata': {'context': 'Notre Dame alumni work in various fields. Alumni '
                         'working in political fields include state governors, '
                         'members of the United States Congress, and former '
                         'United States Secretary of State Condoleezza Rice. A '
                         'notable alumnus of the College of Science is '
                         'Medicine Nobel Prize winner Eric F. Wieschaus. A '
                         'number of university heads are alumni, including '
                         "Notre Dame's current president, the Rev. John "
                         'Jenkins. Additionally, many alumni are in the media, '
                         'including talk show hosts Regis Philbin and Phil '
                         'Donahue, and television and radio personalities such '
                         'as Mike Golic and Hannah Storm. With the university '
                         'h

Let's run another question. This time for top 3 context passages from the retriever.

In [41]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'id': 'Montana',
 'metadata': {'context': 'When the U.S. entered World War II on December 8, '
                         '1941, many Montanans already had enlisted in the '
                         'military to escape the poor national economy of the '
                         'previous decade. Another 40,000-plus Montanans '
                         'entered the armed forces in the first year following '
                         'the declaration of war, and over 57,000 joined up '
                         'before the war ended. These numbers constituted '
                         "about 10 percent of the state's total population, "
                         'and Montana again contributed one of the highest '
                         'numbers of soldiers per capita of any state. Many '
                         'Native Americans were among those who served, '
                         'including soldiers from the Crow Nation who became '
                         'Code Talkers. At least 1

The result looks pretty good.

In [None]:
pc.delete_index(index_name)

### Add a few more questions. What did you observe?

In [50]:
question = "How many Montanans passed away?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'id': 'Montana',
 'metadata': {'context': 'When the U.S. entered World War II on December 8, '
                         '1941, many Montanans already had enlisted in the '
                         'military to escape the poor national economy of the '
                         'previous decade. Another 40,000-plus Montanans '
                         'entered the armed forces in the first year following '
                         'the declaration of war, and over 57,000 joined up '
                         'before the war ended. These numbers constituted '
                         "about 10 percent of the state's total population, "
                         'and Montana again contributed one of the highest '
                         'numbers of soldiers per capita of any state. Many '
                         'Native Americans were among those who served, '
                         'including soldiers from the Crow Nation who became '
                         'Code Talkers. At least 1

In [61]:
question = "Whom did Beyonce work with in 2013?"
context = get_context(question, top_k=3)
#extract_answer(question, context)
print('CONTEXT:',context)

[{'id': 'Beyonce',
 'metadata': {'context': 'In December, Beyoncé along with a variety of other '
                         'celebrities teamed up and produced a video campaign '
                         'for "Demand A Plan", a bipartisan effort by a group '
                         'of 950 US mayors and others designed to influence '
                         'the federal government into rethinking its gun '
                         'control laws, following the Sandy Hook Elementary '
                         'School shooting. Beyoncé became an ambassador for '
                         'the 2012 World Humanitarian Day campaign donating '
                         'her song "I Was Here" and its music video, shot in '
                         'the UN, to the campaign. In 2013, it was announced '
                         'that Beyoncé would work with Salma Hayek and Frida '
                         'Giannini on a Gucci "Chime for Change" campaign that '
                         'aims to spr

In [77]:
question = "Who is Tina Knowles"
context = get_context(question, top_k=3)
extract_answer(question, context)
#print('CONTEXT:',context)

RESULTS:
[{'answer': 'her mother',
  'context': 'In December, Beyoncé along with a variety of other celebrities '
             'teamed up and produced a video campaign for "Demand A Plan", a '
             'bipartisan effort by a group of 950 US mayors and others '
             'designed to influence the federal government into rethinking its '
             'gun control laws, following the Sandy Hook Elementary School '
             'shooting. Beyoncé became an ambassador for the 2012 World '
             'Humanitarian Day campaign donating her song "I Was Here" and its '
             'music video, shot in the UN, to the campaign. In 2013, it was '
             'announced that Beyoncé would work with Salma Hayek and Frida '
             'Giannini on a Gucci "Chime for Change" campaign that aims to '
             'spread female empowerment. The campaign, which aired on February '
             '28, was set to her new music. A concert for the cause took place '
             'on June 1, 20

In [79]:
question = "Who is Beyonce?"
context = get_context(question, top_k=3)
extract_answer(question, context)
#print('CONTEXT:',context)

RESULTS:
[{'answer': '2012 World Humanitarian Day campaign',
  'context': 'In December, Beyoncé along with a variety of other celebrities '
             'teamed up and produced a video campaign for "Demand A Plan", a '
             'bipartisan effort by a group of 950 US mayors and others '
             'designed to influence the federal government into rethinking its '
             'gun control laws, following the Sandy Hook Elementary School '
             'shooting. Beyoncé became an ambassador for the 2012 World '
             'Humanitarian Day campaign donating her song "I Was Here" and its '
             'music video, shot in the UN, to the campaign. In 2013, it was '
             'announced that Beyoncé would work with Salma Hayek and Frida '
             'Giannini on a Gucci "Chime for Change" campaign that aims to '
             'spread female empowerment. The campaign, which aired on February '
             '28, was set to her new music. A concert for the cause took place '
 