# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [3]:
#!pip install python-dotenv
#!pip install torch

In [4]:
import torch
import os
print(torch.cuda.is_available())

#nvcc --version

False


In [5]:
# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [6]:

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY= os.getenv('PINECONE_API_KEY')

# Install Dependencies

In [7]:
# Leading to error due to incompatible versions.. 
#!pip install -qU datasets pinecone-client sentence-transformers 


# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [8]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

# Randomly sample 10% of the data
df = df.sample(frac=0.01, random_state=42)  # frac=0.1 means 10%, random_state ensures reproducibility

# Print the size of the sample
print(f"Sampled {len(df)} rows out of {len(df)} total rows.")

Sampled 876 rows out of 876 total rows.


In [9]:
df.columns

Index(['id', 'title', 'context', 'question', 'answers'], dtype='object')

In [10]:
# select only title and context column
df = df[['title', 'context']]
# drop rows containing duplicate context passages
df = df.drop_duplicates(subset='context')
df

Unnamed: 0,title,context
9983,Institute_of_technology,The world's first institution of technology or...
43267,Film_speed,The standard specifies how speed ratings shoul...
81021,Sumer,The most impressive and famous of Sumerian bui...
49374,"Ann_Arbor,_Michigan",Ann Arbor has a council-manager form of govern...
53414,John_von_Neumann,"Shortly before his death, when he was already ..."
...,...,...
28392,FC_Barcelona,"On 4 January 2016, Barcelona's transfer ban en..."
86329,Humanism,The humanists' close study of Latin literary t...
60397,Exhibition_game,Several MLB teams used to play regular exhibit...
50048,Edmund_Burke,"In 1744, Burke started at Trinity College Dubl..."


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [12]:
from pinecone import Pinecone, ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# connect to pinecone environment
pc = Pinecone(
    api_key = PINECONE_API_KEY,
    environment='us-east-1'  # find next to API key in console
)

Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [13]:
#!pip install pinecone-client

In [15]:
import pinecone
index_name = "question-answering"

# check if the extractive-question-answering index exists
if index_name not in pc.list_indexes():
    # create the index if it does not exist
    pc.create_index(name=index_name, dimension=384, metric="cosine", spec=spec)
# connect to the extractive-question-answering index we created
index = pc.Index(index_name)

# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [16]:
#!pip install sentence-transformers
#print(torch.__version__)

In [17]:
from sentence_transformers import SentenceTransformer

# load the retriever model from huggingface model hub
retriever = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1') #use the 'multi-qa-MiniLM-L6-cos-v1' model from HuggingFace to build the retriever
retriever

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [18]:
from tqdm.auto import tqdm
from uuid import uuid4

# Initialize parameters
batch_size = 64
batch_limit = 100  # Limit to upsert in each batch
texts = []
metadatas = []

# Iterate through the dataframe in batches
for i in tqdm(range(0, len(df), batch_size)):
    # Find the end of the batch
    end = min(i + batch_size, len(df))
    
    # Extract batch
    batch = df.iloc[i:end]
    
    # Generate embeddings for context passages in the batch (convert to list directly)
    batch_contexts = batch['context'].tolist()
    embeddings = retriever.encode(batch_contexts, convert_to_tensor=False)  # Keep embeddings as lists

    # Get metadata for each document in the batch (e.g., title, context, etc.)
    metadata = [
        {
            'title': row['title'],
            'context': row['context']
        }
        for _, row in batch.iterrows()
    ]
    
    # Create unique IDs for each document
    ids = [str(uuid4()) for _ in range(len(batch))]
    
    # Prepare data to upsert into Pinecone (id, embedding, metadata)
    to_upsert = list(zip(ids, embeddings, metadata))  # No need for .tolist() now
    
    # Upsert/insert these records into the Pinecone index
    _ = index.upsert(vectors=to_upsert)

# Check that all vectors are in the index
index.describe_index_stats()


  0%|          | 0/14 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 857}},
 'total_vector_count': 857}

In [49]:
""" from tqdm.auto import tqdm
from uuid import uuid4

# Initialize parameters
batch_size = 64
batch_limit = 100  # Limit to upsert in each batch
texts = []
metadatas = []

# Iterate through the dataframe in batches
for i in tqdm(range(0, len(df), batch_size)):
    # Find the end of the batch
    end = min(i + batch_size, len(df))
    
    # Extract batch
    batch = df.iloc[i:end]
    
    # Generate embeddings for context passages in the batch
    batch_contexts = batch['context'].tolist()
    embeddings = retriever.encode(batch_contexts, convert_to_tensor=False)

    # Get metadata for each document in the batch (e.g., title, context, etc.)
    metadata = [
        {
            'title': row['title'],
            'context': row['context']
        }
        for _, row in batch.iterrows()
    ]
    
    # Create unique IDs for each document
    ids = [str(uuid4()) for _ in range(len(batch))]
    
    # Prepare data to upsert into Pinecone (id, embedding, metadata)
    to_upsert = list(zip(ids, embeddings.tolist(), metadata))
    
    # Upsert/insert these records into the Pinecone index
    _ = index.upsert(vectors=to_upsert)
    
# Check that all vectors are in the index
index.describe_index_stats() """


  0%|          | 0/62 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 3940}},
 'total_vector_count': 3940}

# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [35]:
from transformers import pipeline

# Define model for Reader
model_name = 'deepset/electra-base-squad2'

# Define device (use -1 for CPU, 0 for GPU)
device = -1  # or device = 0 if using GPU

# Load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)

reader

<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x1dad8c4de80>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [40]:
# Define the get_context function 
def get_context(question, top_k=1):
    # Generate embeddings for the question
    xq = retriever.encode(question, convert_to_tensor=True).tolist()

    # Ensure that 'xq' is a list of floats before querying Pinecone
    if isinstance(xq, list) and all(isinstance(i, float) for i in xq):
        # Query Pinecone for the context passages
        xc = index.query(vector=xq, top_k=top_k, include_metadata=True)
        
        # Extract the context passages from the search results
        contexts = [match["metadata"]["context"] for match in xc["matches"]]
        
        return contexts
    else:
        print("Error: The vector format is incorrect.")
        return None


In [None]:
from pprint import pprint

# Define the extract_answer function
def extract_answer(question, contexts):
    results = []
    
    # Loop through each context and get the answer using the reader model
    for context in contexts:
        # Feed the reader the question and context to extract the answer
        answer = reader(question=question, context=context)
        
        # Add the context to the answer dictionary for printing both together
        answer["context"] = context
        
        # Append the result to the list of answers
        results.append(answer)
    
    # Sort the results based on the score (confidence) from the reader model
    sorted_results = sorted(results, key=lambda x: x['score'], reverse=True)
    
    # Pretty print the sorted results
    pprint(sorted_results)
    
    return sorted_results

In [44]:
# Example query for the get_context function
question = "Why is Gandhi a bad person?"
context = get_context(question, top_k=5)

# Check if a context was returned
if context:
    print("Context:", context)
else:
    print("No context found.")

answer = extract_answer(question, context)

Context: ['\n India: Due to concerns about pro-Tibet protests, the relay through New Delhi on April 17 was cut to just 2.3 km (less than 1.5 miles), which was shared amongst 70 runners. It concluded at the India Gate. The event was peaceful due to the public not being allowed at the relay. A total of five intended torchbearers -Kiran Bedi, Soha Ali Khan, Sachin Tendulkar, Bhaichung Bhutia and Sunil Gavaskar- withdrew from the event, citing "personal reasons", or, in Bhutia\'s case, explicitly wishing to "stand by the people of Tibet and their struggle" and protest against the PRC "crackdown" in Tibet. Indian national football captain, Baichung Bhutia refused to take part in the Indian leg of the torch relay, citing concerns over Tibet. Bhutia, who is Sikkimese, is the first athlete to refuse to run with the torch. Indian film star Aamir Khan states on his personal blog that the "Olympic Games do not belong to China" and confirms taking part in the torch relay "with a prayer in his hear

In [36]:
""" # Assuming 'retriever' is an embedding model
question = "How much oil is Egypt producing in a day?"
xq = retriever.encode(question, convert_to_tensor=True).tolist()
# Ensure that 'xq' is a list of floats and then query Pinecone
if isinstance(xq, list) and all(isinstance(i, float) for i in xq):
    # Query Pinecone with the correct parameters
    xc = index.query(vector=xq, top_k=1, include_metadata=True)
    
    # Extract context passage from Pinecone result
    c = [match["metadata"]["context"] for match in xc["matches"]]
    print("Context:", c)
else:
    print("Error: The vector format is incorrect.") """

Context: ['Shell was vertically integrated and is active in every area of the oil and gas industry, including exploration and production, refining, distribution and marketing, petrochemicals, power generation and trading. It has minor renewable energy activities in the form of biofuels and wind. It has operations in over 90 countries, produces around 3.1 million barrels of oil equivalent per day and has 44,000 service stations worldwide. Shell Oil Company, its subsidiary in the United States, is one of its largest businesses.']


In [20]:
""" from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    # Loop through each context and get the answer using reader model
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)

    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    
    return sorted_result """

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [45]:
# Example query
question = "What are the first names of the men that invented YouTube?"
context = get_context(question, top_k=5)
extract_answer(question, context)

[{'answer': 'Carl',
  'context': 'In the late 1980s, according to "Richard Feynman and the '
             'Connection Machine", Feynman played a crucial role in developing '
             'the first massively parallel computer, and in finding innovative '
             'uses for it in numerical computations, in building neural '
             'networks, as well as physical simulations using cellular '
             'automata (such as turbulent fluid flow), working with Stephen '
             'Wolfram at Caltech. His son Carl also played a role in the '
             'development of the original Connection Machine engineering; '
             'Feynman influencing the interconnects while his son worked on '
             'the software.',
  'end': 396,
  'score': 3.1657223509284904e-09,
  'start': 392},
 {'answer': 'William Sawyer',
  'context': 'Albon Man, a New York lawyer, started Electro-Dynamic Light '
             'Company in 1878 to exploit his patents and those of William '
             

[{'score': 3.1657223509284904e-09,
  'start': 392,
  'end': 396,
  'answer': 'Carl',
  'context': 'In the late 1980s, according to "Richard Feynman and the Connection Machine", Feynman played a crucial role in developing the first massively parallel computer, and in finding innovative uses for it in numerical computations, in building neural networks, as well as physical simulations using cellular automata (such as turbulent fluid flow), working with Stephen Wolfram at Caltech. His son Carl also played a role in the development of the original Connection Machine engineering; Feynman influencing the interconnects while his son worked on the software.'},
 {'score': 1.0191442828544339e-10,
  'start': 112,
  'end': 126,
  'answer': 'William Sawyer',
  'context': "Albon Man, a New York lawyer, started Electro-Dynamic Light Company in 1878 to exploit his patents and those of William Sawyer. Weeks later the United States Electric Lighting Company was organized. This company didn't made their 

The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [46]:
question = "What is Albert Eistein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'reformer of Estonian',
  'context': 'The most famous reformer of Estonian, Johannes Aavik '
             '(1880–1973), used creations ex nihilo (cf. ‘free constructions’, '
             'Tauli 1977), along with other sources of lexical enrichment such '
             'as derivations, compositions and loanwords (often from Finnish; '
             'cf. Saareste and Raun 1965: 76). In Aavik’s dictionary (1921), '
             'which lists approximately 4000 words, there are many words which '
             'were (allegedly) created ex nihilo, many of which are in common '
             'use today. Examples are',
  'end': 36,
  'score': 0.004661196377128363,
  'start': 16}]


[{'score': 0.004661196377128363,
  'start': 16,
  'end': 36,
  'answer': 'reformer of Estonian',
  'context': 'The most famous reformer of Estonian, Johannes Aavik (1880–1973), used creations ex nihilo (cf. ‘free constructions’, Tauli 1977), along with other sources of lexical enrichment such as derivations, compositions and loanwords (often from Finnish; cf. Saareste and Raun 1965: 76). In Aavik’s dictionary (1921), which lists approximately 4000 words, there are many words which were (allegedly) created ex nihilo, many of which are in common use today. Examples are'}]

Let's run another question. This time for top 3 context passages from the retriever.

In [15]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'Armstrong',
  'context': 'The trip to the Moon took just over three days. After achieving '
             'orbit, Armstrong and Aldrin transferred into the Lunar Module, '
             'named Eagle, and after a landing gear inspection by Collins '
             'remaining in the Command/Service Module Columbia, began their '
             'descent. After overcoming several computer overload alarms '
             'caused by an antenna switch left in the wrong position, and a '
             'slight downrange error, Armstrong took over manual flight '
             'control at about 180 meters (590 ft), and guided the Lunar '
             'Module to a safe landing spot at 20:18:04 UTC, July 20, 1969 '
             '(3:17:04 pm CDT). The first humans on the Moon would wait '
             'another six hours before they ventured out of their craft. At '
             '02:56 UTC, July 21 (9:56 pm CDT July 20), Armstrong became the '
             'first human to set foot on the Moon.',

The result looks pretty good.

In [None]:
pc.delete_index(index_name)

### Add a few more questions. What did you observe?