# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [3]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY= os.getenv('PINECONE_API_KEY')

# Install Dependencies

In [22]:
#!pip uninstall -y fsspec gcsfs

Found existing installation: fsspec 2024.10.0
Uninstalling fsspec-2024.10.0:
  Successfully uninstalled fsspec-2024.10.0
Found existing installation: gcsfs 2024.10.0
Uninstalling gcsfs-2024.10.0:
  Successfully uninstalled gcsfs-2024.10.0


In [25]:
#!pip install -qU gcsfs==2024.10.0

In [24]:
#!pip install -qU fsspec==2024.10.0

In [18]:
#!pip install -qU datasets pinecone-client sentence-transformers torch --use-deprecated=legacy-resolver

In [17]:
#!pip install -qU datasets pinecone-client sentence-transformers torch

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.[0m[31m
[0m

In [26]:
#!pip show fsspec gcsfs datasets pinecone-client sentence-transformers torch

Name: fsspec
Version: 2024.10.0
Summary: File-system specification
Home-page: https://github.com/fsspec/filesystem_spec
Author: 
Author-email: 
License: BSD 3-Clause License

Copyright (c) 2018, Martin Durant
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRES

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [27]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [28]:
# select only title and context column
df = df[['title', 'context']]

# drop rows containing duplicate context passages
df = df.drop_duplicates(subset=['context'])

# Display the updated DataFrame
df

Unnamed: 0,title,context
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
5,University_of_Notre_Dame,"As at most other universities, Notre Dame's st..."
10,University_of_Notre_Dame,The university is the major seat of the Congre...
15,University_of_Notre_Dame,The College of Engineering was established in ...
20,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...
...,...,...
87574,Kathmandu,"Institute of Medicine, the central college of ..."
87579,Kathmandu,Football and Cricket are the most popular spor...
87584,Kathmandu,The total length of roads in Nepal is recorded...
87589,Kathmandu,The main international airport serving Kathman...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [2]:
from pinecone import Pinecone, ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# connect to pinecone environment
pc = Pinecone(
    api_key = 'PINECONE_API_KEY',
    environment='us-east-1'  # find next to API key in console
)

Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [37]:
index_name = "question-answering"

# check if the extractive-question-answering index exists
if index_name not in pc.list_indexes().names():
    # create the index if it does not exist
    pc.create_index(
    name=index_name,
    dimension=384, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=spec
)

# connect to extractive-question-answering index we created
index = pc.Index(index_name)

# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [38]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load the retriever model from huggingface model hub
retriever = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1', device=device) #use the 'multi-qa-MiniLM-L6-cos-v1' model from HuggingFace to build the retriever
retriever

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [39]:
from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    end = min(i + batch_size, len(df))
    # extract batch
    batch = df.iloc[i:end]
    # generate embeddings for batch
    emb = emb = retriever.encode(batch['context'].tolist(), convert_to_tensor=True).cpu().numpy()
    # get metadata
    meta = meta = [{'title': title, 'context': context} for title, context in zip(batch['title'], batch['context'])]
    # create unique IDs
    ids = [f"{i + idx}" for idx in range(len(batch))]
    # add all to upsert list
    to_upsert = [(ids[j], emb[j], meta[j]) for j in range(len(batch))]
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index_stats = index.describe_index_stats()

print(index_stats)


  0%|          | 0/296 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 19584}},
 'total_vector_count': 19584}


# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [40]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2' # Define the model name for the reader

# load the reader model into a question-answering pipeline
reader = pipeline(
    tokenizer=model_name,
    model=model_name,
    task='question-answering',
    device=device)

reader

config.json:   0%|          | 0.00/635 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/200 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda


<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7e75b7a3f5d0>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [47]:
# gets context passages from the pinecone index
def get_context(question, top_k):
    # generate embeddings for the question
    xq = retriever.encode([question], convert_to_tensor=True).cpu().numpy().tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(vector=xq[0], top_k=top_k, include_metadata=True)
    # extract the context passage from pinecone search result
    c = [match['metadata']['context'] for match in xc['matches']]
    return c

In [48]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result

In [49]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

['Egypt was producing 691,000 bbl/d of oil and 2,141.05 Tcf of natural gas (in 2013), which makes Egypt as the largest oil producer not member of the Organization of the Petroleum Exporting Countries (OPEC) and the second-largest dry natural gas producer in Africa. In 2013, Egypt was the largest consumer of oil and natural gas in Africa, as more than 20% of total oil consumption and more than 40% of total dry natural gas consumption in Africa. Also, Egypt possesses the largest oil refinery capacity in Africa 726,000 bbl/d (in 2012). Egypt is currently planning to build its first nuclear power plant in El Dabaa city, northern Egypt.']

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [50]:
extract_answer(question, context)

[{'answer': '691,000 bbl/d',
  'context': 'Egypt was producing 691,000 bbl/d of oil and 2,141.05 Tcf of '
             'natural gas (in 2013), which makes Egypt as the largest oil '
             'producer not member of the Organization of the Petroleum '
             'Exporting Countries (OPEC) and the second-largest dry natural '
             'gas producer in Africa. In 2013, Egypt was the largest consumer '
             'of oil and natural gas in Africa, as more than 20% of total oil '
             'consumption and more than 40% of total dry natural gas '
             'consumption in Africa. Also, Egypt possesses the largest oil '
             'refinery capacity in Africa 726,000 bbl/d (in 2012). Egypt is '
             'currently planning to build its first nuclear power plant in El '
             'Dabaa city, northern Egypt.',
  'end': 33,
  'score': 0.9999852180480957,
  'start': 20}]


The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [51]:
question = "What are the first names of the men that invented youtube?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'Hurley and Chen',
  'context': 'According to a story that has often been repeated in the media, '
             'Hurley and Chen developed the idea for YouTube during the early '
             'months of 2005, after they had experienced difficulty sharing '
             "videos that had been shot at a dinner party at Chen's apartment "
             'in San Francisco. Karim did not attend the party and denied that '
             'it had occurred, but Chen commented that the idea that YouTube '
             'was founded after a dinner party "was probably very strengthened '
             'by marketing ideas around creating a story that was very '
             'digestible".',
  'end': 79,
  'score': 0.9999276399612427,
  'start': 64}]


In [52]:
question = "What is Albert Eistein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'his theories of special relativity and general relativity',
  'context': 'Albert Einstein is known for his theories of special relativity '
             'and general relativity. He also made important contributions to '
             'statistical mechanics, especially his mathematical treatment of '
             'Brownian motion, his resolution of the paradox of specific '
             'heats, and his connection of fluctuations and dissipation. '
             'Despite his reservations about its interpretation, Einstein also '
             'made contributions to quantum mechanics and, indirectly, quantum '
             'field theory, primarily through his theoretical studies of the '
             'photon.',
  'end': 86,
  'score': 0.9500371217727661,
  'start': 29}]


Let's run another question. This time for top 3 context passages from the retriever.

In [53]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'Armstrong',
  'context': 'The trip to the Moon took just over three days. After achieving '
             'orbit, Armstrong and Aldrin transferred into the Lunar Module, '
             'named Eagle, and after a landing gear inspection by Collins '
             'remaining in the Command/Service Module Columbia, began their '
             'descent. After overcoming several computer overload alarms '
             'caused by an antenna switch left in the wrong position, and a '
             'slight downrange error, Armstrong took over manual flight '
             'control at about 180 meters (590 ft), and guided the Lunar '
             'Module to a safe landing spot at 20:18:04 UTC, July 20, 1969 '
             '(3:17:04 pm CDT). The first humans on the Moon would wait '
             'another six hours before they ventured out of their craft. At '
             '02:56 UTC, July 21 (9:56 pm CDT July 20), Armstrong became the '
             'first human to set foot on the Moon.',

The result looks pretty good.

### Add a few more questions. What did you observe?

In [54]:
question = "What were the main contributions of the University of Notre Dame to science?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'increased the faculty by more than 500 professors',
  'context': 'In the 18 years under the presidency of Edward Malloy, C.S.C., '
             "(1987–2005), there was a rapid growth in the school's "
             'reputation, faculty, and resources. He increased the faculty by '
             'more than 500 professors; the academic quality of the student '
             'body has improved dramatically, with the average SAT score '
             'rising from 1240 to 1360; the number of minority students more '
             'than doubled; the endowment grew from $350 million to more than '
             '$3 billion; the annual operating budget rose from $177 million '
             'to more than $650 million; and annual research funding improved '
             "from $15 million to more than $70 million. Notre Dame's most "
             'recent[when?] capital campaign raised $1.1 billion, far '
             'exceeding its goal of $767 million, and is the largest in the '
        

In [55]:
question = "What is the cultural significance of Kathmandu?"
context = get_context(question, top_k=2)
extract_answer(question, context)

You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset


[{'answer': 'festivities',
  'context': 'The city has a rich history, spanning nearly 2000 years, as '
             'inferred from inscriptions found in the valley. Religious and '
             'cultural festivities form a major part of the lives of people '
             "residing in Kathmandu. Most of Kathmandu's people follow "
             'Hinduism and many others follow Buddhism. There are people of '
             'other religious beliefs as well, giving Kathmandu a cosmopolitan '
             'culture. Nepali is the most commonly spoken language in the '
             "city. English is understood by Kathmandu's educated residents. "
             'Historic areas of Kathmandu were devastated by a 7.8 magnitude '
             'earthquake on 25 April 2015.',
  'end': 142,
  'score': 0.7669987082481384,
  'start': 131},
 {'answer': 'Buddhism',
  'context': 'The location and terrain of Kathmandu have played a significant '
             'role in the development of a stable economy which 

In [56]:
question = "Which industries contribute the most to Nepal's economy?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'tourism',
  'context': "Since then, tourism in Nepal has thrived; it is the country's "
             'most important industry.[citation needed] Tourism is a major '
             'source of income for most of the people in the city, with '
             'several hundred thousand visitors annually. Hindu and Buddhist '
             "pilgrims from all over the world visit Kathmandu's religious "
             'sites such as Pashupatinath, Swayambhunath, Boudhanath and '
             'Budhanilkantha. From a mere 6,179 tourists in 1961/62, the '
             'number jumped to 491,504 in 1999/2000. Following the end of the '
             'Maoist insurgency, there was a significant rise of 509,956 '
             'tourist arrivals in 2009. Since then, tourism has improved as '
             'the country turned into a Democratic Republic. In economic '
             'terms, the foreign exchange registered 3.8% of the GDP in '
             '1995/96 but then started declining[why?]. The 

In [57]:
question = "Describe the advancements associated with the Apollo 8 mission."
context = get_context(question, top_k=5)
extract_answer(question, context)

[{'answer': 'first to leave low-Earth orbit and go to another celestial body',
  'context': 'On December 21, 1968, Frank Borman, James Lovell, and William '
             'Anders became the first humans to ride the Saturn V rocket into '
             'space on Apollo 8. They also became the first to leave low-Earth '
             'orbit and go to another celestial body, and entered lunar orbit '
             'on December 24. They made ten orbits in twenty hours, and '
             'transmitted one of the most watched TV broadcasts in history, '
             'with their Christmas Eve program from lunar orbit, that '
             'concluded with a reading from the biblical Book of Genesis. Two '
             'and a half hours after the broadcast, they fired their engine to '
             'perform the first trans-Earth injection to leave lunar orbit and '
             'return to the Earth. Apollo 8 safely landed in the Pacific ocean '
             "on December 27, in NASA's first dawn spla

In [58]:
question = "What are the technological breakthroughs of the Saturn V rocket?"
context = get_context(question, top_k=4)
extract_answer(question, context)

[{'answer': 'staged design, a completely new control system, and a new fuel',
  'context': 'In 1953, Korolev was given the go-ahead to develop the R-7 '
             'Semyorka rocket, which represented a major advance from the '
             'German design. Although some of its components (notably '
             'boosters) still resembled the German G-4, the new rocket '
             'incorporated staged design, a completely new control system, and '
             'a new fuel. It was successfully tested on August 21, 1957 and '
             "became the world's first fully operational ICBM the following "
             'month. It would later be used to launch the first satellite into '
             'space, and derivatives would launch all piloted Soviet '
             'spacecraft.',
  'end': 307,
  'score': 0.978406548500061,
  'start': 245},
 {'answer': 'The first step was unguided missile systems',
  'context': 'Some nations started rocket research before World War II, '
             'i

# **Index Deletion**

In [59]:
pc.delete_index(index_name)