<a href="https://colab.research.google.com/github/TiagoFerreira-lab/lab-extractive-question-answering/blob/main/lab_extractive_question_answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [None]:
!pip install dotenv

Collecting dotenv
  Downloading dotenv-0.9.9-py2.py3-none-any.whl.metadata (279 bytes)
Collecting python-dotenv (from dotenv)
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading dotenv-0.9.9-py2.py3-none-any.whl (1.9 kB)
Downloading python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv, dotenv
Successfully installed dotenv-0.9.9 python-dotenv-1.1.0


In [None]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY= os.getenv('PINECONE_API_KEY')

# Install Dependencies

In [None]:
!pip install -qU datasets pinecone-client sentence-transformers torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.2/491.2 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m340.6/340.6 kB[0m [31m26.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m103.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m82.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m10.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [None]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/7.62k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/14.5M [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/1.82M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

In [None]:
# select only title and context column
df =  df[['title', 'context']]
# drop rows containing duplicate context passages
df = df.drop_duplicates(subset='context')
df

Unnamed: 0,title,context
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
5,University_of_Notre_Dame,"As at most other universities, Notre Dame's st..."
10,University_of_Notre_Dame,The university is the major seat of the Congre...
15,University_of_Notre_Dame,The College of Engineering was established in ...
20,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...
...,...,...
87574,Kathmandu,"Institute of Medicine, the central college of ..."
87579,Kathmandu,Football and Cricket are the most popular spor...
87584,Kathmandu,The total length of roads in Nepal is recorded...
87589,Kathmandu,The main international airport serving Kathman...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [None]:
!pip install pinecone

Collecting pinecone
  Downloading pinecone-6.0.2-py3-none-any.whl.metadata (9.0 kB)
Downloading pinecone-6.0.2-py3-none-any.whl (421 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.9/421.9 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pinecone
Successfully installed pinecone-6.0.2


In [None]:
from pinecone import Pinecone, ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# connect to pinecone environment
pc = Pinecone(
    api_key = PINECONE_API_KEY,
    environment='us-east-1'  # find next to API key in console
)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [None]:
index_name = "question-answering"

# check if the extractive-question-answering index exists
if index_name not in pc.list_indexes().names():
    # create the index if it does not exist
    pc.create_index(
        index_name,
        dimension=384,  # dimensionality of dada
        metric="cosine",
         spec=spec)
    # wait for index to be initialized
    while not pc.describe_index(index_name).status['ready']:
      # Check index status every 5 seconds
        time.sleep(5)
        print(f"Waiting for index {index_name} to be ready...")

# connect to extractive-question-answering index we created
index = pc.Index(index_name)
# view index stats
index.describe_index_stats()

{'dimension': 384,
 'index_fullness': 0.0,
 'metric': 'cosine',
 'namespaces': {'': {'vector_count': 768}},
 'total_vector_count': 768,
 'vector_type': 'dense'}

# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [None]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# Define the retriever model from Hugging Face Model Hub
retriever_model_name = "multi-qa-MiniLM-L6-cos-v1"

# Load the model and move it to the selected device
retriever = SentenceTransformer(retriever_model_name, device=device)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/11.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [None]:
from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

# generate embeddings for batch
texts = df['context'].tolist()
embeddings = retriever.encode(texts).tolist()

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    end = i + batch_size
    # extract batch
    batch = df.iloc[i:end]

    # get metadata
    meta = batch[['title', 'context']].to_dict(orient='records')
    # create unique IDs
    ids = ids = [f"id-{j}" for j in range(i, i + len(batch))]
    # add all to upsert list
    to_upsert = list(zip(ids, embeddings, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()

KeyboardInterrupt: 

# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [None]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

Device set to use cuda


<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7e6cc2bb6ad0>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [None]:
# gets context passages from the pinecone index
def get_context(question, top_k):
    # generate embeddings for the question
    xq = retriever.encode(question).tolist()
    # search pinecone index for context passage with the answer
    xc =  index.query(vector=xq, top_k=top_k, include_metadata=True)
    # extract the context passage from pinecone search result
    c =  [x['metadata']['context'] for x in xc['matches']]
    return c

In [None]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result

In [None]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

['In 550 BC, Cyrus the Great, son of Mandane and Cambyses I, took over the Median Empire, and founded the Achaemenid Empire by unifying other city states. The conquest of Media was a result of what is called the Persian Revolt. The brouhaha was initially triggered by the actions of the Median ruler Astyages, and was quickly spread to other provinces, as they allied with the Persians. Later conquests under Cyrus and his successors expanded the empire to include Lydia, Babylon, Egypt, parts of the Balkans and Eastern Europe proper, as well as the lands to the west of the Indus and Oxus rivers.']

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [None]:
extract_answer(question, context)

[{'answer': 'conquests under Cyrus',
  'context': 'In 550 BC, Cyrus the Great, son of Mandane and Cambyses I, took '
             'over the Median Empire, and founded the Achaemenid Empire by '
             'unifying other city states. The conquest of Media was a result '
             'of what is called the Persian Revolt. The brouhaha was initially '
             'triggered by the actions of the Median ruler Astyages, and was '
             'quickly spread to other provinces, as they allied with the '
             'Persians. Later conquests under Cyrus and his successors '
             'expanded the empire to include Lydia, Babylon, Egypt, parts of '
             'the Balkans and Eastern Europe proper, as well as the lands to '
             'the west of the Indus and Oxus rivers.',
  'end': 412,
  'score': 1.3741748895474554e-13,
  'start': 391}]


The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [None]:
question = "What are the first names of the men that invented youtube?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'Native corporations created under the Alaska Native Claims '
            'Settlement Act',
  'context': 'Another 44 million acres (18 million hectares) are owned by 12 '
             'regional, and scores of local, Native corporations created under '
             'the Alaska Native Claims Settlement Act (ANCSA) of 1971. '
             'Regional Native corporation Doyon, Limited often promotes itself '
             'as the largest private landowner in Alaska in advertisements and '
             'other communications. Provisions of ANCSA allowing the '
             "corporations' land holdings to be sold on the open market "
             'starting in 1991 were repealed before they could take effect. '
             'Effectively, the corporations hold title (including subsurface '
             'title in many cases, a privilege denied to individual Alaskans) '
             'but cannot sell the land. Individual Native allotments can be '
             'and are sold on the open ma

In [None]:
question = "What is Albert Eistein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'answer': 'achievement of grid parity for PV',
  'context': 'The PV industry has seen drops in module prices since 2008. In '
             'late 2011, factory-gate prices for crystalline-silicon '
             'photovoltaic modules dropped below the $1.00/W mark. The $1.00/W '
             'installed cost, is often regarded in the PV industry as marking '
             'the achievement of grid parity for PV. These reductions have '
             'taken many stakeholders, including industry analysts, by '
             'surprise, and perceptions of current solar power economics often '
             'lags behind reality. Some stakeholders still have the '
             'perspective that solar PV remains too costly on an unsubsidized '
             'basis to compete with conventional generation options. Yet '
             'technological advancements, manufacturing process improvements, '
             'and industry re-structuring, mean that further price reductions '
             'are likely

Let's run another question. This time for top 3 context passages from the retriever.

In [None]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'emits the equivalent light of a four watt bulb',
  'context': 'The relationships above are valid for only a few percent change '
             'of voltage around rated conditions, but they do indicate that a '
             'lamp operated at much lower than rated voltage could last for '
             'hundreds of times longer than at rated conditions, albeit with '
             'greatly reduced light output. The "Centennial Light" is a light '
             'bulb that is accepted by the Guinness Book of World Records as '
             'having been burning almost continuously at a fire station in '
             'Livermore, California, since 1901. However, the bulb emits the '
             'equivalent light of a four watt bulb. A similar story can be '
             'told of a 40-watt bulb in Texas that has been illuminated since '
             '21 September 1908. It once resided in an opera house where '
             'notable celebrities stopped to take in its glow, and was mov

The result looks pretty good.

In [None]:
pc.delete_index(index_name)

### Add a few more questions. What did you observe?