<a href="https://colab.research.google.com/github/solvedbrunus/lab-extractive-question-answering/blob/main/lab_extractive_question_answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB | Extractive Question Answering

This notebook demonstrates how Pinecone helps you build an extractive question-answering application. To build an extractive question-answering system, we need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A reader model to extract answers

We will use the SQuAD dataset, which consists of **questions** and **context** paragraphs containing question **answers**. We generate embeddings for the context passages using the retriever, index them in the vector database, and query with semantic search to retrieve the top k most relevant contexts containing potential answers to our question. We then use the reader model to extract the answers from the returned contexts.

Let's get started by installing the packages needed for notebook to run:

In [None]:
%pip install python-dotenv

Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Downloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.1
Note: you may need to restart the kernel to use updated packages.


In [None]:

# Import required packages to load environment variables
import os
from dotenv import load_dotenv, find_dotenv

# Load environment variables from .env file
_ = load_dotenv(find_dotenv())

# Get API keys from environment variables
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY = os.getenv('PINECONE_API_KEY')

In [None]:
#from google.colab import userdata

#OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

#PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')

# Install Dependencies

In [None]:
!pip install -qU datasets pinecone-client sentence-transformers torch

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [None]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

In [None]:
df.head()

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...


In [None]:
# select only title and context column
df = df[["title", "context"]]
# drop rows containing duplicate context passages
df = df.drop_duplicates(subset=["context"])

df

Unnamed: 0,title,context
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
5,University_of_Notre_Dame,"As at most other universities, Notre Dame's st..."
10,University_of_Notre_Dame,The university is the major seat of the Congre...
15,University_of_Notre_Dame,The College of Engineering was established in ...
20,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...
...,...,...
87574,Kathmandu,"Institute of Medicine, the central college of ..."
87579,Kathmandu,Football and Cricket are the most popular spor...
87584,Kathmandu,The total length of roads in Nepal is recorded...
87589,Kathmandu,The main international airport serving Kathman...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [None]:
from pinecone import Pinecone, ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# connect to pinecone environment
pc = Pinecone(
    api_key = PINECONE_API_KEY,
    environment='us-east-1'  # find next to API key in console
)

Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [None]:
index_name = "question-answering"

# check if the extractive-question-answering index exists
if index_name not in pc.list_indexes().names():
    # create the index if it does not exist
    pc.create_index(
        name=index_name,
        metric="cosine",
        dimension=384,
        spec=spec
    )
# connect to extractive-question-answering index we created
index = pc.Index(index_name)

# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [None]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
#use the 'multi-qa-MiniLM-L6-cos-v1' model from HuggingFace to build the retriever
retriever = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1').to(device)
retriever

2025-01-20 20:03:49.449888: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [None]:
from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    end = min(i + batch_size, len(df))
    # extract batch
    batch = df.iloc[i:end]
    # generate embeddings for batch
    emb = retriever.encode(batch["context"].tolist(), convert_to_tensor=True, show_progress_bar=False)
    # get metadata
    meta = batch["title"].tolist()
    # create unique IDs
    ids = [f"{index_name}_{i}" for i in range(i, end)]
    # add all to upsert list with proper metadata
to_upsert = [{
    "id": i,
    "values": v.cpu().numpy().tolist(),
    "metadata": {
        "title": title,
        "context": context
    }
} for i, v, title, context in zip(ids, emb, batch["title"].tolist(), batch["context"].tolist())]
    # upsert/insert these records to pinecone
_ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()

  0%|          | 0/296 [00:00<?, ?it/s]

{'dimension': 384,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [None]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

<transformers.pipelines.question_answering.QuestionAnsweringPipeline at 0x7f7e223a5970>

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [None]:
def get_context(question, top_k=3):
    """Retrieve relevant contexts from Pinecone index"""
    try:
        # Encode question
        question_embedding = retriever.encode(question).tolist()

        # Query Pinecone
        results = index.query(
            vector=question_embedding,
            top_k=top_k,
            include_metadata=True
        )

        # Extract contexts
        contexts = [match.metadata['text'] for match in results.matches]
        return contexts
    except Exception as e:
        print(f"Error retrieving context: {e}")
        return []


In [None]:
def get_context(question, top_k=3):
    """Retrieve relevant contexts from Pinecone index"""
    try:
        # Encode question
        question_embedding = retriever.encode(question).tolist()

        # Query Pinecone
        results = index.query(
            vector=question_embedding,
            top_k=top_k,
            include_metadata=True
        )

        # Extract contexts with proper error handling
        contexts = []
        for match in results.matches:
            try:
                # Check different possible metadata keys
                if 'text' in match.metadata:
                    contexts.append(match.metadata['text'])
                elif 'context' in match.metadata:
                    contexts.append(match.metadata['context'])
                elif 'content' in match.metadata:
                    contexts.append(match.metadata['content'])
            except AttributeError:
                # Handle dict-style access
                if isinstance(match, dict) and 'metadata' in match:
                    meta = match['metadata']
                    if 'text' in meta:
                        contexts.append(meta['text'])
                    elif 'context' in meta:
                        contexts.append(meta['context'])
                    elif 'content' in meta:
                        contexts.append(meta['content'])

        return contexts if contexts else []

    except Exception as e:
        print(f"Error retrieving context: {str(e)}")
        return []

In [None]:
def extract_answer(question, contexts):
    """Extract answers from contexts using the reader model"""
    try:
        results = []
        for context in contexts:
            prediction = reader(
                question=question,
                context=context
            )
            prediction['context'] = context
            results.append(prediction)

        # Sort by confidence score
        return sorted(results, key=lambda x: x['score'], reverse=True)
    except Exception as e:
        print(f"Error extracting answer: {e}")
        return []

In [None]:

def get_answer(question):
    """Complete QA pipeline"""
    contexts = get_context(question)
    if not contexts:
        return "No relevant context found"

    answers = extract_answer(question, contexts)
    if not answers:
        return "No answer found"

    best_answer = answers[0]
    return {
        'answer': best_answer['answer'],
        'confidence': best_answer['score'],
        'context': best_answer['context']
    }


In [None]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

['The total length of roads in Nepal is recorded to be (17,182 km (10,676 mi)), as of 2003–04. This fairly large network has helped the economic development of the country, particularly in the fields of agriculture, horticulture, vegetable farming, industry and also tourism. In view of the hilly terrain, transportation takes place in Kathmandu are mainly by road and air. Kathmandu is connected by the Tribhuvan Highway to the south, Prithvi Highway to the west and Araniko Highway to the north. The BP Highway, connecting Kathmandu to the eastern part of Nepal is under construction.']

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [None]:
extract_answer(question, context)

[{'score': 1.317318243420143e-13,
  'start': 53,
  'end': 75,
  'answer': '(17,182 km (10,676 mi)',
  'context': 'The total length of roads in Nepal is recorded to be (17,182 km (10,676 mi)), as of 2003–04. This fairly large network has helped the economic development of the country, particularly in the fields of agriculture, horticulture, vegetable farming, industry and also tourism. In view of the hilly terrain, transportation takes place in Kathmandu are mainly by road and air. Kathmandu is connected by the Tribhuvan Highway to the south, Prithvi Highway to the west and Araniko Highway to the north. The BP Highway, connecting Kathmandu to the eastern part of Nepal is under construction.'}]

The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

In [None]:
question = "What are the first names of the men that invented youtube?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'score': 6.61898383325088e-11,
  'start': 53,
  'end': 60,
  'answer': 'Araniko',
  'context': 'Legendary Princess Bhrikuti (7th-century) and artist Araniko (1245 - 1306 AD) from that tradition of Kathmandu valley played a significant role in spreading Buddhism in Tibet and China. There are over 108 traditional monasteries (Bahals and Bahis) in Kathmandu based on Newar Buddhism. Since the 1960s, the permanent Tibetan Buddhist population of Kathmandu has risen significantly so that there are now over fifty Tibetan Buddhist monasteries in the area. Also, with the modernization of Newar Buddhism, various Theravada Bihars have been established.'}]

In [None]:
question = "What is Albert Einstein famous for?"
context = get_context(question, top_k=1)
extract_answer(question, context)

[{'score': 1.5064460096025911e-12,
  'start': 709,
  'end': 725,
  'answer': 'urban management',
  'context': "Kathmandu Metropolitan City (KMC), in order to promote international relations has established an International Relations Secretariat (IRC). KMC's first international relationship was established in 1975 with the city of Eugene, Oregon, United States. This activity has been further enhanced by establishing formal relationships with 8 other cities: Motsumoto City of Japan, Rochester of the USA, Yangon (formerly Rangoon) of Myanmar, Xi'an of the People's Republic of China, Minsk of Belarus, and Pyongyang of the Democratic Republic of Korea. KMC's constant endeavor is to enhance its interaction with SAARC countries, other International agencies and many other major cities of the world to achieve better urban management and developmental programs for Kathmandu."}]

Let's run another question. This time for top 3 context passages from the retriever.

In [None]:
question = "Who was the first person to step foot on the moon?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'score': 2.488356742880171e-10,
  'start': 53,
  'end': 60,
  'answer': 'Araniko',
  'context': 'Legendary Princess Bhrikuti (7th-century) and artist Araniko (1245 - 1306 AD) from that tradition of Kathmandu valley played a significant role in spreading Buddhism in Tibet and China. There are over 108 traditional monasteries (Bahals and Bahis) in Kathmandu based on Newar Buddhism. Since the 1960s, the permanent Tibetan Buddhist population of Kathmandu has risen significantly so that there are now over fifty Tibetan Buddhist monasteries in the area. Also, with the modernization of Newar Buddhism, various Theravada Bihars have been established.'},
 {'score': 4.5121467326381115e-12,
  'start': 456,
  'end': 465,
  'answer': 'first son',
  'context': 'The Bagmati River which flows through Kathmandu is considered a holy river both by Hindus and Buddhists, and many Hindu temples are located on the banks of this river. The importance of the Bagmati also lies in the fact that Hindus are crema

The result looks pretty good.

In [None]:
pc.delete_index(index_name)

### Add a few more questions. What did you observe?

In [None]:
question = "who is katmandu?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'score': 0.002149143721908331,
  'start': 434,
  'end': 440,
  'answer': 'Kirats',
  'context': 'Kirant Mundhum is one of the indigenous animistic practices of Nepal. It is practiced by Kirat people. Some animistic aspects of Kirant beliefs, such as ancestor worship (worship of Ajima) are also found in Newars of Kirant origin. Ancient religious sites believed to be worshipped by ancient Kirats, such as Pashupatinath, Wanga Akash Bhairabh (Yalambar) and Ajima are now worshipped by people of all Dharmic religions in Kathmandu. Kirats who have migrated from other parts of Nepal to Kathmandu practice Mundhum in the city.'},
 {'score': 0.0004798909940291196,
  'start': 317,
  'end': 344,
  'answer': 'Tibetan Buddhist population',
  'context': 'Legendary Princess Bhrikuti (7th-century) and artist Araniko (1245 - 1306 AD) from that tradition of Kathmandu valley played a significant role in spreading Buddhism in Tibet and China. There are over 108 traditional monasteries (Bahals and Bahis) i

## this was challenging the questions were not clear and the answers were not in the context, i tried to fixed but the probem was the index, i understood the obejctive but had a hard time with the implementation

## i will try to fix it and submit again
