# LAB | Extractive Question Answering

Let's get started by installing the packages needed for notebook to run:

In [1]:
!pip install python-dotenv


Collecting python-dotenv
  Downloading python_dotenv-1.1.0-py3-none-any.whl.metadata (24 kB)
Downloading python_dotenv-1.1.0-py3-none-any.whl (20 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.1.0


In [10]:
import os
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv())

OPENAI_API_KEY  = os.getenv('OPENAI_API_KEY')
PINECONE_API_KEY= os.getenv('PINECONE_API_KEY')

# Install Dependencies

In [None]:
!pip install -qU datasets pinecone sentence-transformers torch

# Load Dataset

Now let's load the SQUAD dataset from the HuggingFace Model Hub. We load the dataset into a pandas dataframe and filter the title, question, and context columns, and we drop any duplicate context passages.

In [11]:
from datasets import load_dataset

# load the squad dataset into a pandas dataframe
df = load_dataset("squad", split="train").to_pandas()

In [5]:
# select only title and context column
df = df[['title', 'context']]
# drop rows containing duplicate context passages
df = df.drop_duplicates(subset='context')
df

Unnamed: 0,title,context
0,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha..."
5,University_of_Notre_Dame,"As at most other universities, Notre Dame's st..."
10,University_of_Notre_Dame,The university is the major seat of the Congre...
15,University_of_Notre_Dame,The College of Engineering was established in ...
20,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...
...,...,...
87574,Kathmandu,"Institute of Medicine, the central college of ..."
87579,Kathmandu,Football and Cricket are the most popular spor...
87584,Kathmandu,The total length of roads in Nepal is recorded...
87589,Kathmandu,The main international airport serving Kathman...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our context passages which we can retrieve using another vector (query vector). We first need to initialize our connection to Pinecone to create our vector index. For this, we need a free [API key]("https://app.pinecone.io/"), and then we initialize the connection like so:

In [12]:
from pinecone import Pinecone, ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

# connect to pinecone environment
pc = Pinecone(
    api_key = PINECONE_API_KEY,
    environment='us-east-1'  # find next to API key in console
)

Now we create a new index called "question-answering" — we can name the index anything we want. We specify the metric type as "cosine" and dimension as 384 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 384-dimension vectors.

In [13]:
index_name = "question-answering"


if index_name not in pc.list_indexes().names():

    pc.create_index(
        name=index_name,
        dimension=384,
        metric="cosine",
        spec=spec
    )

index = pc.Index(index_name)

# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all context passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will generate embeddings in a way that the questions and context passages containing answers to our questions are nearby in the vector space. We can use cosine similarity to calculate the similarity between the query and context embeddings to find the context passages that contain potential answers to our question.

We will use a SentenceTransformer model named ``multi-qa-MiniLM-L6-cos-v1`` designed for semantic search and trained on 215M (question, answer) pairs from diverse sources as our retriever.

In [14]:
from sentence_transformers import SentenceTransformer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# This is the model that returns 384-dim embeddings (matches your Pinecone index)
retriever = SentenceTransformer('multi-qa-MiniLM-L6-cos-v1', device=device)


# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, context passage, etc.

In [None]:
from tqdm.auto import tqdm

batch_size = 64

for i in tqdm(range(0, len(df), batch_size)):
    # extract the batch
    batch_df = df.iloc[i:i+batch_size]

    # generate embeddings
    texts = batch_df['context'].tolist()
    embeddings = retriever.encode(texts).tolist()

    # generate IDs
    ids = [f"id-{i+j}" for j in range(len(batch_df))]

    # create metadata (optional)
    metadata = [{"title": row["title"], "context": row["context"]} for _, row in batch_df.iterrows()]

    # bundle for upsert
    to_upsert = list(zip(ids, embeddings, metadata))

    # upsert to Pinecone
    index.upsert(vectors=to_upsert)

# sanity check
index.describe_index_stats()


# Initialize Reader

We use the `deepset/electra-base-squad2` model from the HuggingFace model hub as our reader model. We load this model into a "question-answering" pipeline from HuggingFace transformers and feed it our questions and context passages individually. The model gives a prediction for each context we pass through the pipeline.

In [None]:
from transformers import pipeline

model_name = 'deepset/electra-base-squad2'
# load the reader model into a question-answering pipeline
reader = pipeline(tokenizer=model_name, model=model_name, task='question-answering', device=device)
reader

Now all the components we need are ready. Let's write some helper functions to execute our queries. The `get_context` function retrieves the context embeddings containing answers to our question from the Pinecone index, and the `extract_answer` function extracts the answers from these context passages.

In [17]:
def get_context(question, top_k=5):
    # encode the question into an embedding
    q_emb = retriever.encode([question]).tolist()

    # query Pinecone with the question embedding
    search_results = index.query(vector=q_emb[0], top_k=top_k, include_metadata=True)

    # extract the context texts from the results
    context_passages = [match['metadata']['context'] for match in search_results['matches']]

    return context_passages

In [18]:
from pprint import pprint

# extracts answer from the context passage
def extract_answer(question, context):
    results = []
    for c in context:
        # feed the reader the question and contexts to extract answers
        answer = reader(question=question, context=c)
        # add the context to answer dict for printing both together
        answer["context"] = c
        results.append(answer)
    # sort the result based on the score from reader model
    sorted_result = pprint(sorted(results, key=lambda x: x['score'], reverse=True))
    return sorted_result

In [19]:
question = "How much oil is Egypt producing in a day?"
context = get_context(question, top_k = 1)
context

['Egypt was producing 691,000 bbl/d of oil and 2,141.05 Tcf of natural gas (in 2013), which makes Egypt as the largest oil producer not member of the Organization of the Petroleum Exporting Countries (OPEC) and the second-largest dry natural gas producer in Africa. In 2013, Egypt was the largest consumer of oil and natural gas in Africa, as more than 20% of total oil consumption and more than 40% of total dry natural gas consumption in Africa. Also, Egypt possesses the largest oil refinery capacity in Africa 726,000 bbl/d (in 2012). Egypt is currently planning to build its first nuclear power plant in El Dabaa city, northern Egypt.']

As we can see, the retiever is working fine and gets us the context passage that contains the answer to our question. Now let's use the reader to extract the exact answer from the context passage.

In [None]:
extract_answer(question, context)

The reader model predicted with 99% accuracy the correct answer *691,000 bbl/d* as seen from the context passage. Let's run few more queries.

Let's run another question. This time for top 3 context passages from the retriever.

**here it worked less well. **

In [None]:
question = "where is kathmandu"
context = get_context(question, top_k=3)
extract_answer(question, context)

Here it answeres very well, I looked into the stanford Squad dataset and found different topics it may be able to answer well.

In [None]:
question = "where is the himalayas"
context = get_context(question, top_k=3)
extract_answer(question, context)

In [30]:
question = "how do i bake a choccolate cake?"
context = get_context(question, top_k=3)
extract_answer(question, context)

[{'answer': 'Alwar ka Mawa',
  'context': 'Rajasthani cooking was influenced by both the war-like '
             'lifestyles of its inhabitants and the availability of '
             'ingredients in this arid region. Food that could last for '
             'several days and could be eaten without heating was preferred. '
             'The scarcity of water and fresh green vegetables have all had '
             'their effect on the cooking. It is known for its snacks like '
             'Bikaneri Bhujia. Other famous dishes include bajre ki roti '
             '(millet bread) and lashun ki chutney (hot garlic paste), mawa '
             'kachori Mirchi Bada, Pyaaj Kachori and ghevar from Jodhpur, '
             'Alwar ka Mawa(Milk Cake), malpauas from Pushkar and rassgollas '
             'from Bikaner. Originating from the Marwar region of the state is '
             'the concept Marwari Bhojnalaya, or vegetarian restaurants, today '
             'found in many parts of India, which of

Here you can see it is discussing Rajasthani cooking, relating to the source-material.

The result looks pretty good.

In [None]:
pc.delete_index(index_name)

### Add a few more questions. What did you observe?

I observed that the questions were very academically answered, well-formatted and detailed in it's response. I have to reflect however, that if the question was outside the expertise of the dataset, it started to hallucinate and provide answers to something similar to what was asked. Like in the case of n"how do i bake a choccolate cake" it answered with the history of Rajasthani cooking, which was amusing and informative, but not what I was looking for.