# Abstractive Question Answering

Question answering can be of two types:
- **Extractive QA** - Extractive QA focuses on selecting the most relevant portions of the source text that directly contain the answer to a given question. The system identifies and extracts the answer by locating specific phrases, sentences, or paragraphs within the source document.
- **Abstractive QA** - Abstractive QA, on the other hand, aims to generate concise and coherent answers by understanding the meaning of the question and the context of the source text. Instead of directly extracting information from the source, abstractive QA systems generate answers by synthesizing information from various parts of the text and potentially even introducing new words or phrases that were not present in the original source.

In this tutorial we will look at how we can use Qdrant to do abstractive question-answering. It's like a chatbot, you can provide your question and it will provide answer to that. To do so, we will store documents with relevant information in qdrant collection after converting them to vectors, then we will be able to search for information with our query vector and generate answer with the information we get. We will need three things-
- **Qdrant**: it will store our documents and search relevant document with answer to our query.
- **Retriever model**: it will be used for embedding context passages.
- **Generator model**: to generate answers.

Don't worry if you do not understand everything yet, we will go through each step in the notebook.
Let's start:

## Install Dependencies
Let's get started by installing the packages needed for the notebook to run:

In [2]:
!pip install -qU datasets==2.12.0 qdrant-client==1.2.0 sentence_transformers==2.2.2 \
    torch==2.0.1 tqdm==4.65.0 cohere==4.11.2 transformers[sentencepiece]

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m14.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m77.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m100.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m108.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m111.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m64.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m111.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m109.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Import libraries

In [3]:
from datasets import load_dataset
from tqdm.auto import tqdm
import pandas as pd
from qdrant_client import QdrantClient
from qdrant_client.http import models
from pathlib import Path
import torch
import cohere
import os
from sentence_transformers import SentenceTransformer
from transformers import AutoModelWithLMHead, AutoTokenizer

## Load Dataset

We will use the squad dataset, which is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles. It contains 87k records.

In [4]:
# load the dataset from huggingface
dataset = load_dataset("squad", split="train")
dataset

Downloading builder script:   0%|          | 0.00/5.27k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.36k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.67k [00:00<?, ?B/s]

Downloading and preparing dataset squad/plain_text to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/8.12M [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.05M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/87599 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/10570 [00:00<?, ? examples/s]

Dataset squad downloaded and prepared to /root/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.


Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

## Prepare dataset
We only want the context column from the dataset so we will make a pandas dataframe and drop duplicate contexts.

In [5]:
# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(dataset["context"])
df = df.drop_duplicates()
print(len(df))
df.head()

18891


Unnamed: 0,0
0,"Architecturally, the school has a Catholic cha..."
5,"As at most other universities, Notre Dame's st..."
10,The university is the major seat of the Congre...
15,The College of Engineering was established in ...
20,All of Notre Dame's undergraduate students are...


After removing duplicates we are left with 18k unique contexts.

## Initialize Qdrant client

In [6]:
# Initialize Qdrant client

current_folder = Path.cwd()  # Get the current folder
qdrant_folder = current_folder / "qdrant"
qdrant_folder.mkdir()  # Create qdrant folder to store collection

client = QdrantClient(path=qdrant_folder.resolve())  # path to new qdrant folder

## Create new collection
Now the data is ready, we can set up our Qdrant collection to store it.

In [7]:
context_collection = "abstractive-question-answering"

collections = client.get_collections()
print(collections)

# only create collection if it doesn't exist
if context_collection not in collections:
    client.recreate_collection(
        collection_name=context_collection,
        vectors_config=models.VectorParams(
            size=1024,  # specifying dimensionality of vectors output by retriever model as both need to be same
            distance=models.Distance.COSINE,  # specifying which metric will be used to check similarity
        ),
    )
collections = client.get_collections()
print(collections)

collections=[]
collections=[CollectionDescription(name='abstractive-question-answering')]


## Initialize Retriever
We will use the cohere embeddings, which is used to create vector representations of our records and also for our search queries. These vector embeddings capture the semantic meaning of the documents or records. Then, during the retrieval phase, similarity measure (i.e., cosine similarity) is applied in vector space to find the most similar records to a given query.

In [9]:
# set device to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"

# Initialize cohere client using your api key, you can get your api key from cohere after signing up
COHERE_API_KEY = os.getenv("COHERE_API_KEY")
cohere_client = cohere.Client(COHERE_API_KEY)
cohere_client

<cohere.client.Client at 0x7ff2ceca2d10>

## Generate Embeddings -> Store in Qdrant

Next, we need to generate embeddings for the context passages.

When passing the documents to Qdrant, we need an:

    1. id (a unique integer value),
    2. context embedding, and
    3. payload for each document representing context passages in the dataset. The payload is a dictionary containing data relevant to our embeddings. We will store original context as payload.

In [61]:
%%time

batch_size = 512  # specify batch size according to your RAM and compute, higher batch size = more RAM usage

for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    i_end = min(i + batch_size, len(df))
    # extract batch
    batch = df.iloc[i:i_end]

    # Generate embeddings using Cohere
    emb = cohere_client.embed(model="small", texts=batch[0].tolist()).embeddings
    for j in range(len(emb)):
        for k in range(len(emb[j])):
            emb[j][k] = float(emb[j][k])

    # get metadata
    meta = batch.to_dict(orient="records")
    # create unique IDs
    ids = list(range(i, i_end))
    # upsert to qdrant
    client.upsert(
        collection_name=context_collection,
        points=models.Batch(ids=ids, vectors=emb, payloads=meta),
    )
print(
    "vector count in collection- ",
    client.get_collection(collection_name=context_collection).vectors_count,
)

  0%|          | 0/37 [00:00<?, ?it/s]

vector count in collection-  18891
CPU times: user 1min 44s, sys: 1.35 s, total: 1min 46s
Wall time: 3min 8s


## Initialize Generator model
We will use **tuner007/t5_abs_qa** from hugging face for generating answers to our questions. First we will search qdrant to find the context of our question. Then we will provide both the question and context to our generator model which will produce the answer for us. This is T5-base model fine-tuned for abstractive QA using text-to-text approach.

In [13]:
# load tokenizer and model from huggingface

tokenizer = AutoTokenizer.from_pretrained("tuner007/t5_abs_qa")
model = AutoModelWithLMHead.from_pretrained("tuner007/t5_abs_qa")
model = model.to(device)

Downloading (…)okenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/1.23k [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.
The `xla_device` argument has been deprecated in v4.4.0 of Transformers. It is ignored and you can safely remove it from your `config.json` file.


Downloading pytorch_model.bin:   0%|          | 0.00/1.19G [00:00<?, ?B/s]

Now we will write two helper functions, **query_qdrant** to search relevant context from collection based on our query and **generate_answer** which will generate answer after being provided with the query and relevant context from query_qdrant function.

In [18]:
def query_qdrant(query, top_k):
    # generate embeddings for the query
    encoded_query = cohere_client.embed(model="small", texts=[query]).embeddings[0]
    result = client.search(
        collection_name=context_collection,
        query_vector=encoded_query,
        limit=top_k,
    )  # search qdrant collection for context passage with the answer
    return result

In [53]:
def generate_answer(query, results):
    context = ""
    for i in range(len(results)):
        context += results[i].payload[0]
    input_text = "context: %s <question for context: %s </s>" % (context, query)
    features = tokenizer([input_text], return_tensors="pt", max_length=1024)
    out = model.generate(
        input_ids=features["input_ids"].to(device),
        attention_mask=features["attention_mask"].to(device),
        num_beams=2,
        min_length=10,
        max_length=40,
    )
    answer = tokenizer.decode(out[0])
    return answer[6 : len(answer) - 4]

Let's test query_qdrant function and see how we get our relevant contexts.

In [68]:
query = "How many planets are there in the solar system?"
results = query_qdrant(query, top_k=2)
results

[ScoredPoint(id=15607, version=0, score=0.44769913243169046, payload={0: "Neptune is the eighth and farthest known planet from the Sun in the Solar System. It is the fourth-largest planet by diameter and the third-largest by mass. Among the giant planets in the Solar System, Neptune is the most dense. Neptune is 17 times the mass of Earth and is slightly more massive than its near-twin Uranus, which is 15 times the mass of Earth and slightly larger than Neptune.[c] Neptune orbits the Sun once every 164.8 years at an average distance of 30.1 astronomical units (4.50×109 km). Named after the Roman god of the sea, its astronomical symbol is ♆, a stylised version of the god Neptune's trident."}, vector=None),
 ScoredPoint(id=15617, version=0, score=0.4362614019194972, payload={0: 'From its discovery in 1846 until the subsequent discovery of Pluto in 1930, Neptune was the farthest known planet. When Pluto was discovered it was considered a planet, and Neptune thus became the penultimate kno

Now we have to pass the query and results to generate_answer function.

In [69]:
answer = generate_answer(query, results)
answer

'There are eight planets in the solar system.'

As we can see, the two contexts we got do not contain the answer word to word as provided by our model. Let's run some  more queries.

In [70]:
query = "Who invented the telephone?"
results = query_qdrant(query, top_k=2)
answer = generate_answer(query, results)
answer

'Alexander Graham Bell was a scientist and inventor who invented the telephone in 1847.'

Our model also instead of producing irrelevant and vague answers, tells us that the answer is not available in context.

In [73]:
query = "where did the COVID-19 pandemic originate?"
results = query_qdrant(query, top_k=3)
answer = generate_answer(query, results)
answer

'No answer available in context - no answer available in context'

In [76]:
query = "What is the largest desert in the world?"
results = query_qdrant(query, top_k=3)
answer = generate_answer(query, results)
answer

'The Sahara is the largest desert in the world.'

The model gives decent answers to our questions, you can try more queries if you want.

In [None]:
client.delete_collection(collection_name=context_collection)

True