<a href="https://colab.research.google.com/github/solvedbrunus/lab-abstractive-question-answering/blob/main/lab-abstractive-question-answering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# LAB | Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [1]:
!pip install -qU datasets pinecone-client==3.1.0 sentence-transformers torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.0/211.0 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatib

# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [2]:
from datasets import load_dataset

# load the dataset from huggingface in streaming mode and shuffle it
wiki_data = load_dataset(
    'vblagoje/wikipedia_snippets_streamed',
    split='train',
    streaming=True,
    trust_remote_code=True
).shuffle(seed=960)

wikipedia_snippets_streamed.py:   0%|          | 0.00/4.58k [00:00<?, ?B/s]

We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [3]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'wiki_id': 'Q7649565',
 'start_paragraph': 20,
 'start_character': 272,
 'end_paragraph': 24,
 'end_character': 380,
 'article_title': 'Sustainable Agriculture Research and Education',
 'section_title': "2000s & Evaluation of the program's effectiveness",
 'passage_text': "preserving the surrounding prairies. It ran until March 31, 2001.\nIn 2008, SARE celebrated its 20th anniversary. To that date, the program had funded 3,700 projects and was operating with an annual budget of approximately $19 million. Evaluation of the program's effectiveness As of 2008, 64% of farmers who had received SARE grants stated that they had been able to earn increased profits as a result of the funding they received and utilization of sustainable agriculture methods. Additionally, 79% of grantees said that they had experienced a significant improvement in soil quality though the environmentally friendly, sustainable methods that they were"}

In [4]:
# filter only documents with History as section_title - Replace None with your code
history = wiki_data.filter(lambda x: x['section_title'] == 'History')

In [6]:
first_history = next(iter(history))



In [7]:
first_history

{'wiki_id': 'Q2644349',
 'start_paragraph': 10,
 'start_character': 397,
 'end_paragraph': 10,
 'end_character': 534,
 'article_title': 'Taupo District',
 'section_title': 'History',
 'passage_text': 'was not until the 1950s that the region started to develop, with forestry and the construction of the Wairakei geothermal power station.'}

Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [8]:
from tqdm.auto import tqdm  # progress bar

total_doc_count = 100

counter = 0
docs = []
# iterate through the dataset and apply our filter
for d in tqdm(history, total=total_doc_count):

    # extract the fields we need - article, section, and passage
    doc = {
        'article': d['article_title'],
        'section': d['section_title'],
        'passage': d['passage_text']
    }
    docs.append(doc)

    # increase the counter on every iteration
    counter += 1
    # break the loop if the counter reaches the total_doc_count
    if counter == total_doc_count:
        break


  0%|          | 0/100 [00:00<?, ?it/s]

In [9]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)

df.head()

Unnamed: 0,article,section,passage
0,Taupo District,History,was not until the 1950s that the region starte...
1,The Bishop Wand Church of England School,History,The Bishop Wand Church of England School Histo...
2,Surface Hill Uniting Church,History,in perpetual reminder that work and worship go...
3,The Electras (band),History,"as its B-side. However, copies of the single, ..."
4,Swanton House,History,it. Lane provided funds for restoration by the...


In [10]:
df.isna().sum()

Unnamed: 0,0
article,0
section,0
passage,0


# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [None]:
#import os
#from pinecone import Pinecone

#from dotenv import load_dotenv, find_dotenv
#_ = load_dotenv(find_dotenv())

# initialize connection to pinecone (get API key at app.pinecone.io)
#api_key = os.environ.get('PINECONE_API_KEY')


# configure client
#pc = Pinecone(api_key=api_key)

In [11]:
import os
from pinecone import Pinecone
from google.colab import userdata

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')

PINECONE_API_KEY = userdata.get('PINECONE_API_KEY')


In [12]:
pc = Pinecone(api_key=PINECONE_API_KEY)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [13]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)



Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [23]:
index_name = "abstractive-question-answering" #give your index a meaningful name

In [22]:
pc.delete_index(index_name) #-uncoment if problems with index to delete and start over

In [16]:
!pip install --upgrade pinecone-client

Collecting pinecone-client
  Downloading pinecone_client-5.0.1-py3-none-any.whl.metadata (19 kB)
Collecting pinecone-plugin-inference<2.0.0,>=1.0.3 (from pinecone-client)
  Downloading pinecone_plugin_inference-1.1.0-py3-none-any.whl.metadata (2.2 kB)
Collecting pinecone-plugin-interface<0.0.8,>=0.0.7 (from pinecone-client)
  Downloading pinecone_plugin_interface-0.0.7-py3-none-any.whl.metadata (1.2 kB)
Downloading pinecone_client-5.0.1-py3-none-any.whl (244 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_inference-1.1.0-py3-none-any.whl (85 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pinecone_plugin_interface-0.0.7-py3-none-any.whl (6.2 kB)
Installing collected packages: pinecone-plugin-interface, pinecone-plugin-inference, pinecone-client
  Attempting uninstall: pinecone-client
  

In [24]:
import time
from pinecone import Index

# check if index already exists
if index_name not in pc.list_indexes():
    # create the index
    pc.create_index(
        name=index_name,
        metric="cosine",
        dimension=768,
        spec=spec
    )
    # wait for the index to be created
    while not pc.describe_index(index_name).status['ready']:
        time.sleep(1)
    print(f"Index {index_name} created successfully")
else:
    print(f"Index {index_name} already exists")

# initialize the index
index = pc.Index(index_name)

# check index stats
stats = index.describe_index_stats()
print(f"Index {stats}")


Index abstractive-question-answering created successfully
Index {'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

In [25]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer('flax-sentence-embeddings/all_datasets_v3_mpnet-base') #load the retriever model from HuggingFace. Use the flax-sentence-embeddings/all_datasets_v3_mpnet-base model
retriever = retriever.to(device)
retriever

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.


0it [00:00, ?it/s]

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/9.85k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/591 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [28]:
# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(docs), batch_size)):
    # find end of batch
    i_end = min(i + batch_size, len(docs))
    # extract batch
    batch = df.iloc[i:i_end]
    # generate embeddings for batch
    # Iterate over the rows of the DataFrame instead of the column names
    emb = retriever.encode(batch['passage'].tolist()).tolist()

    # create metadata and id for each vector
    metadata = [
        {
            'article': doc['article'],
            'section': doc['section'],
            'passage_text': doc['passage']
        } for doc in batch.to_dict('records') # Convert DataFrame to a list of dictionaries
    ]

    # create unique IDs for each vector
    ids = [f"doc_{i + j}" for j in range(len(batch))]

    #create upsert list
    to_upsert = list(zip(ids, emb, metadata))

    # upsert batch of vectors to pinecone
    index.upsert(vectors=to_upsert)

# check that we have all vectors in index
stats= index.describe_index_stats()

  0%|          | 0/2 [00:00<?, ?it/s]

We have None vectors in our index


In [30]:
print(index.describe_index_stats())

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 100}},
 'total_vector_count': 100}


# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [31]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

tokenizer_config.json:   0%|          | 0.00/27.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.32k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [51]:
def query_pinecone(query, top_k, format_results=True):
    # generate embeddings for the query
    xq = retriever.encode(query).tolist()

    # search pinecone index for context passage with the answer
    xc = index.query(
        vector=xq,
        top_k=top_k,
        include_metadata=True
    )

    if format_results:
        formatted_results = []
        for match in xc['matches']:
            formatted_results.append({
                'score': round(match['score'], 3),
                'article': match['metadata']['article'],
                'text': match['metadata']['passage_text'] # Change 'text' to 'passage_text'
            })
        return formatted_results

    return xc

In [52]:
def format_query(query, context):
    # format the query for the abstractive answer generator
    # the generator expects the query and context to be separated by <P>
    if type(context) != list:
        context = [f"<P> {m['metadata']['passage_text']}" for m in context['matches']]
    else:
        # If it's already a list
        context = [f"<P> {m['metadata']['passage_text']}" for m in context]

    # concatenate all context passages
    context = " ".join(context)

    # return the formatted query
    return f"question: {query} context: {context}"


Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [53]:
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1, format_results=False) # Set format_results to False
result

{'matches': [{'id': 'doc_0',
              'metadata': {'article': 'Taupo District',
                           'passage_text': 'was not until the 1950s that the '
                                           'region started to develop, with '
                                           'forestry and the construction of '
                                           'the Wairakei geothermal power '
                                           'station.',
                           'section': 'History'},
              'score': 0.362797797,
              'values': []}],
 'namespace': '',
 'usage': {'read_units': 6}}

In [54]:
from pprint import pprint

In [55]:
# format the query in the form generator expects the input
query = format_query(query, result) # Change result to result['matches']
pprint(query)

('question: when was the first electric power system built? context: <P> was '
 'not until the 1950s that the region started to develop, with forestry and '
 'the construction of the Wairakei geothermal power station.')


The output looks great. Now let's write a function to generate answers.

In [60]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [61]:
generate_answer(query)

('The first wireless message was sent by a telegraph. It was sent by a '
 'telegraph operator.')


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [62]:
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5, format_results=False) #Set format_results to False to get the raw results
query = format_query(query, context['matches']) # Access the 'matches' key
generate_answer(query)

('The first wireless message was sent in the early 1900s. The first wireless '
 'message was sent by a telegraph cable from London to New York. The telegraph '
 'cable was connected to a te')


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [63]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

PBX or key system and replacing it with a solution based on IP.  This IP solution is software driven only, and thereby does away with the need for "switching" equipment at a customer site (save the equipment necessary to connect to the outside world).  This created a new technology, now called IP telephony.  A system that uses IP-based telephony services only, rather than a legacy PBX or key system, is called an IP telephony solution.
With the advent of IP telephony the handset was no longer a digital device hanging off a copper loop from a PBX. 
---
Felice.
In the 1960s the village began to industrialize more heavily.
---
Roads, then headed via New York to Boston for repairs at the navy yard which she completed early in November. On the 16th, she loaded torpedoes at Newport and headed south to Charleston, where she arrived on the 18th. She remained there until the spring of 1922. On 29 May of that year, she got underway for a voyage which took her up the coast to Philadelphia; thence 

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [65]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=3, format_results=False) #Set format_results to False to get the raw results
query = format_query(query, context['matches']) # Access the 'matches' key
generate_answer(query)

('COVID-19 is a name for the COVID-19 program, which was a program that was '
 "developed by the United States Army in response to the Soviet Union's "
 'invasion of Afghanistan.')


In [66]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

the transfer of the Italian Consulate and installation of the Italian Institute of Culture in the building the theater.
---
Felice.
In the 1960s the village began to industrialize more heavily.
---
Major 'melioration' measures such as draining, deep ploughing (Tiefumbruch) and river regulation were supposed to increase the productivity of agriculture and even enabled arable farming. Intensive farming methods were used to grow maize as an animal feedstuff. These measures had been supported since the middle of the 20th century by various national and European subsidy programmes. This went so far that ditches dried out in summer, heath fires broke out and, during sustained periods of drought, the land was artificially watered.
In the 1990s a major rethink began. By leaving the land to regenerate and by reflooding it, attempts have been
---


Let’s finish with a final few questions.

In [67]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5, format_results=False)
query = format_query(query, context["matches"])
generate_answer(query)

('The War of Currents was a series of naval battles between the British and '
 'French navies during the Napoleonic Wars. The British and French navies '
 'fought each other in the Battle of')


In [69]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10,format_results=False)
query = format_query(query, context["matches"])
generate_answer(query)

('The first person to walk on the moon was Neil Armstrong, who walked on the '
 'moon in 1969.')


In [70]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3, format_results=False)
query = format_query(query, context["matches"])
generate_answer(query)

("I'm not sure if this counts as a project, but I'm pretty sure that the US "
 'Air Force was the most expensive organization in the world in the early 20th '
 'century. The US')


In [71]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')


Felice.
In the 1960s the village began to industrialize more heavily.
---
Tamás Nádas History In 1998 at the age of 29 Nádas got into connection with flying thanks to a pleasure flight. He liked it so much that he started his pilot course that day. After a few months he got his license.
He was not satisfied with all this so he got into a Z-142 and continued his aviation career with aerobatics. His complete aerobatics training ended in 2001. He flew YAK-18, YAK-52 and Z-726, then single-seat machines: YAK-55M, ACRO-230, CAP-231, Z-50LS and he made the audience of several Hungarian events happy.
A milestone in his aviation career was the year of
---
was not until the 1950s that the region started to develop, with forestry and the construction of the Wairakei geothermal power station.
---


As we can see, the model can generate some decent answers.

#### Add a few more questions

In [72]:
query = "Is there live on mars?"
context = query_pinecone(query, top_k=3, format_results=False)
query = format_query(query, context["matches"])
generate_answer(query)

('Yes, there is. The Mars Reconnaissance Orbiter (MRO) has a radio telescope '
 "that can pick up radio signals from Mars. It's not very good, but it can "
 'pick')


### Conclusion

In this lab of abstractive question answer, I encounter some difficulties running the the pre populated code and had to change somethings in order to work to answer using Pinecone and machine learning models. However, I couldn't always get the right answers, showing that there's still room for improvement.