# LAB | Abstractive Question Answering

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [5]:
!pip uninstall -y fsspec

Found existing installation: fsspec 2024.9.0
Uninstalling fsspec-2024.9.0:
  Successfully uninstalled fsspec-2024.9.0


In [None]:
!pip uninstall -y gcsfs

Found existing installation: gcsfs 2024.10.0
Uninstalling gcsfs-2024.10.0:
  Successfully uninstalled gcsfs-2024.10.0


In [None]:
!pip install -qU gcsfs==2024.10.0

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.2.0 requires fsspec[http]<=2024.9.0,>=2023.1.0, but you have fsspec 2024.10.0 which is incompatible.[0m[31m
[0m

In [6]:
!pip install -qU fsspec==2024.10.0

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/179.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.6/179.6 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.2.0 requires fsspec[http]<=2024.9.0,>=2023.1.0, but you have fsspec 2024.10.0 which is incompatible.[0m[31m
[0m

In [3]:
!pip install -qU datasets pinecone-client sentence-transformers torch --use-deprecated=legacy-resolver

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m24.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.8/244.8 kB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m275.7/275.7 kB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m12.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m14.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.4/85.4 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pi

In [7]:
!pip show fsspec gcsfs datasets pinecone-client sentence-transformers torch

Name: fsspec
Version: 2024.10.0
Summary: File-system specification
Home-page: https://github.com/fsspec/filesystem_spec
Author: 
Author-email: 
License: BSD 3-Clause License

Copyright (c) 2018, Martin Durant
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:

* Redistributions of source code must retain the above copyright notice, this
  list of conditions and the following disclaimer.

* Redistributions in binary form must reproduce the above copyright notice,
  this list of conditions and the following disclaimer in the documentation
  and/or other materials provided with the distribution.

* Neither the name of the copyright holder nor the names of its
  contributors may be used to endorse or promote products derived from
  this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRES

In [None]:
!pip install -qU datasets pinecone-client==3.1.0 sentence-transformers torch

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 1.33.0 requires gcsfs>=2023.3.0, which is not installed.[0m[31m
[0m

# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [None]:
from datasets import load_dataset

# load the dataset from huggingface in streaming mode and shuffle it
wiki_data = load_dataset(
    'vblagoje/wikipedia_snippets_streamed',
    split='train',
    streaming=True
).shuffle(seed=960)

wikipedia_snippets_streamed.py:   0%|          | 0.00/4.58k [00:00<?, ?B/s]

The repository for vblagoje/wikipedia_snippets_streamed contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at https://hf.co/datasets/vblagoje/wikipedia_snippets_streamed.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N] y


We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [None]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'wiki_id': 'Q7649565',
 'start_paragraph': 20,
 'start_character': 272,
 'end_paragraph': 24,
 'end_character': 380,
 'article_title': 'Sustainable Agriculture Research and Education',
 'section_title': "2000s & Evaluation of the program's effectiveness",
 'passage_text': "preserving the surrounding prairies. It ran until March 31, 2001.\nIn 2008, SARE celebrated its 20th anniversary. To that date, the program had funded 3,700 projects and was operating with an annual budget of approximately $19 million. Evaluation of the program's effectiveness As of 2008, 64% of farmers who had received SARE grants stated that they had been able to earn increased profits as a result of the funding they received and utilization of sustainable agriculture methods. Additionally, 79% of grantees said that they had experienced a significant improvement in soil quality though the environmentally friendly, sustainable methods that they were"}

In [None]:
# filter only documents with History as section_title - Replace None with your code
history = (doc for doc in wiki_data if doc['section_title'] == "History")

Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [None]:
from tqdm.auto import tqdm  # progress bar

total_doc_count = 50000

counter = 0
docs = []
# iterate through the dataset and apply our filter
for d in tqdm(history, total=total_doc_count):
    # extract the fields we need - article, section, and passage
    docs.append({
            "article_title": d["article_title"],
            "section_title": d["section_title"],
            "passage_text": d["passage_text"]
        })
    # increase the counter on every iteration
    counter += 1

    # Break the loop after collecting 50,000 documents
    if counter >= total_doc_count:
        break

  0%|          | 0/50000 [00:00<?, ?it/s]

In [None]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,article_title,section_title,passage_text
0,Reptile Database,History,"Database when the founder, Peter Uetz, was a g..."
1,Propaganda for Japanese-American internment,History,citizen-influenced farming conflicts with the ...
2,"Richvale, California",History,"Richvale, California History Legend says that ..."
3,ROH Survival of the Fittest,History,Championship Wrestling's Shane Shamrock Memori...
4,"Old Hall Hotel, Sandbach",History,lord of the manor of Sandbach. The first phas...


In [None]:
df.shape

(50000, 3)

# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [8]:
import os
from pinecone import Pinecone
from google.colab import userdata

# initialize connection to pinecone (get API key at app.pinecone.io)
api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'

# configure client
pc = Pinecone(api_key=api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [None]:
'''from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)'''

In [None]:
from pinecone import Pinecone, ServerlessSpec

spec = ServerlessSpec(
    cloud="aws", region="us-east-1"
)

Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [18]:
index_name = "abstractive-question-answering" #give your index a meaningful name

# check if the abstractive-question-answering index exists
if index_name not in pc.list_indexes().names():
    # create the index if it does not exist
    pc.create_index(
    name=index_name,
    dimension=768, # Replace with your model dimensions
    metric="cosine", # Replace with your model metric
    spec=spec
)

# connect to abstractive-question-answering index we created
index = pc.Index(index_name)

In [None]:
import time

# check if index already exists (it shouldn't if this is first time)

index_stats = index.describe_index_stats() #initialize the index, and insure the stats are all zeros

print(index_stats)

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}


# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

In [16]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load the retriever model from huggingface model hub
retriever = SentenceTransformer("flax-sentence-embeddings/all_datasets_v3_mpnet-base").to(device) #load the retriever model from HuggingFace. Use the flax-sentence-embeddings/all_datasets_v3_mpnet-base model

# Output the retriever to verify it was loaded successfully
retriever

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/9.85k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/591 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [None]:
# we will use batches of 64
batch_size = 64

#You will create embedding for the passage_text variable and be use to include the meta data in each batch
for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    end = min(i + batch_size, len(df))
    # extract batch
    batch = df.iloc[i:end]
    # generate embeddings for batch
    embeddings = retriever.encode(batch['passage_text'].tolist()).tolist()


# Prepare the metadata for each document in the batch
    to_upsert = [
        {
            "id": f"{idx}",  # Unique ID for the document
            "values": embedding,  # The embedding vector
            "metadata": {
                "article_title": row['article_title'],
                "section_title": row['section_title'],
                "passage_text": row['passage_text']
            }
        }
        for idx, (embedding, row) in enumerate(zip(embeddings, batch.to_dict(orient="records")))
    ]

    # Upsert the batch to Pinecone
    _= index.upsert(vectors=to_upsert)

# Check the statistics of the index to verify all vectors are added
index_stats = index.describe_index_stats()
print("Index stats:", index_stats)

  0%|          | 0/782 [00:00<?, ?it/s]

Index stats: {'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 64}},
 'total_vector_count': 64}


In [None]:
# We will use batches of 64
batch_size = 64

# Add a counter to track the number of batches processed
batch_counter = 0

# You will create embedding for the passage_text variable and include the metadata in each batch
for i in tqdm(range(0, len(df), batch_size)):
    # Increment the batch counter
    batch_counter += 1

    # Find end of batch
    end = min(i + batch_size, len(df))
    # Extract batch
    batch = df.iloc[i:end]

    # Debug: Print batch info
    print(f"Processing batch {batch_counter} with records from index {i} to {end}")

    # Generate embeddings for batch
    embeddings = retriever.encode(batch['passage_text'].tolist()).tolist()

    # Prepare the metadata for each document in the batch
    to_upsert = [
        {
            "id": f"{idx + i}",  # Unique ID for the document (offset by i for global uniqueness)
            "values": embedding,  # The embedding vector
            "metadata": {
                "article_title": row['article_title'],
                "section_title": row['section_title'],
                "passage_text": row['passage_text']
            }
        }
        for idx, (embedding, row) in enumerate(zip(embeddings, batch.to_dict(orient="records")))
    ]

    # Debug: Print the size of the current batch and a sample document
    print(f"Batch size: {len(to_upsert)}, Sample document ID: {to_upsert[0]['id']}")

    # Upsert the batch to Pinecone
    _ = index.upsert(vectors=to_upsert)

    # Optional: Debugging total vector count after each upsert
    index_stats = index.describe_index_stats()
    print(f"Current vector count after batch {batch_counter}: {index_stats['total_vector_count']}")

# Final statistics
index_stats = index.describe_index_stats()
print("Final Index stats:", index_stats)

  0%|          | 0/782 [00:00<?, ?it/s]

Processing batch 1 with records from index 0 to 64
Batch size: 64, Sample document ID: 0
Current vector count after batch 1: 64
Processing batch 2 with records from index 64 to 128
Batch size: 64, Sample document ID: 64
Current vector count after batch 2: 64
Processing batch 3 with records from index 128 to 192
Batch size: 64, Sample document ID: 128
Current vector count after batch 3: 64
Processing batch 4 with records from index 192 to 256
Batch size: 64, Sample document ID: 192
Current vector count after batch 4: 64
Processing batch 5 with records from index 256 to 320
Batch size: 64, Sample document ID: 256
Current vector count after batch 5: 64
Processing batch 6 with records from index 320 to 384
Batch size: 64, Sample document ID: 320
Current vector count after batch 6: 64
Processing batch 7 with records from index 384 to 448
Batch size: 64, Sample document ID: 384
Current vector count after batch 7: 64
Processing batch 8 with records from index 448 to 512
Batch size: 64, Sample

In [19]:
index_stats = index.describe_index_stats() #initialize the index, and insure the stats are all zeros

print(index_stats)

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 50000}},
 'total_vector_count': 50000}


# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [10]:
import torch

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'

In [12]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [13]:
def query_pinecone(query, top_k):
    # generate embeddings for the query
    xq = retriever.encode([query]).tolist()
    # search pinecone index for context passage with the answer
    xc = index.query(
        vector=xq[0],  # Query vector
        top_k=top_k,  # Number of results to retrieve
        include_metadata=True  # Include metadata in the results
    )
    return xc # Return the query results

In [14]:
def format_query(query, context):
    # extract passage_text from Pinecone search result and add the <P> tag
    context_passages = [f"<P> {m['metadata']['passage_text']}" for m in context] # Renamed context to context_passages
    # concatinate all context passages
    context_text = " ".join(context_passages)  # Renamed context to context_text
    # contcatinate the query and context passages
    query = f"Question: {query} Context: {context_text}"  # use context_text
    return query # Return the formatted query

Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [20]:
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)
result

{'matches': [{'id': '18787',
              'metadata': {'article_title': 'Pundooah railway station',
                           'passage_text': '25 kV AC, in 1958.',
                           'section_title': 'History'},
              'score': 0.591892362,
              'values': []}],
 'namespace': '',
 'usage': {'read_units': 6}}

In [21]:
from pprint import pprint

In [22]:
# format the query in the form generator expects the input
query = format_query(query, result["matches"])
pprint(query)

('Question: when was the first electric power system built? Context: <P> 25 kV '
 'AC, in 1958.')


The output looks great. Now let's write a function to generate answers.

In [23]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [24]:
generate_answer(query)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


('Electricity was invented in the 19th century. The first electric power '
 'system was a steam engine. The first steam engine was built in 1812.')


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [25]:
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The first wireless message was sent by a telegraph. The first telegraph was '
 'built in the early 1800s, and was used to send messages between London and '
 'Paris. The first telegraph')


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [26]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

the wireless telegraphy or "spark" era, primitive radio transmitters called spark gap transmitters were used, which generated radio waves by an electric spark.  These transmitters were unable to produce the continuous sinusoidal waves which are used to transmit audio (sound) in modern AM or FM radio transmission. Instead spark gap transmitters transmitted information by wireless telegraphy; the user turned the transmitter on and off rapidly by tapping on a telegraph key, producing pulses of radio waves which spelled out text messages in Morse code.  Therefore, the radio receivers of this era did not have to demodulate the radio
---
Amplitude modulation History Although AM was used in a few crude experiments in multiplex telegraph and telephone transmission in the late 1800s, the practical development of amplitude modulation is synonymous with the development between 1900 and 1920 of "radiotelephone" transmission, that is, the effort to send sound (audio) by radio waves.  The first radi

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [27]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('COVID-19 is a program that was created by the United States Department of '
 'Defense in response to the outbreak of cholera in the United States in the '
 'early 1990s. The CDC')


In [28]:
for doc in context["matches"]:
    print(doc["metadata"]["passage_text"], end='\n---\n')

from bombing by the German Luftwaffe in May 1940, in the course of the Battle of Belgium.
---
History The 19th century held major discoveries in medicine and public health. The Broad Street cholera outbreak of 1854 was central to the development of modern epidemiology. The microorganisms responsible for malaria and tuberculosis were identified in 1880 and 1882, respectively. The 20th century saw the development of preventive and curative treatments for many diseases, including the BCG vaccine (for tuberculosis) and penicillin in the 1920s. The eradication of smallpox, with the last naturally occurring case recorded in 1977, raised hope that other diseases could be eradicated as well.
Important steps were taken towards global cooperation in health with the formation
---
Europe.
---


Let’s finish with a final few questions.

In [29]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The War of the Currents was a battle between Edison and Westinghouse over '
 "the contract to light the 1893 Chicago World's Fair as well as build the "
 "world's largest hydro-electric system")


In [30]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('The first man to walk on the Moon was Neil Armstrong, who walked on the Moon '
 'in 1969. He was the first man to walk on the Moon, and the first man to walk '
 'on')


In [31]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The Space Shuttle was the most expensive project in the history of NASA. It '
 'cost $2.6 billion to build.')


As we can see, the model can generate some decent answers.

#### Add a few more questions

In [32]:
query = "What is the tallest mountain in the world?"
context = query_pinecone(query, top_k=1)
query = format_query(query, context["matches"])
generate_answer(query)

('The tallest mountain in the world is Mount Everest, which is the tallest '
 'mountain in the world.')


In [33]:
query = "Who invented the light bulb?"
context = query_pinecone(query, top_k=2)
query = format_query(query, context["matches"])
generate_answer(query)

('The light bulb was invented in 1801 by Sir Humphry Davy in an 1801 paper '
 "published in William Nicholson's Journal of Natural Philosophy and the Arts.")


In [34]:
query = "What caused the Great Depression?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The Great Depression was caused by a combination of two things: 1. The Great '
 'Depression was caused by a combination of two things: 1. The Great '
 'Depression was caused by a combination of two')


In [35]:
query = "How has artificial intelligence impacted healthcare in the 21st century?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

("I'm not sure if this is what you're looking for, but I can tell you that "
 'there is a lot of work being done in the field of artificial intelligence in '
 'the medical field.')


In [36]:
query = "What are the similarities and differences between black holes and neutron stars?"
context = query_pinecone(query, top_k=7)
query = format_query(query, context["matches"])
generate_answer(query)

('A neutron star is a neutron star that is so massive that it has no escape '
 'velocity. A black hole is a black hole that is so massive that it has no '
 'escape velocity.')


# **Index Deletion**

In [37]:
pc.delete_index(index_name)