# LAB | Abstractive Question Answering

Abstractive question-answering focuses on the generation of multi-sentence answers to open-ended questions. It usually works by searching massive document stores for relevant information and then using this information to synthetically generate answers. This notebook demonstrates how Pinecone helps you build an abstractive question-answering system. We need three main components:

- A vector index to store and run semantic search
- A retriever model for embedding context passages
- A generator model to generate answers

# Install Dependencies

In [1]:
!pip install -qU datasets pinecone-client==3.1.0 sentence-transformers torch

# Load and Prepare Dataset

Our source data will be taken from the Wiki Snippets dataset, which contains over 17 million passages from Wikipedia. But, since indexing the entire dataset may take some time, we will only utilize 50,000 passages in this demo that include "History" in the "section title" column. If you want, you may utilize the complete dataset. Pinecone vector database can effortlessly manage millions of documents for you.

In [1]:
from datasets import load_dataset

# load the dataset from huggingface in streaming mode and shuffle it
wiki_data = load_dataset(
    'wiki_snippets',
    'wiki40b_en_100_0',
    split='train',
    streaming=True
).shuffle(seed=960)

  from .autonotebook import tqdm as notebook_tqdm


We are loading the dataset in the streaming mode so that we don't have to wait for the whole dataset to download (which is over 9GB). Instead, we iteratively download records one at a time.

In [2]:
# show the contents of a single document in the dataset
next(iter(wiki_data))

{'_id': '{"datasets_id": 1251236, "wiki_id": "Q2049363", "sp": 16, "sc": 56, "ep": 16, "ec": 109}',
 'datasets_id': 1251236,
 'wiki_id': 'Q2049363',
 'start_paragraph': 16,
 'start_character': 56,
 'end_paragraph': 16,
 'end_character': 109,
 'article_title': 'Panemotichus',
 'section_title': 'Site',
 'passage_text': 'Archaeologists have revealed Iron Age remains there.'}

In [3]:
# filter only documents with History as section_title - Replace None with your code
# history = None
# filter only documents with History as section_title
history = (doc for doc in wiki_data if doc["section_title"] == "History")

In [4]:
print(wiki_data.features)


{'_id': Value('string'), 'datasets_id': Value('int32'), 'wiki_id': Value('string'), 'start_paragraph': Value('int32'), 'start_character': Value('int32'), 'end_paragraph': Value('int32'), 'end_character': Value('int32'), 'article_title': Value('string'), 'section_title': Value('string'), 'passage_text': Value('string')}


Let's iterate through the dataset and apply our filter to select the 50,000 historical passages. We will extract `article_title`, `section_title` and `passage_text` from each document.

In [5]:
from tqdm.auto import tqdm  # progress bar

total_doc_count = 50000
max_docs = 1000  # limit to the first 10k documents
counter = 0
docs = []
# iterate through the dataset and apply our filter
# for d in tqdm(islice(history, max_docs), total=max_docs):
for d in tqdm(history, total=max_docs):
    # extract the fields we need - article, section, and passage
    article = d["article_title"]
    section = d["section_title"]
    passage = d["passage_text"]

    docs.append({
        "article": article,
        "section": section,
        "passage": passage
    })  
    # print("Got doc", i)
    # if i == 0:
    #     break
    # increase the counter on every iteration
    counter += 1
    if counter >= total_doc_count:
        break

49999it [06:48, 122.28it/s]                         


In [6]:
import pandas as pd

# create a pandas dataframe with the documents we extracted
df = pd.DataFrame(docs)
df.head()

Unnamed: 0,article,section,passage
0,Paffendorf,History,Paffendorf History Ramon Zenker was involved i...
1,Peranakan,History,the Chinese was widespread.\nIt cannot be deni...
2,Pennsylvania Route 244,History,244 to its original alignment.
3,Patio 29,History,"kidnapped during the 1973 coup.\nIn 1981, Sant..."
4,Pecan Street Festival,History,Pecan Street Festival History This downtown Au...


# Initialize Pinecone Index

The Pinecone index stores vector representations of our historical passages which we can retrieve later using another vector (query vector). To build our vector index, we must first establish a connection with Pinecone. For this, we need an API from Pinecone. You can get one for free from [here](https://app.pinecone.io/), and after that, we initialize the connection as follows:

In [7]:
# import os
# from pinecone import Pinecone

# # initialize connection to pinecone (get API key at app.pinecone.io)
# api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'

# # configure client
# pc = Pinecone(api_key=api_key)

import getpass
import os

from pinecone import Pinecone
if not os.getenv("PINECONE_API_KEY"):
    os.environ["PINECONE_API_KEY"] = getpass.getpass("Enter your Pinecone API key: ")

# initialize connection to pinecone (get API key at app.pinecone.io)

pinecone_api_key = os.environ.get("PINECONE_API_KEY")

In [11]:
# # configure client
pc = Pinecone(api_key=pinecone_api_key)

Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects).

In [12]:
from pinecone import ServerlessSpec

cloud = os.environ.get('PINECONE_CLOUD') or 'aws'
region = os.environ.get('PINECONE_REGION') or 'us-east-1'

spec = ServerlessSpec(cloud=cloud, region=region)

In [13]:
from tqdm.auto import tqdm  # progress bar
from itertools import islice

total_doc_count = 50000
max_docs = 1000  # limit to the first 10k documents
counter = 0
docs = []
# iterate through the dataset and apply our filter
for d in tqdm(islice(history, max_docs), total=max_docs):
# for d in tqdm(history, total=max_docs):
    # extract the fields we need - article, section, and passage
    article = d["article_title"]
    section = d["section_title"]
    passage = d["passage_text"]

    docs.append({
        "article": article,
        "section": section,
        "passage": passage
    })  
    # print("Got doc", i)
    # if i == 0:
    #     break
    # increase the counter on every iteration
    counter += 1
    if counter >= total_doc_count:
        break

100%|██████████| 1000/1000 [00:08<00:00, 123.90it/s]


In [14]:
print(pc.list_indexes())


{'indexes': [{'dimension': 1536,
              'host': 'langchain-retrieval-augmentation-3-6da0nqw.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'langchain-retrieval-augmentation-3',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}},
             {'dimension': 3072,
              'host': 'langchain-retrieval-augmentation-6da0nqw.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'langchain-retrieval-augmentation',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},
              'status': {'ready': True, 'state': 'Ready'}},
             {'dimension': 384,
              'host': 'question-answering1-6da0nqw.svc.aped-4627-b74a.pinecone.io',
              'metric': 'cosine',
              'name': 'question-answering1',
              'spec': {'serverless': {'cloud': 'aws', 'region': 'us-east-1'}},


In [16]:
pc.delete_index("langchain-retrieval-augmentation")
pc.delete_index("question-answering1")
pc.delete_index("question-answering")


Now we create a new index. We will name it "abstractive-question-answering" — you can name it anything we want. We specify the metric type as "cosine" and dimension as 768 because the retriever we use to generate context embeddings is optimized for cosine similarity and outputs 768-dimension vectors.

In [20]:
index_name = "abstractive-question-answering" #give your index a meaningful name
if index_name not in pc.list_indexes():
    pc.create_index(
        name=index_name,
        dimension=768,                     # your embedding size
        metric="cosine",                   # similarity metric
        spec=ServerlessSpec(               # serverless configuration
            cloud="aws",
            region="us-east-1"
        ),
    )

# 4) Bind to the existing (or newly created) index
index = pc.Index(index_name)

# 5) Now you can upsert, query, etc.
print(index.describe_index_stats())

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}


In [None]:
# import time

# check if index already exists (it shouldn't if this is first time)
# None #initialize the index, and insure the stats are all zeros

In [None]:
# import time
# from tqdm.auto import tqdm

# batch_size = 100
# vectors = []

# for i, doc in enumerate(docs, start=1):
#     uid = f"doc-{i}"
#     emb = embedder.embed_documents([doc["passage"]])[0]
#     vectors.append((uid, emb, {
#         "article_title": doc["article"],
#         "section_title": doc["section"]
#     }))

#     # when we have a full batch, send it and then sleep
#     if len(vectors) >= batch_size:
#         index.upsert(vectors=vectors)
#         vectors = []
#         time.sleep(0.5)    # <- pause for half a second

# # send any remaining vectors
# if vectors:
#     index.upsert(vectors=vectors)
#     time.sleep(0.5)


NameError: name 'embedder' is not defined

# Initialize Retriever

Next, we need to initialize our retriever. The retriever will mainly do two things:

- Generate embeddings for all historical passages (context vectors/embeddings)
- Generate embeddings for our questions (query vector/embedding)

The retriever will create embeddings such that the questions and passages that hold the answers to our queries are close to one another in the vector space. We will use a SentenceTransformer model based on Microsoft's MPNet as our retriever. This model performs quite well for comparing the similarity between queries and documents. We can use Cosine Similarity to compute the similarity between query and context vectors generated by this model (Pinecone automatically does this for us).

In [23]:
import torch
from sentence_transformers import SentenceTransformer

# set device to GPU if available
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load the retriever model from huggingface model hub
retriever = SentenceTransformer(
    "flax-sentence-embeddings/all_datasets_v3_mpnet-base",
    device=device
) #load the retriever model from HuggingFace. Use the flax-sentence-embeddings/all_datasets_v3_mpnet-base model
retriever




SentenceTransformer(
  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False, 'architecture': 'MPNetModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

In [24]:
emb = retriever.encode("Hello world!", convert_to_tensor=True)
print(emb.shape) 

torch.Size([768])


# Generate Embeddings and Upsert

Next, we need to generate embeddings for the context passages. We will do this in batches to help us more quickly generate embeddings and upload them to the Pinecone index. When passing the documents to Pinecone, we need an id (a unique value), context embedding, and metadata for each document representing context passages in the dataset. The metadata is a dictionary containing data relevant to our embeddings, such as the article title, section title, passage text, etc.

In [25]:
df

Unnamed: 0,article,section,passage
0,Paffendorf,History,Paffendorf History Ramon Zenker was involved i...
1,Peranakan,History,the Chinese was widespread.\nIt cannot be deni...
2,Pennsylvania Route 244,History,244 to its original alignment.
3,Patio 29,History,"kidnapped during the 1973 coup.\nIn 1981, Sant..."
4,Pecan Street Festival,History,Pecan Street Festival History This downtown Au...
...,...,...,...
49995,Party finance in Germany,History,relied on donations given by wealthy individua...
49996,Raymond Park (West) Air Raid Shelter,History,"build trenches. However, after plans were amen..."
49997,Phoenix Air Defense Sector,History,Phoenix Air Defense Sector History PhADS was e...
49998,Party finance in Germany,History,German National People's Party (DNVP) received...


In [None]:
# import time
# from tqdm.auto import tqdm

# batch_size = 256
# vectors = []

# for i, doc in enumerate(docs, start=1):
#     uid = f"doc-{i}"
#     # encode to a single GPU tensor, then move to CPU and listify
#     emb_tensor = retriever.encode([doc["passage"]], convert_to_tensor=True)
#     emb = emb_tensor.cpu().numpy().tolist()[0]
#     vectors.append((uid, emb, {
#         "article_title": doc["article"],
#         "section_title": doc["section"]
#     }))

#     # upsert once per big batch
#     if len(vectors) >= batch_size:
#         index.upsert(vectors=vectors)
#         vectors = []
#         # optional pause
#         time.sleep(0.2)

# # flush remainder
# if vectors:
#     index.upsert(vectors=vectors)


In [27]:
# we will use batches of 64
batch_size = 64
from uuid import uuid4

#You will create embedding for the passage_text variable and be use to include the meta data in each batch
for i in tqdm(range(0, len(df), batch_size)):
    # find end of batch
    j = min(i + batch_size, len(df))
    # extract batch
    batch_df = df.iloc[i:j]
    texts = batch_df["passage"].tolist()
    # generate embeddings for batch
    emb = retriever.encode(texts, convert_to_tensor=False)
    #get meta data for each text
    meta = [
        {
            "article": article,
            "section": section,
            "passage": passage
        }
        for article, section, passage in zip(
            batch_df["article"],
            batch_df["section"],
            batch_df["passage"]
        )
    ]
    ids = [str(uuid4()) for _ in texts]
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)
# check that we have all vectors in index
print(index.describe_index_stats())

100%|██████████| 782/782 [39:32<00:00,  3.03s/it]

{'dimension': 768,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 49984}},
 'total_vector_count': 49984}





# Initialize Generator

We will use ELI5 BART for the generator which is a Sequence-To-Sequence model trained using the ‘Explain Like I’m 5’ (ELI5) dataset. Sequence-To-Sequence models can take a text sequence as input and produce a different text sequence as output.

The input to the ELI5 BART model is a single string which is a concatenation of the query and the relevant documents providing the context for the answer. The documents are separated by a special token &lt;P>, so the input string will look as follows:

>question: What is a sonic boom? context: &lt;P> A sonic boom is a sound associated with shock waves created when an object travels through the air faster than the speed of sound. &lt;P> Sonic booms generate enormous amounts of sound energy, sounding similar to an explosion or a thunderclap to the human ear. &lt;P> Sonic booms due to large supersonic aircraft can be particularly loud and startling, tend to awaken people, and may cause minor damage to some structures. This led to prohibition of routine supersonic flight overland.

More detail on how the ELI5 dataset was built is available [here](https://arxiv.org/abs/1907.09190) and how ELI5 BART model was trained is available [here](https://yjernite.github.io/lfqa.html).

Let's initialize the BART model using transformers.

In [28]:
from transformers import BartTokenizer, BartForConditionalGeneration

# load bart tokenizer and model from huggingface
tokenizer = BartTokenizer.from_pretrained('vblagoje/bart_lfqa')
generator = BartForConditionalGeneration.from_pretrained('vblagoje/bart_lfqa').to(device)

All the components of our abstract QA system are complete and ready to be queried. But first, let's write some helper functions to retrieve context passages from Pinecone index and to format the query in the way the generator expects the input.

In [45]:
def query_pinecone(query, top_k):
    """
    1) Embed the incoming query string
    2) Query Pinecone for the top_k most similar passages (with metadata)
    3) Return the raw list of matches
    """
    
    # generate embedding for the query (a single 768-d vector)
    xq = retriever.encode([query], convert_to_tensor=False)[0].tolist()
    
    # search pinecone index for context passage with the answer
    
    # search Pinecone for the top_k similar passages
    resp = index.query(
        vector=xq,
        top_k=top_k,
        include_metadata=True
    )
    # xc = resp["matches"]   # list of dicts with keys: 'id', 'score', 'metadata'
    return resp

In [50]:
def format_query(query, context):
    """
    1) Tag each retrieved passage with <P>
    2) Concatenate all passages into one context string
    3) Prepend (or append) the user question
    4) Return the final prompt string to feed into your generator
    """
    # extract passage_text from Pinecone search result and add the <P> tag
    context = [f"<P> {m['metadata']['passage']}" for m in context]
    # concatinate all context passages
    context =  " ".join(context)
    # contcatinate the query and context passages
    query =  f"Context:\n{context}\n\nQuestion: {query}"
    return query

Let's test the helper functions. We will query the Pinecone index function we created earlier with the `query_pinecone` to get context passages and pass them to the `format_query` function.

In [51]:
query = "when was the first electric power system built?"
result = query_pinecone(query, top_k=1)
result

{'matches': [{'id': '8c31f7a4-db1e-4dce-82b3-97446cba5e4a',
              'metadata': {'article': 'Pennsylvania Public Utility Commission',
                           'passage': 'in the way that the electricity was '
                                      'priced and regulated.',
                           'section': 'History'},
              'score': 0.609168112,
              'values': []}],
 'namespace': '',
 'usage': {'read_units': 1}}

In [52]:
from pprint import pprint

In [53]:
# format the query in the form generator expects the input
matches = result["matches"] 
query = format_query(query, matches)
pprint(query)

('Context:\n'
 '<P> in the way that the electricity was priced and regulated.\n'
 '\n'
 'Question: when was the first electric power system built?')


The output looks great. Now let's write a function to generate answers.

In [54]:
def generate_answer(query):
    # tokenize the query to get input_ids
    inputs = tokenizer([query], max_length=1024, return_tensors="pt").to(device)
    # use generator to predict output ids
    ids = generator.generate(inputs["input_ids"], num_beams=2, min_length=20, max_length=40)
    # use tokenizer to decode the output ids
    answer = tokenizer.batch_decode(ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    return pprint(answer)

In [55]:
generate_answer(query)

Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


('The first electric power system was built in the early 1800s. The first '
 'electric power plant was built in the early 1900s.')


As we can see, the generator used the provided context to answer our question. Let's run some more queries.

In [56]:
query = "How was the first wireless message sent?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The first wireless message was sent by a telegraph. It was sent by a '
 'telegraph operator in London, England, to a telegraph operator in Newbury, '
 'Berkshire, England. The')


To confirm that this answer is correct, we can check the contexts used to generate the answer.

In [58]:
for doc in context["matches"]:
    print(doc["metadata"]["passage"], end='\n---\n')

University of Hawaii began using radio to send digital information as early as 1971, using ALOHAnet. Friedhelm Hillebrand conceptualised SMS in 1984 while working for Deutsche Telekom. Sitting at a typewriter at home, Hillebrand typed out random sentences and counted every letter, number, punctuation, and space. Almost every time, the messages contained fewer than 160 characters, thus giving the basis for the limit one could type via text messaging. With Bernard Ghillebaert of France Télécom, he developed a proposal for the GSM (Groupe Spécial Mobile) meeting in February 1985 in Oslo. The first technical solution evolved in a GSM subgroup
---
the Map Communication Model (MCM) has its roots in information theory developed in the telephone industry before the war began. Mathematician, inventor, and teacher Claude Shannon worked at Bell Labs after completing his Ph.D. at the Massachusetts Institute of Technology in 1940. Shannon applied mathematical theory to information and demonstrated 

In this case, the answer looks correct. If we ask a question and no relevant contexts are retrieved, the generator will typically return nonsensical or false answers, like with this question about COVID-19:

In [59]:
query = "where did COVID-19 originate?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

("I'm not sure if this is the right subreddit to ask this question, but I "
 "think it's important to note that COVID-19 is not a new strain of influenza. "
 "It's")


In [61]:
for doc in context["matches"]:
    print(doc["metadata"]["passage"], end='\n---\n')

in the scientific literature until 1965. The most recent outbreak occurred in 1985, on a farm in the town of Stetsonville, also in Wisconsin. Outbreaks of TME have also occurred in Canada, Finland, Germany, and the former Soviet Union.
---
Colorado, Louisiana, Alabama, New York, New Jersey, North Carolina, Michigan, Missouri, Iowa, Illinois, Montana, Kentucky, Kansas, Oklahoma, Indiana, Connecticut, Massachusetts, Rhode Island, and Wisconsin (Including one involving a previously asthmatic non-immunocompromised adult). In Canada in September 2014, 49 cases of the virus were confirmed in Alberta, three in British Columbia, and over 100 in Ontario. Health officials reported Los Angeles County's first case of viral infection on October 1, 2014. By October 2, 6 more cases had been reported in California: four in San Diego County, and one each in Ventura and Alameda counties.
The CDC later reported that from mid-August
---
The rest of the laboratories then became operational one by one. The 

Let’s finish with a final few questions.

In [62]:
query = "what was the war of currents?"
context = query_pinecone(query, top_k=5)
query = format_query(query, context["matches"])
generate_answer(query)

('The War of the Currents was a series of naval battles between the United '
 'States and the Ottoman Empire. It was the culmination of a series of naval '
 'battles between the United States and the Ottoman')


In [None]:
query = "who was the first person on the moon?"
context = query_pinecone(query, top_k=10)
query = format_query(query, context["matches"])
generate_answer(query)

('The first person to walk on the moon was Neil Armstrong in 1969. He walked '
 'on the moon in 1969. He was the first person to walk on the moon.')


In [63]:
query = "what was NASAs most expensive project?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The Space Shuttle was the most expensive project in the history of NASA. It '
 'cost about $3.5 Billion to build.')


As we can see, the model can generate some decent answers.

#### Add a few more questions

In [64]:
query = "Who discovered the Philippines?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

("I'm not sure if this is the right subreddit to ask this question, but I'm "
 'curious as to who discovered the Philippines.')


In [65]:
for doc in context["matches"]:
    print(doc["metadata"]["passage"], end='\n---\n')

Pacific Ocean the Philippines, the Carolines, the Marianas and Palau) until the 1898 Spanish–American War.
---
of Hindu deities.
In 1989, a laborer working in a sand mine at the mouth of Lumbang River near Laguna de Bay found a copper plate in Barangay Wawa, Lumban. This discovery, is now known as the Laguna Copperplate Inscription by scholars. It is the earliest known written document found in the Philippines, dated to be from the 9th century AD, and was deciphered in 1992 by Dutch anthropologist Antoon Postma. The copperplate inscription suggests economic and cultural links between the Tagalog people of Philippines with the Javanese Medang Kingdom, the Srivijaya empire, and the Hindu-Buddhist kingdoms of India. Hinduism in
---
of the birth of the Philippines' foremost poet, Francisco Balagtas.
---


In [66]:
query = "Who discovered the Peru?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

("I'm not sure if this qualifies as a question for /r/AskAnthropology, but "
 "I'll give it a shot anyway. I'm not a historian, but I'm a")


In [67]:
for doc in context["matches"]:
    print(doc["metadata"]["passage"], end='\n---\n')

of Venezuela until 1864.
---
export of guano in 1840. Spain, not having recognized Peru's independence (it was not to do so until 1879) and desiring the guano profits, occupied the islands in April 1864, setting off the Chincha Islands War (1864–1866).
---
a group of indigenous people came from the Pacific coast. They wanted to get to Tenochtitlan in order to attend the crowning of the emperor. On their way to their destination, these Pacific people established in Ixtapan de la Sal where they formed communities. Here they noticed that once the geothermally heated water was evaporated naturally in the sunlight, salt was formed. This amazed them because back then, salt was a very precious item. When the emperor found out about this discovery, he also ordered men and women to move there, which led to the foundation of Ixtapan de la
---


In [None]:
query = "Is Peru rich?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

("I'm not sure if this is what you're looking for, but I'll give it a shot. "
 "Peru is rich in gold, silver, copper, and other precious metals. It's")


In [69]:
for doc in context["matches"]:
    print(doc["metadata"]["passage"], end='\n---\n')

Arica Province (Peru) The Province of Arica was a historical territorial division of Peru, which existed between 1823 and 1883. It was populated by pre-Hispanic peoples for a long period of time before Spanish colonisation in the early 16th century saw the transformation of a small town into a thriving port. Trade in both gold and silver  was facilitated through Arica after the precious metals were first extracted from the Potosí silver mines of Bolivia. Following the War of the Pacific, the province was transferred to Chile and became an official Chilean territory in 1929. History Arica was established
---
Chachapoyas, Peru History Named San Juan de la Frontera de los Chachapoyas, the city was first established near La Jalca, and then near Levanto.  The city's original locations were abandoned due to climate, disease and a lack of defenses against rebelling local groups. The location of the city changed several times, until it was settled in the place that it now occupies at 2334 m. A

In [72]:
query = "Are Peranakans rich?"
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('Peranakans are a group of people who migrated to Malaysia, Indonesia and '
 'Singapore during the colonial era. They are descended from the Chinese who '
 'settled in the region during the 19th century')


In [73]:
for doc in context["matches"]:
    print(doc["metadata"]["passage"], end='\n---\n')

they are now called, who came from the east coast of Sumatra and other places.
John Anderson - Agent to the Government of Prince of Wales Island
Peranakans themselves later on migrated between Malaysia, Indonesia and Singapore, which resulted in a high degree of cultural similarity between Peranakans in those countries. Economic / educational reasons normally propel the migration between of Peranakans between the Nusantara region (Malaysia, Indonesia and Singapore), their creole language is very close to the indigenous languages of those countries, which makes adaptations a lot easier. In Indonesia, a large population of Peranakans can be found in Tangerang, West
---
into a class of Straits-born Chinese known as the Peranakans.
Due to economic hardships in mainland China, waves of immigrants from China settled in Malaysia, Indonesia and Singapore. Some of them embraced the local customs, while still retaining some degree of their ancestral culture; they are known as the Peranakans. Per

In [74]:
query = "Tell me about the Peranakans."
context = query_pinecone(query, top_k=3)
query = format_query(query, context["matches"])
generate_answer(query)

('The Peranakans are a group of Chinese who migrated to Malaysia, Indonesia, '
 'and Singapore during the 19th century. They are descended from Chinese '
 'miners who migrated to the island during the')
