# Using AwaDB as a Vector database for Question Answering tasks

This notebook is an example for how you can use AwaDB as the vector database to save embedding gained from OpenAI Embedding. Then use GPT and embedding-based search to do question answering tasks

We will provide an end-to-end workflow example to illustrate the entire process.

1. Text Preprocessing
2. Embedding
3. Vector Store
4. Similarity Search
5. Question Answering

```mermaid
graph LR
  A[Text Preprocessing] --> B[Embedding]
  B --> C[Vector Store]
  C --> D[Similarity Search]
  D --> E[Question Answering]
```

## Install libraries
The requirments for this sample is `openai` and `awadb` packages 

You can use `pip install awadb` and `pip install openai` to install them.

In [1]:
# Import necessary libraries

try:
    import openai
    import awadb
except ImportError as exc:
    raise ImportError(
        "Could not import libraries. "
        "Please install it with `pip install awadb` or `pip install openai`"
    ) from exc

You also need to set your openai api key as environment variable before. You can find more information abou this by refering [Best Practices for API Key Safety
](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety) 

In [2]:
import os
import wget

assert os.environ["OPENAI_API_KEY"] != None

## Load Dataset

We then need to load the dataset we are used in this example.

In [3]:
embeddings_path = "https://raw.githubusercontent.com/awa-ai/awadb/main/tests/state_of_the_union.txt"
file_path = "state_of_the_union.txt"

if not os.path.exists(file_path):
    wget.download(embeddings_path, file_path)
    print("\nFile downloaded successfully.")
else:
    print("File already exists in the local file system.")
    
# Load the data file
from langchain.document_loaders import TextLoader
loader = TextLoader(file_path)

File already exists in the local file system.


### Split the text
Then we are going to preprocessing the text. Briefly, we split the text data into chunks of size 100, with an overlap of size 10 between neighboring chunks.

The choice of the two hyperparameters here is related to the average sentence length of your document. A basic logic is the need to ensure that each segmented phrase contains a complete semantic meaning and does not contain more than one semantic meaning.

In [4]:
# Transform to document
data = loader.load()
print(f'documents:{len(data)}')

# Initialize tex spilitter
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=20)

# Split the document
split_docs = text_splitter.split_documents(data)
print("split_docs size:",len(split_docs))

Created a chunk of size 164, which is longer than the specified 100
Created a chunk of size 169, which is longer than the specified 100
Created a chunk of size 122, which is longer than the specified 100
Created a chunk of size 121, which is longer than the specified 100
Created a chunk of size 139, which is longer than the specified 100
Created a chunk of size 181, which is longer than the specified 100
Created a chunk of size 101, which is longer than the specified 100
Created a chunk of size 113, which is longer than the specified 100
Created a chunk of size 129, which is longer than the specified 100
Created a chunk of size 146, which is longer than the specified 100
Created a chunk of size 136, which is longer than the specified 100
Created a chunk of size 189, which is longer than the specified 100
Created a chunk of size 215, which is longer than the specified 100
Created a chunk of size 124, which is longer than the specified 100
Created a chunk of size 118, which is longer tha

Created a chunk of size 106, which is longer than the specified 100
Created a chunk of size 141, which is longer than the specified 100
Created a chunk of size 167, which is longer than the specified 100
Created a chunk of size 165, which is longer than the specified 100
Created a chunk of size 110, which is longer than the specified 100
Created a chunk of size 128, which is longer than the specified 100
Created a chunk of size 115, which is longer than the specified 100
Created a chunk of size 106, which is longer than the specified 100
Created a chunk of size 115, which is longer than the specified 100
Created a chunk of size 127, which is longer than the specified 100
Created a chunk of size 156, which is longer than the specified 100
Created a chunk of size 118, which is longer than the specified 100
Created a chunk of size 186, which is longer than the specified 100
Created a chunk of size 119, which is longer than the specified 100
Created a chunk of size 140, which is longer tha

documents:1
split_docs size: 336


In [5]:
from typing import Set

# Save the embedded texts by Awadb
texts = [text.page_content for text in split_docs]

awadb_client = awadb.Client()
awadb_client.Create("testdb1")

# Add the splitted texts into database
awadb_client.AddTexts("embedding_text", "testdb1", texts=texts)

not_include_fields: Set[str] = {"text_embedding"}

### Set the question

Use `awadb_client.Search` for similarity search

In [6]:
# Set the question
query = "What measures does the speaker ask Congress to pass to reduce gun violence?"
# Similarity search results
similar_docs = awadb_client.Search(query=query, topn=3, not_include_fields=not_include_fields)

print(similar_docs)

[{'ResultSize': 3, 'ResultItems': [{'testdb1': [0.003860166296362877, 0.05287330225110054, 0.04189165309071541, 0.013958923518657684, 0.008931133896112442, -0.004174927249550819, 0.03456006199121475, -0.011983185075223446, -0.019423844292759895, -0.008950561285018921, 0.014751210808753967, 0.00012077309656888247, -0.028799518942832947, 0.014195965602993965, 0.0028611551970243454, -0.014142422005534172, 0.05331096798181534, -0.02960740216076374, -0.03736810013651848, 0.012383769266307354, -0.012664074078202248, 0.003340320196002722, 0.011186112649738789, 0.003241416299715638, 0.056063853204250336, -0.017529955133795738, 0.028422176837921143, -0.014435794204473495, -0.06394178420305252, -0.010992028750479221, 0.05998272821307182, -0.017841048538684845, 0.009276899509131908, -0.05242784693837166, 1.4265939398683258e-06, -0.03983017057180405, 0.029982086271047592, -0.002376806689426303, 0.017949869856238365, -0.005107346456497908, 0.005001213867217302, -0.09216341376304626, 0.0333099849522

## Create Prompt
We then will create prompts based on our question and the results from similarity serach.

In [7]:
# Create prompt
system_prompt = "You are a person who answers questions for people based on specified information\n"

similar_prompt = ""
for i in range(3):
    similar_prompt += similar_docs[0]['ResultItems'][i]['embedding_text'] + "\n"

#similar_prompt = similar_docs[0].page_content + "\n" + similar_docs[1].page_content + "\n" + similar_docs[2].page_content + "\n"
question_prompt = f"Here is the question: {query}\nPlease provide an answer only related to the question and do not include any information more than that.\n"
prompt = system_prompt + "Here is some information given to you:\n" + similar_prompt + question_prompt

print(prompt)

You are a person who answers questions for people based on specified information
Here is some information given to you:
Ban assault weapons and high-capacity magazines.
Repeal the liability shield that makes gun manufacturers the only industry in America that can’t be sued.
I ask Democrats and Republicans alike: Pass my budget and keep our neighborhoods safe.
Here is the question: What measures does the speaker ask Congress to pass to reduce gun violence?
Please provide an answer only related to the question and do not include any information more than that.



In [8]:
# Create response from gpt-3.5
response = openai.ChatCompletion.create(
  model = "gpt-3.5-turbo",
  temperature =  0.7,
  messages=[
        {"role": "system", "content": ""},
        {"role": "user", "content": prompt},
    ],
  max_tokens = 40
)

print(response['choices'][0]['message']['content'].replace(' .', '.').strip())

The speaker asks Congress to pass measures to ban assault weapons and high-capacity magazines, as well as to repeal the liability shield for gun manufacturers.
