# Using AwaDB as a Vector database for Question Answering tasks

This notebook is an example for how you can use AwaDB as the vector database to save embedding gained from OpenAI Embedding. Then use GPT and embedding-based search to do question answering tasks

We will provide an end-to-end workflow example to illustrate the entire process.

1. Text Preprocessing
2. Embedding
3. Vector Store
4. Similarity Search
5. Question Answering

```mermaid
graph LR
  A[Text Preprocessing] --> B[Embedding]
  B --> C[Vector Store]
  C --> D[Similarity Search]
  D --> E[Question Answering]
```

## Install libraries
The requirments for this sample is `openai` and `awadb` packages 

You can use `pip install awadb` and `pip install openai` to install them.

In [6]:
# Import necessary libraries

try:
    import openai
    import awadb
except ImportError as exc:
    raise ImportError(
        "Could not import libraries. "
        "Please install it with `pip install awadb` or `pip install openai`"
    ) from exc

You also need to set your openai api key as environment variable before. You can find more information abou this by refering [Best Practices for API Key Safety
](https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety) 

In [7]:
import os
import wget

assert os.environ["OPENAI_API_KEY"] != None

## Load Dataset

We then need to load the dataset we are used in this example.

In [8]:
embeddings_path = "https://raw.githubusercontent.com/awa-ai/awadb/main/tests/state_of_the_union.txt"
file_path = "state_of_the_union.txt"

if not os.path.exists(file_path):
    wget.download(embeddings_path, file_path)
    print("\nFile downloaded successfully.")
else:
    print("File already exists in the local file system.")
    
# Load the data file
from langchain.document_loaders import TextLoader
loader = TextLoader(file_path)

File already exists in the local file system.


### Split the text
Then we are going to preprocessing the text. Briefly, we split the text data into chunks of size 40, with an overlap of size 5 between neighboring chunks

In [9]:
# Transform to document
data = loader.load()
print(f'documents:{len(data)}')

# Initialize tex spilitter
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=40, chunk_overlap=5)
# Split the document
split_docs = text_splitter.split_documents(data)
print("split_docs size:",len(split_docs))

Created a chunk of size 164, which is longer than the specified 40
Created a chunk of size 75, which is longer than the specified 40
Created a chunk of size 95, which is longer than the specified 40
Created a chunk of size 71, which is longer than the specified 40
Created a chunk of size 78, which is longer than the specified 40
Created a chunk of size 169, which is longer than the specified 40
Created a chunk of size 122, which is longer than the specified 40
Created a chunk of size 121, which is longer than the specified 40
Created a chunk of size 139, which is longer than the specified 40
Created a chunk of size 181, which is longer than the specified 40
Created a chunk of size 101, which is longer than the specified 40
Created a chunk of size 113, which is longer than the specified 40
Created a chunk of size 129, which is longer than the specified 40
Created a chunk of size 70, which is longer than the specified 40
Created a chunk of size 100, which is longer than the specified 40


Created a chunk of size 95, which is longer than the specified 40
Created a chunk of size 47, which is longer than the specified 40
Created a chunk of size 78, which is longer than the specified 40
Created a chunk of size 41, which is longer than the specified 40
Created a chunk of size 46, which is longer than the specified 40
Created a chunk of size 47, which is longer than the specified 40
Created a chunk of size 49, which is longer than the specified 40
Created a chunk of size 55, which is longer than the specified 40
Created a chunk of size 75, which is longer than the specified 40
Created a chunk of size 72, which is longer than the specified 40
Created a chunk of size 72, which is longer than the specified 40
Created a chunk of size 166, which is longer than the specified 40
Created a chunk of size 159, which is longer than the specified 40
Created a chunk of size 124, which is longer than the specified 40
Created a chunk of size 107, which is longer than the specified 40
Create

Created a chunk of size 132, which is longer than the specified 40
Created a chunk of size 132, which is longer than the specified 40
Created a chunk of size 147, which is longer than the specified 40
Created a chunk of size 111, which is longer than the specified 40
Created a chunk of size 101, which is longer than the specified 40
Created a chunk of size 77, which is longer than the specified 40
Created a chunk of size 127, which is longer than the specified 40
Created a chunk of size 76, which is longer than the specified 40
Created a chunk of size 120, which is longer than the specified 40
Created a chunk of size 176, which is longer than the specified 40
Created a chunk of size 177, which is longer than the specified 40
Created a chunk of size 177, which is longer than the specified 40
Created a chunk of size 242, which is longer than the specified 40
Created a chunk of size 183, which is longer than the specified 40
Created a chunk of size 92, which is longer than the specified 4

documents:1
split_docs size: 359


In [10]:
# Save the embedded texts by Awadb
"""
from langchain.vectorstores import AwaDB
db = AwaDB.from_documents(split_docs)

# Set the question
query = "What were the two main things the author worked on before college?"
# Similarity search results
similar_docs = db.similarity_search(query)
print(similar_docs)
"""

texts = [{'embeddingtext':text.page_content} for text in split_docs]

awadb_client = awadb.Client()
awadb_client.Create("testdb1")

awadb_client.Add(texts)

# Set the question
query = "What were the two main things the author worked on before college?"
# Similarity search results
similar_docs = awadb_client.Search(query, 3)

## Create Prompt
We then will create prompts based on our question and the results from similarity serach.

In [6]:
# Create prompt
system_prompt = "You are a person who answers questions for people based on specified information\n"
similar_prompt = similar_docs[0].page_content + "\n" + similar_docs[1].page_content + "\n" + similar_docs[2].page_content + "\n"
question_prompt = f"Here is the question: {query}\nPlease provide an answer only related to the question and do not include any information more than that.\n"
prompt = system_prompt + "Here is some information given to you:\n" + similar_prompt + question_prompt

In [None]:
# Create response from gpt-3.5
response = openai.ChatCompletion.create(
  model = "gpt-3.5-turbo",
  temperature =  0.7,
  messages=[
        {"role": "system", "content": ""},
        {"role": "user", "content": prompt},
    ],
  max_tokens = 40
)

print(response['choices'][0]['message']['content'].replace(' .', '.').strip())