# KDB.AI for Q&A with ChatGPT Retrieval Plugin

1. Install Python 3.10 if not already installed.

2. Clone the `retrieval-app` repository:

```
git clone git@github.com:openai/retrieval-app.git
```

3. Navigate to the app directory:

```
cd /path/to/retrieval-app
```

4. Install `poetry`:

```
pip install poetry
```

5. Create a new virtual environment:

```
poetry env use python3.10
```

6. Install the `retrieval-app` dependencies:

    Before installing the dependencies with poetry, we will need to add PyKX and a local KDBAI python package as a dependence in the `pyproject.toml` file. This can be done by by adding the following lines under `[tool.poetry.dependencies]`:

* __pykx = "^1.6.3"__
* __kdbai = { path = "path/to/kdbai-python-client" }__

    Once these have been added, install dependencies to the poetry environment:

```
poetry install
```

7. Set up the KDB.AI Cloud Server
    
    Sign up for an account [here](https://test.qa.cld.kx.com/kdbai/signup/) and verify your email to receive your host url (access endpoint) and KDB.AI API key.


8. Set app environment variables:

* `BEARER_TOKEN`: Secret token used by the app to authorize incoming requests. The token can be generated using [jwt.io](https://jwt.io/).
* `OPENAI_API_KEY`: The OpenAI API key used for generating embeddings with the `text-embedding-ada-002` model. [Get an API key here](https://platform.openai.com/account/api-keys)!


9. Set `KDBAI`-specific environment variables:

* `DATASTORE`: set to `kdbai`
* `KDBAI_ENDPOINT`: set to your Cloud host url ('https://ui.qa.cld.kx.com/instance/abcde12345' - omit '/api/v1/config/table')
* `KDBAI_API_KEY`: set to your Cloud KDB.AI API key

10. Run the app with:

```
poetry run start
```
The ChatGPT plugin is ready to handle requests when `Application startup complete` is printed to the terminal.

When the application is up and running a FastAPI server exposes the ChatGPT plugin's endpoints for `upserting`, `querying`, and `deleting` documents.

During the start up of the application, a new table named `documents` is created in `KDB.AI` with default parameters `{type='flat', metric='L2', dims=1536}`. This table will contain vector embeddings of the data, with an HNSW index for similarity searching. 

In [1]:
import os
import random
import requests
from tqdm.auto import tqdm
from datasets import load_dataset

import openai

In [2]:
BEARER_TOKEN = os.environ.get("BEARER_TOKEN")

### Load Dataset from Hugging Face

The Adversarial_QA dataset is chosen for this demo.
The adversarial human annotation paradigm ensures that these datasets consist of questions that current state-of-the-art models find challenging.

In [3]:
data = load_dataset("adversarial_qa", 'adversarialQA', split="train").to_pandas()
data = data.drop_duplicates(subset=["context"])
print(f"Number of unique contexts: {len(data)}")
data.head()

Number of unique contexts: 2648


Unnamed: 0,id,title,context,question,answers,metadata
0,7ba1e8f4261d3170fcf42e84a81dd749116fae95,Brain,Another approach to brain function is to exami...,What sare the benifts of the blood brain barrir?,"{'text': ['isolated from the bloodstream'], 'a...","{'split': 'train', 'model_in_the_loop': 'Combi..."
12,936a8460bfffe437b54cf3ec1e825a3b7b5627a1,Brain,Motor systems are areas of the brain that are ...,What do you think with?,"{'text': ['brain'], 'answer_start': [467]}","{'split': 'train', 'model_in_the_loop': 'Combi..."
24,e40737d487964dbcd26a223f2799cf56390a98a8,Brain,The brain is an organ that serves as the cente...,How are neurons connected?,"{'text': ['synapses'], 'answer_start': [602]}","{'split': 'train', 'model_in_the_loop': 'Combi..."
37,a0f8e785a10f6e21e24207d24ba2823162383062,Brain,The SCN projects to a set of areas in the hypo...,The body's central biological clock is contain...,"{'text': ['SCN'], 'answer_start': [4]}","{'split': 'train', 'model_in_the_loop': 'Combi..."
53,6d753d4a8878b5f5bc496d9a369b8c0b212079a0,Brain,The brain contains several motor areas that pr...,What is at the highest level?,"{'text': ['the primary motor cortex'], 'answer...","{'split': 'train', 'model_in_the_loop': 'Combi..."


In [4]:
# extract text data from the dataset and reformat
documents = [
    {
        'text': r['context'],
    } for r in data.to_dict(orient='records')
]

print(documents[:3])

[{'text': 'Another approach to brain function is to examine the consequences of damage to specific brain areas. Even though it is protected by the skull and meninges, surrounded by cerebrospinal fluid, and isolated from the bloodstream by the blood–brain barrier, the delicate nature of the brain makes it vulnerable to numerous diseases and several types of damage. In humans, the effects of strokes and other types of brain damage have been a key source of information about brain function. Because there is no ability to experimentally control the nature of the damage, however, this information is often difficult to interpret. In animal studies, most commonly involving rats, it is possible to use electrodes or locally injected chemicals to produce precise patterns of damage and then examine the consequences for behavior.'}, {'text': 'Motor systems are areas of the brain that are directly or indirectly involved in producing body movements, that is, in activating muscles. Except for the mus

In [5]:
# initialise a session with the vector database
s = requests.Session()

### Insert data to the KDB.AI table

The `/upsert` instruction is used to insert data to the KDB.AI datastore in batches, with each batch being embedded with `text-embedding-ada-002` before it is added to the table.

In [6]:
batchSize = 100

# upsert documents from the dataset in batches
for i in tqdm(range(0, len(documents), batchSize)):
    i_end = min(len(documents), i+batchSize)
    
    res = s.post(
        "http://localhost:8000/upsert",
        
        headers = {
            "Authorization": f"Bearer {BEARER_TOKEN}"
        },
        
        json = {
            "documents": documents[i:i_end]
        }
    )

  0%|          | 0/27 [00:00<?, ?it/s]

### Query the KDB.AI table

In [7]:
# extract questions and reformat into queries
queries = data['question'].tolist()
queries = [{'query': queries[i]} for i in range(len(queries))]

# choose 5 queries at random 
i = random.randint(0, len(queries)-5)
searchQueries = queries[i:i+5]

print(searchQueries)

[{'query': 'Are there fewer counties or members in the House of Representatives?'}, {'query': 'What county in Liberia has the largest number of people living there?'}, {'query': "Which came first, the elections or Taylor's extradition?"}, {'query': 'How has Portuguese influenced another mentioned language?'}, {'query': "An adjective that describes characteristics from countries that are not of one's own is?"}]


The `/query` instruction is used to extract relevant information from the KDB.AI datastore. The queries are embedded into vectors, and a similarity search algorithm is used to calculate its nearest neighbours, representing the most relevant entries in the table.

In [10]:
# query the vector database
results = requests.post(
    "http://localhost:8000/query",
    
    headers = {
        "Authorization": f"Bearer {BEARER_TOKEN}"
    },
    
    json = {
        'queries': searchQueries
    }
)

print(results)

<Response [200]>


In the cell below, we iterate through each query and its results, and pass this data to ChatGPT, which uses it to respond to the query with natural language.

In [11]:
# iterate through each set of queries/results
for query_response in results.json()['results']:
    query = query_response['query']
    answers = []
    scores = []
    
    # extract answers and scores from each result
    for result in query_response['results']:
        
        # answer = textual information related to the query
        answers.append(result['text'])
        
        # score = distance between the query vector and the answer vector (smaller=better!)
        scores.append(round(result['score'], 2))
    
    # print the query
    print("\nQUERY:\n"+query)
    
    # print the query responses, and their scores
    print("\nCONTEXT:\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n")
    
    # format the query and its answers into GPT messages
    messages = [
        {"role": "system", "content": f"You are a helpful assistant with the following knowledge: {answers}"},
        {"role": "user", "content": f"Using your knowledge, answer this query: {query}"}
    ]
    
    # send the messages to a GPT model
    response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=messages,
      max_tokens=100,
      n=1,
      stop=None,            
    )

    # extract the generated response from the API response
    generated_response =response['choices'][0]['message']['content']
    
    # Print the generated response
    print(f"RESPONSE: \n{generated_response}\n")
    print("-"*70)


QUERY:
Are there fewer counties or members in the House of Representatives?

CONTEXT:
0.3: The Legislature is composed of the Senate and the House of Representatives. The House, led by a speaker, has 73 members apportioned among the 15 counties on the basis of the national census, with each county receiving a minimum of two members. Each House member represents an electoral district within a county as drawn by the National Elections Commission and is elected by a plurality of the popular vote of their district into a six-year term. The Senate is made up of two senators from each county for a total of 30 senators. Senators serve nine-year terms and are elected at-large by a plurality of the popular vote. The vice president serves as the President of the Senate, with a President pro tempore serving in their absence.
0.36: The House of Representatives currently has 59 members elected for a five-year term, 56 members by proportional representation and 3 observer members representing the A

RESPONSE: 
The adjective that describes characteristics from countries that are not of one's own is "foreign." However, it is important to note that the concept of "foreignness" can vary depending on the perspective or frame of reference of the individual or culture using the term.

----------------------------------------------------------------------


### Delete the KDB.AI table

In [18]:
res = requests.delete(
    "http://localhost:8000/delete",

    headers = {
        "Authorization": f"Bearer {BEARER_TOKEN}"
    },
    
    json = {
        "ids": ['documents']
    }

)