# KDB.AI for Q&A with ChatGPT Retrieval Plugin

## Getting Started

### Prerequisites

- Python 3
- Pip
- Git

### Install the KDB.AI ChatGPT Retrieval Plugin server app

```
git clone https://github.com/KxSystems/chatgpt-retrieval-plugin -b KDB.AI
cd chatgpt-retrieval-plugin
pip install poetry
poetry install
```

### Run the KDB.AI ChatGPT Retrieval Plugin server app

```
export BEARER_TOKEN='<BEARER TOKEN>'  # you can create your own bearer token on auth0.com
export DATASTORE=kdbai
export KDBAI_ENDPOINT='<KDB.AI ENDPOINT>'
export KDBAI_API_KEY='<KDB.AI API KEY>'
export OPENAI_API_KEY='<OPENAI API KEY>'  # You can get a free API key on https://platform.openai.com

poetry run start
```

### Install a separate Jupyter environment to run this notebook

```
pip install datasets jupyter openai tqdm
```

### Run Jupyter

```
export BEARER_TOKEN='<BEARER TOKEN>'  # Same bearer token as above
export OPENAI_API_KEY='<OPENAI API KEY>'

jupyter notebook
```

Then open this notebook in Jupyter.

In [1]:
import os
from pprint import pprint
import random

from datasets import load_dataset
import openai
import requests
from tqdm.auto import tqdm

In [2]:
BEARER_TOKEN = os.environ.get("BEARER_TOKEN")

### Load Dataset from Hugging Face

The Adversarial_QA dataset is chosen for this demo.
The adversarial human annotation paradigm ensures that these datasets consist of questions that current state-of-the-art models find challenging.

In [3]:
data = load_dataset("adversarial_qa", 'adversarialQA', split="train").to_pandas()
data = data.drop_duplicates(subset=["context"])
print(f"Number of unique contexts: {len(data)}")
data.head()

Number of unique contexts: 2648


Unnamed: 0,id,title,context,question,answers,metadata
0,7ba1e8f4261d3170fcf42e84a81dd749116fae95,Brain,Another approach to brain function is to exami...,What sare the benifts of the blood brain barrir?,"{'text': ['isolated from the bloodstream'], 'a...","{'split': 'train', 'model_in_the_loop': 'Combi..."
12,936a8460bfffe437b54cf3ec1e825a3b7b5627a1,Brain,Motor systems are areas of the brain that are ...,What do you think with?,"{'text': ['brain'], 'answer_start': [467]}","{'split': 'train', 'model_in_the_loop': 'Combi..."
24,e40737d487964dbcd26a223f2799cf56390a98a8,Brain,The brain is an organ that serves as the cente...,How are neurons connected?,"{'text': ['synapses'], 'answer_start': [602]}","{'split': 'train', 'model_in_the_loop': 'Combi..."
37,a0f8e785a10f6e21e24207d24ba2823162383062,Brain,The SCN projects to a set of areas in the hypo...,The body's central biological clock is contain...,"{'text': ['SCN'], 'answer_start': [4]}","{'split': 'train', 'model_in_the_loop': 'Combi..."
53,6d753d4a8878b5f5bc496d9a369b8c0b212079a0,Brain,The brain contains several motor areas that pr...,What is at the highest level?,"{'text': ['the primary motor cortex'], 'answer...","{'split': 'train', 'model_in_the_loop': 'Combi..."


In [4]:
# extract text data from the dataset
documents = [
    {
        'text': r['context'],
    } for r in data.to_dict(orient='records')
]
pprint(documents[0])

{'text': 'Another approach to brain function is to examine the consequences of '
         'damage to specific brain areas. Even though it is protected by the '
         'skull and meninges, surrounded by cerebrospinal fluid, and isolated '
         'from the bloodstream by the blood–brain barrier, the delicate nature '
         'of the brain makes it vulnerable to numerous diseases and several '
         'types of damage. In humans, the effects of strokes and other types '
         'of brain damage have been a key source of information about brain '
         'function. Because there is no ability to experimentally control the '
         'nature of the damage, however, this information is often difficult '
         'to interpret. In animal studies, most commonly involving rats, it is '
         'possible to use electrodes or locally injected chemicals to produce '
         'precise patterns of damage and then examine the consequences for '
         'behavior.'}


In [5]:
# initialise an HTTP session with the KDB.AI ChatGPT Retrieval Plugin app
s = requests.Session()

### Insert data to the KDB.AI table

The `/upsert` instruction is used to insert data to the KDB.AI datastore in batches, with each batch being embedded with `text-embedding-ada-002` before it is added to the table.

In [6]:
batchSize = 100

# upsert documents from the dataset in batches
for i in tqdm(range(0, len(documents), batchSize)):
    i_end = min(len(documents), i+batchSize)
    
    res = s.post(
        "http://localhost:8000/upsert",
        
        headers = {
            "Authorization": f"Bearer {BEARER_TOKEN}"
        },
        
        json = {
            "documents": documents[i:i_end]
        }
    )

  0%|          | 0/27 [00:00<?, ?it/s]

### Query the KDB.AI table

In [7]:
# extract questions and reformat into queries
queries = data['question'].tolist()
queries = [{'query': queries[i]} for i in range(len(queries))]

# choose 5 queries at random 
i = random.randint(0, len(queries)-5)
searchQueries = queries[i:i+5]

print(searchQueries)

[{'query': 'what group is mentioned second to last?'}, {'query': 'What is the type of the third mentioned location?'}, {'query': 'What intangible object is central to the proper functioning of a lighting fixture?'}, {'query': 'What is the exception to the less bright form being a type of light meant to illuminate broadly?'}, {'query': 'Solid state lighting has increased because of what?'}]


The `/query` instruction is used to extract relevant information from the KDB.AI datastore. The queries are embedded into vectors, and a similarity search algorithm is used to calculate its nearest neighbours, representing the most relevant entries in the table.

In [8]:
# query the vector database
results = requests.post(
    "http://localhost:8000/query",
    
    headers = {
        "Authorization": f"Bearer {BEARER_TOKEN}"
    },
    
    json = {
        'queries': searchQueries
    }
)

print(results)

<Response [200]>


In the cell below, we iterate through each query and its results, and pass this data to ChatGPT, which uses it to respond to the query with natural language.

In [9]:
# iterate through each set of queries/results
for query_response in results.json()['results']:
    query = query_response['query']
    answers = []
    scores = []
    
    # extract answers and scores from each result
    for result in query_response['results']:
        
        # answer = textual information related to the query
        answers.append(result['text'])
        
        # score = distance between the query vector and the answer vector (smaller=better!)
        scores.append(round(result['score'], 2))
    
    # print the query
    print("\nQUERY:\n"+query)
    
    # print the query responses, and their scores
    print("\nCONTEXT:\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n")
    
    # format the query and its answers into GPT messages
    messages = [
        {"role": "system", "content": f"You are a helpful assistant with the following knowledge: {answers}"},
        {"role": "user", "content": f"Using your knowledge, answer this query: {query}"}
    ]
    
    # send the messages to a GPT model
    response = openai.ChatCompletion.create(
      model="gpt-3.5-turbo",
      messages=messages,
      max_tokens=100,
      n=1,
      stop=None,            
    )

    # extract the generated response from the API response
    generated_response =response['choices'][0]['message']['content']
    
    # Print the generated response
    print(f"RESPONSE: \n{generated_response}\n")
    print("-"*70)


QUERY:
what group is mentioned second to last?

CONTEXT:
0.43: Listing all finite simple groups was a major achievement in contemporary group theory. 1998 Fields Medal winner Richard Borcherds succeeded in proving the monstrous moonshine conjectures, a surprising and deep relation between the largest finite simple sporadic group—the "monster group"—and certain modular functions, a piece of classical complex analysis, and string theory, a theory supposed to unify the description of many physical phenomena.
0.44: note 2]
0.44: note 2]

RESPONSE: 
The group mentioned second to last is the "monster group," which is the largest finite simple sporadic group.

----------------------------------------------------------------------

QUERY:
What is the type of the third mentioned location?

CONTEXT:
0.44: Upper Palaeolithic deposits, including bones of Homo sapiens, have been found in local caves, and artefacts dating from the Bronze Age to the Middle Iron Age have been found at Mount Batten sh

### Delete the KDB.AI table

In [10]:
## One would need to restart the KDB.AI ChatGPT Retrieval Plugin server app
## after this, to recreate the table
res = requests.delete(
    "http://localhost:8000/delete",
    headers = {
        "Authorization": f"Bearer {BEARER_TOKEN}"
    }, 
    json = {
        "delete_all": True
    }
)