# KDBAI for Q&A with ChatGPT Retrieval Plugin

1. Install Python 3.10 if not already installed.

2. Clone the `retrieval-app` repository:

```
git clone git@github.com:openai/retrieval-app.git
```

3. Navigate to the app directory:

```
cd /path/to/retrieval-app
```

4. Install `poetry`:

```
pip install poetry
```

5. Create a new virtual environment:

```
poetry env use python3.10
```

6. Install the `retrieval-app` dependencies:

    Before installing the dependencies with poetry, we will need to add pyKX and a local KDBAI python package as a dependence in the `pyproject.toml` file. This can be done by pointing the KDBAI package under `[tool.poetry.dependencies]` by adding the following lines.

    - __pykx = "^1.6.0"__
    - __kdbai = { path = "path/to/kdbaiPackage.tar.gz", develop = false }__

```
poetry install
```

7. Set app environment variables:

* `BEARER_TOKEN`: Secret token used by the app to authorize incoming requests. The token can be generated using [jwt.io](https://jwt.io/).

* `OPENAI_API_KEY`: The OpenAI API key used for generating embeddings with the `text-embedding-ada-002` model. [Get an API key here](https://platform.openai.com/account/api-keys)!

8. Set `KDBAI`-specific environment variables:

* `DATASTORE`: set to `kdbai`.

8. Run the app with:

```
poetry run start
```
When the application is up and running a FastAPI server exposes the ChatGPT plugin's endpoints for `upserting`, `querying`, and `deleting` documents.

During the start up of the application `KDBAI` initiates a new HNSW index named `openaiEmbeddings` with default parameters {"efConstruction": 8, "efSearch": 8, "M": 32}

In [1]:
import os
import requests
from tqdm.auto import tqdm
from datasets import load_dataset


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# If the bearer token is not set as an environment variable, you can set it manually here
BEARER_TOKEN = os.environ.get("BEARER_TOKEN")
headers = {"Authorization": f"Bearer {BEARER_TOKEN}"}

### Load Dataset from Hugging Face

The Adversarial_QA dataset is chosen for this demo.
The adversarial human annotation paradigm ensures that these datasets consist of questions that current state-of-the-art models find challenging.

In [3]:
data = load_dataset("adversarial_qa", 'adversarialQA', split="train").to_pandas()
data = data.drop_duplicates(subset=["context"])
print(f"Number of unique contexts: {len(data)}")
data.head()

Found cached dataset adversarial_qa (/Users/alexg/.cache/huggingface/datasets/adversarial_qa/adversarialQA/1.0.0/92356be07b087c5c6a543138757828b8d61ca34de8a87807d40bbc0e6c68f04b)


Number of unique contexts: 2648


Unnamed: 0,id,title,context,question,answers,metadata
0,7ba1e8f4261d3170fcf42e84a81dd749116fae95,Brain,Another approach to brain function is to exami...,What sare the benifts of the blood brain barrir?,"{'text': ['isolated from the bloodstream'], 'a...","{'split': 'train', 'model_in_the_loop': 'Combi..."
12,936a8460bfffe437b54cf3ec1e825a3b7b5627a1,Brain,Motor systems are areas of the brain that are ...,What do you think with?,"{'text': ['brain'], 'answer_start': [467]}","{'split': 'train', 'model_in_the_loop': 'Combi..."
24,e40737d487964dbcd26a223f2799cf56390a98a8,Brain,The brain is an organ that serves as the cente...,How are neurons connected?,"{'text': ['synapses'], 'answer_start': [602]}","{'split': 'train', 'model_in_the_loop': 'Combi..."
37,a0f8e785a10f6e21e24207d24ba2823162383062,Brain,The SCN projects to a set of areas in the hypo...,The body's central biological clock is contain...,"{'text': ['SCN'], 'answer_start': [4]}","{'split': 'train', 'model_in_the_loop': 'Combi..."
53,6d753d4a8878b5f5bc496d9a369b8c0b212079a0,Brain,The brain contains several motor areas that pr...,What is at the highest level?,"{'text': ['the primary motor cortex'], 'answer...","{'split': 'train', 'model_in_the_loop': 'Combi..."


In [5]:
# Formatting the documents for the retrieval plugin requests
documents = [
    {
        'id': r['id'],
        'text': r['context'],
        'metadata': {
            'title': r['title']
        }
    } for r in data.to_dict(orient='records')
]
documents[:3]

[{'id': '7ba1e8f4261d3170fcf42e84a81dd749116fae95',
  'text': 'Another approach to brain function is to examine the consequences of damage to specific brain areas. Even though it is protected by the skull and meninges, surrounded by cerebrospinal fluid, and isolated from the bloodstream by the blood–brain barrier, the delicate nature of the brain makes it vulnerable to numerous diseases and several types of damage. In humans, the effects of strokes and other types of brain damage have been a key source of information about brain function. Because there is no ability to experimentally control the nature of the damage, however, this information is often difficult to interpret. In animal studies, most commonly involving rats, it is possible to use electrodes or locally injected chemicals to produce precise patterns of damage and then examine the consequences for behavior.',
  'metadata': {'title': 'Brain'}},
 {'id': '936a8460bfffe437b54cf3ec1e825a3b7b5627a1',
  'text': 'Motor systems are 

### Insert data to the HNSW Index

In [27]:
batchSize = 100
s = requests.Session()

for i in tqdm(range(0, len(documents), batchSize)):
    i_end = min(len(documents), i+batchSize)
    res = s.post(
        "http://localhost:8000/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

100%|██████████| 27/27 [00:38<00:00,  1.44s/it]


### Query the HNSW index

In [33]:
# Select questions from the AdversarialQA dataset
queries = data['question'].tolist()

queries = [{'query': queries[i]} for i in range(len(queries))]
searchQueries = queries[150:152]
print(searchQueries)

[{'query': 'Who went into orbit following Gagarin?'}, {'query': 'Who was responsible for being the best in the Soveit space prgram?'}]


In [34]:
results = requests.post(
    "http://0.0.0.0:8000/query",
    headers=headers,
    json={
        'queries': searchQueries
    }
)
results

<Response [200]>

In [35]:
for query_response in results.json()['results']:
    query = query_response['query']
    answers = []
    scores = []
    for result in query_response['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    # Print the scores and corresponding answers
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

----------------------------------------------------------------------
Who went into orbit following Gagarin?

0.35: On May 1, 1960, a U.S. one-man U-2 spy plane was reportedly shot down at high altitude over Soviet Union airspace. The flight was made to gain photo intelligence before the scheduled opening of an East–West summit conference, which had been scheduled in Paris, 15 days later. Captain Francis Gary Powers had bailed out of his aircraft and was captured after parachuting down onto Russian soil. Four days after Powers disappeared, the Eisenhower Administration had NASA issue a very detailed press release noting that an aircraft had "gone missing" north of Turkey. It speculated that the pilot might have fallen unconscious while the autopilot was still engaged, and falsely claimed that "the pilot reported over the emergency frequency that he was experiencing oxygen difficulties.
0.46: In 1955 American nuclear arms policy became one aimed primarily at arms control as opposed to 