# Using MongoDB Atlas as a datastore

In this walkthrough, we will see how to use the retrieval API with a MongoDB Atlas datastore for *search / question-answering*.

Before running this notebook, you should have already initialized the retrieval API and have it running locally or elsewhere. See readme for instructions on how to do this.


## App Quickstart

1. Install Python 3.10 if not already installed.

2. Clone the `retrieval-app` repository:

```
git clone git@github.com:openai/retrieval-app.git
```

3. Navigate to the app directory:

```
cd /path/to/retrieval-app
```

4. Install `poetry`:

```
pip install poetry
```

5. Create a new virtual environment:

```
poetry env use python3.10
```

6. Install the `retrieval-app` dependencies:

```
poetry install
```

7. Set app environment variables:

* `BEARER_TOKEN`: Secret token used by the app to authorize incoming requests. We will later include this in the request `headers`. The token can be generated however you prefer, such as using [jwt.io](https://jwt.io/).

* `OPENAI_API_KEY`: The OpenAI API key used for generating embeddings with the `text-embedding-ada-002` model. [Get an API key here](https://platform.openai.com/account/api-keys)!

8. Set MongoDB-specific environment variables:

* `DATASTORE`: set to `mongodb`.

9. Set the MongoDB connection specific environment variables. Set `MONGODB_CONNECTION_URI`.
* `MONGODB_URL`: To obtain the MongoDB connection URL (often referred to as MONGODB_URL or `MongoDB URI`. Go to the Database section in the "Clusters" dashboard, click on the "Connect" button for your cluster, and choose "Drivers." and copy the "uri" string in the code example. The "URI" is something like this `mongodb+srv://<username>:<password>@<cluster>/?authSource=<authSource>&authMechanism=<authMechanism>`

10. Alternatively, set MongoDB authentication-specific environment variables:
`MONGODB_USER`, `MONGODB_PASSWORD`, `MONGODB_HOST`, `MONGODB_PORT`, `MONGODB_AUTHSOURCE`, `MONGODB_AUTHMECHANISM`, and `MONGODB_COLLECTION`.

11. Set the MongoDB index-specific environment variables.
* `MONGODB_INDEX`: Set to the name of the MongoDB index you want to use.

12. Run the app with:

```
poetry run start
```

If running the app locally you should see something like:

```
INFO:     Uvicorn running on http://0.0.0.0:8000
INFO:     Application startup complete.
```

Now we're ready to move on to populating our index with some data.

## Required Libraries

There are a few Python libraries we must `pip install` for this notebook to run, those are:

In [None]:
!pip install -qU datasets pandas tqdm

## Preparing Data

In this example, we will use the **S**tanford **Qu**estion **A**nswering **D**ataset (SQuAD2), which we download from Hugging Face Datasets.

In [None]:
from datasets import load_dataset

data = load_dataset("squad_v2", split="train")
data

Transform the data into a Pandas dataframe for simpler preprocessing.

In [None]:
data = data.to_pandas()
data.head()

The dataset contains a lot of duplicate `context` paragraphs, this is because each `context` can have many relevant questions. We don't want these duplicates so we remove like so:

In [None]:
data = data.drop_duplicates(subset=["context"])
print(len(data))
data.head()

The format required by the apps `upsert` function is a list of documents like:

```json
[
    {
        "id": "abc",
        "text": "some important document text",
        "metadata": {
            "field1": "optional metadata goes here",
            "field2": 54
        }
    },
    {
        "id": "123",
        "text": "some other important text",
        "metadata": {
            "field1": "another metadata",
            "field2": 71,
            "field3": "not all metadatas need the same structure"
        }
    }
    ...
]
```

Every document *must* have a `"text"` field. The `"id"` and `"metadata"` fields are optional.

To create this format for our SQuAD data we do:

In [None]:
documents = [
    {
        'id': r['id'],
        'text': r['context'],
        'metadata': {
            'title': r['title']
        }
    } for r in data.to_dict(orient='records')
]
documents[:3]

### Indexing the Docs

Now, it's time to initiate the indexing process, also known as upserting, for our documents. To perform these requests to the retrieval app API, we must provide authorization using the BEARER_TOKEN we defined earlier. Below is how we accomplish this:

In [None]:
import os

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkpvaG4gRG9lIiwiaWF0IjoxNTE2MjM5MDIyfQ.SflKxwRJSMeKKF2QT4fwpMeJf36POk6yJV_adQssw5c"


In [None]:

headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

Now we will execute bulk inserts in batches set by the `batch_size`.

Now that all our SQuAD2 records have been successfully indexed, we can proceed with the querying phase.

In [None]:
from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry

batch_size = 100
endpoint_url = "http://localhost:8000"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))
documents = documents[:10]
for i in tqdm(range(0, 10, batch_size)):
    i_end = min(len(documents), i+batch_size)
    # make post request that allows up to 5 retries
    
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

### Making Queries

By passing one or more queries to the /query endpoint, we can easily conduct a query on the datastore. For this task, we can utilize a few questions from SQuAD2.

In [None]:
queries = data['question'].tolist()
# format into the structure needed by the /query endpoint
queries = [{'query': queries[i]} for i in range(len(queries))]
len(queries)

In [None]:
res = requests.post(
    "http://0.0.0.0:8000/query",
    headers=headers,
    json={
        'queries': queries[:3]
    }
)
res

At this point, we have the ability to iterate through the responses and observe the outcomes obtained for each query:

In [None]:
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

In [None]:
# PROBE EL DELETE ALL FUNCIONA OK
response = requests.delete(
    f"{endpoint_url}/delete",
    headers=headers,
    json={"ids":["65991f75a315f755c3365ab2", "65991f75a315f755c3365ab3"]}
)

response.json()

In [None]:
# PROBE EL DELETE ALL FUNCIONA OK
response = requests.delete(
    f"{endpoint_url}/delete",
    headers=headers,
    json={"delete_all":True}
)

response.json()

The top results are all relevant as we would have hoped. We can see that the `score` is a measure of how relevant the document is to the query. The higher the score the more relevant the document is to the query.