# Using Elasticsearch as a datastore

In this walkthrough we will see how to use the retrieval API with a Elasticsearch datastore for *search / question-answering*.

Before running this notebook you should have already initialized the retrieval API and have it running locally or elsewhere. See readme for instructions on how to do this.

## App Quickstart

1. Install Python 3.10 if not already installed.

2. Clone the `retrieval-app` repository:

```
git clone git@github.com:openai/retrieval-app.git
```

3. Navigate to the app directory:

```
cd /path/to/retrieval-app
```

4. Install `poetry`:

```
pip install poetry
```

5. Create a new virtual environment:

```
poetry env use python3.10
```

6. Install the `retrieval-app` dependencies:

```
poetry install
```

7. Set app environment variables:

* `BEARER_TOKEN`: Secret token used by the app to authorize incoming requests. We will later include this in the request `headers`. The token can be generated however you prefer, such as using [jwt.io](https://jwt.io/).

* `OPENAI_API_KEY`: The OpenAI API key used for generating embeddings with the `text-embedding-ada-002` model. [Get an API key here](https://platform.openai.com/account/api-keys)!

8. Set Elasticsearch-specific environment variables:

* `DATASTORE`: set to `elasticsearch`.

9. Set the Elasticsearch connection specific environment variables. Either set `ELASTICSEARCH_CLOUD_ID` or `ELASTICSEARCH_URL`.
* `ELASTICSEARCH_CLOUD_ID`: Set to your deployment cloud id. You can find this in the [Elasticsearch console](https://cloud.elastic.co).

* `ELASTICSEARCH_URL`: Set to your Elasticsearch URL, looks like `https://<username>:<password>@<host>:<port>`. You can find this in the [Elasticsearch console](https://cloud.elastic.co).

10. Set the Elasticsearch authentication specific environment variables. Either set `ELASTICSEARCH_USERNAME` and `ELASTICSEARCH_PASSWORD` or `ELASTICSEARCH_API_KEY`.

* `ELASTICSEARCH_USERNAME`: Set to your Elasticsearch username. You can find this in the [Elasticsearch console](https://cloud.elastic.co). Typically this is set to `elastic`.

* `ELASTICSEARCH_PASSWORD`: Set to your Elasticsearch password. You can find this in the [Elasticsearch console](https://cloud.elastic.co) in security.

* `ELASTICSEARCH_API_KEY`: Set to your Elasticsearch API key. You can set one up in Kibana Stack management page.

11. Set the Elasticsearch index specific environment variables.

* `ELASTICSEARCH_INDEX`: Set to the name of the Elasticsearch index you want to use.

12. Run the app with:

```
poetry run start
```

If running the app locally you should see something like:

```
INFO:     Uvicorn running on http://0.0.0.0:8000
INFO:     Application startup complete.
```

In that case, the app is automatically connected to our index (specified by `ELASTICSEARCH_INDEX`), if no index with that name existed beforehand, the app creates one for us.

Now we're ready to move on to populating our index with some data.

## Required Libraries

There are a few Python libraries we must `pip install` for this notebook to run, those are:

In [39]:
!pip install -qU datasets pandas tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2[0m[39;49m -> [0m[32;49m23.2.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Preparing Data

In this example, we will use the **S**tanford **Qu**estion **A**nswering **D**ataset (SQuAD2), which we download from Hugging Face Datasets.

In [None]:
from datasets import load_dataset

data = load_dataset("squad_v2", split="train")
data

Transform the data into a Pandas dataframe for simpler preprocessing.

In [41]:
data = data.to_pandas()
data.head()

Unnamed: 0,id,title,context,question,answers
0,56be85543aeaaa14008c9063,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce start becoming popular?,"{'text': ['in the late 1990s'], 'answer_start'..."
1,56be85543aeaaa14008c9065,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,What areas did Beyonce compete in when she was...,"{'text': ['singing and dancing'], 'answer_star..."
2,56be85543aeaaa14008c9066,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce leave Destiny's Child and bec...,"{'text': ['2003'], 'answer_start': [526]}"
3,56bf6b0f3aeaaa14008c9601,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,In what city and state did Beyonce grow up?,"{'text': ['Houston, Texas'], 'answer_start': [..."
4,56bf6b0f3aeaaa14008c9602,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,In which decade did Beyonce become famous?,"{'text': ['late 1990s'], 'answer_start': [276]}"


The dataset contains a lot of duplicate `context` paragraphs, this is because each `context` can have many relevant questions. We don't want these duplicates so we remove like so:

In [42]:
data = data.drop_duplicates(subset=["context"])
print(len(data))
data.head()

19029


Unnamed: 0,id,title,context,question,answers
0,56be85543aeaaa14008c9063,Beyoncé,Beyoncé Giselle Knowles-Carter (/biːˈjɒnseɪ/ b...,When did Beyonce start becoming popular?,"{'text': ['in the late 1990s'], 'answer_start'..."
15,56be86cf3aeaaa14008c9076,Beyoncé,Following the disbandment of Destiny's Child i...,"After her second solo album, what other entert...","{'text': ['acting'], 'answer_start': [207]}"
27,56be88473aeaaa14008c9080,Beyoncé,"A self-described ""modern-day feminist"", Beyonc...","In her music, what are some recurring elements...","{'text': ['love, relationships, and monogamy']..."
39,56be892d3aeaaa14008c908b,Beyoncé,"Beyoncé Giselle Knowles was born in Houston, T...",Beyonce's younger sibling also sang with her i...,"{'text': ['Destiny's Child'], 'answer_start': ..."
52,56be8a583aeaaa14008c9094,Beyoncé,Beyoncé attended St. Mary's Elementary School ...,What town did Beyonce go to school in?,"{'text': ['Fredericksburg'], 'answer_start': [..."


The format required by the apps `upsert` function is a list of documents like:

```json
[
    {
        "id": "abc",
        "text": "some important document text",
        "metadata": {
            "field1": "optional metadata goes here",
            "field2": 54
        }
    },
    {
        "id": "123",
        "text": "some other important text",
        "metadata": {
            "field1": "another metadata",
            "field2": 71,
            "field3": "not all metadatas need the same structure"
        }
    }
    ...
]
```

Every document *must* have a `"text"` field. The `"id"` and `"metadata"` fields are optional.

To create this format for our SQuAD data we do:

In [None]:
documents = [
    {
        'id': r['id'],
        'text': r['context'],
        'metadata': {
            'title': r['title']
        }
    } for r in data.to_dict(orient='records')
]
documents[:3]

### Indexing the Docs

Now, it's time to initiate the indexing process, also known as upserting, for our documents. To perform these requests to the retrieval app API, we must provide authorization using the BEARER_TOKEN we defined earlier. Below is how we accomplish this:

In [44]:
import os

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "BEARER_TOKEN_HERE"

headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

Now we will execute bulk inserts in batches set by the `batch_size`.

Now that all our SQuAD2 records have been successfully indexed, we can proceed with the querying phase.

In [46]:
from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry

batch_size = 100
endpoint_url = "http://localhost:8000"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

for i in tqdm(range(0, 10, batch_size)):
    i_end = min(len(documents), i+batch_size)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

100%|██████████| 1/1 [00:16<00:00, 16.88s/it]


### Making Queries

By passing one or more queries to the /query endpoint, we can easily conduct a query on the datastore. For this task, we can utilize a few questions from SQuAD2.

In [47]:
queries = data['question'].tolist()
# format into the structure needed by the /query endpoint
queries = [{'query': queries[i]} for i in range(len(queries))]
len(queries)

19029

In [49]:
res = requests.post(
    "http://0.0.0.0:8000/query",
    headers=headers,
    json={
        'queries': queries[:3]
    }
)
res

<Response [200]>

At this point, we have the ability to iterate through the responses and observe the outcomes obtained for each query:

In [50]:
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

----------------------------------------------------------------------
When did Beyonce start becoming popular?

0.93: On December 13, 2013, Beyoncé unexpectedly released her eponymous fifth studio album on the iTunes Store without any prior announcement or promotion. The album debuted atop the Billboard 200 chart, giving Beyoncé her fifth consecutive number-one album in the US. This made her the first woman in the chart's history to have her first five studio albums debut at number one. Beyoncé received critical acclaim and commercial success, selling one million digital copies worldwide in six days; The New York Times noted the album's unconventional, unexpected release as significant. Musically an electro-R&B album, it concerns darker themes previously unexplored in her work, such as "bulimia, postnatal depression [and] the fears and insecurities of marriage and motherhood". The single "Drunk in Love", featuring Jay Z, peaked at number two on the Billboard Hot 100 chart.
0.93: Beyon

The top results are all relevant as we would have hoped. We can see that the `score` is a measure of how relevant the document is to the query. The higher the score the more relevant the document is to the query.