In [16]:
import os
import requests

# Document retrieval: upsert and query basic usage

In this walkthrough we will see how to use the retrieval API with a Redis datastore for *semantic search / question-answering*. We will also provide a basic demo showing how to use the "filter" function.

Before running this notebook you should have already initialized the retrieval API and have it running locally or elsewhere. The full instructions for doing this are found in on the chatgpt-retrieval-plugin page [page](https://github.com/openai/chatgpt-retrieval-plugin#quickstart). Please follow the instructions to start the app with the redis datastore.

Additional examples using the search features can be found [here](https://github.com/openai/chatgpt-retrieval-plugin/blob/main/examples/providers/pinecone/semantic-search.ipynb).

## Document

First we will prepare a collection of documents. From the perspective of the retrieval plugin, a [document](https://github.com/openai/chatgpt-retrieval-plugin/blob/main/models/models.py) this consists
of an "id", "text" and a collection of "metadata".

The "metadata" has "source", "source_id", "created_at", "url" and "author" fields. Query metadata does not expose the "url" field.

The "source" field is an Enum and can only be one of ("file", "email" or "chat").

Text is taken from company SEC 10-K filings which are in the public domain.

For demonstration, we will insert some **fake** authors for the documents, see the respective links for the original sources. 

In [17]:
document_1 = {
    "id": "twtr",
    "text": """Postponements, suspensions or cancellations of major events, such as sporting events
                and music festivals, may lead to people perceiving the content on Twitter as less
                relevant or useful or of lower quality, which could negatively affect mDAU growth,
                or may reduce monetization opportunities in connection with such events.""",
    "metadata" : {
        "source" : "file",
        "source_id" : "test:twtr10k",
        "created_at": "2020-12-31",
        "url": "https://www.sec.gov/Archives/edgar/data/1418091/000141809121000031/twtr-20201231.htm",
        "author": 'Elvis Tusk Sr.'        
    }
}

document_2 = {
    "id": "tsla",
    "text": """Because we do not have independent dealer networks, we are responsible for delivering
               all of our vehicles to our customers.""",
    "metadata" : {
        "source" : "file",
        "source_id" : "test:tesla10k",
        "created_at": "2021-12-31",
        "url": "https://www.sec.gov/Archives/edgar/data/1318605/000095017022000796/tsla-20211231.htm",
        "author": 'Elvis Tusk Jr.'        
    }     
}

document_3 = {
    "id": "xom",
    "text": """All practical and economically-viable energy sources will need to be pursued to continue
               meeting global energy demand, recognizing the scale and variety of worldwide energy needs
               as well as the importance of expanding access to modern energy to promote better standards
               of living for billions of people.""",
    "metadata" : {
        "source" : "file",
        "source_id" : "test:xom10k",
        "created_at": "2020-12-31",
        "url": "https://www.sec.gov/Archives/edgar/data/34088/000003408821000012/xom-20201231.htm",
        "author": 'Vape Jordan'        
    }     
}


### Indexing the Docs

We're now ready to begin indexing (or *upserting*) our `documents`. To make these requests to the retrieval app API, we will need to provide authorization in the form of the `BEARER_TOKEN` we set earlier. We do this below:

In [18]:
BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "BEARER_TOKEN_HERE"
endpoint_url = 'http://0.0.0.0:8000'
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

Use the `BEARER_TOKEN` to create our authorization `headers`:

In [19]:
response = requests.post(
    f"{endpoint_url}/upsert",
    headers=headers,
    json={
        "documents": [document_1, document_2, document_3]
    }
)
response.raise_for_status()

### Example filter syntax
In our example data we have tagged each companies 10k documents as a source: test:twtr10k, test:tsla10k, and test:xom10k.
And we have created **fake** authors of the documents, Elvis Tusk Jr., Elvis Tusk Sr. and Vape Jordan. We will then filter based on these fields.

### TAG Fields

source and source_id are "TAG" fields, Redis supports a limited [query syntax](https://redis.io/docs/stack/search/reference/tags/) on TAGS, which includes and "or" syntax, i.e. "test:twtr10k|test:tesla10k" or a ```*``` wildcard to match a prefix.

In this example we have only two documents that match the filter so only two documents will show.

Gotcha: There cannot be a space between the bar "|", i.e. "test:twtr10k|test:tesla10k" is valid, "test:twtr10k | test:tesla10k" is not.

In [20]:
query = {
    "query": "How does Tesla deliver cars?",
    "filter": {"source_id": "test:twtr10k|test:tesla10k"},
    "top_k": 3
}

response = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={
        "queries": [query]
    }
)
response.raise_for_status()

response.json()

{'results': [{'query': 'How does Tesla deliver cars?',
   'results': [{'id': 'tsla',
     'text': 'Because we do not have independent dealer networks, we are responsible for delivering                all of our vehicles to our customers.',
     'metadata': {'source': 'file',
      'source_id': 'test:tesla10k',
      'url': 'https://www.sec.gov/Archives/edgar/data/1318605/000095017022000796/tsla-20211231.htm',
      'created_at': '1640908800',
      'author': 'Elvis Tusk Jr.',
      'document_id': 'tsla'},
     'embedding': None,
     'score': 0.185401830213},
    {'id': 'twtr',
     'text': 'Postponements, suspensions or cancellations of major events, such as sporting events                 and music festivals, may lead to people perceiving the content on Twitter as less                 relevant or useful or of lower quality, which could negatively affect mDAU growth,                 or may reduce monetization opportunities in connection with such events.',
     'metadata': {'source': 

In this example we use a wild card to filter by prefix. There are three documents matching this filter so three results will be printed.

Gotcha, only prefix filtering is supported for redis TAGS, i.e. "test*" is valid, where as "te\*t\*" is not.

In [21]:
query = {
    "query": "I want information related to car dealerships.",
    "filter": {"source_id": "test:*"},
    "top_k": 3
}

response = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={
        "queries": [query]
    }
)
response.raise_for_status()

response.json()

{'results': [{'query': 'I want information related to car dealerships.',
   'results': [{'id': 'tsla',
     'text': 'Because we do not have independent dealer networks, we are responsible for delivering                all of our vehicles to our customers.',
     'metadata': {'source': 'file',
      'source_id': 'test:tesla10k',
      'url': 'https://www.sec.gov/Archives/edgar/data/1318605/000095017022000796/tsla-20211231.htm',
      'created_at': '1640908800',
      'author': 'Elvis Tusk Jr.',
      'document_id': 'tsla'},
     'embedding': None,
     'score': 0.204279193893},
    {'id': 'twtr',
     'text': 'Postponements, suspensions or cancellations of major events, such as sporting events                 and music festivals, may lead to people perceiving the content on Twitter as less                 relevant or useful or of lower quality, which could negatively affect mDAU growth,                 or may reduce monetization opportunities in connection with such events.',
     'meta

The last example we filter by the "author" field. The author field is a TextField, and so we have more options for filtering, 
see [here](https://redis.io/docs/stack/search/reference/query_syntax/) for a complete set of examples.

We can select by a specific author, here we only expect to return a single result.

In [22]:
query = {
    "query": "I want information related to car dealerships.",
    "filter": {"source_id": "test:*", "author": "Vape Jordan"},
    "top_k": 3
}

response = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={
        "queries": [query]
    }
)
response.raise_for_status()

response.json()

{'results': [{'query': 'I want information related to car dealerships.',
   'results': [{'id': 'xom',
     'text': 'All practical and economically-viable energy sources will need to be pursued to continue                meeting global energy demand, recognizing the scale and variety of worldwide energy needs                as well as the importance of expanding access to modern energy to promote better standards                of living for billions of people.',
     'metadata': {'source': 'file',
      'source_id': 'test:xom10k',
      'url': 'https://www.sec.gov/Archives/edgar/data/34088/000003408821000012/xom-20201231.htm',
      'created_at': '1609372800',
      'author': 'Vape Jordan',
      'document_id': 'xom'},
     'embedding': None,
     'score': 0.305264299269}]}]}

Here we use the negation "-" to select all documents, except those published by an author called Elvis

In [23]:
query = {
    "query": "I want information related to car dealerships.",
    "filter": {"source_id": "test:*", "author": "-Elvis"},
    "top_k": 3
}

response = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={
        "queries": [query]
    }
)
response.raise_for_status()

response.json()

{'results': [{'query': 'I want information related to car dealerships.',
   'results': [{'id': 'xom',
     'text': 'All practical and economically-viable energy sources will need to be pursued to continue                meeting global energy demand, recognizing the scale and variety of worldwide energy needs                as well as the importance of expanding access to modern energy to promote better standards                of living for billions of people.',
     'metadata': {'source': 'file',
      'source_id': 'test:xom10k',
      'url': 'https://www.sec.gov/Archives/edgar/data/34088/000003408821000012/xom-20201231.htm',
      'created_at': '1609372800',
      'author': 'Vape Jordan',
      'document_id': 'xom'},
     'embedding': None,
     'score': 0.305264299269}]}]}

Last example we filter two of the authors:

In [24]:
query = {
    "query": "I want information related to car dealerships.",
    "filter": {"source_id": "test:*", "author": "Elvis*Jr.|Vape"},
    "top_k": 3
}

response = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={
        "queries": [query]
    }
)
response.raise_for_status()

response.json()

{'results': [{'query': 'I want information related to car dealerships.',
   'results': [{'id': 'tsla',
     'text': 'Because we do not have independent dealer networks, we are responsible for delivering                all of our vehicles to our customers.',
     'metadata': {'source': 'file',
      'source_id': 'test:tesla10k',
      'url': 'https://www.sec.gov/Archives/edgar/data/1318605/000095017022000796/tsla-20211231.htm',
      'created_at': '1640908800',
      'author': 'Elvis Tusk Jr.',
      'document_id': 'tsla'},
     'embedding': None,
     'score': 0.204279193893},
    {'id': 'xom',
     'text': 'All practical and economically-viable energy sources will need to be pursued to continue                meeting global energy demand, recognizing the scale and variety of worldwide energy needs                as well as the importance of expanding access to modern energy to promote better standards                of living for billions of people.',
     'metadata': {'source': 'file