In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

%config IPCompleter.greedy=True

# Using the Pinecone Retrieval App

In this walkthrough we will see how to use the retrieval API with a Pinecone datastore for *semantic search / question-answering*.

Before running this notebook you should have already initialized the retrieval API and have it running locally or elsewhere. The full instructions for doing this are found in the [project README]().

We will summarize the instructions (specific to the Pinecone datastore) before moving on to the walkthrough.

## App Quickstart

1. Install Python 3.10 if not already installed.

2. Clone the `retrieval-app` repository:

```
git clone git@github.com:openai/retrieval-app.git
```

3. Navigate to the app directory:

```
cd /path/to/retrieval-app
```

4. Install `poetry`:

```
pip install poetry
```

5. Create a new virtual environment:

```
poetry env use python3.10
```

6. Install the `retrieval-app` dependencies:

```
poetry install
```

7. Set app environment variables:

* `BEARER_TOKEN`: Secret token used by the app to authorize incoming requests. We will later include this in the request `headers`. The token can be generated however you prefer, such as using [jwt.io](https://jwt.io/).

* `OPENAI_API_KEY`: The OpenAI API key used for generating embeddings with the `text-embedding-ada-002` model. [Get an API key here](https://platform.openai.com/account/api-keys)!

8. Set Pinecone-specific environment variables:

* `DATASTORE`: set to `pinecone`.

* `PINECONE_API_KEY`: Set to your Pinecone API key. This requires a free Pinecone account and can be [found in the Pinecone console](https://app.pinecone.io/).

* `PINECONE_ENVIRONMENT`: Set to your Pinecone environment, looks like `us-east1-gcp`, `us-west1-aws`, and can be found next to your API key in the [Pinecone console](https://app.pinecone.io/).

* `PINECONE_INDEX`: Set this to your chosen index name. The name you choose is your choice, we just recommend setting it to something descriptive like `"openai-retrieval-app"`. *Note that index names are restricted to alphanumeric characters, `"-"`, and can contain a maximum of 45 characters.*

8. Run the app with:

```
poetry run start
```

If running the app locally you should see something like:

```
INFO:     Uvicorn running on http://0.0.0.0:8000
INFO:     Application startup complete.
```

In that case, the app is automatically connected to our index (specified by `PINECONE_INDEX`), if no index with that name existed beforehand, the app creates one for us.

Now we're ready to move on to populating our index with some data.

## Required Libraries

There are a few Python libraries we must `pip install` for this notebook to run, those are:

In [1]:
!pip install -qU datasets pandas tqdm

You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

## Preparing Data

In this example, we will use the **S**tanford **Qu**estion **A**nswering **D**ataset (SQuAD), which we download from Hugging Face Datasets.

In [2]:
from datasets import load_dataset

data = load_dataset("squad", split="train")
data

  from .autonotebook import tqdm as notebook_tqdm
Downloading builder script: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.27k/5.27k [00:00<00:00, 1.86MB/s]
Downloading metadata: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.36k/2.36k [00:00<00:00, 1.42MB/s]
Downloading readme: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.67k/7.67k [00:00<00:00, 4.35MB/s]


Downloading and preparing dataset squad/plain_text to /Users/hminooei/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453...


Downloading data files:   0%|                                                                                                                                                                      | 0/2 [00:00<?, ?it/s]
Downloading data:   0%|                                                                                                                                                                      | 0.00/8.12M [00:00<?, ?B/s][A
Downloading data:  32%|██████████████████████████████████████████████████▋                                                                                                          | 2.62M/8.12M [00:00<00:00, 26.2MB/s][A
Downloading data:  87%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                    | 7.06M/8.12M [00:00<00:00, 36.9MB/s][A
Downloading data: 13.7MB [00:00, 50.5MB/s]                                                                             

Dataset squad downloaded and prepared to /Users/hminooei/.cache/huggingface/datasets/squad/plain_text/1.0.0/d6ec3ceb99ca480ce37cdd35555d6cb2511d223b9150cce08a837ef62ffea453. Subsequent calls will reuse this data.




Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

Convert to Pandas dataframe for easier preprocessing steps.

In [3]:
data = data.to_pandas()
data.head()

Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
1,5733be284776f4190066117f,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is in front of the Notre Dame Main Building?,"{'text': ['a copper statue of Christ'], 'answe..."
2,5733be284776f41900661180,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",The Basilica of the Sacred heart at Notre Dame...,"{'text': ['the Main Building'], 'answer_start'..."
3,5733be284776f41900661181,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What is the Grotto at Notre Dame?,{'text': ['a Marian place of prayer and reflec...
4,5733be284776f4190066117e,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",What sits on top of the Main Building at Notre...,{'text': ['a golden statue of the Virgin Mary'...


The dataset contains a lot of duplicate `context` paragraphs, this is because each `context` can have many relevant questions. We don't want these duplicates so we remove like so:

In [4]:
data = data.drop_duplicates(subset=["context"])
print(len(data))
data.head()

18891


Unnamed: 0,id,title,context,question,answers
0,5733be284776f41900661182,University_of_Notre_Dame,"Architecturally, the school has a Catholic cha...",To whom did the Virgin Mary allegedly appear i...,"{'text': ['Saint Bernadette Soubirous'], 'answ..."
5,5733bf84d058e614000b61be,University_of_Notre_Dame,"As at most other universities, Notre Dame's st...",When did the Scholastic Magazine of Notre dame...,"{'text': ['September 1876'], 'answer_start': [..."
10,5733bed24776f41900661188,University_of_Notre_Dame,The university is the major seat of the Congre...,Where is the headquarters of the Congregation ...,"{'text': ['Rome'], 'answer_start': [119]}"
15,5733a6424776f41900660f51,University_of_Notre_Dame,The College of Engineering was established in ...,How many BS level degrees are offered in the C...,"{'text': ['eight'], 'answer_start': [487]}"
20,5733a70c4776f41900660f64,University_of_Notre_Dame,All of Notre Dame's undergraduate students are...,What entity provides help with the management ...,"{'text': ['Learning Resource Center'], 'answer..."


The format required by the apps `upsert` function is a list of documents like:

```json
[
    {
        "id": "abc",
        "text": "some important document text",
        "metadata": {
            "field1": "optional metadata goes here",
            "field2": 54
        }
    },
    {
        "id": "123",
        "text": "some other important text",
        "metadata": {
            "field1": "another metadata",
            "field2": 71,
            "field3": "not all metadatas need the same structure"
        }
    }
    ...
]
```

Every document *must* have a `"text"` field. The `"id"` and `"metadata"` fields are optional.

To create this format for our SQuAD data we do:

In [5]:
documents = [
    {
        'id': r['id'],
        'text': r['context'],
        'metadata': {
            'title': r['title']
        }
    } for r in data.to_dict(orient='records')
]
documents[:3]

[{'id': '5733be284776f41900661182',
  'text': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
  'metadata': {'title': 'University_of_Notre_Dame'}},
 {'id': '5733bf84d058e614000b61be',
  'text': "As at most other universities, Notre Dame's students run a number of news media outlets. The nine student-run outlets include three newspapers, both a ra

### Indexing the Docs

We're now ready to begin indexing (or *upserting*) our `documents`. To make these requests to the retrieval app API, we will need to provide authorization in the form of the `BEARER_TOKEN` we set earlier. We do this below:

In [2]:
import os

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIxMjM0NTY3ODkwIiwibmFtZSI6IkhhZGkgTWlub29laSIsImlhdCI6MTUxNjIzOTAyMn0.Ggr5MBFnLBKqVnpasGRiX536Tl3wwIqL1gqaui4QDhY"

Use the `BEARER_TOKEN` to create our authorization `headers`:

In [3]:
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

We'll perform the upsert in batches of `batch_size`. Make sure that the `endpoint_url` variable is set to the correct location for your running *retrieval-app* API.

In [14]:
from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry

batch_size = 100
endpoint_url = "http://localhost:8000"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

for i in tqdm(range(0, len(documents), batch_size)):
    i_end = min(len(documents), i+batch_size)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 189/189 [41:03<00:00, 13.04s/it]


In [111]:
!pip install markdownify

Collecting markdownify
  Downloading markdownify-0.11.6-py3-none-any.whl (16 kB)
Installing collected packages: markdownify
Successfully installed markdownify-0.11.6
You should consider upgrading via the '/Library/Frameworks/Python.framework/Versions/3.9/bin/python3 -m pip install --upgrade pip' command.[0m[33m
[0m

In [48]:
# # Importing BeautifulSoup class from the bs4 module
# from bs4 import BeautifulSoup

# # Opening the html file
# HTMLFile = open("/Users/hminooei/Downloads/site/amazon-s3-connector/0.3.7/index.html")

# # Reading the file
# index = HTMLFile.read()

Option one to convert html to markdown.

In [49]:
# import markdownify

# # convert html to markdown
# h = markdownify.markdownify(index, heading_style="ATX")

# # print markdown
# print(h)

Option two to convert html to markdown.

In [51]:
# import html2text

# text = html2text.html2text(index)
# print(text)

Detecting all the html files

In [52]:
import os

site_path = '/Users/hminooei/Downloads/site'
html_files = []
for root, dirs, files in os.walk(site_path, topdown=False):
    for name in files:
        if name.endswith(".html"):
            # print(os.path.join(root, name))
            html_files.append(os.path.join(root, name))

In [53]:
import sys
sys.path.insert(0, os.path.abspath('..'))

Convert all the html file to markdown.

In [54]:
import concurrent.futures
from pinecone.utils import html_to_md

md_folder = 'site-files'
with concurrent.futures.ProcessPoolExecutor() as executor:
    executor.map(html_to_md, html_files, [md_folder]*len(html_files))

Remove the preamble, and ending of the md files.

In [55]:
import os

site_path = 'site-files'
md_files = []
for root, dirs, files in os.walk(site_path, topdown=False):
    for name in files:
        if name.endswith(".md"):
            md_files.append(os.path.join(root, name))

In [56]:
import concurrent.futures
from pinecone.utils import truncate_mulesoft_site_md_pages

md_folder = 'site-files'
with concurrent.futures.ProcessPoolExecutor() as executor:
    executor.map(truncate_mulesoft_site_md_pages, md_files)

In [57]:
# Due to the chatgpt bug, changing the file extensions to txt
import os

site_path = 'site-files'
for root, dirs, files in os.walk(site_path, topdown=False):
    for name in files:
        if name.endswith(".md"):
            pre, ext = os.path.splitext(os.path.join(root, name))
            os.rename(os.path.join(root, name), pre[:-4] + "txt") # pre[:-4] removes the html from the end

In [68]:
import os

site_path = 'site-files'
txt_files = []
for root, dirs, files in os.walk(site_path, topdown=False):
    for name in files:
        if name.endswith(".txt"):
            txt_files.append(os.path.join(root, name))

In [70]:
txt_files = [p for p in txt_files if ".ipynb_checkpoints" not in p]

Sending documents to the doc-store:

In [72]:
from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry

endpoint_url = "http://localhost:8000"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

In [80]:
# # Given the inconsistency of the mimetype, although the mimetype of the files are now 
# # text/plain but apparently chatgpt's code sees them as unauthorized type.
# for p in tqdm(txt_files[:2]):
#     # make post request that allows up to 5 retries
#     # mimetype, _ = mimetypes.guess_type(p)
#     # print(mimetype)   
#     with open(p, 'rb') as f:
#         res = s.post(
#             f"{endpoint_url}/upsert-file",
#             headers=headers,
#             files={'file':f}
#         ) 
#         print(res)

In [100]:
documents = []

for p in tqdm(txt_files):
    with open(p, 'r') as f:
        text = f.read()
        documents.append(
            {
                'id': p,
                'text': text,
            }
        )

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 13517/13517 [00:01<00:00, 7289.37it/s]


In [103]:
batch_size = 100

for i in tqdm(range(0, len(documents), batch_size)):
    i_end = min(len(documents), i+batch_size)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 271/271 [1:23:22<00:00, 18.46s/it]


In [4]:
queries = [{'query': "What's dataweave?"},
           {'query': "How to install s3 connector?"}, 
           {'query': "how to create api groups?"},
           {'query': "mapObject example in dataweave"},
           {'query': "DataWeave example that sorts an array by a field"}
          ]

In [6]:
from tqdm.auto import tqdm
import requests
from requests.adapters import HTTPAdapter, Retry

endpoint_url = "http://localhost:8000"
s = requests.Session()

res = requests.post(
    "http://0.0.0.0:8000/query",
    headers=headers,
    json={
        'queries': queries
    }
)
res

  from .autonotebook import tqdm as notebook_tqdm


<Response [200]>

In [7]:
results = res.json()['results']

In [8]:
for n in range(len(queries)):
    print('Query: ', results[n]['query'])
    for m in range(3):
        print("id: ", results[n]['results'][m]['metadata']['document_id'])
        print("score: ", results[n]['results'][m]['score'])
        print("answer: ", results[n]['results'][m]['text'])
        print("----\n")

Query:  What's dataweave?
id:  site-files/_Users_hminooei_Downloads_site_dataweave_1.1_dataweave-reference-documentation.txt
score:  0.894097149
answer:  The DataWeave Language is a powerful template engine that allows you to transform data to and from any kind of format (XML, CSV, JSON, Pojos, Maps, etc).  Let us start off with some examples to demonstrate the prowess of Dataweave as a data transformation tool.  ## Basic Example  This example shows a simple mapping from JSON to XML  Input                {       "title": "Java 8 in Action: Lambdas, Streams, and functional-style programming",       "author": "Mario Fusco",       "year": 2014     }  Transform                %dw 1.0     %output application/xml     ---     {       order: {         type: "Book",         title: payload.title,         details: "By $(payload.author) ($(payload.year))"       }     }  Output                <?xml version='1.0' encoding='UTF-8'?>     <order>
----

id:  site-files/_Users_hminooei_Downloads_site_dat

### Making Queries

To query the datastore all we need to do is pass one or more queries to the `/query` endpoint. We can take a few questions from SQuAD:

In [15]:
queries = data['question'].tolist()
# format into the structure needed by the /query endpoint
queries = [{'query': queries[i]} for i in range(len(queries))]
len(queries)

18891

We will use just the first *three* questions:

In [16]:
queries[:3]

[{'query': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?'},
 {'query': 'When did the Scholastic Magazine of Notre dame begin publishing?'},
 {'query': 'Where is the headquarters of the Congregation of the Holy Cross?'}]

In [20]:
res = requests.post(
    "http://0.0.0.0:8000/query",
    headers=headers,
    json={
        'queries': queries[:3]
    }
)
res

<Response [200]>

Now we can loop through the responses and see the results returned for each query:

In [21]:
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

----------------------------------------------------------------------
To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?


----------------------------------------------------------------------


----------------------------------------------------------------------
When did the Scholastic Magazine of Notre dame begin publishing?


----------------------------------------------------------------------


----------------------------------------------------------------------
Where is the headquarters of the Congregation of the Holy Cross?


----------------------------------------------------------------------




The top results are all relevant as we would have hoped. With that we've finished. The retrieval app API can be shut down, and to save resources the Pinecone index can be deleted within the [Pinecone console](https://app.pinecone.io/).