[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/chatgpt/plugins/langchain-docs-plugin.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/generation/chatgpt/plugins/langchain-docs-plugin.ipynb)

# Building a LangChain Docs Plugin for ChatGPT

In this walkthrough we setup a ChatGPT plugin.

Before running this notebook you should have already initialized the retrieval API and have it running locally or elsewhere (like on Digital Ocean). More detailed instructions for the setup and deployment can be [found in the video here](https://youtu.be/hpePPqKxNq8).

We will summarize the instructions (specific to the Pinecone datastore) before moving on to the walkthrough.

## App Quickstart

1. Install Python 3.10 if not already installed.

2. Clone the `chatgpt-retrieval-plugin` repository:

```
git clone git@github.com:openai/chatgpt-retrieval-plugin.git
```

_**Note**: To see how we setup the *hosted app* on DigitalOcean [refer to this video](https://youtu.be/hpePPqKxNq8), otherwise continue to setup the app locally by following the remaining steps._

3. Navigate to the app directory:

```
cd /path/to/chatgpt-retrieval-plugin
```

4. Install `poetry`:

```
pip install poetry
```

5. Create a new virtual environment:

```
poetry env use python3.10
```

6. Install the `retrieval-app` dependencies:

```
poetry install
```

7. Set app environment variables:

* `BEARER_TOKEN`: Secret token used by the app to authorize incoming requests. We will later include this in the request `headers`. The token can be generated however you prefer, such as using [jwt.io](https://jwt.io/).

* `OPENAI_API_KEY`: The OpenAI API key used for generating embeddings with the `text-embedding-ada-002` model. [Get an API key here](https://platform.openai.com/account/api-keys)!

8. Set Pinecone-specific environment variables:

* `DATASTORE`: set to `pinecone`.

* `PINECONE_API_KEY`: Set to your Pinecone API key. This requires a free Pinecone account and can be [found in the Pinecone console](https://app.pinecone.io/).

* `PINECONE_ENVIRONMENT`: Set to your Pinecone environment, looks like `us-east1-gcp`, `us-west1-aws`, and can be found next to your API key in the [Pinecone console](https://app.pinecone.io/).

* `PINECONE_INDEX`: Set this to your chosen index name. The name you choose is your choice, we just recommend setting it to something descriptive like `"openai-retrieval-app"`. *Note that index names are restricted to alphanumeric characters, `"-"`, and can contain a maximum of 45 characters.*

8. Run the app with:

```
poetry run start
```

If running the app locally you should see something like:

```
INFO:     Uvicorn running on http://0.0.0.0:8000
INFO:     Application startup complete.
```

In that case, the app has automatically connected to our index (specified by `PINECONE_INDEX`), if no index with that name existed beforehand, the app creates one for us.

Now we're ready to move on to populating our index with some data.

## Required Libraries

There are a few Python libraries we must `pip install` for this notebook to run, those are:

In [9]:
!pip install -qU langchain tiktoken tqdm

## Preparing Data

In this example, we will download the LangChain docs from [langchain.readthedocs.io/](https://langchain.readthedocs.io/latest/en/). We get all `.html` files located on the site like so:

!wget -r -A.html -P rtdocs https://python.langchain.com/en/latest/

This downloads all HTML into the `rtdocs` directory. Now we can use LangChain itself to process these docs. We do this using the `ReadTheDocsLoader` like so:

In [15]:
!pip install pypdf

Collecting pypdf
  Obtaining dependency information for pypdf from https://files.pythonhosted.org/packages/fe/e6/a23048e3ae56daf03479f5fe2f49af8d989cab1bcc1f8d2f15de83a3eaea/pypdf-3.14.0-py3-none-any.whl.metadata
  Downloading pypdf-3.14.0-py3-none-any.whl.metadata (6.9 kB)
Downloading pypdf-3.14.0-py3-none-any.whl (269 kB)
   ---------------------------------------- 0.0/269.8 kB ? eta -:--:--
   --------- ------------------------------ 61.4/269.8 kB 1.7 MB/s eta 0:00:01
   ---------------------------- ----------- 194.6/269.8 kB 2.4 MB/s eta 0:00:01
   ---------------------------------------- 269.8/269.8 kB 2.8 MB/s eta 0:00:00
Installing collected packages: pypdf
Successfully installed pypdf-3.14.0


from langchain.document_loaders import ReadTheDocsLoader

loader = ReadTheDocsLoader('rtdocs')
docs = loader.load()
len(docs)

In [9]:
from pypdf import PdfReader
fileSource = "/rtdocs/BatteryGuide_AG_US-LowRes.pdf"
docs = []

reader = PdfReader(".." + "/rtdocs/BatteryGuide_AG_US-LowRes.pdf")
pages = reader.pages
for page in pages:
  
  doc = {
    "page_content": page.extract_text(),
    "lookup_str": "",
    "metadata": { "source": fileSource },
  }

  docs.append(doc)


This leaves us with `389` processed doc pages. Let's take a look at the format each one contains:

In [10]:
print(len(docs))

32


In [36]:
print(docs[0]["page_content"])

Battery Testing Guide


We access the plaintext page content like so:

In [37]:
print(docs[0])

{'page_content': 'Battery Testing Guide', 'lookup_str': '', 'metadata': {'source': '/rtdocs/BatteryGuide_AG_US-LowRes.pdf'}}


In [39]:
import json
print(docs[0].__dir__())

['__new__', '__repr__', '__hash__', '__getattribute__', '__lt__', '__le__', '__eq__', '__ne__', '__gt__', '__ge__', '__iter__', '__init__', '__or__', '__ror__', '__ior__', '__len__', '__getitem__', '__setitem__', '__delitem__', '__contains__', '__sizeof__', 'get', 'setdefault', 'pop', 'popitem', 'keys', 'items', 'values', 'update', 'fromkeys', 'clear', 'copy', '__reversed__', '__class_getitem__', '__doc__', '__str__', '__setattr__', '__delattr__', '__reduce_ex__', '__reduce__', '__getstate__', '__subclasshook__', '__init_subclass__', '__format__', '__dir__', '__class__']


In [41]:
for key, value in docs[6].items():
    print(key, ' : ', value)

page_content  :  Battery types
There are several main types of battery technologies with 
subtypes:
	■Lead-acid
	■Flooded (wet): lead-calcium, lead-antimony
	■Valve Regulated Lead-acid, VRLA (sealed):  
lead-calcium, lead-antimony-selenium
	■ Absorbed Glass Matte (AGM)
	■Gel
	■Flat plate
	■Tubular plate
	■Nickel-cadmium
	■Flooded
	■Sealed
	■Pocket plate
	■Flat plate
Lead-acid overview
The basic lead-acid chemical reaction in a sulphuric acid 
electrolyte, where the sulphate of the acid is part of the 
reaction, is:
PbO2 + Pb + 2H2SO4  2PbSO4 + 2H2 + 1⁄2 O2
The acid is depleted upon discharge and regenerated upon 
recharge. Hydrogen and oxygen form during discharge and 
float charging (because float charging is counteracting self-
discharge). In flooded batteries, they escape and water must 
be periodically added. In valve-regulated, lead-acid (sealed) 
batteries, the hydrogen and oxygen gases recombine to form 
water. Additionally, in VRLA batteries, the acid is immobilized 
by an abso

We can also find the source of each document:

Looks good, we need to also consider the length of each page with respect to the number of tokens that will reasonably fit within the window of a ChatGPT model. We will use `gpt-3.5-turbo` as the assumed model.

### Chunking the Text

At the time of writing, `gpt-3.5-turbo` supports a context window of 4096 tokens — that means that input tokens + generated ( / completion) output tokens, cannot total more than 4096 without hitting an error.

So we 100% need to keep below this. If we assume a very safe margin of ~2000 tokens for the input prompt into `gpt-3.5-turbo`, leaving ~2000 tokens for conversation history and completion.

With this ~2000 token limit we may want to include *five* snippets of relevant information, meaning each snippet can be no more than **400** token long.

To create these snippets we use the `RecursiveCharacterTextSplitter` from LangChain. To measure the length of snippets we also need a *length function*. This is a function that consumes text, counts the number of tokens within the text (after tokenization using the `gpt-3.5-turbo` tokenizer), and returns that number. We define it like so:

In [42]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

Note that for the tokenizer we defined the encoder as `"cl100k_base"`. This is a specific tiktoken encoder which is used by `gpt-3.5-turbo`. Other encoders exist and at the time of writing are summarized as:

| Encoder | Models |
| --- | --- |
| `cl100k_base` | `gpt-4`, `gpt-3.5-turbo`, `text-embedding-ada-002` |
| `p50k_base` | `text-davinci-003`, `code-davinci-002`, `code-cushman-002` |
| `r50k_base` | `text-davinci-001`, `davinci`, `text-similarity-davinci-001` |
| `gpt2` | `gpt2` |

You can find these details in the [Tiktoken `model.py` script](https://github.com/openai/tiktoken/blob/main/tiktoken/model.py), or using `tiktoken.encoding_for_model`:

In [43]:
tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

With the length function defined we can initialize our `RecursiveCharacterTextSplitter` object like so:

In [44]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,  # number of tokens overlap between chunks
    length_function=tiktoken_len,
    separators=['\n\n', '\n', ' ', '']
)

Then we split the text for a document like so:

In [49]:
chunks = text_splitter.split_text(docs[5]["page_content"])
len(chunks)

3

In [50]:
tiktoken_len(chunks[0]), tiktoken_len(chunks[1])

(369, 367)

For `docs[5]` we created `2` chunks of token length `346` and `247`.

This is for a single document, we need to do this over all of our documents. While we iterate through the docs to create these chunks we will reformat them into the format required by our API app. This format needs to align to the `/upsert` endpoints required document format, which looks like this:

```json
[
    {
        "id": "abc",
        "text": "some important document text",
        "metadata": {
            "field1": "optional metadata goes here",
            "field2": 54
        }
    },
    {
        "id": "123",
        "text": "some other important text",
        "metadata": {
            "field1": "another metadata",
            "field2": 71,
            "field3": "not all metadatas need the same structure"
        }
    }
    ...
]
```

Every document *must* have a `"text"` field. The `"id"` and `"metadata"` fields are optional, however, we will include both.

The `"id"` will be created based on the URL of the text + it's chunk number.

In [52]:
import hashlib
m = hashlib.md5()  # this will convert URL into unique ID

url = docs[5]["metadata"]['source']
print(url)

# convert URL to unique ID
m.update(url.encode('utf-8'))
uid = m.hexdigest()[:12]
print(uid)

/rtdocs/BatteryGuide_AG_US-LowRes.pdf
bc3a09561505


Then use the `uid` alongside chunk number and actual `url` to create the format needed:

In [53]:
data = [
    {
        'id': f'{uid}-{i}',
        'text': chunk,
        'metadata': {'url': url}
    } for i, chunk in enumerate(chunks)
]
data

[{'id': 'bc3a09561505-0',
  'text': 'Why backup \nbatteries are \nneeded\nBatteries are used to ensure that critical electrical equipment \nis always on. There are so many places where batteries are \nused – it is nearly impossible to list them all. Some of the \napplications for batteries include:\n\t■Electric generating stations and substations for protection and \ncontrol of switches and relays\n\t■Telephone systems to support phone service, especially emergency \nservices\n\t■Industrial applications for protection and control \n\t■Back up of computers, especially financial data  \nand information \n\t■“Less critical” business information systems\nWithout battery back-up hospitals would have to close their \ndoors until power is restored. But even so, there are patients \non life support systems that require absolute 100% electric \npower. For those patients, as it was once said, “failure is not \nan option.” \nJust look around to see how much electricity we use and then \nto see ho

Now we repeat the same logic across our full dataset:

In [58]:
from tqdm.auto import tqdm

documents = []

for doc in tqdm(docs):
    url = doc["metadata"]['source']
    m.update(url.encode('utf-8'))
    uid = m.hexdigest()[:12]
    chunks = text_splitter.split_text(doc["page_content"])
    for i, chunk in enumerate(chunks):
        documents.append({
            'id': f'{uid}-{i}',
            'text': chunk,
            'metadata': {'url': url}
        })

len(documents)

100%|██████████| 32/32 [00:00<00:00, 309.11it/s]


76

We're now left with `2201` documents in the format required by our API.

---

#### (Optional) Load Dataset from Hugging Face

Rather than running the above scripts to build the dataset, you can load a prepared version from Hugging Face Datasets like so:

!pip install -qU datasets

from datasets import load_dataset

documents = load_dataset('jamescalam/langchain-docs', split='train')
documents

In [59]:
documents[0]

{'id': 'ee08476fe8ff-0',
 'text': 'Battery Testing Guide',
 'metadata': {'url': '/rtdocs/BatteryGuide_AG_US-LowRes.pdf'}}

This needs to be reformated into the format we need for the API:

documents = [{
    'id': doc['id'],
    'text': doc['text'],
    'metadata': {'url': doc['url']}
} for doc in documents]

documents[0]

---

### Indexing the Docs

We're now ready to begin indexing (or *upserting*) our `documents`. To make these requests to the retrieval app API, we will need to provide authorization in the form of the `BEARER_TOKEN` we set earlier. We do this below:

In [1]:
import os

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJuYW1lIjoiTWF0ZXVzeiBMaWJlciBEQ1gifQ.UIy6GwZnyQn2O8DNxSQ_BTAEzWf7fkFpukLIwmpiS3Y"

Use the `BEARER_TOKEN` to create our authorization `headers`:

In [2]:
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

We'll perform the upsert in batches of `batch_size`. Make sure that the `endpoint_url` variable is set to the correct location for your running *retrieval-app* API.

In [4]:
import requests
from requests.adapters import HTTPAdapter, Retry
from tqdm.auto import tqdm

batch_size = 100
endpoint_url = "https://lobster-app-hfwib.ondigitalocean.app"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

# for i in tqdm(range(0, len(documents), batch_size)):
#     i_end = min(len(documents), i+batch_size)
#     # make post request that allows up to 5 retries
#     res = s.post(
#         f"{endpoint_url}/upsert",
#         headers=headers,
#         json={
#             "documents": documents[i:i_end]
#         }
#     )

  from .autonotebook import tqdm as notebook_tqdm


With that our LangChain doc records have all been indexed and we can move on to querying.

### Making Queries

To query the datastore all we need to do is pass one or more queries to the `/query` endpoint. We can make a few questions related to LangChain and see if we return relevant info:

In [7]:
queries = [
    {'query': "How to maintain the battery?"},
]

res = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={
        'queries': queries
    }
)
res

<Response [200]>

In [8]:
print(res.json())

{'results': [{'query': 'How to maintain the battery?', 'results': [{'id': '5a0c779fbae8-0_1', 'text': 'This might also be a risky approach. Batteries can fail earlier than  expected.  Also it is waste of capital if the batteries are replaced  earlier than needed.  Properly maintained batteries can live longer  than the predetermined replacement time. \t■A serious maintenance and testing program in order to ensure the  batteries are in good condition, prolong their life and to find the  optimal time for replacement .   A maintenance program including inspection, impedance and  capacity testing is the way to track the battery’s state of health.   Degradation and faults will be found before they become serious  and surprises can be avoided.  Maintenance costs are higher but  this is what you have to pay for to get the reliability you want for  your back-up system.', 'metadata': {'source': None, 'source_id': None, 'url': '/rtdocs/BatteryGuide_AG_US-LowRes.pdf', 'created_at': None, 'author'

Now we can loop through the responses and see the results returned for each query:

In [70]:
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

----------------------------------------------------------------------
How to maintain the battery?

0.85: This might also be a risky approach. Batteries can fail earlier than  expected.  Also it is waste of capital if the batteries are replaced  earlier than needed.  Properly maintained batteries can live longer  than the predetermined replacement time. 	■A serious maintenance and testing program in order to ensure the  batteries are in good condition, prolong their life and to find the  optimal time for replacement .   A maintenance program including inspection, impedance and  capacity testing is the way to track the battery’s state of health.   Degradation and faults will be found before they become serious  and surprises can be avoided.  Maintenance costs are higher but  this is what you have to pay for to get the reliability you want for  your back-up system.
0.85: Battery Testing Guide
0.85: Battery Testing Guide
-------------------------------------------------------------------

The top results are all relevant as we would have hoped. With that we've finished. The retrieval app API can be shut down, and to save resources the Pinecone index can be deleted within the [Pinecone console](https://app.pinecone.io/).