In [143]:
%pip install -qU langchain tiktoken tqdm

Note: you may need to restart the kernel to use updated packages.


In [144]:
%pip install pypdf

Note: you may need to restart the kernel to use updated packages.


In [145]:
from pypdf import PdfReader
import os

pdfFilesPath = "../rtdocs/"
pdfList = os.listdir(pdfFilesPath)
docs = []

for pdf in pdfList:
  fileSource = pdfFilesPath + pdf
  reader = PdfReader(fileSource)
  pdfPages = reader.pages
  doc = []
  for pdfPage in pdfPages:
    text = {
      "page_content": pdfPage.extract_text(),
      "lookup_str": "",
      "metadata": { "source": fileSource },
    }
    doc.append(text)
  docs.append(doc)


In [146]:
print("Number of files:" + str(len(docs)))
for x in range(0, len(docs)):
  print("   Number of pdf" + str(x) + " pages:" + str(len(docs[x])))

os.listdir('../rtdocs/')


Number of files:3
   Number of pdf0 pages:32
   Number of pdf1 pages:8
   Number of pdf2 pages:7


['BatteryGuide_AG_US-LowRes.pdf',
 'Torkel900_BR_EN_V03.pdf',
 'TORKEL900_DS_en.pdf']

In [147]:
for doc in docs:
    for key, value in doc[0].items():
        print(key, ' : ', value)

page_content  :  Battery Testing Guide
lookup_str  :  
metadata  :  {'source': '../rtdocs/BatteryGuide_AG_US-LowRes.pdf'}
page_content  :  TORKEL 900
The capacity test is the most important of all the 
battery tests
  Testing batteries, including during operation
 Dynamic discharge technology — full power 
at all voltages
automatic shut-off, for example in the event 
of blocked air flow
 Load resistors can be expanded with TXL load 
units
 Real-time monitoring during the test
 Reverse polarity protection
 Automatic quick log
lookup_str  :  
metadata  :  {'source': '../rtdocs/Torkel900_BR_EN_V03.pdf'}
page_content  :   Batteries can be tested in service
 Dynamic discharge technology – full power at all 
voltages
 Safety in all details, e.g. detection of blocked 
airflow
 Real time monitoring during test
 Easy report function and calibration
 Easily expandable for larger battery banks using 
TXL extra load units
 Battery cell monitor control integrated in the 
system
 Can b

In [148]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

In [149]:
tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [150]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,  # number of tokens overlap between chunks
    length_function=tiktoken_len,
    separators=['\n\n']
)

In [151]:
chunks = text_splitter.split_text(docs[0][5]["page_content"])
len(chunks)

1

In [152]:
chunks[0]

'Why backup \nbatteries are \nneeded\nBatteries are used to ensure that critical electrical equipment \nis always on. There are so many places where batteries are \nused – it is nearly impossible to list them all. Some of the \napplications for batteries include:\n\t■Electric generating stations and substations for protection and \ncontrol of switches and relays\n\t■Telephone systems to support phone service, especially emergency \nservices\n\t■Industrial applications for protection and control \n\t■Back up of computers, especially financial data  \nand information \n\t■“Less critical” business information systems\nWithout battery back-up hospitals would have to close their \ndoors until power is restored. But even so, there are patients \non life support systems that require absolute 100% electric \npower. For those patients, as it was once said, “failure is not \nan option.” \nJust look around to see how much electricity we use and then \nto see how important batteries have become in

In [154]:
tiktoken_len(chunks[0])

850

In [155]:
import hashlib
m = hashlib.md5()  # this will convert URL into unique ID

url = docs[0][5]["metadata"]['source']
print(url)

# convert URL to unique ID
m.update(url.encode('utf-8'))
uid = m.hexdigest()[:12]
print(uid)

../rtdocs/BatteryGuide_AG_US-LowRes.pdf
ff7e1fa1fb8d


In [156]:
data = [
    {
        'id': f'{uid}-{i}',
        'text': chunk,
        'metadata': {'url': url}
    } for i, chunk in enumerate(chunks)
]
data

[{'id': 'ff7e1fa1fb8d-0',
  'text': 'Why backup \nbatteries are \nneeded\nBatteries are used to ensure that critical electrical equipment \nis always on. There are so many places where batteries are \nused – it is nearly impossible to list them all. Some of the \napplications for batteries include:\n\t■Electric generating stations and substations for protection and \ncontrol of switches and relays\n\t■Telephone systems to support phone service, especially emergency \nservices\n\t■Industrial applications for protection and control \n\t■Back up of computers, especially financial data  \nand information \n\t■“Less critical” business information systems\nWithout battery back-up hospitals would have to close their \ndoors until power is restored. But even so, there are patients \non life support systems that require absolute 100% electric \npower. For those patients, as it was once said, “failure is not \nan option.” \nJust look around to see how much electricity we use and then \nto see ho

Now we repeat the same logic across our full dataset:

In [157]:
from tqdm.auto import tqdm

documents = []

for docItem in docs:
    for doc in tqdm(docItem):
        url = doc["metadata"]['source']
        m.update(url.encode('utf-8'))
        uid = m.hexdigest()[:12]
        chunks = text_splitter.split_text(doc["page_content"])
        for i, chunk in enumerate(chunks):
            if(len(chunk) < 50):
                continue
            documents.append({
                'id': f'{uid}-{i}',
                'text': chunk,
                'metadata': {'url': url}
            })

len(documents)

100%|██████████| 32/32 [00:00<00:00, 1777.77it/s]
100%|██████████| 8/8 [00:00<00:00, 1602.64it/s]
100%|██████████| 7/7 [00:00<00:00, 1732.98it/s]


44

In [158]:
with open('output.md', 'w', encoding="utf-8") as f:
    for chunk in documents:
        f.write('CHUNK:' + str(chunk['text']))
        f.write('\n\n\n')

### Indexing the Docs

We're now ready to begin indexing (or *upserting*) our `documents`. To make these requests to the retrieval app API, we will need to provide authorization in the form of the `BEARER_TOKEN` we set earlier. We do this below:

In [159]:
import os

BEARER_TOKEN = os.environ.get("BEARER_TOKEN") or "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJuYW1lIjoiTWF0ZXVzeiBMaWJlciBEQ1gifQ.UIy6GwZnyQn2O8DNxSQ_BTAEzWf7fkFpukLIwmpiS3Y"

Use the `BEARER_TOKEN` to create our authorization `headers`:

In [160]:
headers = {
    "Authorization": f"Bearer {BEARER_TOKEN}"
}

We'll perform the upsert in batches of `batch_size`. Make sure that the `endpoint_url` variable is set to the correct location for your running *retrieval-app* API.

In [161]:
import requests
from requests.adapters import HTTPAdapter, Retry
from tqdm.auto import tqdm

batch_size = 100
endpoint_url = "https://lobster-app-hfwib.ondigitalocean.app"
s = requests.Session()

# we setup a retry strategy to retry on 5xx errors
retries = Retry(
    total=5,  # number of retries before raising error
    backoff_factor=0.1,
    status_forcelist=[500, 502, 503, 504]
)
s.mount('http://', HTTPAdapter(max_retries=retries))

for i in tqdm(range(0, len(documents), batch_size)):
    i_end = min(len(documents), i+batch_size)
    # make post request that allows up to 5 retries
    res = s.post(
        f"{endpoint_url}/upsert",
        headers=headers,
        json={
            "documents": documents[i:i_end]
        }
    )

100%|██████████| 1/1 [00:14<00:00, 14.71s/it]


With that our LangChain doc records have all been indexed and we can move on to querying.

### Making Queries

To query the datastore all we need to do is pass one or more queries to the `/query` endpoint. We can make a few questions related to LangChain and see if we return relevant info:

In [165]:
queries = [
    {'query': "Are you smarter or dumper?"},
]

res = requests.post(
    f"{endpoint_url}/query",
    headers=headers,
    json={
        'queries': queries
    }
)
res

<Response [200]>

In [166]:
for key, value in res.json().items():
  print(key, ' : ', value)

results  :  [{'query': 'Are you smarter or dumper?', 'results': [{'id': '648452fb90e4-0_2', 'text': 'Features and benefits', 'metadata': {'source': None, 'source_id': None, 'url': '../rtdocs/Torkel900_BR_EN_V03.pdf', 'created_at': None, 'author': None, 'document_id': '648452fb90e4-0'}, 'embedding': None, 'score': 0.731910408}, {'id': 'ae8ef241bff1-0_5', 'text': 'Yes it is possible to do. Megger has test equipment that  automatically senses and regulate the discharge current even  when the batteries are connected to the ordinary load. Most  users choose to make a 80% discharge test when on-line in  order to still have some backup time at the end of the test. Battery technology summary As you can see, there is a lot to a battery. It is a complex electro- chemical device. There is much more information available that  goes further into the details of Tafel curves and depolariza - tion but that is beyond this scope. Essentially, batteries need  maintenance and care to get the most of them 

Now we can loop through the responses and see the results returned for each query:

In [167]:
for query_result in res.json()['results']:
    query = query_result['query']
    answers = []
    scores = []
    for result in query_result['results']:
        answers.append(result['text'])
        scores.append(round(result['score'], 2))
    print("-"*70+"\n"+query+"\n\n"+"\n".join([f"{s}: {a}" for a, s in zip(answers, scores)])+"\n"+"-"*70+"\n\n")

----------------------------------------------------------------------
Are you smarter or dumper?

0.73: Features and benefits
0.72: Yes it is possible to do. Megger has test equipment that  automatically senses and regulate the discharge current even  when the batteries are connected to the ordinary load. Most  users choose to make a 80% discharge test when on-line in  order to still have some backup time at the end of the test. Battery technology summary As you can see, there is a lot to a battery. It is a complex electro- chemical device. There is much more information available that  goes further into the details of Tafel curves and depolariza - tion but that is beyond this scope. Essentially, batteries need  maintenance and care to get the most of them which is the  main reason people spend so much on batteries – to support far  more expensive equipment and to ensure continuous revenue  streams.  23
0.72: reviewed and then filed away, probably not reviewed again  until a problem a

The top results are all relevant as we would have hoped. With that we've finished. The retrieval app API can be shut down, and to save resources the Pinecone index can be deleted within the [Pinecone console](https://app.pinecone.io/).