## RAG with llmsherpa (nlm-ingestor), lancedb and OpenAI

- Simple and effective chunking was made possible with [llmsherpa](https://github.com/nlmatics/llmsherpa). 
- [nlm-ingester](https://github.com/nlmatics/nlm-ingestor) allowed self-hosted server for llmsherpa with docker.
- OpenAI for embedding and chat completion (can be swapped to other alternatives).
- This notebook provides an example for loading chunks created from llmsherpa to lancedb. Also uses in-built hybrid search from lancedb.


Pull and run the docker image of nlm-ingester.

In [None]:
!docker pull ghcr.io/nlmatics/nlm-ingestor:latest
!docker run -p 5010:5001 ghcr.io/nlmatics/nlm-ingestor:latest

Install required libraries or use requirements.txt to install in a venv.

In [3]:
%pip install llmsherpa lancedb pandas openai

Collecting llmsherpa
  Using cached llmsherpa-0.1.4-py3-none-any.whl.metadata (14 kB)
Collecting lancedb
  Using cached lancedb-0.6.13-cp38-abi3-macosx_10_15_x86_64.whl.metadata (4.7 kB)
Collecting pandas
  Using cached pandas-2.2.2-cp311-cp311-macosx_10_9_x86_64.whl.metadata (19 kB)
Collecting openai
  Downloading openai-1.30.1-py3-none-any.whl.metadata (21 kB)
Collecting urllib3 (from llmsherpa)
  Using cached urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB)
Collecting deprecation (from lancedb)
  Using cached deprecation-2.1.0-py2.py3-none-any.whl.metadata (4.6 kB)
Collecting pylance==0.10.12 (from lancedb)
  Using cached pylance-0.10.12-cp38-abi3-macosx_10_15_x86_64.whl.metadata (7.3 kB)
Collecting ratelimiter~=1.0 (from lancedb)
  Using cached ratelimiter-1.2.0.post0-py3-none-any.whl.metadata (4.0 kB)
Collecting requests>=2.31.0 (from lancedb)
  Downloading requests-2.32.2-py3-none-any.whl.metadata (4.6 kB)
Collecting retry>=0.9.2 (from lancedb)
  Using cached retry-0.9.2-py2.py3-

In [10]:
%pip install tantivy # for lancedb full text search

Collecting tantivy
  Using cached tantivy-0.22.0-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl.metadata (1.2 kB)
Using cached tantivy-0.22.0-cp311-cp311-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (6.2 MB)
Installing collected packages: tantivy
Successfully installed tantivy-0.22.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
from llmsherpa.readers import LayoutPDFReader

import os
import lancedb
from lancedb.embeddings import get_registry
from lancedb.pydantic import LanceModel, Vector

import pandas as pd

from openai import OpenAI

os.environ["OPENAI_API_KEY"] = '<add your OPENAI API>'

In [2]:
def get_embedding(text):
    client = OpenAI()
    text = text.replace("\n"," ")
    return client.embeddings.create(input=[text], model="text-embedding-3-large").data[0].embedding

def text_sanitize(text):
    if not text:
        return " "
    return text

In [3]:
llmsherpa_api_url = "http://localhost:5010/api/parseDocument?renderFormat=all"
#"https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all" # llmsherpa api if not self-hosted

pdf_url = "https://arxiv.org/pdf/1910.13461.pdf" # also allowed is a file path e.g. /home/downloads/xyz.pdf
pdf_reader = LayoutPDFReader(llmsherpa_api_url)
doc = pdf_reader.read_pdf(pdf_url)

In [4]:
data = []
for chunk in doc.chunks():
    tmp_dict = {}
    chunk_text = chunk.to_text()
    chunk_text = text_sanitize(chunk_text)
    tmp_dict["text"] = chunk_text
    tmp_dict["context"]= chunk.to_context_text()
    tmp_dict["vector"] = get_embedding(chunk_text)
    data.append(tmp_dict)

In [5]:
db = lancedb.connect("./db")
table = db.create_table("chunks", data=data, mode="overwrite")

In [6]:
table.create_fts_index("text")  # Create a fts index before the hybrid search

In [7]:
table.search(
    "What are the objectives of pre-training?"
).limit(5).to_pandas()

Unnamed: 0,text,context,vector,score
0,Table 1: Comparison of pre-training objectives...,BART: Denoising Sequence-to-Sequence Pre-train...,"[0.026339235, 0.009338064, -0.02416826, 0.0143...",12.379797
1,Token masking is crucial Pre-training objectiv...,BART: Denoising Sequence-to-Sequence Pre-train...,"[-0.024983516, -0.032836277, -0.010570487, 0.0...",7.768296
2,Left-to-right pre-training improves generation...,BART: Denoising Sequence-to-Sequence Pre-train...,"[-0.014563556, -0.008050171, -0.013285295, 0.0...",7.281164
3,While many pre-training objectives have been p...,BART: Denoising Sequence-to-Sequence Pre-train...,"[-0.0027317216, 0.017693207, -0.01675744, 0.01...",7.237228
4,Performance of pre-training methods varies sig...,BART: Denoising Sequence-to-Sequence Pre-train...,"[-0.0013883491, 0.0031317624, -0.018050993, -0...",6.996423


In [8]:
def create_prompt_with_context(query, context):
    limit = 3750

    prompt_start = (
        "Answer the question based on the context below.\n\n"+
        "Context:\n"
    )
    prompt_end = (
        f"\n\nQuestion: {query}\nAnswer:"
    )
    # append contexts until hitting limit
    for i in range(1, len(context)):
        if len("\n\n---\n\n".join(context.text[:i])) >= limit:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(context.text[:i-1]) +
                prompt_end
            )
            break
        elif i == len(context)-1:
            prompt = (
                prompt_start +
                "\n\n---\n\n".join(context.text) +
                prompt_end
            )
    return prompt

In [9]:
def complete(prompt):
    client = OpenAI()
    response = client.chat.completions.create(
        model = "gpt-4-turbo",
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    message = response.choices[0].message.content
    #print(response)

    return message

In [10]:
query = "What are the objectives of pre-training?"
complete(query)

"Pre-training, particularly in the context of machine learning and artificial intelligence, serves several important objectives that generally aim to improve the performance, adaptability, and efficiency of models. The objectives of pre-training include:\n\n1. **Learning General Features:** Pre-training allows models to learn general features and patterns from large datasets. This is especially useful in domains where labeled data is scarce, but unlabeled data is abundant. By pre-training on a large, unlabeled dataset, the model can learn a good representation of the input space which can be useful across a variety of tasks.\n\n2. **Transfer Learning:** Pre-trained models can be fine-tuned on specific, smaller tasks. This approach, known as transfer learning, is beneficial when the target task has limited training data. The pre-trained model, having learned a broad representation, needs only minimal adjustment to adapt to the specifics of the new task.\n\n3. **Improving Performance:** 