# Introduction

This is a simple example on how to use InterSystems Vector Search as a vector store for embeddings extracted from PDF files containing InterSystems documentation and using OpenAI LLM model to perform RAG on it.

The example is a copy of the work presented in [Alex Woodhead's article](https://community.intersystems.com/post/langchain-intersystems-pdf-documentation), with some modification.

# Installation

In [1]:
!pip install -q openai langchain wget tiktoken pypdf langchain-iris


[notice] A new release of pip is available: 23.2.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


# Prepare the docs

In [2]:
import glob
import wget;

url='https://docs.intersystems.com/irisforhealth20231/csp/docbook/pdfs.zip';
wget.download(url)
# extract docs
import zipfile
with zipfile.ZipFile('pdfs.zip','r') as zip_ref:
  zip_ref.extractall('.')

# get a list of files
pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")]

100% [........................................................................] 66448881 / 66448881

In [3]:
import getpass
import os
from dotenv import load_dotenv

load_dotenv(override=True)

if not os.environ.get("OPENAI_API_KEY"): 
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

# Load docs into Vector Store

In [4]:
from langchain.embeddings import OpenAIEmbeddings
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.prompts.prompt import PromptTemplate
from langchain import OpenAI
from langchain.chains import LLMChain
from langchain_iris import IRISVector

# Extract text from PDF files
documentsAll=[]
pdfFiles=[file for file in glob.glob("./pdfs/pdfs/*")]
for file_name in pdfFiles:
  loader = PyPDFLoader(file_name)
  pages = loader.load_and_split()
  # Strip unwanted padding
  for page in pages:
    page.page_content=("".join((page.page_content.split('\xa0'))))
  documents = CharacterTextSplitter().split_documents(pages)
  # Ignore the cover pages
  for document in documents[2:]:
    documentsAll.append(document)

# IRIS connection
username = 'demo'
password = 'demo' 
hostname = os.getenv('IRIS_HOSTNAME', 'localhost')
port = '1972' 
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

# Under the hood, this becomes a SQL table.
COLLECTION_NAME = "intersystems_doc"

# This will take couple of minutes to complete
embeddings = OpenAIEmbeddings()
db = IRISVector.from_documents(
    embedding=embeddings,
    documents=documentsAll,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

  embeddings = OpenAIEmbeddings()


# Prep the search template

In [5]:
_GetDocWords_TEMPLATE = """Answer the Question: {question}

By considering the following documents:
{docs}
"""

PROMPT = PromptTemplate(
     input_variables=["docs","question"], template=_GetDocWords_TEMPLATE
)

llm = OpenAI(temperature=0, verbose=True)

chain = LLMChain(llm=llm, prompt=PROMPT)

  llm = OpenAI(temperature=0, verbose=True)
  chain = LLMChain(llm=llm, prompt=PROMPT)


# Lets talk with the documentation

In [6]:
query = "What is Adaptive Analytics?"
docs = db.similarity_search(query)
# Only using the first two documents to reduce token search size on openai
answer = chain.run(docs=docs[:2],question=query)
print(f"> {query}\n>{answer}")

  answer = chain.run(docs=docs[:2],question=query)


> What is Adaptive Analytics?
>
Adaptive Analytics is an optional extension for InterSystems IRIS that provides a business-oriented, virtual data model layer between InterSystems IRIS and popular Business Intelligence (BI) and Artificial Intelligence (AI) client tools. It allows for the development of a centralized common data model, making it easier for enterprises to provide their end users with a consistent view of business metrics and data characterization. Adaptive Analytics includes features such as a modeler for making data accessible to business users, publication of data model changes as virtual cubes, and unified access to an online analytical processing (OLAP) model. It is powered by AtScale and requires the installation or upgrade of an instance of AtScale to integrate with InterSystems IRIS.


In [7]:
query = "Can I use pre build InterSystems BI cubes with Adaptive Analytics?"
docs = db.similarity_search(query)
# Only using the first two documents to reduce token search size on openai
answer = chain.run(docs=docs[:2],question=query)
print(f"> {query}\n>{answer}")

> Can I use pre build InterSystems BI cubes with Adaptive Analytics?
>
Yes, you can use pre-built InterSystems BI cubes with Adaptive Analytics. The process for importing these cubes is outlined in the document "Integrating Adaptive Analytics with InterSystems Reports". However, there may be some incompatibilities between the structure of the exported .json file and Adaptive Analytics' expected data model, so it is recommended to manually review and adjust any calculated measures and drill-throughs to ensure compliance. Additionally, cubes based on data connectors and cube relationships may not export in this process.


# References

https://community.intersystems.com/post/langchain-intersystems-pdf-documentation

https://github.com/intersystems-community/hackathon-2024/blob/main/demo/langchain_demo.ipynb