## 3. Use Helper Libraries like Langchain

In section 1, we used a basic sentence transformer model and direct SQL access in InterSystems IRIS to load and vectorize data, then run vector searches on that data.

In this section, we will use OpenAI's *OpenAIEmbeddings* model to vectorize our data, and we'll use Langchain to load and interact with that data. Langchain provides several advantages in building a RAG application, including streamlining the retrieval process, adding conversation history, enabling guardrails to keep your application within its intended usage, and more.

To use the *OpenAIEmbeddings* model, we will need an OpenAI API Key. The following block of code is used to manage environment variables, specifically for loading and setting the OpenAI API key. It begins by importing necessary modules for operating system interactions and secure password input.

For this workshop, InterSystems has provided a short-term OpenAI API key that is already configured in the environment variables. Run the block below to load these settings.

In [1]:
import getpass
import os
from dotenv import load_dotenv

load_dotenv(override=True)

True


The next block imports a variety of libraries and modules for completing advanced language processing tasks. These include handling and storing documents, loading textual and JSON data, splitting text based on character count, and utilizing embeddings from OpenAI, Hugging Face, and potentially faster embedding methods. 

In [78]:
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader, JSONLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings.fastembed import FastEmbedEmbeddings

from langchain_iris import IRISVector
from sqlalchemy import create_engine, text


### Load data set with 100 patient case reports via Langchain
Next, we will set up the process for loading, splitting, and preparing to embed text documents from the same data set of 100 case reports that we used with SQL in section 1.

The first step is to initialize a *JSONLoader* to load documents from a specified file. The line
*json_lines=True* below specifies that we are loading files from a *json_lines* file, which is a file format where each line is a complete JSON object, separated by new lines. This format is particularly useful for handling large datasets or streams of data, because it allows for reading, writing, and processing one line (or one JSON object) at a time, rather than needing to load an entire file into memory at once.

After loading the data, the text is split into smaller chunks to facilitate more efficient processing and embedding. Here, we use a chunk size of 1,000 characters with an overlap of 100 characters. Chunking the text helps in managing large documents by breaking them into smaller, more manageable pieces, which can be individually embedded into vector format. The overlap ensures that important contextual information is preserved across chunks, enhancing the quality of the resulting embeddings.

Run the block of code below to execute this:

In [87]:
## drop the table first if re-running this
engine = create_engine(CONNECTION_STRING)
with engine.connect() as conn:
    with conn.begin():# Load 
        sql = f"""
                DROP TABLE IF EXISTS case_reports
        """
        result = conn.execute(text(sql))

## load data from json lines file
loader = JSONLoader(
    file_path='./data/healthcare/augmented_notes_100.jsonl',
    jq_schema='.note',
    json_lines=True
)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", " ", ""]
)

docs = text_splitter.split_documents(documents)
print(f"Got {len(docs)} chunks")

##text_splitter = CharacterTextSplitter(separator="", chunk_size=100, chunk_overlap=10)
##docs = text_splitter.split_documents(documents)

# embeddings = OpenAIEmbeddings()
embeddings = FastEmbedEmbeddings()


Got 100 chunks


Run the following two blocks to create and print the connection string that will be used to connect to InterSystems IRIS. 

In [88]:
username = '_SYSTEM'
password = 'sys'
hostname = 'localhost'
port = 1972
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

In [89]:
print(CONNECTION_STRING)


iris://_SYSTEM:sys@localhost:1972/USER


Next, let's initialize a database in InterSystems IRIS, which you will populate with the tweets that we have processed and embedded. 

This setup is essential for applications involving search and retrieval of information where the semantic content of the documents is more important than their keyword content. The vector store uses embeddings to perform similarity searches, offering significant advantages over traditional search methods by understanding the context and meaning embedded within the text.

Run the block below to load your data and embeddings into InterSystems IRIS. This may take a few moments.

In [90]:
COLLECTION_NAME = "case_reports"
db = IRISVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

Confirm that there are 100 documents in your vector storage by running the following block.

In [81]:
print(f"Number of docs in vector store: {len(db.get()['ids'])}")

Number of docs in vector store: 192


## Try out vector search 

Now that the text documents are loaded, embedded, and stored in the vector database, you can try running a vector search. In the code block below, we will use the search phrase "Have any children presented with knee injuries?" to retrieve similar vectors from our storage.

The second line in the block returns the documents along with their similarity scores, which quantify how similar each document is to the query. Lower scores indicate greater relevance.

In [91]:
query = "Have any children presented with knee injuries?"
docs_with_score = db.similarity_search_with_score(query)

Run the following block to print the returned documents along with their scores.

In [92]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.308900874012489
At the time of diagnosis this patient was a 9 year-old female with a one year history of pain and swelling about her left knee. She had experienced a fall and related all symptoms to the fall. She was seen in her local emergency room by her family physician; there was no diagnosis or treatment. Approximately one month prior to her representation, she was struck in the left knee by a basketball and developed worsening pain. She was seen by an orthopedic surgeon (December 1999) and was noted to have a valgus posture of both lower extremities, exaggerated on the left by external rotation and she walked with a mild limp. The left knee had no effusion but was hypersensitive to light touch over the lateral aspect where there was soft tissue swelling just below the knee. There was no obvious mass in the area, although firm palpation was difficult because of patient discomfort. Plain film

Next, you will enter a search phrase of your choice and perform a similarity search on that phrase.

Set the *content* variable to a word or phrase of your choice; you can reference keywords from some of the companies you saw in the data set when initially browsing it in Step 1, if you'd like. Or you can choose any phrase you wish. Then, run the block.

In [93]:
content="Upper respiratory illness"
docs_with_score = db.similarity_search_with_score(content)

Run the block below to print the first returned document in the list, which is the one with the most similar result. Observe its similarity score, which is a number between 0 and 1. The closer to 0 it is, the closer in similarity the document is to your search phrase.

In [94]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.359710919987393
The patient was a 60-year-old man who referred to our emergency department due to worsening dyspnea and hemoptysis since 2-3 days prior to admission. He reported to have dyspnea and hoarseness during the previous year. He had undergone a direct laryngoscopy which had revealed left vocal cord palsy and a chest computed tomography (CT) scan which had shown a mediastinal mass and with possibility of a malignancy process a direct needle biopsy was done which demonstrated inflammatory cells in the background of blood. His past medical history only included a mild stroke 6 years before without any sequel.
At the emergency department he had pulse rate of 86 beats/min, respiratory rate of 18/min and blood pressure of 135/80 mm Hg. He had a continuous murmur in left sternal and pulmonic area and decreased breathing sound in left hemithorax.
A chest x-ray was obtained which showed a large d

### Prepare retrieval mechanism for RAG app
Next, we'll set up a retriever for our database. A retriever is an essential component in information retrieval systems, as it allows us to fetch relevant documents based on a query efficiently. By converting the database into a retriever, we enhance our ability to interact with the data, enabling more advanced search and retrieval operations.

The `as_retriever()` method transforms the database into a retriever object. This object can then be used to perform various retrieval tasks, making it a versatile tool for working with our embedded documents.

Run the block below to create the retriever and print it to confirm its setup.

In [95]:
retriever = db.as_retriever()
print(retriever)

tags=['IRISVector'] vectorstore=<langchain_iris.vectorstores.IRISVector object at 0x1347aeab0> search_kwargs={}


This final step ensures that your database is ready for advanced retrieval operations, leveraging the power of vector embeddings to find and return the most relevant documents efficiently. In Steps 4 and 5, we will further build our chat application to leverage these documents.