## 2. Load Data and Run Some Vector Searches
Now that you have created and run a simple Streamlit application that essentially acts as a basic LLM, let's load some data into InterSystems IRIS and run vector searches on this data. We will later connect this vector storage to the Streamlit application so that the chat app can utilize this data.

We will use the *FastEmbed* embeddings model to vectorize our data, and we'll use Langchain to load and interact with the data. Langchain provides several advantages in building a RAG application, including streamlining the retrieval process, adding conversation history, enabling guardrails to keep your application within its intended usage, and more. We will implement these features of the chat later in the workshop.

Throughout the code snippets in this workshop, you may see lines of code commented out, which can be used in later iterations when you take the code home. For example, you could use the *OpenAIEmbeddings* model to create your embeddings. This requires an OpenAI API key. For this workshop, InterSystems has provided a short-term OpenAI API key that is already configured in the environment variables. The key is also used for the base LLM that the chat application uses.

The following block of code is used to manage environment variables, specifically for loading and setting the OpenAI API key. It begins by importing necessary modules for operating system interactions and secure password input. Run the block below to load these settings.

In [None]:
import getpass
import os
from dotenv import load_dotenv

load_dotenv(override=True)


The next block imports a variety of libraries and modules for completing advanced language processing tasks. These include handling and storing documents, loading textual and JSON data, splitting text based on character count, and utilizing embeddings from OpenAI, Hugging Face, and potentially faster embedding methods. 

In [None]:
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader, JSONLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings.fastembed import FastEmbedEmbeddings

from langchain_iris import IRISVector
from sqlalchemy import create_engine, text


### View the Case Reports data
Before loading this data into InterSystems IRIS, let's take a quick look at it. We can use a Pandas DataFrame to easily format and view the data, which currently exists within a `json lines` file, which is a file format where each line is a complete JSON object, separated by new lines. This format is particularly useful for handling large datasets or streams of data, because it allows for reading, writing, and processing one line (or one JSON object) at a time, rather than needing to load an entire file into memory at once.

Run the snippet below to create a DataFrame and view the first 10 case reports in the data set.

In [None]:
import pandas as pd

# Load JSONL file into DataFrame
file_path = './data/healthcare/augmented_notes_100.jsonl'
df_cases = pd.read_json(file_path, lines=True)
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_colwidth', None)
df_cases.head(10)

### Load the case reports into InterSystems IRIS via Langchain
Next, we will set up the process for loading, splitting, and preparing to embed text documents from the data set of 100 case reports.

The first step is to initialize a *JSONLoader* to load documents from a specified file. The line *json_lines=True* below specifies that we are loading files from a *json_lines* file.

After loading the data, the text can be split into smaller chunks to facilitate more efficient processing and embedding. Here, we use a chunk size of 2,500 characters with an overlap of 100 characters. With these settings, each case report will be a single chunk.

Chunking the text helps in managing large documents by breaking them into smaller, more manageable pieces, which can be individually embedded into vector format. The overlap ensures that important contextual information is preserved across chunks, enhancing the quality of the resulting embeddings. We will further chunk this data set later in this section.

Run the block of code below to prepare these chunks. The output will indicate how many chunks were created; in this case, you should have 100 chunks, since the chunk size is large enough to fit each case report.

In [None]:
## load data from json lines file
loader = JSONLoader(
    file_path='./data/healthcare/augmented_notes_100.jsonl',
    jq_schema='.note',
    json_lines=True
)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", " ", ""]
)

docs = text_splitter.split_documents(documents)
print(f"Got {len(docs)} chunks")

With chunks created, let's prepare to load the data into InterSystems IRIS. Run the following block to create and print the connection string that will be used to connect to InterSystems IRIS. 

In [None]:
username = '_SYSTEM'
password = 'sys'
hostname = 'localhost'
port = 1972
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"
print(CONNECTION_STRING)

Next, let's initialize a database in InterSystems IRIS, which you will populate with the case reports. 

This setup is essential for applications involving search and retrieval of information where the semantic content of the documents is more important than their keyword content. The vector store uses embeddings to perform similarity searches, offering significant advantages over traditional search methods by understanding the context and meaning embedded within the text.

Run the block below to load your data and embeddings into InterSystems IRIS. Note that we are using the *FastEmbed* model to create our embeddings, but your application could use a variety of different embeddings models.

This may take a few moments.

In [None]:
# embeddings = OpenAIEmbeddings()
embeddings = FastEmbedEmbeddings()

COLLECTION_NAME = "case_reports"
db = IRISVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

Confirm that there are 100 documents in your InterSystems IRIS vector storage by running the following block.

In [None]:
print(f"Number of docs in vector store: {len(db.get()['ids'])}")

## Try out vector search 

Now that the case reports are loaded, embedded, and stored in the vector database, you can try running a vector search. In the code block below, we will use the search phrase "Have any children presented with knee injuries?" to retrieve similar vectors from our storage.

The second line in the block returns the documents along with their similarity scores, which quantify how similar each document is to the query.

What is this vector search really doing? Recall earlier that we chose the *FastEmbed* model to create our embeddings. When you provide a search query in this module, an embedding is created for your query using the same embeddings model as the one we used to embed our data. Then, from the vector storage in InterSystems IRIS, the `similarity_search_with_score` function provided by Langchain is finding the most semantically similar results to the search query you provided.

NOTE: Lower similarity scores indicate greater similarity.

In [None]:
query = "Have any children presented with knee injuries?"
docs_with_score = db.similarity_search_with_score(query)

Run the following block to print the returned documents along with their scores.

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

Next, you will enter a search phrase of your choice and perform a similarity search on that phrase.

Set the *content* variable to a word or phrase of your choice; you can reference keywords from some of the companies you saw in the data set when initially browsing it in Step 1, if you'd like. Or you can choose any phrase you wish. Then, run the block.

In [None]:
content="Upper respiratory illness"
docs_with_score = db.similarity_search_with_score(content)

Run the block below to print the most similar results. Observe the similarity scores, keeping in mind that the closer to 0 it is, the closer in similarity the document is to your search phrase.

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

## Experiment with chunk sizes
Next, let's try experimenting with varied chunk sizes. Run the block below, which is the same as your previous iteration for loading and chunking, but this time uses a chunk size of 400 instead of 2,500. Observe the change in the number of chunks that were created.

In [None]:
## load data from json lines file
loader = JSONLoader(
    file_path='./data/healthcare/augmented_notes_100.jsonl',
    jq_schema='.note',
    json_lines=True
)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=100,
    separators=["\n\n", "\n", ".", " ", ""]
)

docs = text_splitter.split_documents(documents)
print(f"Got {len(docs)} chunks")

Now let's load this newly chunked data into a separate data store within InterSystems IRIS.

In [None]:
# embeddings = OpenAIEmbeddings()
embeddings = FastEmbedEmbeddings()

COLLECTION_NAME = "case_reports-chunked"
dbchunked = IRISVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

Confirm how many documents are loaded into this second data structure in InterSystems IRIS by running the block below.

In [None]:
print(f"Number of docs in vector store: {len(dbchunked.get()['ids'])}")

Now let's try another vector search against the more finely-chunked data. Run the module below, optionally replacing your search term with the one you chose earlier. Whichever term you choose, observe how the results and similarity scores may differ slightly from before, when larger chunks were used.

In [None]:
query = "Have any children presented with knee injuries?"
docs_with_score = dbchunked.similarity_search_with_score(query)

for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

### Prepare retrieval mechanism for RAG app
Now that we have experimented with vector searches and chunk sizes, let's set up a retriever for our database. A retriever is an essential component in information retrieval systems, as it allows us to fetch relevant documents based on a query efficiently. By converting the database into a retriever, we enhance our ability to interact with the data, enabling more advanced search and retrieval operations.

The `as_retriever()` method transforms the database into a retriever object. This object can then be used to perform various retrieval tasks, making it a versatile tool for working with our embedded documents.

Run the block below to create the retriever and print it to confirm its setup. We'll use the more finely chunked data, consisting of 764 chunks, for this retriever.

In [None]:
retriever = dbchunked.as_retriever()
print(retriever)