## Load and embed your data

In this notebook, you will use llama index to load and embed the data. 

Run the first block below, which will import the needed libraries and environment variables, then load the data from a JSON lines file. This is a file format where each line is a complete JSON object, separated by new lines. This format is particularly useful for handling large datasets or streams of data because it allows for reading, writing, and processing one line (or one JSON object) at a time, rather than needing to load an entire file into memory at once.

In [1]:
##############################################################
## This code is for academic and educational purposes only. ##
## Event: Global Summit 2024 Maryland USA                   ##
## InterSystems Corporation 2024 (C)                        ##
## Date: June 9th 2024                                      ##
##############################################################

##### We are going to use llama index that allows us to load and store data from file and put it into iris
from llama_index import download_loader
from llama_index import SimpleDirectoryReader, StorageContext, ServiceContext
from llama_index.readers.json import JSONReader
from llama_index.indices.vector_store import VectorStoreIndex
from llama_iris import IRISVectorStore

from dotenv import load_dotenv
load_dotenv(override=True)

import os

##### Let's load our dataset
reader = JSONReader(is_jsonl=True)
documents = reader.load_data('./data/healthcare/augmented_notes_100.jsonl')


Run the next block to see the first 5 documents that were loaded. 

In [None]:
##### Let's see the first 5 documents
documents[:5]

##### We have already reduced these documents (in Step 0) to just the text and first 100 documents

Next, you need to connect to InterSystems IRIS so that the data can be vectorized and stored in an InterSystems IRIS database. The following two blocks configure the connection, and initialize the table where your data will be stored. 

In [3]:
##### Configuring IRIS
# Setup our demo connectivity
username = 'demo'
password = 'demo' 
hostname = os.getenv('IRIS_HOSTNAME', 'localhost')
port = '61209' 
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"
#####

In [4]:
##### Here, we connect the dataset into the IRISVectorStore helper
vector_store = IRISVectorStore.from_params(
    connection_string=CONNECTION_STRING,
    table_name="augmented_notes_llamaindex",
    embed_dim=1536,  # openai embedding dimension
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

Finally, you can connect into the InterSystems IRIS instance and save your data in a vectorized format. Run the following block to complete this step. Vectorizing, or embedding, your data creates numerical representations of the data that capture the semantic properties such that similar meanings are represented by numerically close values. 

In a RAG setup, embeddings help quickly find relevant documents by measuring the similarity between the embedded vectors of a query and those in the database.

In [None]:
##### Finally, We can connect into the InterSystems IRIS instance and save our data in a vectorized format
##### Below, we setup how we are going to index the vectorized data (using an embeddings model)
index = VectorStoreIndex.from_documents(
    documents,                              ##### These are our clinical notes we loaded up
    storage_context=storage_context,        ##### This is our connection to the vector store
    show_progress=True,                     ##### Let's see the progress as it happens
)

##### To interact with our embeddings, we take the query engine from our documents
query_engine = index.as_query_engine()      ##### The "as_query_engine" is a llama_index directive which lets 
                                            ##### us search and retrieve based on vector similarity

## Try out vector search 

Now that the text documents are loaded, embedded, and stored in the vector database, you can try running a vector search. In the code block below, set `query` equal to "36 year old patient with a history of pain" and run the block. 

In [None]:

response = query_engine.query("")
import textwrap
print(textwrap.fill(str(response), 100))