## Load and embed your data

In this step, we want to use `OpenAIEmbeddings` to vectorize our data, so we have to get the OpenAI API Key.

The following block of code is used to manage environment variables, specifically for loading and setting the OpenAI API key. It begins by importing necessary modules for operating system interactions and secure password input. 

The script then checks if the `OPENAI_API_KEY` is already set in the environment variables. If not set, it will prompt the user to input their API key, illustrating how one could securely obtain and set this key at runtime. Using environment variables for such sensitive information, rather than hardcoding it into your application, enhances security by keeping credentials out of the source code and under strict control via environment configurations.

In [1]:
import getpass
import os
from dotenv import load_dotenv

load_dotenv(override=True)

if not os.environ.get("OPENAI_API_KEY"): 
    #os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
    pass

# os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")  
# os.environ["OPENAI_API_KEY"]




The next block imports a variety of libraries and modules for completing advanced language processing tasks. These include handling and storing documents, loading textual and JSON data, splitting text based on character count, and utilizing embeddings from OpenAI, Hugging Face, and potentially faster embedding methods. 

In [2]:
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader, JSONLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings.fastembed import FastEmbedEmbeddings

from langchain_iris import IRISVector


Next, set up the process for loading, splitting, and preparing to embed text documents from a dataset.

The first step is to initialize a JSONLoader to load documents from a specified file. The line
`json_lines=True` specifies that we are loading files from a json_lines file, which is a file format where each line is a complete JSON object, separated by new lines. This format is particularly useful for handling large datasets or streams of data because it allows for reading, writing, and processing one line (or one JSON object) at a time, rather than needing to load an entire file into memory at once.

The text is then split into smaller chunks, and embedded into vector format.

In [3]:
# loader = TextLoader("../data/state_of_the_union.txt", encoding='utf-8')
# Windows only install: 
# ! pip install https://jeffreyknockel.com/jq/jq-1.4.0-cp311-cp311-win_amd64.whl
# Other platforms
# ! pip install jq
#

loader = JSONLoader(
    file_path='./data/healthcare/augmented_notes_1000.jsonl',
    jq_schema='.note',
    json_lines=True # TODO: tell audience what json lines are
)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
# embeddings = FastEmbedEmbeddings()

  warn_deprecated(


Run the following two blocks to create and print the connection string that will be used to connect to InterSystems IRIS. 

In [4]:
username = 'demo'
password = 'demo' 
hostname = os.getenv('IRIS_HOSTNAME', 'localhost')
port = '61209' # '1972'
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

In [5]:
# print(os.environ.get("OPENAI_API_KEY"))
print(CONNECTION_STRING)


iris://demo:demo@localhost:61209/USER



The following code block will initialize a database in InterSystems IRIS, which you will later populate with the text documents that we have processed and embedded. 

This setup is essential for applications involving search and retrieval of information where the semantic content of the documents is more important than their keyword content. The vector database uses embeddings to perform similarity searches, offering significant advantages over traditional search methods by understanding the context and meaning embedded within the text. 

In [6]:
COLLECTION_NAME = "augmented_notes"

db = IRISVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

In [7]:
# If reconnecting to the database, use this:

# db = IRISVector(
#     embedding_function=embeddings,
#     dimension=1536,
#     collection_name=COLLECTION_NAME,
#     connection_string=CONNECTION_STRING,
# )

Run the following code block to add the documents to the newly initialized database. 

In [8]:
# To add documents to existing vector store:

db.add_documents(documents)

['8c73887e-0c10-11ef-82ba-3448ed843086',
 '8c73887f-0c10-11ef-8e03-3448ed843086',
 '8c738880-0c10-11ef-832d-3448ed843086',
 '8c738881-0c10-11ef-9675-3448ed843086',
 '8c738882-0c10-11ef-9716-3448ed843086',
 '8c738883-0c10-11ef-9ba4-3448ed843086',
 '8c738884-0c10-11ef-a03a-3448ed843086',
 '8c738885-0c10-11ef-a9b0-3448ed843086',
 '8c738886-0c10-11ef-be87-3448ed843086',
 '8c738887-0c10-11ef-87a3-3448ed843086',
 '8c738888-0c10-11ef-a64e-3448ed843086',
 '8c738889-0c10-11ef-9f90-3448ed843086',
 '8c73888a-0c10-11ef-9cab-3448ed843086',
 '8c73888b-0c10-11ef-84e5-3448ed843086',
 '8c73888c-0c10-11ef-b2d7-3448ed843086',
 '8c73888d-0c10-11ef-af21-3448ed843086',
 '8c73888e-0c10-11ef-9c2b-3448ed843086',
 '8c73888f-0c10-11ef-b27d-3448ed843086',
 '8c738890-0c10-11ef-8094-3448ed843086',
 '8c738891-0c10-11ef-a960-3448ed843086',
 '8c738892-0c10-11ef-bf47-3448ed843086',
 '8c738893-0c10-11ef-bd6a-3448ed843086',
 '8c738894-0c10-11ef-92ab-3448ed843086',
 '8c738895-0c10-11ef-9c59-3448ed843086',
 '8c738896-0c10-

In [9]:
print(f"Number of docs in vector store: {len(db.get()['ids'])}")

Number of docs in vector store: 2000


## Try out vector search 

Now that the text documents are loaded, embedded, and stored in the vector database, you can try running a vector search. In the code block below, set `query` equal to "19 year old patient" and run the block. 

The second line in the block returns the documents along with their similarity scores, which quantify how similar each document is to the query. Lower scores indicate greater relevance.

In [10]:
query = "19 year old patient"
docs_with_score = db.similarity_search_with_score(query)

Run the following block to print the returned documents along with their scores.

In [11]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

--------------------------------------------------------------------------------
Score:  0.178323928204704
The patient is a 30-year-old pregnant woman, gravida 1 para 0, 170 cm and weighted 82 kg at 18 weeks’ gestation. Her initial NIPT result showed an unexpected 5 Mb deletion and 9 Mb duplication on the short arm of chromosome 18. Because of the rare discovery, the patient was then referred to us for genetic counseling sessions and further genetic tests were issued with the complete consent of her parents to investigate if the pregnant woman, her biological parents and the fetus were healthy. After cytogenetic and molecular examinations, a rare de novo 18p terminal deletion with inverted duplication was identified in the pregnant woman, but her parents and the fetus were normal.
The course of her pregnancy was uneventful with the exception of hypothyroidism at 7 weeks’ gestation and treated with Euthyrox from then on. Despite an uneventful family history, the patient had a healthy ap

In the following two blocks, you will add a new document to the database and perform a similarity search on the contents of this document. *This could be a good spot to have participants input the contents of the doc - whatever they want. Could add a variable that they set which gets passed to the first two lines of the block* 

Printing the first returned document in the list shows that the document itself is returned as the most similar, with a similarity score of 0.0. 

Run the following block to see what else was returned by the similarity search.

In [12]:
db.add_documents([Document(page_content="foo")])
docs_with_score = db.similarity_search_with_score("foo")
docs_with_score[0]

(Document(page_content='foo'), 0.0)

In [13]:
docs_with_score

[(Document(page_content='foo'), 0.0),
 (Document(page_content='A 45 year old male patient who was run over by a train resulting in a right leg amputation at the level of the knee and a crush injury of the left foot. He was brought to our hospital about 2 h after the accident. The right lower limb had a severe comminution and bone loss at the knee joint, with the loss of skin and soft- tissue and crushing of muscle above and below the knee [Figures and ]. The left forefoot was completely degloved and all the toes were crushed and degloved as well [Figures and ].\nThe right lower limb was deemed not replantable as the knee joint was severely damaged and not salvageable, In addition, debridement of crushed and devitalized tissues would result in a 15-20 cm shortening and a limb that was at least 15 cm short with fused knee joint would not be functionally useful and primary insertion of prosthetic knee joint was not considered to be feasible by the attending orthopaedic surgeon.\nFocus was

In [14]:
retriever = db.as_retriever()
print(retriever)

tags=['IRISVector'] vectorstore=<langchain_iris.vectorstores.IRISVector object at 0x000001D2FA6EDE50>
