## 3. Use Helper Libraries like Langchain

In section 1, we used a basic sentence transformer model and direct SQL access in InterSystems IRIS to load and vectorize data, then run vector searches on that data.

In this section, we will use OpenAI's *OpenAIEmbeddings* model to vectorize our data, and we'll use Langchain to load and interact with that data. Langchain provides several advantages in building a RAG application, including streamlining the retrieval process, adding conversation history, enabling guardrails to keep your application within its intended usage, and more.

To use the *OpenAIEmbeddings* model, we will need an OpenAI API Key. The following block of code is used to manage environment variables, specifically for loading and setting the OpenAI API key. It begins by importing necessary modules for operating system interactions and secure password input.

For this workshop, InterSystems has provided a short-term OpenAI API key that is already configured in the environment variables. Run the block below to load these settings.

In [None]:
import getpass
import os
from dotenv import load_dotenv

load_dotenv(override=True)


The next block imports a variety of libraries and modules for completing advanced language processing tasks. These include handling and storing documents, loading textual and JSON data, splitting text based on character count, and utilizing embeddings from OpenAI, Hugging Face, and potentially faster embedding methods. 

In [None]:
from langchain.docstore.document import Document
from langchain.document_loaders import TextLoader, JSONLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.embeddings.fastembed import FastEmbedEmbeddings

from langchain_iris import IRISVector


### Load data set with 1,000 financial tweets via Langchain
Next, we will set up the process for loading, splitting, and preparing to embed text documents from the same data set of 1,000 tweets that we used with SQL in section 1.

The first step is to initialize a *JSONLoader* to load documents from a specified file. The line
*json_lines=True* below specifies that we are loading files from a *json_lines* file, which is a file format where each line is a complete JSON object, separated by new lines. This format is particularly useful for handling large datasets or streams of data, because it allows for reading, writing, and processing one line (or one JSON object) at a time, rather than needing to load an entire file into memory at once.

After loading the data, the text is split into smaller chunks to facilitate more efficient processing and embedding. Here, we use a chunk size of 1,000 characters with an overlap of 100 characters. Chunking the text helps in managing large documents by breaking them into smaller, more manageable pieces, which can be individually embedded into vector format. The overlap ensures that important contextual information is preserved across chunks, enhancing the quality of the resulting embeddings. In this example, we don't really expect the data to need chunking, since tweets are already a smaller size.

Run the block of code below to execute this:

In [None]:
loader = JSONLoader(
    file_path='./data/financial/tweets_all.jsonl',
    jq_schema='.note',
    json_lines=True
)
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

embeddings = OpenAIEmbeddings()
# embeddings = FastEmbedEmbeddings()

Run the following two blocks to create and print the connection string that will be used to connect to InterSystems IRIS. 

In [None]:
username = '_SYSTEM'
password = 'SYS'
hostname = 'iris'
port = 1972
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

In [None]:
print(CONNECTION_STRING)


Next, let's initialize a database in InterSystems IRIS, which you will populate with the tweets that we have processed and embedded. 

This setup is essential for applications involving search and retrieval of information where the semantic content of the documents is more important than their keyword content. The vector store uses embeddings to perform similarity searches, offering significant advantages over traditional search methods by understanding the context and meaning embedded within the text.

Run the block below to load your data and embeddings into InterSystems IRIS. This may take a few moments.

In [None]:
COLLECTION_NAME = "financial_tweets"

db = IRISVector.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
)

Confirm that there are 1,000 documents in your vector storage by running the following block.

In [None]:
print(f"Number of docs in vector store: {len(db.get()['ids'])}")

## Try out vector search 

Now that the text documents are loaded, embedded, and stored in the vector database, you can try running a vector search. In the code block below, we will use the search phrase "How is Beyond Meat doing?" to retrieve similar vectors from our storage.

The second line in the block returns the documents along with their similarity scores, which quantify how similar each document is to the query. Lower scores indicate greater relevance.

In [None]:
query = "How is Beyond Meat doing?"
docs_with_score = db.similarity_search_with_score(query)

Run the following block to print the returned documents along with their scores.

In [None]:
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

In the following two blocks, you will add a new document to the database and perform a similarity search on the contents of this document.

Set the *content* variable to a word or phrase of your choice; you can reference keywords from some of the companies you saw in the data set when initially browsing it in Step 1, if you'd like. Or you can choose any phrase you'd like. Then, run the block.

In [None]:
content="your search phrase"
db.add_documents([Document(page_content=content)])
docs_with_score = db.similarity_search_with_score(content)

Run the block below to print the first returned document in the list, which is the one with the most similar result. You will notice that the document itself is returned as the most similar, with a similarity score of 0.0.

In [None]:
docs_with_score[0]

Run the block below to see the set of results beyond the single most similar document.

In [None]:
docs_with_score

### Prepare retrieval mechanism for RAG app
Next, we'll set up a retriever for our database. A retriever is an essential component in information retrieval systems, as it allows us to fetch relevant documents based on a query efficiently. By converting the database into a retriever, we enhance our ability to interact with the data, enabling more advanced search and retrieval operations.

The `as_retriever()` method transforms the database into a retriever object. This object can then be used to perform various retrieval tasks, making it a versatile tool for working with our embedded documents.

Run the block below to create the retriever and print it to confirm its setup.

In [None]:
retriever = db.as_retriever()
print(retriever)

This final step ensures that your database is ready for advanced retrieval operations, leveraging the power of vector embeddings to find and return the most relevant documents efficiently. In Steps 4 and 5, we will further build our chat application to leverage these documents.