## Vector Search with IRIS SQL
This tutorial covers how to use InterSystems IRIS as vector storage for the same set of financial tweets that we loaded and vectorized in steps 1A and/or 1B.

Begin by running the block of code below, which imports the necessary components to get started.

In [None]:
import os, pandas as pd
from sentence_transformers import SentenceTransformer
from sqlalchemy import create_engine, text

from dotenv import load_dotenv
load_dotenv(override=True)

Next, we will set InterSystems IRIS-specific information such as username, password, the hostname and port of the InterSystems IRIS container in this lab, the namespace, and a connection string putting all of those elements together.

In [9]:
username = 'demo'
password = 'demo'
hostname = os.getenv('IRIS_HOSTNAME', 'localhost')
port = '55665' 
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

Using the the connection string we just built, let's establish a connection to InterSystems IRIS.

In [10]:
engine = create_engine(CONNECTION_STRING)

### Load financial tweet data
Next, we will load the JSON file with financial tweets into a Pandas DataFrame that can be easily imported into InterSystems IRIS as a SQL table.

In [3]:
import pandas as pd

# Load JSONL file into DataFrame
file_path = './data/financial/tweets_all.jsonl'
df_tweets = pd.read_json(file_path, lines=True)


Let's display the first few rows of our DataFrame by running the line below.

In [None]:
df_tweets.head()

With the new release of InterSystems IRIS vector search capability, InterSystems IRIS supports vectors as a datatype in tables! In the block below, we will create a table with a few different columns. The last column, 'note_vector', will be used to store vectors that are generated by passing the 'note' of a tweet through an embedding model.

In [11]:

with engine.connect() as conn:
    with conn.begin():# Load 
        sql = f"""
                CREATE TABLE financial_tweets (
        note VARCHAR(255),
        note_vector VECTOR(DOUBLE, 384)
        )
                """
        result = conn.execute(text(sql))

Next, let's load a pre-trained sentence transformer model. This model's output vectors are of size 384. We will use this model to create vector embeddings for each financial tweet in our data set.

In [12]:
# Load a pre-trained sentence transformer model. This model's output vectors are of size 384
model = SentenceTransformer('all-MiniLM-L6-v2') 

Using the sentence transformer above, we will create embeddings for all of the financial tweets in the data set.

In [None]:

# Generate embeddings for all tweets at once. Batch processing makes it faster
embeddings = model.encode(df_tweets['note'].tolist(), normalize_embeddings=True)

# Add the embeddings to the DataFrame
df_tweets['note_vector'] = embeddings.tolist()


Let's view the first few entries again, this time with an added column for the vector embedding that goes with the tweet.

In [None]:
df_tweets.head()

In the next block of code, we will insert each tweet and its associated vector from the Pandas DataFrame into InterSystems IRIS.

In [16]:
with engine.connect() as conn:
    with conn.begin():
        for index, row in df_tweets.iterrows():
            sql = text("""
                INSERT INTO financial_tweets 
                (note, note_vector) 
                VALUES (:note, TO_VECTOR(:note_vector))
            """)
            conn.execute(sql, {
                'note': row['note'], 
                'note_vector': str(row['note_vector'])
            })


Let's run a vector search! The block below will take a search phrase -- in this case, "covid effect" -- and convert it into a vector to be used in searching for similar content.

In [17]:
note_search = "covid effect"
search_vector = model.encode(note_search, normalize_embeddings=True).tolist() # Convert search phrase into a vector

Next, we will use the vector that was just created based on the search phrase and find the top three vectors that are closest in similarity to the vector for our "covid effect" search.

In [18]:
with engine.connect() as conn:
    with conn.begin():
        sql = text("""
            SELECT TOP 3 * FROM financial_tweets
            ORDER BY VECTOR_DOT_PRODUCT(note_vector, TO_VECTOR(:search_vector)) DESC
        """)

        results = conn.execute(sql, {'search_vector': str(search_vector)}).fetchall()


Let's print the results.

In [None]:
print(results)

For an output that is a bit more readable, we can take the results and process them for better display using the block below.

In [None]:
results_df = pd.DataFrame(results, columns=df_tweets.columns).iloc[:, :-1] # Remove vector
pd.set_option('display.max_colwidth', None)  # Easier to read description
results_df.head()