## Vector Search with IRIS SQL
This tutorial covers how to use InterSystems IRIS as vector storage for the same set of financial tweets that we loaded and vectorized in steps 1A and/or 1B.

Begin by running the block of code below, which imports the necessary components to get started.

In [1]:
import os, pandas as pd
from sentence_transformers import SentenceTransformer
from sqlalchemy import create_engine, text

from dotenv import load_dotenv
load_dotenv(override=True)

  from .autonotebook import tqdm as notebook_tqdm


False

Next, we will set InterSystems IRIS-specific information such as username, password, the hostname and port of the InterSystems IRIS container in this lab, the namespace, and a connection string putting all of those elements together.

In [2]:
username = 'demo'
password = 'demo'
hostname = os.getenv('IRIS_HOSTNAME', 'localhost')
port = '55665' 
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

Using the the connection string we just built, let's establish a connection to InterSystems IRIS.

In [3]:
engine = create_engine(CONNECTION_STRING)

### Load financial tweet data
Next, we will load the JSON file with financial tweets into a Pandas DataFrame that can be easily imported into InterSystems IRIS as a SQL table.

In [4]:
import pandas as pd

# Load JSONL file into DataFrame
file_path = './data/financial/tweets_all.jsonl'
df_tweets = pd.read_json(file_path, lines=True)


Let's display the first few rows of our DataFrame by running the line below.

In [5]:
df_tweets.head()

Unnamed: 0,note,sentiment,url
0,$BYND - JPMorgan reels in expectations on Beyo...,2,https://huggingface.co/datasets/zeroshot/twitt...
1,$CCL $RCL - Nomura points to bookings weakness...,2,https://huggingface.co/datasets/zeroshot/twitt...
2,"$CX - Cemex cut at Credit Suisse, J.P. Morgan ...",2,https://huggingface.co/datasets/zeroshot/twitt...
3,$ESS: BTIG Research cuts to Neutral https://t....,2,https://huggingface.co/datasets/zeroshot/twitt...
4,$FNKO - Funko slides after Piper Jaffray PT cu...,2,https://huggingface.co/datasets/zeroshot/twitt...


With the new release of InterSystems IRIS vector search capability, InterSystems IRIS supports vectors as a datatype in tables! In the block below, we will create a table with a few different columns. The last column, 'note_vector', will be used to store vectors that are generated by passing the 'note' of a tweet through an embedding model.

In [28]:
with engine.connect() as conn:
    with conn.begin():# Load 
        sql = f"""
                DROP TABLE IF EXISTS financial_tweets
            """
        result = conn.execute(text(sql))

In [29]:

with engine.connect() as conn:
    with conn.begin():# Load 
        sql = f"""
                CREATE TABLE financial_tweets (
        note VARCHAR(255),
        sentiment INTEGER,
        note_vector VECTOR(DOUBLE, 384)
        )
                """
        result = conn.execute(text(sql))

Next, let's load a pre-trained sentence transformer model. This model's output vectors are of size 384. We will use this model to create vector embeddings for each financial tweet in our data set.

In [11]:
# Load a pre-trained sentence transformer model. This model's output vectors are of size 384
model = SentenceTransformer('all-MiniLM-L6-v2') 

Using the sentence transformer above, we will create embeddings for all of the financial tweets in the data set.

In [14]:

# Generate embeddings for all tweets at once. Batch processing makes it faster
embeddings = model.encode(df_tweets['note'].tolist(), normalize_embeddings=True)

# Add the embeddings to the DataFrame
df_tweets['note_vector'] = embeddings.tolist()


Let's view the first few entries again, this time with an added column for the vector embedding that goes with the tweet.

In [15]:
df_tweets.head()

Unnamed: 0,note,sentiment,url,note_vector
0,$BYND - JPMorgan reels in expectations on Beyo...,2,https://huggingface.co/datasets/zeroshot/twitt...,"[-0.13631078600883484, 0.026333356276154518, -..."
1,$CCL $RCL - Nomura points to bookings weakness...,2,https://huggingface.co/datasets/zeroshot/twitt...,"[-0.033777981996536255, 0.06702922284603119, -..."
2,"$CX - Cemex cut at Credit Suisse, J.P. Morgan ...",2,https://huggingface.co/datasets/zeroshot/twitt...,"[-0.08540519326925278, 0.04619771987199783, 0...."
3,$ESS: BTIG Research cuts to Neutral https://t....,2,https://huggingface.co/datasets/zeroshot/twitt...,"[-0.13111060857772827, 0.03535114973783493, 0...."
4,$FNKO - Funko slides after Piper Jaffray PT cu...,2,https://huggingface.co/datasets/zeroshot/twitt...,"[-0.0776449665427208, 0.055340882390737534, -0..."


In the next block of code, we will insert each tweet and its associated vector from the Pandas DataFrame into InterSystems IRIS.

In [30]:
with engine.connect() as conn:
    with conn.begin():
        for index, row in df_tweets.iterrows():
            sql = text("""
                INSERT INTO financial_tweets 
                (note, sentiment, note_vector) 
                VALUES (:note, :sentiment, TO_VECTOR(:note_vector))
            """)
            conn.execute(sql, {
                'note': row['note'], 
                'sentiment': row['sentiment'],
                'note_vector': str(row['note_vector'])
            })


Let's run a vector search! The block below will take a search phrase -- in this case, "covid effect" -- and convert it into a vector to be used in searching for similar content.

In [23]:
note_search = "covid effect"
search_vector = model.encode(note_search, normalize_embeddings=True).tolist() # Convert search phrase into a vector

Next, we will use the vector that was just created based on the search phrase and find the top three vectors that are closest in similarity to the vector for our "covid effect" search.

We are also specifying that the "sentiment" field be equal to 1, which in this dataset refers to "positive sentiment". This ability to use additional data to filter, directly in SQL, is a unique capability of InterSystems IRIS and how we have implemented our vector search functionality.

In [33]:
with engine.connect() as conn:
    with conn.begin():
        sql = text("""
            SELECT TOP 3 * FROM financial_tweets
            WHERE sentiment = 1
            ORDER BY VECTOR_DOT_PRODUCT(note_vector, TO_VECTOR(:search_vector)) DESC
        """)

        results = conn.execute(sql, {'search_vector': str(search_vector)}).fetchall()


Let's print the results.

In [34]:
print(results)

[('$NVDA - Nvidia set for gaming tailwinds - BofA https://t.co/l3m78pJzrW', 1, 'https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment', '-.020523741841316223144,.050246082246303558349,-.050210062414407730102,-.042630381882190704346,-.019571455195546150207,.0044612870551645755767,.02922 ... (8836 characters truncated) ... 79621,-.0025744738522917032241,-.042040660977363586426,.0043053296394646167756,-.062128648161888122558,-.020193470641970634461,.091304995119571685791'), ("Morgan Stanley upgrades Nvidia to buy, predicting 2020 will be 'a return to solid growth' https://t.co/9gTGxKbSGj", 1, 'https://huggingface.co/datasets/zeroshot/twitter-financial-news-sentiment', '-.074603758752346038818,-.046326998621225357056,-.0060204053297638893127,.030615020543336868286,.0050287107005715370178,.0047806329093873500823,-.047 ... (8826 characters truncated) ... 20913696,-.083879895508289337158,.042892597615718841552,.037991441786289215087,-.13488604128360748291,-.001465251552872

For an output that is a bit more readable, we can take the results and process them for better display using the block below.

In [26]:
results_df = pd.DataFrame(results, columns=df_tweets.columns).iloc[:, :-1] # Remove vector
pd.set_option('display.max_colwidth', None)  # Easier to read description
results_df.head()

Unnamed: 0,note,sentiment,url
0,New developments added to @FedFRASER's COVID-19 timeline in the latest week: second historic rise in unemployment i… https://t.co/o4yYfNRbhA,,
1,Central banks must evolve to help governments fight coronavirus https://t.co/mfSJuTKUDm,,
2,Luckin Coffee and Yum China hit again by coronavirus anxiety,,
