## 1. Embed Data Manually Using InterSystems IRIS SQL
In this section, you will use InterSystems IRIS as vector storage for a data set that contains 1,000 tweets about financial news and analysis. You will begin by loading and viewing this data, and then you will generate vector embeddings for each tweet. By generating embeddings, you will be able to run some simple vector searches to return relevant information based on a search string.

Let's begin by running the block of code below, which imports the necessary components to get started. This includes the *sentence_transformers* library that will be used to generate embeddings for this data.

In [None]:
import os, pandas as pd
from sentence_transformers import SentenceTransformer
from sqlalchemy import create_engine, text

from dotenv import load_dotenv
load_dotenv(override=True)

Next, we will set InterSystems IRIS-specific information such as username, password, the hostname and port of the InterSystems IRIS container in this lab, the namespace, and a connection string putting all of those elements together.

In [None]:
username = '_SYSTEM'
password = 'SYS'
hostname = 'iris'
port = 1972
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

Using the the connection string we just built, let's establish a connection to InterSystems IRIS.

In [None]:
engine = create_engine(CONNECTION_STRING)

### Load data set with 1,000 financial tweets
Next, we will load the JSON file with 1,000 tweets into a Pandas DataFrame that can be easily imported into InterSystems IRIS as a SQL table. A Pandas DataFrame is a powerful data structure that allows for efficient data manipulation and analysis. It provides a convenient way to handle and preprocess data, making it easy to clean, transform, and organize the data into a structured format suitable for SQL operations.

By using a DataFrame, we can leverage Pandas' robust functionality to ensure the data is correctly formatted and ready for seamless integration into InterSystems IRIS.

In [None]:
import pandas as pd

# Load JSONL file into DataFrame
file_path = './data/financial/tweets_all.jsonl'
df_tweets = pd.read_json(file_path, lines=True)
pd.set_option('display.max_rows', 1000)

Let's display the entire set of tweets to get a comprehensive view of the data by running the line below. This will help you understand the structure and content of the dataset before we proceed. Scroll through the data and see the types of tweets that exist, noting some of the companies referenced in the tweets.

In [None]:
df_tweets

With the release of InterSystems IRIS vector search capability, InterSystems IRIS supports vectors as a datatype in tables! In the next block, we will create a new table in InterSystems IRIS for our data to be loaded into: the *financial_tweets_sql* table. This table has columns for *note*, *sentiment*, and *note_vector*. The *note_vector* column will be used to store a vector embedding for each tweet in the data set.

In [None]:

with engine.connect() as conn:
    with conn.begin():# Load 
        sql = f"""
                CREATE TABLE financial_tweets_sql (
        note VARCHAR(255),
        sentiment INTEGER,
        note_vector VECTOR(DOUBLE, 384)
        )
                """
        result = conn.execute(text(sql))

### Create vector embeddings using a sentence transformer
Before we load the tweets into this *financial_tweets* table in InterSystems IRIS, we will first create the vector embeddings that go with each tweet. Vector embeddings are numerical representations of text that capture the semantic meaning of the text, making it easier to perform tasks like similarity search, clustering, and classification. To generate these embeddings, we will use a pre-trained sentence transformer model.

Sentence transformers are a type of model designed to create dense vector representations of sentences, which can be used for various natural language processing tasks. These models are trained on large datasets and fine-tuned to understand the context and semantics of sentences. The specific model we will use is *"all-MiniLM-L6-v2"*, a lightweight and efficient transformer model. This model produces output vectors of size 384, providing a compact yet powerful representation of the tweets.

In [None]:
model = SentenceTransformer('all-MiniLM-L6-v2') 

Using this sentence transformer, let's create embeddings for all of the financial tweets in the data set and add them to the Pandas DataFrame we created earlier.

In [None]:

# Generate embeddings for all tweets at once. Batch processing makes it faster
embeddings = model.encode(df_tweets['note'].tolist(), normalize_embeddings=True)

# Add the embeddings to the DataFrame
df_tweets['note_vector'] = embeddings.tolist()


Let's explore the *df_tweets* dataframe again, this time using just the *head* call to see the first 100 entries in the data set. Notice the addition of the vector embeddings in the newly added *note_vector* column.

In [None]:
pd.set_option('display.max_colwidth', 200)
df_tweets.head(100)

Next, let's load this into InterSystems IRIS by inserting each tweet and its associated vector from the Pandas DataFrame into the *financial_tweets* table we created earlier.

In [None]:
with engine.connect() as conn:
    with conn.begin():
        for index, row in df_tweets.iterrows():
            sql = text("""
                INSERT INTO financial_tweets_sql 
                (note, sentiment, note_vector) 
                VALUES (:note, :sentiment, TO_VECTOR(:note_vector))
            """)
            conn.execute(sql, {
                'note': row['note'], 
                'sentiment': row['sentiment'],
                'note_vector': str(row['note_vector'])
            })


### Run a vector search
With tweets loaded into InterSystems IRIS and vector embeddings stored alongside each tweet, let's run a vector search!

The block below will take a search phrase -- in this case, "Beyond Meat" -- and convert it into a vector to be used in searching for similar content.

In [None]:
note_search = "Beyond Meat"
search_vector = model.encode(note_search, normalize_embeddings=True).tolist() # Convert search phrase into a vector

Next, we will use the vector that was just created based on the "Beyond Meat" search phrase and find the top three vectors that are closest in similarity to that vector. In this case, we are using the dot product of the vectors to determine their similarity; other methods include cosine similarity or Euclidian distance.

Notice that we are also specifying that the *sentiment* field should be equal to 1, which refers to "positive sentiment" in this set of tweets. This ability to use additional data to filter, directly in SQL, is a unique capability of InterSystems IRIS and how its vector search functionality has been implemented.

Run the block below to return the three "positive sentiment" tweets that our vector search indicates are most similar to the "Beyond Meat" search phrase.

In [None]:
with engine.connect() as conn:
    with conn.begin():
        sql = text("""
            SELECT TOP 3 * FROM financial_tweets_sql
            WHERE sentiment = 1
            ORDER BY VECTOR_DOT_PRODUCT(note_vector, TO_VECTOR(:search_vector)) DESC
        """)

        results = conn.execute(sql, {'search_vector': str(search_vector)}).fetchall()


Let's print the results using the line below.

In [None]:
print(results)

For an output that is a bit more readable, we can take the results and process them for better display using the block below.

In [None]:
results_df = pd.DataFrame(results, columns=df_tweets.columns).iloc[:, :-1] # Remove vector
pd.set_option('display.max_colwidth', None)  # Easier to read description
results_df.head()