Adapted from https://github.com/intersystems-community/iris-vector-search/blob/main/demo/sql_demo.ipynb

## Vector Search with IRIS SQL (and **dc.embedding**)
This tutorial covers how to use IRIS as a vector database. 

For this tutorial, we will use a dataset of 2.2k online reviews of scotch (
dataset from https://www.kaggle.com/datasets/koki25ando/22000-scotch-whisky-reviews) . With our latest vector database functionality, we can leverage the latest embedding models to run semantic search on the online reviews of scotch whiskeys. In addition, we'll be able to apply filters on columns with structured data. For example, we will be able to search for whiskeys that are priced under $100, and are 'earthy, smooth, and easy to drink'. Let's find our perfect whiskey!

In [1]:
!pip install -q pandas sqlalchemy-iris

In [2]:
import time

start = time.time()

In [3]:
import os, pandas as pd
# -- this line is no more necessary
# from sentence_transformers import SentenceTransformer
from sqlalchemy import create_engine, text

In [4]:
username = '_system'
password = 'SYS'
hostname = 'sql-embeddings-iris-1'
port = '1972' 
namespace = 'IRISAPP'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

In [5]:
engine = create_engine(CONNECTION_STRING)

In [6]:
# Load the CSV file
df = pd.read_csv('https://raw.githubusercontent.com/intersystems-community/iris-vector-search/refs/heads/main/data/scotch_review.csv')
df = df.head(100)

In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,name,category,review.point,price,currency,description
0,1,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,$,"Magnificently powerful and intense. Caramels, ..."
1,2,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,$,What impresses me most is how this whisky evol...
2,3,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,$,There have been some legendary Bowmores from t...
3,4,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,$,With a name inspired by a 1926 Buster Keaton m...
4,5,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,$,"Captivating, enticing, and wonderfully charmin..."


In [8]:
# Clean data
# Remove the specified columns
df.drop(['currency'], axis=1, inplace=True)

# Drop the first column
df.drop(columns=df.columns[0], inplace=True)

# Remove rows without a price
df.dropna(subset=['price'], inplace=True)

# Ensure values in 'price' are numbers
df = df[pd.to_numeric(df['price'], errors='coerce').notna()]

# Replace NaN values in other columns with an empty string
df.fillna('', inplace=True)

In [9]:
df.head()

Unnamed: 0,name,category,review.point,price,description
0,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,"Magnificently powerful and intense. Caramels, ..."
1,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,What impresses me most is how this whisky evol...
2,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,There have been some legendary Bowmores from t...
3,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,With a name inspired by a 1926 Buster Keaton m...
4,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,"Captivating, enticing, and wonderfully charmin..."


Now, InterSystems IRIS supports vectors as a datatype in tables! Here, we create a table with a few different columns. The last column, 'description_vector', will be used to store vectors that are generated by passing the 'description' of a review through an embedding model.

In [10]:
with engine.connect() as conn:
    with conn.begin():# Load 
        sql = f""" 
                DROP TABLE IF EXISTS scotch_reviews
                """
        result = conn.execute(text(sql))

In [11]:
with engine.connect() as conn:
    with conn.begin():# Load 
        sql = f""" 
                truncate table dc_musketeersbr_sqlembeddings.Cache
                """
        result = conn.execute(text(sql))

In [12]:

with engine.connect() as conn:
    with conn.begin():# Load 
        sql = f"""
                CREATE TABLE scotch_reviews (
        name VARCHAR(255),
        category VARCHAR(255),
        review_point INT,
        price DOUBLE,
        description VARCHAR(2000),
        description_vector VECTOR(DOUBLE, 384)
        )
                """
        result = conn.execute(text(sql))

In [13]:
# -- this line is no more necessary
# Load a pre-trained sentence transformer model. This model's output vectors are of size 384
# model = SentenceTransformer('all-MiniLM-L6-v2') 

In [14]:
# -- this block is no more necessary

# # Generate embeddings for all descriptions at once. Batch processing makes it faster
# embeddings = model.encode(df['description'].tolist(), normalize_embeddings=True)

# # Add the embeddings to the DataFrame
# df['description_vector'] = embeddings.tolist()

df['description_vector'] = None

In [15]:
df.head()

Unnamed: 0,name,category,review.point,price,description,description_vector
0,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,"Magnificently powerful and intense. Caramels, ...",
1,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,What impresses me most is how this whisky evol...,
2,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,There have been some legendary Bowmores from t...,
3,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,With a name inspired by a 1926 Buster Keaton m...,
4,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,"Captivating, enticing, and wonderfully charmin...",


In [16]:
with engine.connect() as conn:
    with conn.begin():
        for index, row in df.iterrows():
            sql = text("""
                INSERT INTO scotch_reviews 
                (name, category, review_point, price, description, description_vector) 
                VALUES (:name, :category, :review_point, :price, :description, dc.embedding(:description, 'fastembed/BAAI/bge-small-en-v1.5'))
            """)
            conn.execute(sql, {
                'name': row['name'], 
                'category': row['category'], 
                'review_point': row['review.point'], 
                'price': row['price'], 
                'description': row['description']
            })


Let's look for a scotch that costs less than $100, and has an earthy and creamy taste.

In [17]:
description_search = "earthy and creamy taste"
# -- this line is no more necessary
# search_vector = model.encode(description_search, normalize_embeddings=True).tolist() # Convert search phrase into a vector

In [18]:
with engine.connect() as conn:
    with conn.begin():
        sql = text("""
            SELECT TOP 3 * FROM scotch_reviews 
            WHERE price < 100 
            ORDER BY VECTOR_DOT_PRODUCT(description_vector, dc.embedding(:description_search)) DESC
        """)

        results = conn.execute(sql, {'description_search': str(description_search)}).fetchall()


In [19]:
print(results)

[('Compass Box The Peat Monster, 46%', 'Blended Malt Scotch Whisky', 94, 60.0, "The formula for this whisky has changed slightly since its inception -- and I think for the better. They've added some Laphroaig into the mix of Caol ... (408 characters truncated) ... uit add complexity. Long, warming finish. Amazing how a small change in composition can significantly benefit the overall flavor profile of a whisky.", '-.020112115889787673951,.023213732987642288208,-.0077229822054505348206,.0027969358488917350769,.018323017284274101257,.019524058327078819274,-.02228 ... (8780 characters truncated) ... 241,-.0035571814514696598052,-.0051112188957631587982,.024038551375269889831,-.020489530637860298156,.038185123354196548461,-.017557691782712936401,0'), ('Bowmore, 16 year old, 1989 vintage, 51.8%', 'Single Malt Scotch', 93, 90.0, 'No frills here, just pure, unadulterated Bowmore. This Islay whisky speaks of its location in a very pure and natural way. I find invigorating brine, ... (50 charac

In [20]:
results_df = pd.DataFrame(results, columns=df.columns.values+['description_vector']).iloc[:, :-1] # Remove vector
pd.set_option('display.max_colwidth', None)  # Easier to read description
results_df.head()

Unnamed: 0,namedescription_vector,categorydescription_vector,review.pointdescription_vector,pricedescription_vector,descriptiondescription_vector
0,"Compass Box The Peat Monster, 46%",Blended Malt Scotch Whisky,94,60.0,"The formula for this whisky has changed slightly since its inception -- and I think for the better. They've added some Laphroaig into the mix of Caol Ila and Ardmore. This whisky demonstrates the layered complexity that can be achieved by marrying whisky from different distilleries and different regions. I particularly enjoy the rich maltiness and oily texture that provide firm bedding and flavor contrast to the classic Islay notes: tar, boat docks, brine, smoked olive, seaweed, and kiln ash. More subtle cracked peppercorn, mustard seed, and citrus fruit add complexity. Long, warming finish. Amazing how a small change in composition can significantly benefit the overall flavor profile of a whisky."
1,"Bowmore, 16 year old, 1989 vintage, 51.8%",Single Malt Scotch,93,90.0,"No frills here, just pure, unadulterated Bowmore. This Islay whisky speaks of its location in a very pure and natural way. I find invigorating brine, seaweed, green olive, and fishnets, along with the classic Bowmore peat smoke. All these flavors are softened by gentle vanilla and honeyed malt, while background tropical fruit add complexity. \r\n"
2,"Ardbeg, 10 year old, 46%",Single Malt Scotch,93,55.0,"Straw-gold color. On the nose, sweet toffee, citrus notes, seaweed, and spice complement a powerful peat smoke infusion. In body, it is thick and oily. On the palate, a somewhat sweet maltiness up front is run over by a powerful peat smoke locomotive. Again, the whisky is enriched with citrus and pear notes, spice, and seaweed. The finish is powerful, long, and warming. The smoke lingers for minutes, if not hours. If you like your Ardbeg to go to a phenolic extreme, you will cherish this one. This big, powerful whisky makes no apologies for its Islay roots. And the fact that this whisky is bottled at 46% ABV just makes this big whisky even bigger."


## Elapsed time

In [21]:
done = time.time()
elapsed = done - start
print(f'elapsed time: {elapsed:.3f}')

elapsed time: 9.816
