## Vector Search with IRIS SQL
This tutorial covers how to use IRIS as a vector database. 

For this tutorial, we will use a dataset of 2.2k online reviews of scotch (
dataset from https://www.kaggle.com/datasets/koki25ando/22000-scotch-whisky-reviews) . With our latest vector database functionality, we can leverage the latest embedding models to run semantic search on the online reviews of scotch whiskeys. In addition, we'll be able to apply filters on columns with structured data. For example, we will be able to search for whiskeys that are priced under $100, and are 'earthy, smooth, and easy to drink'. Let's find our perfect whiskey!

In [109]:
import os, pandas as pd
from sentence_transformers import SentenceTransformer
import iris 

In [83]:
username = 'demo'
password = 'demo'
hostname = os.getenv('IRIS_HOSTNAME', 'localhost')
port = '1972' 
namespace = 'USER'
CONNECTION_STRING = f"{hostname}:{port}/{namespace}"

In [84]:
# Note: Ideally conn and cursor should be used with context manager or with try-execpt-finally 
conn = iris.connect(CONNECTION_STRING, username, password)
cursor = conn.cursor()

In [85]:
# Load the CSV file
df = pd.read_csv('../data/scotch_review.csv')

In [86]:
df.head()

Unnamed: 0.1,Unnamed: 0,name,category,review.point,price,currency,description
0,1,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,$,"Magnificently powerful and intense. Caramels, ..."
1,2,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,$,What impresses me most is how this whisky evol...
2,3,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,$,There have been some legendary Bowmores from t...
3,4,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,$,With a name inspired by a 1926 Buster Keaton m...
4,5,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,$,"Captivating, enticing, and wonderfully charmin..."


In [87]:
# Clean data
# Remove the specified columns
df.drop(['currency'], axis=1, inplace=True)

# Drop the first column
df.drop(columns=df.columns[0], inplace=True)

# Remove rows without a price
df.dropna(subset=['price'], inplace=True)

# Ensure values in 'price' are numbers
df = df[pd.to_numeric(df['price'], errors='coerce').notna()]

# Replace NaN values in other columns with an empty string
df.fillna('', inplace=True)

In [88]:
df.head()

Unnamed: 0,name,category,review.point,price,description
0,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,"Magnificently powerful and intense. Caramels, ..."
1,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,What impresses me most is how this whisky evol...
2,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,There have been some legendary Bowmores from t...
3,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,With a name inspired by a 1926 Buster Keaton m...
4,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,"Captivating, enticing, and wonderfully charmin..."


Now, InterSystems IRIS supports vectors as a datatype in tables! Here, we create a table with a few different columns. The last column, 'description_vector', will be used to store vectors that are generated by passing the 'description' of a review through an embedding model.

In [89]:

sql = """
        CREATE TABLE scotch_reviews_dbapi (
name VARCHAR(255),
category VARCHAR(255),
review_point INT,
price DOUBLE,
description VARCHAR(2000),
description_vector VECTOR(DOUBLE, 384)
)
        """
result = cursor.execute(sql)

In [90]:
# sql = "DROP TABLE scotch_reviews_dbapi"
# result = cursor.execute(sql)

In [91]:
# Load a pre-trained sentence transformer model. This model's output vectors are of size 384
model = SentenceTransformer('all-MiniLM-L6-v2') 



In [92]:

# Generate embeddings for all descriptions at once. Batch processing makes it faster
embeddings = model.encode(df['description'].tolist(), normalize_embeddings=True)

# Add the embeddings to the DataFrame
df['description_vector'] = embeddings.tolist()


In [93]:
df.head()

Unnamed: 0,name,category,review.point,price,description,description_vector
0,"Johnnie Walker Blue Label, 40%",Blended Scotch Whisky,97,225.0,"Magnificently powerful and intense. Caramels, ...","[-0.010494397953152657, 0.014729012735188007, ..."
1,"Black Bowmore, 1964 vintage, 42 year old, 40.5%",Single Malt Scotch,97,4500.0,What impresses me most is how this whisky evol...,"[0.023181220516562462, -0.051230352371931076, ..."
2,"Bowmore 46 year old (distilled 1964), 42.9%",Single Malt Scotch,97,13500.0,There have been some legendary Bowmores from t...,"[0.04333317279815674, -0.01706666499376297, -0..."
3,"Compass Box The General, 53.4%",Blended Malt Scotch Whisky,96,325.0,With a name inspired by a 1926 Buster Keaton m...,"[-0.07594005763530731, -0.036762338131666183, ..."
4,"Chivas Regal Ultis, 40%",Blended Malt Scotch Whisky,96,160.0,"Captivating, enticing, and wonderfully charmin...","[-0.0128188980743289, -0.0976979061961174, 0.0..."


In [100]:
# Prepare SQL query
sql = """
    INSERT INTO scotch_reviews_dbapi
    (name, category, review_point, price, description, description_vector) 
    VALUES (?, ?, ?, ?, ?, TO_VECTOR(?))
"""

# Iterate through DataFrame rows and execute insert for each row
for index, row in df.iterrows():
    # Prepare the parameters for each row
    params = [
        row['name'], 
        row['category'], 
        row['review.point'], 
        row['price'], 
        row['description'], 
        str(row['description_vector'])] # Convert to string if necessary
    
    
    # Execute the SQL statement for each row
    cursor.execute(sql, params)

Let's look for a scotch that costs less than $100, and has an earthy and creamy taste.

In [101]:
description_search = "earthy and creamy taste"
search_vector = model.encode(description_search, normalize_embeddings=True).tolist() # Convert search phrase into a vector

In [111]:
# Bug here
# Define the SQL query with placeholders for the vector and limit
sql = """
    SELECT TOP ? id, name, category, price, review_point, description
    FROM scotch_reviews_dbapi

"""
    # WHERE price < 100 
    # ORDER BY VECTOR_DOT_PRODUCT(description_vector, TO_VECTOR(?)) DESC
numberOfResults = 3

# Execute the query with the number of results and search vector as parameters
# cursor.execute(sql, [numberOfResults, str(search_vector)])
cursor.execute(sql, [numberOfResults])
# Fetch all results
results = cursor.fetchall()


TypeError: can only concatenate str (not "int") to str

In [105]:
print(results)

[[998, 'Signatory (distilled at Bowmore), 16 year old, 1988 vintage, cask #42508, 46%', 'Single Malt Scotch', 60.0, 87], [1564, 'Shieldaig 12 year old, 40%', 'Blended Scotch Whisky', 31.0, 85], [1182, 'The Arran Malt, Single Bourbon Cask, (Cask#1801), 1996 Vintage, 50.5%', 'Single Malt Scotch', 80.0, 86]]


In [98]:
results_df = pd.DataFrame(results, columns=df.columns).iloc[:, :-1] # Remove vector
pd.set_option('display.max_colwidth', None)  # Easier to read description
results_df.head()

Unnamed: 0,name,category,review.point,price,description
