## Vector Search with IRIS SQL
This tutorial covers how to use IRIS as a vector database. 

For this tutorial, we will use a dataset of 2.2k online reviews of scotch (
dataset from https://www.kaggle.com/datasets/koki25ando/22000-scotch-whisky-reviews) . With our latest vector database functionality, we can leverage the latest embedding models to run semantic search on the online reviews of scotch whiskeys. In addition, we'll be able to apply filters on columns with structured data. For example, we will be able to search for whiskeys that are priced under $100, and are 'earthy, smooth, and easy to drink'. Let's find our perfect whiskey!

In [12]:
import os, pandas as pd
from sentence_transformers import SentenceTransformer
from sqlalchemy import create_engine, text

In [13]:
username = 'demo'
password = 'demo'
hostname = os.getenv('IRIS_HOSTNAME', 'localhost')
port = '1972' 
namespace = 'USER'
CONNECTION_STRING = f"iris://{username}:{password}@{hostname}:{port}/{namespace}"

In [14]:
engine = create_engine(CONNECTION_STRING)

In [15]:
# Load the CSV file
df = pd.read_csv('./data/data.csv')

In [16]:
df.head()

Unnamed: 0,landmark,latitude,longitude,wiki_title,image_url,wiki_content
0,"Eiffel Tower, Paris, France",48.85826,2.294501,Eiffel Tower,https://upload.wikimedia.org/wikipedia/commons...,The Eiffel Tower ( EYE-fəl; French: Tour Eiffe...
1,"Colosseum, Rome, Italy",41.890261,12.493087,Colosseum,https://upload.wikimedia.org/wikipedia/commons...,The Colosseum ( KOL-ə-SEE-əm; Italian: Colosse...
2,"Sagrada Familia, Barcelona, Spain",41.403505,2.174428,Sagrada Família,https://upload.wikimedia.org/wikipedia/commons...,The Basílica i Temple Expiatori de la Sagrada ...
3,"Acropolis of Athens, Greece",37.971689,23.72632,Acropolis of Athens,https://upload.wikimedia.org/wikipedia/commons...,The Acropolis of Athens (Ancient Greek: ἡ Ἀκρό...
4,"Saint Basil's Cathedral, Moscow, Russia",55.752474,37.623162,Saint Basil's Cathedral,https://upload.wikimedia.org/wikipedia/commons...,The Cathedral of Vasily the Blessed (Russian: ...


In [17]:

# Replace NaN values in other columns with an empty string
df.fillna('', inplace=True)

In [18]:
df.head()

Unnamed: 0,landmark,latitude,longitude,wiki_title,image_url,wiki_content
0,"Eiffel Tower, Paris, France",48.85826,2.294501,Eiffel Tower,https://upload.wikimedia.org/wikipedia/commons...,The Eiffel Tower ( EYE-fəl; French: Tour Eiffe...
1,"Colosseum, Rome, Italy",41.890261,12.493087,Colosseum,https://upload.wikimedia.org/wikipedia/commons...,The Colosseum ( KOL-ə-SEE-əm; Italian: Colosse...
2,"Sagrada Familia, Barcelona, Spain",41.403505,2.174428,Sagrada Família,https://upload.wikimedia.org/wikipedia/commons...,The Basílica i Temple Expiatori de la Sagrada ...
3,"Acropolis of Athens, Greece",37.971689,23.72632,Acropolis of Athens,https://upload.wikimedia.org/wikipedia/commons...,The Acropolis of Athens (Ancient Greek: ἡ Ἀκρό...
4,"Saint Basil's Cathedral, Moscow, Russia",55.752474,37.623162,Saint Basil's Cathedral,https://upload.wikimedia.org/wikipedia/commons...,The Cathedral of Vasily the Blessed (Russian: ...


Now, InterSystems IRIS supports vectors as a datatype in tables! Here, we create a table with a few different columns. The last column, 'description_vector', will be used to store vectors that are generated by passing the 'description' of a review through an embedding model.

In [27]:

with engine.connect() as conn:
    with conn.begin():# Load 
        sql = f"""
                CREATE TABLE monuments (
        landmark VARCHAR(255),
        latitude DOUBLE,
        longitude DOUBLE,
        wiki_title VARCHAR(255),
        image_url VARCHAR(20000),
        wiki_content VARCHAR(20000),
        description_vector VECTOR(DOUBLE, 384)
        )
                """
        result = conn.execute(text(sql))

In [28]:
# Load a pre-trained sentence transformer model. This model's output vectors are of size 384
model = SentenceTransformer('all-MiniLM-L6-v2') 

In [29]:

# Generate embeddings for all descriptions at once. Batch processing makes it faster
embeddings = model.encode(df['wiki_content'].tolist(), normalize_embeddings=True)

# Add the embeddings to the DataFrame
df['description_vector'] = embeddings.tolist()


In [30]:
len(df['description_vector'][0])

768

In [31]:
with engine.connect() as conn:
    with conn.begin():
        for index, row in df.iterrows():
            sql = text("""
                INSERT INTO monuments 
                (landmark, latitude, longitude, wiki_title, image_url, wiki_content, description_vector) 
                VALUES (:landmark, :latitude, :longitude, :wiki_title, :image_url, :wiki_content, TO_VECTOR(:description_vector))
            """)
            conn.execute(sql, {
                'landmark': row['landmark'], 
                'latitude': row['latitude'], 
                'longitude': row['longitude'], 
                'wiki_title': row['wiki_title'], 
                'image_url': row['image_url'], 
                'wiki_content': row['wiki_content'], 
                'description_vector': str(row['description_vector'])
            })


Let's look for a scotch that costs less than $100, and has an earthy and creamy taste.

In [36]:
description_search = "Best monuments in Barcelona"
search_vector = model.encode(description_search, normalize_embeddings=True).tolist() # Convert search phrase into a vector

In [37]:
with engine.connect() as conn:
    with conn.begin():
        sql = text("""
            SELECT TOP 10 * FROM monuments 
            WHERE latitude < 5000 
            ORDER BY VECTOR_COSINE(description_vector, TO_VECTOR(:search_vector)) DESC
        """)

        results = conn.execute(sql, {'search_vector': str(search_vector)}).fetchall()


In [38]:
print(results)

[('Pont du Gard, France', 43.9470703, 4.535600512520862, 'Pont du Gard', 'https://upload.wikimedia.org/wikipedia/commons/thumb/4/42/Pont_du_Gard_BLS.jpg/500px-Pont_du_Gard_BLS.jpg', "The Pont du Gard is an ancient Roman aqueduct bridge built in the first century AD to carry water over 50 km (31 mi) to the Roman colony of Nemausus  ... (168 characters truncated) ... s added to UNESCO's list of World Heritage sites in 1985 because of its exceptional preservation, historical importance, and architectural ingenuity.", '-.037369482219219207763,.0033026244491338729858,-.023289291188120841979,.0029516082722693681716,.052871260792016983032,-.017278179526329040527,.01912 ... (17932 characters truncated) ... 0305786,.013148564845323562622,-.021633584052324295043,-.021005593240261077881,-.054070044308900833129,-.028030565008521080017,.036217868328094482421'), ("Fisherman's Bastion, Budapest, Hungary", 47.50232795, 19.034710434555507, "Fisherman's Bastion", 'https://upload.wikimedia.org/wikipedia/

In [39]:
results_df = pd.DataFrame(results, columns=df.columns).iloc[:, :-1] # Remove vector
pd.set_option('display.max_colwidth', None)  # Easier to read description
results_df["landmark"].head(10)

0                                  Pont du Gard, France
1                Fisherman's Bastion, Budapest, Hungary
2                              Alhambra, Granada, Spain
3        National Archaeological Museum, Athens, Greece
4                 La Concha Beach, San Sebastián, Spain
5                         Knossos Palace, Crete, Greece
6    National Art Museum of Catalonia, Barcelona, Spain
7                        Arc de Triomphe, Paris, France
8                          Park Güell, Barcelona, Spain
9                             Epidaurus Theater, Greece
Name: landmark, dtype: object