# Incoporating semantic similarity in tabular databases

In this notebook we will cover how to run semantic search over a specific table column within a single SQL query, combining tabular query with RAG.


### Overall workflow

1. Generating embeddings for a specific column
2. Storing the embeddings in a new column (if column has low cardinality, it's better to use another table containing unique values and their embeddings)
3. Querying using standard SQL queries with [PGVector](https://github.com/pgvector/pgvector) extension which allows using L2 distance (`<->`), Cosine distance (`<=>` or cosine similarity using `1 - <=>`) and Inner product (`<#>`)
4. Running standard SQL query

### Requirements

We will need a PostgreSQL database with [pgvector](https://github.com/pgvector/pgvector) extension enabled. For this example, we will use a `Chinook` database using a local PostgreSQL server.

In [7]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = os.environ.get('OPENAI_API_KEY') or getpass.getpass("OpenAI API Key:")

In [8]:
from langchain.sql_database import SQLDatabase
from langchain.chat_models import ChatOpenAI

CONNECTION_STRING = "postgresql+psycopg2://postgres:test@localhost:5432/vectordb" # Replace with your own
db = SQLDatabase.from_uri(CONNECTION_STRING)
llm = ChatOpenAI(model_name='gpt-4', temperature=0)

### Embedding the song titles

For this example, we will run queries based on semantic meaning of song titles. In order to do this, let's start by adding a new column in the table for storing the embeddings:

In [9]:
# db.run('ALTER TABLE "Track" ADD COLUMN "embeddings" vector;')

Let's generate the embedding for each *track title* and store it as a new column in our "Track" table

In [17]:
from langchain.embeddings import OpenAIEmbeddings

embeddings_model = OpenAIEmbeddings()

In [15]:
tracks = db.run('SELECT "Name" FROM "Track"')
song_titles = [s[0] for s in eval(tracks)]
# song_titles[:10]

In [18]:
title_embeddings = embeddings_model.embed_documents(song_titles)
len(title_embeddings)

3503

Now let's insert the embeddings in the into the new column from our table

In [19]:
from tqdm import tqdm

for i in tqdm(range(len(title_embeddings))):
    title = titles[i].replace("'","''")
    embedding = title_embeddings[i]
    sql_command = f'UPDATE "Track" SET "embeddings" = ARRAY{embedding} WHERE "Name" =' +  f"'{title}'"
    db.run(sql_command)

100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3503/3503 [00:27<00:00, 128.85it/s]


We can test the semantic search running the following query:

In [21]:
embeded_title = embeddings_model.embed_query("hope about the future")
query = 'SELECT "Track"."Name" FROM "Track" WHERE "Track"."embeddings" IS NOT NULL ORDER BY "embeddings" <-> ' +  f"'{embeded_title}' LIMIT 5"
db.run(query)

'[("Tomorrow\'s Dream",), (\'Remember Tomorrow\',), (\'Remember Tomorrow\',), (\'The Best Is Yet To Come\',), ("Thinking \'Bout Tomorrow",)]'

### Creating the SQL Chain

Now let's try to generate a query using the SQL chain:

In [119]:
from langchain.chains.sql_database.prompt import PROMPT, SQL_PROMPTS
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_sql_query_chain

prompt = SQL_PROMPTS['postgresql']

#TODO: Load prompt from LangChain HUB
prompt.template = """
You are a PostgreSQL expert. Given an input question, first create a syntactically correct PostgreSQL query to run, then look at the results of the query and return the answer to the input question.
Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per PostgreSQL. You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (") to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use CURRENT_DATE function to get the current date, if the question involves "today".

IMPORTANT NOTE: you can use specialized pgvector syntax (`<->`) to run
semantic search using an embeddings column in the table.
The embeddings value for a given row typically represents the semantic meaning of that row.
The vector represents an embedding representation of the question, given below. 

Do NOT fill in the vector values directly, but rather specify a
`[search_word]` placeholder, which should contain the word that should be embedded for filtering.
The column containing the embedding is called 'embeddings'.
FOR EXAMPLE, if the user asks for songs about 'the feeling of loneliness' the query could be:
'SELECT "Track"."Name" FROM "Track" ORDER BY "embeddings" <-> '[loneliness]' LIMIT 5'

If you need to combine embeddings from different tables, you can use the WITH statement:
WITH table1 AS (
    SELECT "ColumnName"
    FROM "TableName"
    ORDER BY "embeddings" <-> '[keyword_1]'
    LIMIT 5
)
SELECT "table2"."Column_Name", Table1."TableName"."ColumnName"
FROM "table2
JOIN table1 ON "table2"."Column_Name" = table1."ColumnName"
ORDER BY "table2"."embeddings" <-> '[keyword_2]'
LIMIT 3;

Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:
{table_info}

Question: {input}
"""

In [120]:
db = SQLDatabase.from_uri(CONNECTION_STRING) # We reconnect to dbso the new columns are loaded as well.
llm = ChatOpenAI(model_name='gpt-4', temperature=0)
chain = create_sql_query_chain(llm, db, prompt=prompt)

## Using the Chain

### Example 1: Filtering a column based on semantic meaning

Let's say we want to retrieve songs that express sadness, but filtering based on genre:

In [30]:
question = "Which are the 5 rock songs with titles about deep feeling of dispair?"
query = chain.invoke({"question": question})
print(query)

SELECT "Track"."Name" 
FROM "Track" 
JOIN "Genre" ON "Track"."GenreId" = "Genre"."GenreId" 
WHERE "Genre"."Name" = 'Rock' 
ORDER BY "Track"."embeddings" <-> '[dispair]' 
LIMIT 5


As we can see, we have a placeholder provided by the LLM for us to insert the actual embedding for `dispair`. 

Let's use this function to replace the embedding placeholder with the actual embedding:

In [31]:
import re

def replace_brackets(match):
    words_inside_brackets = match.group(1).split(', ')
    embedded_words = [str(embeddings_model.embed_query(word)) for word in words_inside_brackets]
    return "', '".join(embedded_words)

final_query = re.sub(r'\[([\w\s,]+)\]', replace_brackets, query)

`replace_brackets` takes a stringand replaces all occurrences of words or phrases inside square brackets with their corresponding embeddings.

Embeddings are returned by `embeddings_model.embed_query()`. 

The embeddings are then concatenated into a single string, separated by commas.

And we can run it:

In [32]:
db.run(final_query)

"[('Sea Of Sorrow',), ('Surrender',), ('Indifference',), ('Hard Luck Woman',), ('Desire',)]"

### Some insights

What is substantially different in implementing this method is that we have combined:
- Semantic search (songs that have sad titles)
- Traditional tabular querying (running JOIN statements to filter track based on genre)

This is something we _could_ potentially achieve using metadata filtering, but it's more complex to do so (we would need to use a vector database containing the embeddings, and use metadata filtering based on genre).

However, for other use cases metadata filtering **wouldn't be enough**.

### Example 2: Combining filters

In [42]:
question = "I want to know the 3 albums which have the most amount of songs in the top 150 saddest songs"
query = chain.invoke({"question": question})
# print(query)

In [38]:
final_query = re.sub(r'\[([\w\s,]+)\]', replace_brackets, query)
db.run(final_query)

"[('International Superhits', 5), ('Ten', 4), ('Album Of The Year', 3)]"

So we have result for 3 albums with most amount of songs in top 150 saddest ones. This **wouldn't** be possible using only standard metadata filtering. Without this _hybdrid query_, we would need some postprocessing to get the result.

Another similar exmaple:

In [39]:
question = "I need the 6 albums with shortest title, as long as they contain songs which are in the 20 saddest song list."
query = chain.invoke({"question": question})
# print(query)

SELECT "Album"."Title", "Album"."AlbumId"
FROM "Album"
JOIN "Track" ON "Album"."AlbumId" = "Track"."AlbumId"
WHERE "Track"."TrackId" IN (
    SELECT "TrackId" 
    FROM "Track" 
    ORDER BY "embeddings" <-> '[sad]' 
    LIMIT 20
)
ORDER BY "Album"."title_len" ASC
LIMIT 6


In [40]:
final_query = re.sub(r'\[([\w\s,]+)\]', replace_brackets, query)
db.run(final_query)

"[('Ten', 181), ('Core', 206), ('Big Ones', 5), ('One By One', 81), ('Black Album', 148), ('Miles Ahead', 157)]"

### Example 3: Combining two separate semantic searches

One interesting aspect of this approach which is **substantially different from using standar RAG** is that we can even **combine** two semantic search filters:
- _Get 5 saddest songs..._
- _**...obtained from albums with "lovely" titles**_

This could generalize to **any kind of combined RAG** (paragraphs discussing _X_ topic belonging from books about _Y_, replies to a tweet about _ABC_ topic that express _XYZ_ feeling)

We will combine semantic search on songs and album titles, so we need to do the same for `Album` table:
1. Generate the embeddings
2. Add them to the table as a new column (which we need to add in the table)

In [60]:
# db.run('ALTER TABLE "Album" ADD COLUMN "embeddings" vector;')

In [43]:
albums = db.run('SELECT "Title" FROM "Album"')
album_titles = [title[0] for title in eval(albums)]
album_title_embeddings = embeddings_model.embed_documents(album_titles)

In [44]:
for i in tqdm(range(len(album_title_embeddings))):
    album_title = album_titles[i].replace("'","''")
    album_embedding = album_title_embeddings[i]
    sql_command = f'UPDATE "Album" SET "embeddings" = ARRAY{album_embedding} WHERE "Title" =' +  f"'{album_title}'"
    db.run(sql_command)

100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 347/347 [00:03<00:00, 114.59it/s]


In [46]:
embeded_title = embeddings_model.embed_query("hope about the future")
query = 'SELECT "Album"."Title" FROM "Album" WHERE "Album"."embeddings" IS NOT NULL ORDER BY "embeddings" <-> ' +  f"'{embeded_title}' LIMIT 5"
db.run(query)

"[('Realize',), ('Morning Dance',), ('Into The Light',), ('New Adventures In Hi-Fi',), ('Miles Ahead',)]"

Now we can combine both filters:

In [110]:
db = SQLDatabase.from_uri(CONNECTION_STRING) # We reconnect to dbso the new columns are loaded as well.
llm = ChatOpenAI(model_name='gpt-4', temperature=0)
chain = create_sql_query_chain(llm, db, prompt=prompt)

In [111]:
question = "I want to know songs about love obtained from 5 saddest albums"
query = chain.invoke({"question": question})

In [112]:
print(query)

WITH SadAlbums AS (
    SELECT "AlbumId"
    FROM "Album"
    ORDER BY "embeddings" <-> '[sadness]'
    LIMIT 5
)
SELECT "Track"."Name"
FROM "Track"
JOIN SadAlbums ON "Track"."AlbumId" = SadAlbums."AlbumId"
ORDER BY "Track"."embeddings" <-> '[love]'
LIMIT 5;


In [113]:
final_query = re.sub(r'\[([\w\s,]+)\]', replace_brackets, query)
db.run(final_query)

"[('For Your Life',), ('Frantic',), ('Believer',), ('My World',), ('Dee',)]"

In [114]:
question = "Which are the 5 saddest albums?"
query = chain.invoke({"question": question})

In [117]:
db.run(re.sub(r'\[([\w\s,]+)\]', replace_brackets, query))

"[('St. Anger', 'Metallica'), ('Presence', 'Led Zeppelin'), ('Tribute', 'Ozzy Osbourne'), ('Quiet Songs', 'Aisha Duo'), ('Allegri: Miserere', 'Richard Marlow & The Choir of Trinity College, Cambridge')]"