# Incoporating semantic search in tabular databases

In this notebook we will cover how to add embeddings into the SQL database for doing semantic search **combined** with standard tabular queries in the same solution.

In [3]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

The example will be querying over Chinook database in `PostgreSQL` with `SQLDatabase`. We will query data by combining standard SQL queries with the semantic meaninig of song titles.

The database will need to have the vector extension enabled, so follow up the installation guide for [pgvector](https://github.com/pgvector/pgvector)

In [None]:
from langchain.sql_database import SQLDatabase
from langchain.chat_models import ChatOpenAI

CONNECTION_STRING = "postgresql+psycopg2://postgres:test@localhost:5432/chinook" # Replace with your own
db = SQLDatabase.from_uri(CONNECTION_STRING)
llm = ChatOpenAI(model_name='gpt-4', temperature=0)

### Embedding the song titles

We will need to add a new column in the table we want for storing the embedding:

In [356]:
# db.run('ALTER TABLE "Track" ADD COLUMN "embeddings" vector;')

Let's generate the embedding for each *track title* and store it as a new column in our "Track" table

In [51]:
from langchain.embeddings import OpenAIEmbeddings
from tqdm import tqdm

embeddings_model = OpenAIEmbeddings()

Now we generate the embeddings for each song title. This can take a while, so let's limit to 200 songs

In [32]:
# Fetch the titles from the Track table
res = db.run('SELECT "Name" FROM "Track"')
titles = [title[0] for title in eval(res)]
titles[:5]

['Princess of the Dawn',
 'Put The Finger On You',
 "Let's Get It Up",
 'Inject The Venom',
 'Snowballed']

In [20]:
title_embeddings = embeddings_model.embed_documents(titles)
# len(title_embeddings)

Now let's insert the embeddings in the into the new column from our table

In [38]:
for i in tqdm(range(len(title_embeddings))):
    title = titles[i].replace("'","''")
    embedding = title_embeddings[i]
    sql_command = f'UPDATE "Track" SET "embeddings" = ARRAY{embedding} WHERE "Name" =' +  f"'{title}'"
    db.run(sql_command)

100%|██████████| 3503/3503 [00:23<00:00, 146.68it/s]


We can test the semantic search running the following query:

In [52]:
embeded_title = embeddings_model.embed_query("hope about the future")
query = 'SELECT "Track"."Name" FROM "Track" WHERE "Track"."embeddings" IS NOT NULL ORDER BY "embeddings" <-> ' +  f"'{embeded_title}' LIMIT 5"
print(query)
db.run(query)

SELECT "Track"."Name" FROM "Track" WHERE "Track"."embeddings" IS NOT NULL ORDER BY "embeddings" <-> '[-0.005296176500367851, -0.024523600003614955, -0.0007927474517293692, -0.01893319261373208, 0.0035659644242875383, 0.002227528338073427, -0.00032501444871436925, 0.0019508862909545174, -0.017282933652393022, 0.008219307600334083, 0.006239637969211901, 0.009786413422444112, -0.022975684083882866, -0.00395933972790567, 0.012645581285559251, 0.007451745964153147, 0.03661269833460187, -0.01778184876247676, 0.0186133745668314, -0.013918454355509036, -0.0026528854929477575, 0.0038154219793294078, 0.01217225147021595, -0.00718309943805595, -0.006217250836314019, -0.014506918928506913, 0.021440560811520994, -0.026992590259293216, 0.02301406202599352, -0.022898928199661557, 0.018626167214201617, -0.010643523590216577, -0.03175147370746667, -0.017206177768171713, -0.0012264998396480686, -0.00873101535778804, 0.0011041696835090498, 0.0010410057542879437, 0.012747922464520998, -0.01273512981715078

'[("Tomorrow\'s Dream",), (\'Remember Tomorrow\',), (\'Remember Tomorrow\',), (\'The Best Is Yet To Come\',), ("Thinking \'Bout Tomorrow",)]'

### Creating the SQL Chain

Now let's try to generate a query using the SQL chain:

In [41]:
from langchain.chains.sql_database.prompt import PROMPT, SQL_PROMPTS
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_sql_query_chain

prompt = SQL_PROMPTS['postgresql']

prompt.template = """
You are a PostgreSQL expert. Given an input question, first create a syntactically correct PostgreSQL query to run, then look at the results of the query and return the answer to the input question.
Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per PostgreSQL. You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (") to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use CURRENT_DATE function to get the current date, if the question involves "today".

IMPORTANT NOTE: you can use specialized pgvector syntax (`<->`) to do nearest
neighbors/semantic search to a given vector from an embeddings column in the table.
The embeddings value for a given row typically represents the semantic meaning of that row.
The vector represents an embedding representation
of the question, given below. Do NOT fill in the vector values directly, but rather specify a
`[embedding]` placeholder.
The only tables containing 'embeddings' columns are: Track. No other tables contain 'embeddings' column.

FOR EXAMPLE, the query to get 3 songs which title have a certain semantic meaning is:
'SELECT "Track"."Name" FROM "Track" ORDER BY "embeddings" <-> [embedding] LIMIT 3'


Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:
{table_info}

Question: {input}
"""


llm = ChatOpenAI(model_name='gpt-4', temperature=0)
chain = create_sql_query_chain(llm, db, prompt=prompt)

### Using the Chain

#### Example 1: Getting sad songs

Let's say we want to retrieve songs that express sadness, but filtering based on genre:

In [42]:
question = "Which are the 5 rock songs with saddest titles?"
query = chain.invoke({"question": question})
print(query)

SELECT "Track"."Name" 
FROM "Track" 
JOIN "Genre" ON "Track"."GenreId" = "Genre"."GenreId" 
WHERE "Genre"."Name" = 'Rock' 
ORDER BY "Track"."embeddings" <-> [embedding] 
LIMIT 5


As we can see, we have a placeholder provided by the LLM for us to insert the actual embedding for "sadness", so let's do it:

In [43]:
final_query = query.replace('[embedding]', "'" + str(embeddings_model.embed_query("sadness")) + "'")

And we can run it

In [44]:
db.run(final_query)

"[('Sea Of Sorrow',), ('Out Of Tears',), ('My Melancholy Blues',), ('Indifference',), ('Confusion',)]"

#### Some insights

The previous example couldn't have been used using a standard vector database with metadata filtering on the genre. What is substantially different in implementing this method is that we have combined both in the same solution:
- Semantic search (songs that have sad titles)
- Traditional tabular querying (running JOIN statements to filter track based on genre)

This is something we potentially _could_ achieve using metadata filtering, but it's more complex to do so.

For other use cases, metadata filtering wouldn't be enough:

#### Example 2: Combining filters

In [47]:
question = "Get 3 albums which have the most amount of songs in the top 500 saddest songs"
query = chain.invoke({"question": question})

if '[embedding]' in query:
    final_query = query.replace('[embedding]', "'" + str(embeddings_model.embed_query("sadness")) + "'")

In [48]:
db.run(final_query)

"[('Unplugged', 9), ('Instant Karma: The Amnesty International Campaign to Save Darfur', 7), ('Lost, Season 2', 7)]"

So we have result for 3 albums with most amount of songs in top 500 saddest ones. This **wouldn't** be possible using only standard metadata filtering, we wouldn't need to run some postprocessing to get the result.


#### Example 3: Combining filters #2

In [49]:
question = "I need the 6 albums with shortest title, as soon as they contain songs which are in the 20 saddest song list."
query = chain.invoke({"question": question})
print(query)

WITH saddest_songs AS (
    SELECT "TrackId" FROM "Track" ORDER BY "embeddings" <-> [embedding] LIMIT 20
),
albums_with_sad_songs AS (
    SELECT DISTINCT "AlbumId" FROM "Track" WHERE "TrackId" IN (SELECT "TrackId" FROM saddest_songs)
)
SELECT "Album"."Title" FROM "Album" 
JOIN albums_with_sad_songs ON "Album"."AlbumId" = albums_with_sad_songs."AlbumId" 
ORDER BY "Album"."title_len" ASC 
LIMIT 6


In [50]:
final_query = query.replace('[embedding]', "'" + str(embeddings_model.embed_query("sadness")) + "'")
db.run(final_query)

"[('Vs.',), ('Faceless',), ('Facelift',), ('Unplugged',), ('By The Way',), ('Black Album',)]"

We can even complicate things more, and still use the same solution. This would require a much more complex post-processing if we are limited to just using vectordbs with metadata filtering:

### Combining two semantic similarity filters

One interesting aspect of this approach is that we can even **combine** two semantic search filters:
- 5 most saddest songs...
- ...as long as they belong to top 20 albums "love related" albums

Let's do the same as we did for the song titles with the album titles:
1. Generate the embeddings
2. Add them to the table as a new column (which we need to add)
3. Test the embeddings

In [60]:
# db.run('ALTER TABLE "Album" ADD COLUMN "embeddings" vector;')

In [61]:
albums = db.run('SELECT "Title" FROM "Album"')
album_titles = [title[0] for title in eval(albums)]
album_title_embeddings = embeddings_model.embed_documents(album_titles)

In [58]:
for i in tqdm(range(len(album_title_embeddings))):
    album_title = album_titles[i].replace("'","''")
    album_embedding = album_title_embeddings[i]
    sql_command = f'UPDATE "Album" SET "embeddings" = ARRAY{album_embedding} WHERE "Title" =' +  f"'{album_title}'"
    db.run(sql_command)

100%|██████████| 347/347 [00:02<00:00, 162.18it/s]


In [59]:
embeded_title = embeddings_model.embed_query("hope about the future")
query = 'SELECT "Album"."Title" FROM "Album" WHERE "Album"."embeddings" IS NOT NULL ORDER BY "embeddings" <-> ' +  f"'{embeded_title}' LIMIT 5"
print(query)
db.run(query)

SELECT "Album"."Title" FROM "Album" WHERE "Album"."embeddings" IS NOT NULL ORDER BY "embeddings" <-> '[-0.005296176500367851, -0.024523600003614955, -0.0007927474517293692, -0.01893319261373208, 0.0035659644242875383, 0.002227528338073427, -0.00032501444871436925, 0.0019508862909545174, -0.017282933652393022, 0.008219307600334083, 0.006239637969211901, 0.009786413422444112, -0.022975684083882866, -0.00395933972790567, 0.012645581285559251, 0.007451745964153147, 0.03661269833460187, -0.01778184876247676, 0.0186133745668314, -0.013918454355509036, -0.0026528854929477575, 0.0038154219793294078, 0.01217225147021595, -0.00718309943805595, -0.006217250836314019, -0.014506918928506913, 0.021440560811520994, -0.026992590259293216, 0.02301406202599352, -0.022898928199661557, 0.018626167214201617, -0.010643523590216577, -0.03175147370746667, -0.017206177768171713, -0.0012264998396480686, -0.00873101535778804, 0.0011041696835090498, 0.0010410057542879437, 0.012747922464520998, -0.0127351298171507

"[('Realize',), ('Morning Dance',), ('Into The Light',), ('New Adventures In Hi-Fi',), ('Miles Ahead',)]"

#### Combining both filters

In [104]:
from langchain.chains.sql_database.prompt import PROMPT, SQL_PROMPTS
from langchain.chat_models import ChatOpenAI
from langchain.chains import create_sql_query_chain

prompt = SQL_PROMPTS['postgresql']

prompt.template = """
You are a PostgreSQL expert. Given an input question, first create a syntactically correct PostgreSQL query to run, then look at the results of the query and return the answer to the input question.
Unless the user specifies in the question a specific number of examples to obtain, query for at most {top_k} results using the LIMIT clause as per PostgreSQL. You can order the results to return the most informative data in the database.
Never query for all columns from a table. You must query only the columns that are needed to answer the question. Wrap each column name in double quotes (") to denote them as delimited identifiers.
Pay attention to use only the column names you can see in the tables below. Be careful to not query for columns that do not exist. Also, pay attention to which column is in which table.
Pay attention to use CURRENT_DATE function to get the current date, if the question involves "today".

IMPORTANT NOTE: you can use specialized pgvector syntax (`<->`) to do nearest
neighbors/semantic search to a given vector from an embeddings column in the table.
The embeddings value for a given row typically represents the semantic meaning of that row.
The vector represents an embeddings representation
of the question, given below. Do NOT fill in the vector values directly, but rather specify a
`[search_word]` placeholder, which should contain the word that should be embedded for filtering.
The only tables containing 'embeddings' columns are: 'Track' and 'Album'. No other tables contain 'embeddings' column.
The column is ALWAYS called 'embeddings'. MAKE SURE you call it 'embeddings' and not 'embedding' or similar.
FOR EXAMPLE, the query to get 3 songs which title are saddest:
'SELECT "Track"."Name" FROM "Track" ORDER BY "embeddings" <-> [sad] LIMIT 3'

Or three happiest albums:
'SELECT "Album"."Title" FROM "Album" ORDER BY "embeddings" <-> [happiness] LIMIT 8'

Use the following format:

Question: Question here
SQLQuery: SQL Query to run
SQLResult: Result of the SQLQuery
Answer: Final answer here

Only use the following tables:
{table_info}

Question: {input}
"""


llm = ChatOpenAI(model_name='gpt-4', temperature=0)
chain = create_sql_query_chain(llm, db, prompt=prompt)

In [113]:
# question = "I want to know 5 songs that have a happy name but belonging to an album with a sad name"
question = "I want the 3 saddest songs and album title for each one of the top 5 happiest albums"
query = chain.invoke({"question": question})

In [114]:
print(query)

SELECT "Track"."Name", "Album"."Title" 
FROM "Track" 
JOIN "Album" ON "Track"."AlbumId" = "Album"."AlbumId" 
ORDER BY "Album"."embeddings" <-> [happiness] DESC, "Track"."embeddings" <-> [sad] ASC 
LIMIT 15


In [115]:
final_query = query.replace('[happiness]', "'" + str(embeddings_model.embed_query("happiness")) + "'")
final_query = final_query.replace('[sad]', "'" + str(embeddings_model.embed_query("sad")) + "'")

In [117]:
print(db.run(final_query))

[('Leila', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('Aloha', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('Soul Parsifal', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('Nat�lia', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('Mil Peda�os', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('Dezesseis', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('M�sica De Trabalho', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('M�sica Ambiente', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('Longe Do Meu Lado', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('Esperando Por Mim', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('Quando Voc� Voltar', 'A TempestadeTempestade Ou O Livro Dos Dias'), ("L'Avventura", 'A TempestadeTempestade Ou O Livro Dos Dias'), ('A Via L�ctea', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('1� De Julho', 'A TempestadeTempestade Ou O Livro Dos Dias'), ('O Livro Dos Dias', 'A TempestadeTempestade Ou O Livro Dos Dias')]


In [120]:
question = "Three songs about a combination of happiness with hope"
query = chain.invoke({"question": question})
print(query)

SELECT "Track"."Name" 
FROM "Track" 
ORDER BY "embeddings" <-> [happiness, hope] 
LIMIT 3


In [126]:
import re

def replace_brackets(match):
    words_inside_brackets = match.group(1).split(', ')
    embedded_words = [str(embeddings_model.embed_query(word)) for word in words_inside_brackets]
    return "'" + "', '".join(embedded_words) + "'"

updated_sql = re.sub(r'\[([\w\s,]+)\]', replace_brackets, query)
print(updated_sql)

SELECT "Track"."Name" 
FROM "Track" 
ORDER BY "embeddings" <-> '[0.002974223739582622, -0.02512427741888291, 0.016198377071763158, -0.032035069414131874, -0.0022589244241709074, 0.013963673089441988, -0.015294162454309174, -0.024465491458854405, -0.009106743643139601, -0.01993149627016118, 0.014377029460722387, 0.01652131132937232, -0.0010527653491863428, 0.0001054581864862774, 0.009868868230326892, 0.022747482474972422, 0.041335573798101495, -0.017554701326250696, 0.028030685737365387, -0.023277093986899162, -0.010527654190355397, 0.011489997309455175, -0.018768933687826294, -0.03309429243798535, -0.017296354292692415, -0.007401651172745169, 0.01659881599823338, -0.020409439865492467, -0.008002308631127879, -0.019440637092664973, 0.030071628978856534, -0.002683582861168243, -0.03673698952271214, 0.003384350051168509, -0.02354835846526762, -0.020448192199922995, -0.0018342661035888998, -0.011515832199075529, 0.002099071975967597, 0.012478176249497933, -0.012775275617486743, -0.00372343

Let's handle everything using LCEL