# Running Hybrid VSS Queries with Dragonfly and OpenAI

This notebook provides an introduction to using Dragonfly as a vector database with OpenAI embeddings and running hybrid queries that combine VSS and lexical search. Dragonfly is a scalable, real-time database that can be used as a vector database. The Dragonfly Query and Search capability allows you to index and search for vectors in Dragonfly. This notebook will show you how to use the Dragonfly Query and Search to index and search for vectors created by using the OpenAI API and stored in Dragonfly.

Hybrid queries combine vector similarity with traditional Dragonfly Query and Search filtering capabilities on GEO, NUMERIC, TAG or TEXT data simplifying application code. A common example of a hybrid query in an e-commerce use case is to find items visually similar to a given query image limited to items available in a GEO location and within a price range.

## Prerequisites

Before we start this project, we need to set up the following:

* Start Dragonfly
* Install libraries
* Get your [OpenAI API key](https://platform.openai.com/api-keys)

## Start Dragonfly

```bash
$ docker run -d -p 6379:6379 --name df docker.dragonflydb.io/dragonflydb/dragonfly
```

## Install Requirements

In [None]:
! pip install -r requirements.txt

## Prepare your OpenAI API key

The `OpenAI API key` is used for vectorization of query data.

If you don't have an OpenAI API key, you can get one from [https://platform.openai.com/api-keys](https://platform.openai.com/api-keys).

Once you get your key, please add it to your environment variables as `OPENAI_API_KEY` by using following approach:

In [1]:
import os
import openai

# Set your OpenAI API key here
# Option 1: Set environment variable
# os.environ["OPENAI_API_KEY"] = "your-api-key-here"

# Option 2: Set directly in openai
# openai.api_key = "your-api-key-here"

# Option 3: Use getpass for interactive input
import getpass
if not os.getenv("OPENAI_API_KEY"):
    openai.api_key = getpass.getpass("OpenAI API Key:")
else:
    openai.api_key = os.getenv("OPENAI_API_KEY")
    print("OPENAI_API_KEY loaded from environment")

## Load data

In this section we'll load and clean an ecommerce dataset. We'll generate embeddings using OpenAI and use this data to create an index in Dragonfly and then search for similar vectors.

In [2]:
import pandas as pd
import numpy as np
from typing import List
import openai

EMBEDDING_MODEL = "text-embedding-3-small"

# Simple embedding function compatible with openai==0.28.1
def get_embeddings(list_of_text: List[str], model="text-embedding-3-small") -> List[List[float]]:
    assert len(list_of_text) <= 2048, "The batch size should not be larger than 2048."
    # replace newlines, which can negatively affect performance.
    list_of_text = [text.replace("\n", " ") for text in list_of_text]
    data = openai.Embedding.create(input=list_of_text, model=model)["data"]
    return [d["embedding"] for d in data]

# load in data and clean data types and drop null rows
df = pd.read_csv("../../data/styles_2k.csv", on_bad_lines='skip')
df.dropna(inplace=True)
df["year"] = df["year"].astype(int)
df.info()

# print dataframe
n_examples = 5
df.head(n_examples)

<class 'pandas.core.frame.DataFrame'>
Index: 1978 entries, 0 to 1998
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  1978 non-null   int64 
 1   gender              1978 non-null   object
 2   masterCategory      1978 non-null   object
 3   subCategory         1978 non-null   object
 4   articleType         1978 non-null   object
 5   baseColour          1978 non-null   object
 6   season              1978 non-null   object
 7   year                1978 non-null   int64 
 8   usage               1978 non-null   object
 9   productDisplayName  1978 non-null   object
dtypes: int64(2), object(8)
memory usage: 170.0+ KB


Unnamed: 0,id,gender,masterCategory,subCategory,articleType,baseColour,season,year,usage,productDisplayName
0,15970,Men,Apparel,Topwear,Shirts,Navy Blue,Fall,2011,Casual,Turtle Check Men Navy Blue Shirt
1,39386,Men,Apparel,Bottomwear,Jeans,Blue,Summer,2012,Casual,Peter England Men Party Blue Jeans
2,59263,Women,Accessories,Watches,Watches,Silver,Winter,2016,Casual,Titan Women Silver Watch
3,21379,Men,Apparel,Bottomwear,Track Pants,Black,Fall,2011,Casual,Manchester United Men Solid Black Track Pants
4,53759,Men,Apparel,Topwear,Tshirts,Grey,Summer,2012,Casual,Puma Men Grey T-shirt


In [3]:
df["product_text"] = df.apply(lambda row: f"name {row['productDisplayName']} category {row['masterCategory']} subcategory {row['subCategory']} color {row['baseColour']} gender {row['gender']}".lower(), axis=1)
df.rename({"id":"product_id"}, inplace=True, axis=1)

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1978 entries, 0 to 1998
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   product_id          1978 non-null   int64 
 1   gender              1978 non-null   object
 2   masterCategory      1978 non-null   object
 3   subCategory         1978 non-null   object
 4   articleType         1978 non-null   object
 5   baseColour          1978 non-null   object
 6   season              1978 non-null   object
 7   year                1978 non-null   int64 
 8   usage               1978 non-null   object
 9   productDisplayName  1978 non-null   object
 10  product_text        1978 non-null   object
dtypes: int64(2), object(9)
memory usage: 185.4+ KB


In [4]:
# check out one of the texts we will use to create semantic embeddings
df["product_text"][0]

'name turtle check men navy blue shirt category apparel subcategory topwear color navy blue gender men'

## Connect to Dragonfly

Now that we have our Dragonfly process running. We will use the default host and port for the Dragonfly database which is `localhost:6379`.


In [5]:
import redis
from redis.commands.search.indexDefinition import (
    IndexDefinition,
    IndexType
)
from redis.commands.search.query import Query
from redis.commands.search.field import (
    TagField,
    NumericField,
    TextField,
    VectorField
)

# Connect to Dragonfly
df_client = redis.Redis(
    host="localhost",
    port=6379,
    password=""
)

df_client.ping()

True

## Creating a Search Index in Dragonfly

The below cells will show how to specify and create a search index in Dragonfly. We will:

1. Set some constants for defining our index like the distance metric and the index name
2. Define the index schema with fields
3. Create the index

In [6]:
# Constants
INDEX_NAME = "product_embeddings"         # name of the search index
PREFIX = "doc"                            # prefix for the document keys
DISTANCE_METRIC = "L2"                    # distance metric for the vectors (ex. COSINE, IP, L2)
NUMBER_OF_VECTORS = len(df)

In [7]:
# Define fields for each of the columns in the dataset
name = TextField(name="productDisplayName")
category = TagField(name="masterCategory")
articleType = TagField(name="articleType")
gender = TagField(name="gender")
season = TagField(name="season")
year = NumericField(name="year")
text_embedding = VectorField("product_vector",
    "FLAT", {
        "TYPE": "FLOAT32",
        "DIM": 1536,
        "DISTANCE_METRIC": DISTANCE_METRIC,
        "INITIAL_CAP": NUMBER_OF_VECTORS,
    }
)
fields = [name, category, articleType, gender, season, year, text_embedding]

In [8]:
# Check if index exists
try:
    df_client.ft(INDEX_NAME).info()
    print("Index already exists")
except:
    # Create the index
    df_client.ft(INDEX_NAME).create_index(
        fields = fields,
        definition = IndexDefinition(prefix=[PREFIX], index_type=IndexType.HASH)
)

## Generate OpenAI Embeddings and Load Documents into the Index

Now that we have a search index, we can load documents into it. We will use the dataframe containing the styles dataset loaded previously. In Dragonfly, either the HASH or JSON data types can be used to store documents. We will use the HASH data type in this example. The cells below will show how to get OpenAI embeddings for the different products and load documents into the index.

In [9]:
# Use OpenAI get_embeddings batch requests to speed up embedding creation
def embeddings_batch_request(documents: pd.DataFrame):
    records = documents.to_dict("records")
    print("Records to process: ", len(records))
    product_vectors = []
    docs = []
    batchsize = 1000

    for idx,doc in enumerate(records,start=1):
        # create byte vectors
        docs.append(doc["product_text"])
        if idx % batchsize == 0:
            product_vectors += get_embeddings(docs, EMBEDDING_MODEL)
            docs.clear()
            print("Vectors processed ", len(product_vectors), end='\r')
    product_vectors += get_embeddings(docs, EMBEDDING_MODEL)
    print("Vectors processed ", len(product_vectors), end='\r')
    return product_vectors

In [10]:
def index_documents(client: redis.Redis, prefix: str, documents: pd.DataFrame):
    product_vectors = embeddings_batch_request(documents)
    records = documents.to_dict("records")
    batchsize = 500

    pipe = client.pipeline()
    for idx,doc in enumerate(records,start=1):
        key = f"{prefix}:{str(doc['product_id'])}"

        # create byte vectors
        text_embedding = np.array((product_vectors[idx-1]), dtype=np.float32).tobytes()

        # replace list of floats with byte vectors
        doc["product_vector"] = text_embedding

        pipe.hset(key, mapping = doc)
        if idx % batchsize == 0:
            pipe.execute()
    pipe.execute()

In [11]:
%%time
index_documents(df_client, PREFIX, df)
print(f"Loaded {df_client.info()['db0']['keys']} documents in Dragonfly search index with name: {INDEX_NAME}")

Records to process:  1978
Loaded 1978 documents in Dragonfly search index with name: product_embeddings
CPU times: user 546 ms, sys: 143 ms, total: 689 ms
Wall time: 5.29 s


## Simple Vector Search Queries with OpenAI Query Embeddings

Now that we have a search index and documents loaded into it, we can run search queries. Below we will provide a function that will run a search query and return the results. Using this function we run a few queries that will show how you can utilize Dragonfly as a vector database.

In [12]:
def search_df(
    df_client: redis.Redis,
    user_query: str,
    index_name: str = "product_embeddings",
    vector_field: str = "product_vector",
    return_fields: list = ["productDisplayName", "masterCategory", "gender", "season", "year", "vector_score"],
    hybrid_fields = "*",
    k: int = 20,
    print_results: bool = True,
) -> List[dict]:

    # Use OpenAI to create embedding vector from user query
    embedded_query = openai.Embedding.create(input=user_query,
                                            model="text-embedding-3-small",
                                            )["data"][0]['embedding']

    # Prepare the Query
    base_query = f'{hybrid_fields}=>[KNN {k} @{vector_field} $vector AS vector_score]'
    query = (
        Query(base_query)
         .return_fields(*return_fields)
         .sort_by("vector_score")
         .paging(0, k)
         .dialect(2)
    )
    params_dict = {"vector": np.array(embedded_query).astype(dtype=np.float32).tobytes()}

    # perform vector search
    results = df_client.ft(index_name).search(query, params_dict)
    if print_results:
        for i, product in enumerate(results.docs):
            score = 1 - float(product.vector_score)
            print(f"{i}. {product.productDisplayName} (Score: {round(score ,3) })")
    return results.docs

In [13]:
# Execute a simple vector search in Dragonfly
results = search_df(df_client, 'man blue jeans', k=10)

0. Locomotive Men Washed Blue Jeans (Score: 0.205)
1. Locomotive Men Washed Blue Jeans (Score: 0.205)
2. French Connection Men Blue Jeans (Score: 0.181)
3. John Players Men Blue Jeans (Score: 0.178)
4. Denizen Women Blue Jeans (Score: 0.17)
5. Lee Men Blue Chicago Fit Jeans (Score: 0.159)
6. Lee Men Blue Chicago Fit Jeans (Score: 0.159)
7. Peter England Men Party Blue Jeans (Score: 0.156)
8. Levis Kids Blue Solid Jean (Score: 0.145)
9. Palm Tree Kids Boy Washed Blue Jeans (Score: 0.144)


## Hybrid Queries with Dragonfly

The previous examples showed how run vector search queries. In this section, we will show how to combine vector search with other fields for hybrid search. In the example below, we will combine vector search with full text search.

In [14]:
# improve search quality by adding hybrid query for "man blue jeans" in the product vector combined with a phrase search for "blue jeans"
results = search_df(df_client,
                    "man blue jeans",
                    vector_field="product_vector",
                    k=10,
                    hybrid_fields='@productDisplayName:blue jeans'
                    )

0. Locomotive Men Washed Blue Jeans (Score: 0.205)
1. Locomotive Men Washed Blue Jeans (Score: 0.205)
2. French Connection Men Blue Jeans (Score: 0.181)
3. John Players Men Blue Jeans (Score: 0.178)
4. Denizen Women Blue Jeans (Score: 0.17)
5. Lee Men Blue Chicago Fit Jeans (Score: 0.159)
6. Lee Men Blue Chicago Fit Jeans (Score: 0.159)
7. Peter England Men Party Blue Jeans (Score: 0.156)
8. Palm Tree Kids Boy Washed Blue Jeans (Score: 0.144)
9. Lee Men Tino Blue Jeans (Score: 0.136)


In [15]:
# hybrid query for shirt in the product vector and only include results with the phrase "slim fit" in the title
results = search_df(df_client,
                    "shirt",
                    vector_field="product_vector",
                    k=10,
                    hybrid_fields='@productDisplayName:slim fit'
                    )

0. Basics Men White Slim Fit Striped Shirt (Score: -0.106)
1. ADIDAS Men's Slim Fit White T-shirt (Score: -0.126)
2. Basics Men Red Slim Fit Checked Shirt (Score: -0.135)
3. Basics Men Navy Slim Fit Checked Shirt (Score: -0.142)
4. Basics Men Blue Slim Fit Checked Shirt (Score: -0.143)
5. Basics Men Blue Slim Fit Checked Shirt (Score: -0.143)
6. Tokyo Talkies Women Navy Slim Fit Jeans (Score: -0.174)
7. Lee Rinse Navy Blue Slim Fit Jeans (Score: -0.177)


In [16]:
# hybrid query for watch in the product vector and only include results with the tag "Accessories" in the masterCategory field
results = search_df(df_client,
                    "watch",
                    vector_field="product_vector",
                    k=10,
                    hybrid_fields='@masterCategory:{Accessories}'
                    )

0. Q&Q Women Blue Watch (Score: -0.121)
1. Q&Q Men Silver-Toned Dial Analogue Watch Q252J404Y (Score: -0.123)
2. Q&Q Women White Dial Watch (Score: -0.126)
3. Citizen Men Silver Dial Watch (Score: -0.13)
4. Q&Q Unisex Blue Dial Watch (Score: -0.131)
5. Q&Q Men Black Dial Watch (Score: -0.133)
6. Q&Q Men Black Dial Watch (Score: -0.133)
7. Q&Q Men Black Dial Watch (Score: -0.133)
8. Q&Q Men Black Dial Watch (Score: -0.133)
9. Q&Q Men Black Dial Watch (Score: -0.133)


In [17]:
# hybrid query for sandals in the product vector and only include results within the 2011-2012 year range
results = search_df(df_client,
                    "sandals",
                    vector_field="product_vector",
                    k=10,
                    hybrid_fields='@year:[2011 2012]'
                    )

0. Red Tape Men Black Sandals (Score: 0.061)
1. Coolers Men Black Sandals (Score: 0.056)
2. Coolers Men Black Sandals (Score: 0.056)
3. Gliders Men Yellow Sandals (Score: 0.043)
4. Coolers Men Black Sports Sandals (Score: 0.041)
5. Rocia Women Casual Black Sandal (Score: 0.031)
6. Ganuchi Men Casual Black Sandals (Score: 0.031)
7. Rocia Women Maroon Sandals (Score: 0.029)
8. Rocia Women Maroon Sandals (Score: 0.029)
9. Rocia Women Black & Brown Sandals (Score: 0.027)


In [18]:
# hybrid query for sandals in the product vector and only include results within the 2011-2012 year range from the summer season
results = search_df(df_client,
                    "blue sandals",
                    vector_field="product_vector",
                    k=10,
                    hybrid_fields='(@year:[2011 2012] @season:{Summer})'
                    )

0. Red Tape Men Black Sandals (Score: 0.086)
1. Coolers Men Black Sandals (Score: 0.06)
2. Coolers Men Black Sandals (Score: 0.06)
3. Enroute Teens Orange Sandals (Score: 0.058)
4. Enroute Teens Brown Sandals (Score: 0.052)
5. Rocia Women Brown Sandals (Score: 0.051)
6. Puma Women Purple Techno Sandals (Score: 0.05)
7. Coolers Men Black Sports Sandals (Score: 0.046)
8. Enroute Kids Girls Pink Sandals (Score: 0.044)
9. Ganuchi Men Casual Olive Sandals (Score: 0.043)


In [19]:
# hybrid query for a brown belt filtering results by a year (NUMERIC) with a specific article types (TAG) and with a brand name (TEXT)
results = search_df(df_client,
                    "brown belt",
                    vector_field="product_vector",
                    k=10,
                    hybrid_fields='(@year:[2012 2012] @articleType:{Shirts | Belts} @productDisplayName:"Wrangler")'
                    )

0. Wrangler Women Black Belt (Score: 0.03)
1. Wrangler Men Leather Brown Belt (Score: -0.002)
2. Wrangler Men Green Striped Shirt (Score: -0.194)
3. Wrangler Men Griffith White Shirt (Score: -0.209)
4. Wrangler Men Purple Striped Shirt (Score: -0.214)
5. Wrangler Women Stella Green Shirt (Score: -0.245)
