**Before getting started with the code examples**
<br>For Python, you will need to create a virtual environment and install the following Python packages:

1. redis
2. pandas
3. sentence-transformers
4. (optional) tabulate; this package is used by Pandas to convert dataframe tables to Markdown



In [43]:
!pip3 install redis pandas sentence-transformers tabulate

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


# **Imports**

In [44]:
import json
import time

import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer

# **Loading json bikes dataset**

In [45]:
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()
pd.DataFrame(bikes)

Unnamed: 0,model,brand,price,type,specs,description
0,Jigger,Velorim,270,Kids bikes,"{'material': 'aluminium', 'weight': '10'}","Small and powerful, the Jigger is the best rid..."
1,Hillcraft,Bicyk,1200,Kids Mountain Bikes,"{'material': 'carbon', 'weight': '11'}",Kids want to ride with as little weight as pos...
2,Chook air 5,Nord,815,Kids Mountain Bikes,"{'material': 'alloy', 'weight': '9.1'}",The Chook Air 5 gives kids aged six years and...
3,Eva 291,Eva,3400,Mountain Bikes,"{'material': 'carbon', 'weight': '9.1'}","The sister company to Nord, Eva launched in 20..."
4,Kahuna,Noka Bikes,3200,Mountain Bikes,"{'material': 'alloy', 'weight': '9.8'}",Whether you want to try your hand at XC racing...
5,XBN 2.1 Alloy,Breakout,810,Road Bikes,"{'material': 'alloy', 'weight': '7.2'}",The XBN 2.1 Alloy is our entry-level road bike...
6,WattBike,ScramBikes,2300,eBikes,"{'material': 'alloy', 'weight': '15'}",The WattBike is the best e-bike for people who...
7,Soothe Electric bike,Peaknetic,1950,eBikes,"{'material': 'alloy', 'weight': '14.7'}","The Soothe is an everyday electric bike, from ..."
8,Secto,Peaknetic,430,Commuter bikes,"{'material': 'aluminium', 'weight': '10.0'}",If you struggle with stiff fingers or a kinked...
9,Summit,nHill,1200,Mountain Bike,"{'material': 'alloy', 'weight': '11.3'}",This budget mountain bike from nHill performs ...


Let's take a look at the structure of one of our bike JSON documents

In [46]:
json.dumps(bikes[0], indent=2)

'{\n  "model": "Jigger",\n  "brand": "Velorim",\n  "price": 270,\n  "type": "Kids bikes",\n  "specs": {\n    "material": "aluminium",\n    "weight": "10"\n  },\n  "description": "Small and powerful, the Jigger is the best ride for the smallest of tikes! This is the tiniest kids\\u2019 pedal bike on the market available without a coaster brake, the Jigger is the vehicle of choice for the rare tenacious little rider raring to go. We say rare because this smokin\\u2019 little bike is not ideal for a nervous first-time rider, but it\\u2019s a true giddy up for a true speedster. The Jigger is a 12 inch lightweight kids bicycle and it will meet your little one\\u2019s need for speed. It\\u2019s a single speed bike that makes learning to pump pedals simple and intuitive. It even has  a handle in the bottom of the saddle so you can easily help your child during training!  The Jigger is among the most lightweight children\\u2019s bikes on the planet. It is designed so that 2-3 year-olds fit com

# **Generating text embeddings using SentenceTransformers**
We will use the SentenceTransformers framework to generate embeddings for the bikes descriptions. Sentence-BERT (SBERT) is a BERT model modification that produces consistent and contextually rich sentence embeddings. SBERT improves tasks like semantic search and text grouping by allowing for efficient and meaningful comparison of sentence-level semantic similarity.

**Selecting a suitable pre-trained model**
<br>We must pick a suitable model based on the task at hand when generating embeddings. In our case, we want to query for bicycles using short sentences against the longer bicycle descriptions. This is referred to as asymmetric semantic search, often employed in cases where the search query and the documents being searched are of a different nature or structure. Suitable models for asymmetric semantic search include pre-trained MS MARCO models. MS MARCO models are optimized for understanding real-world queries and retrieving relevant responses. They are widely used in search engines, chatbots, and other AI applications. At the time this tutorial was written, the highest performing MS MARCO model tuned for cosine-similarity available in the SentenceTranformers package is msmarco-distilbert-base-v4.

Let's load the model using the SentenceTransformer function:


In [47]:
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('msmarco-distilbert-base-v4')

Let's grab the description from the first bike in the JSON array:

In [48]:
from textwrap import TextWrapper

sample_description = bikes[0]['description']
wrapped_sample_description = TextWrapper(width=120).wrap(sample_description)
print(wrapped_sample_description)

['Small and powerful, the Jigger is the best ride for the smallest of tikes! This is the tiniest kids’ pedal bike on the', 'market available without a coaster brake, the Jigger is the vehicle of choice for the rare tenacious little rider raring', 'to go. We say rare because this smokin’ little bike is not ideal for a nervous first-time rider, but it’s a true giddy', 'up for a true speedster. The Jigger is a 12 inch lightweight kids bicycle and it will meet your little one’s need for', 'speed. It’s a single speed bike that makes learning to pump pedals simple and intuitive. It even has  a handle in the', 'bottom of the saddle so you can easily help your child during training!  The Jigger is among the most lightweight', 'children’s bikes on the planet. It is designed so that 2-3 year-olds fit comfortably in a molded ride position that', 'allows for efficient riding, balanced handling and agility. The Jigger’s frame design and gears work together so your', 'buddingbiker can stand up out o

To generate the vector embeddings, we use the **encode** function:

In [49]:
embedding = embedder.encode(sample_description)

Let's take a peek at the first 5 elements of the generated vector:

In [50]:
print(embedding.tolist()[:5])

[0.20076504349708557, -0.13000667095184326, 0.3081606328487396, 0.20627866685390472, -0.3692346513271332]


Let's look at the length of the vector embeddings generated by the model.

In [51]:
print(len(embedding))

768


**Create a redis-py client and test the server**

In [52]:
client = redis.Redis(host='localhost', port=6379, decode_responses=True)
res = client.ping()
res

True

**Storing the bikes as JSON documents in Redis**

In [53]:
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
    redis_key = f"bikes:{i:03}"
    pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()

Let's retrieve a specific value from one of the JSON bikes in Redis using a JSONPath expression:

In [54]:
res = client.json().get("bikes:009", "$.model")
res

['Secto']

**Vectorize all of the Bikes descriptions**

To vectorize all the descriptions in the database, we will first collect all the Redis keys for the bikes.

We'll use the keys as a parameter to the JSON.MGET command, along with the JSONPath expression $.description to collect the descriptions in a list. We will then pass the list to the encode method to get a list of vectorized embeddings:

In [55]:
keys = sorted(client.keys("bikes:*"))

#Create Embeddings
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])
print(VECTOR_DIMENSION)

768


Now we can add the vectorized descriptions to the JSON documents in Redis using the JSON.SET command to insert a new field in each of the documents under the JSONPath **$.description_embeddings**.

Once again we'll do this using a pipeline:

In [56]:
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
    pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()

[True, True, True, True, True, True, True, True, True, True, True]

Let's inspect one of the vectorized bike documents using the JSON.GET command:

In [57]:
res = client.json().get("bikes:010")
print(res)

{'model': 'Summit', 'brand': 'nHill', 'price': 1200, 'type': 'Mountain Bike', 'specs': {'material': 'alloy', 'weight': '11.3'}, 'description': 'This budget mountain bike from nHill performs well both on bike paths and on the trail. The fork with 100mm of travel absorbs rough terrain. Fat Kenda Booster tires give you grip in corners and on wet trails. The Shimano Tourney drivetrain offered enough gears for finding a comfortable pace to ride uphill, and the Tektro hydraulic disc brakes break smoothly. Whether you want an affordable bike that you can take to work, but also take trail riding on the weekends or you’re just after a stable, comfortable ride for the bike path, the Summit gives a good value for money.', 'description_embeddings': [-0.538114607334137, -0.49465855956077576, -0.025176964700222015, 0.6540342569351196, -0.062413860112428665, -0.689881443977356, -0.5430229902267456, -0.5903489589691162, 0.506131649017334, 0.20084935426712036, 0.8015637397766113, 1.068802833557129, 0.6

**Making the bikes collection searchable**

Redis Stack provides a powerful search engine that introduces commands to create and maintain search indexes for both collections of HASHES and JSON documents.

To create a search index for the bikes collection, we'll use the FT.CREATE command:



In [None]:
schema = (
    TextField("$.model", no_stem=True, as_name="model"),
    TextField("$.brand", no_stem=True, as_name="brand"),
    NumericField("$.price", as_name="price"),
    TagField("$.type", as_name="type"),
    TextField("$.description", as_name="description"),
    VectorField(
        "$.description_embeddings",
        "FLAT",
        {
            "TYPE": "FLOAT32",
            "DIM": VECTOR_DIMENSION,
            "DISTANCE_METRIC": "COSINE",
        },
        as_name="vector",
    ),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
    fields=schema, definition=definition
)

**Check the state of the index**

After the FT.CREATE command creates the index, the indexing process is automatically started in the background. In a short amount of time, our 11 JSON documents should be indexed and ready to be searched. To validate that, we use the FT.INFO command to check some information and statistics of the index. Of particular interest are the number of documents successfully indexed and the number of failures:

In [59]:
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
print(info)
print(num_docs)
print(indexing_failures)

{'index_name': 'idx:bikes_vss', 'index_options': [], 'index_definition': ['key_type', 'JSON', 'prefixes', ['bikes:'], 'default_score', '1'], 'attributes': [['identifier', '$.model', 'attribute', 'model', 'type', 'TEXT', 'WEIGHT', '1', 'NOSTEM'], ['identifier', '$.brand', 'attribute', 'brand', 'type', 'TEXT', 'WEIGHT', '1', 'NOSTEM'], ['identifier', '$.price', 'attribute', 'price', 'type', 'NUMERIC'], ['identifier', '$.type', 'attribute', 'type', 'type', 'TAG', 'SEPARATOR', ','], ['identifier', '$.description', 'attribute', 'description', 'type', 'TEXT', 'WEIGHT', '1'], ['identifier', '$.description_embeddings', 'attribute', 'vector', 'type', 'VECTOR']], 'num_docs': '11', 'max_doc_id': '55', 'num_terms': '778', 'num_records': '5323', 'inverted_sz_mb': '0.030668258666992188', 'vector_index_sz_mb': '2.998809814453125', 'total_inverted_index_blocks': '787', 'offset_vectors_sz_mb': '0.005707740783691406', 'doc_table_size_mb': '7.867813110351563e-4', 'sortable_values_size_mb': '0', 'key_tabl

# **Structured data searches with Redis**
The index **idx:bikes_vss** indexes the structured fields of our JSON documents model, brand, price, and type. It also indexes the unstructured free-form text description and the generated embeddings in description_embeddings. Before we dive deeper into Vector Similarity Search (VSS), we need to understand the basics of querying a Redis index. The Redis command of interest is FT.SEARCH. Like a SQL select statement, an FT.SEARCH statement can be as simple or as complex as needed.

Let's try a few simple queries that give enough context to complete our VSS examples. For example, to retrieve all bikes where the brand is Peaknetic, we can use the following command:

In [None]:
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
res

[Document {'id': 'bikes:008', 'payload': None, 'json': '{"model":"Soothe Electric bike","brand":"Peaknetic","price":1950,"type":"eBikes","specs":{"material":"alloy","weight":"14.7"},"description":"The Soothe is an everyday electric bike, from the makers of Exercycle  bikes, that conveys style while you get around the city. The Soothe lives up to its name by keeping your posture upright and relaxed for the ride ahead, keeping those aches and pains from riding at bay. It includes a low-step frame , our memory foam seat, bump-resistant shocks and conveniently placed thumb throttle. ","description_embeddings":[0.15016977488994598,-0.8277863264083862,-0.09484998136758804,0.3845418691635132,-0.1938203871250153,-0.12874773144721985,-0.35355475544929504,-0.5088202953338623,0.4719179272651672,-0.6207221150398254,0.12762244045734406,0.4462100863456726,-0.4208441376686096,-0.5660984516143799,-0.24973604083061215,-0.45723506808280945,-0.1304808109998703,0.05307814106345177,0.2264653593301773,0.359

This command will return all matching documents. With the inclusion of the vector embeddings, that's a little too verbose. If we wanted only to return specific fields from our JSON documents, for example, the document id, the brand, model and price, we could use:

In [60]:
query = Query("@brand:Peaknetic").return_fields("id", "brand", "model", "price")
res = client.ft("idx:bikes_vss").search(query).docs
res

[Document {'id': 'bikes:008', 'payload': None, 'brand': 'Peaknetic', 'model': 'Soothe Electric bike', 'price': '1950'},
 Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]

Let's say we only wanted bikes under $1000. We can add a numeric range clause to our query since the price field is indexed as NUMERIC:

In [61]:
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
    "id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
res

[Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]

# **Semantic searching with VSS**
Now that the bikes collection is stored and properly indexed in Redis, we want to query it using short query prompts. Let's put our queries in a list so we can execute them in bulk:

In [62]:
queries = [
    "Bike for small kids",
    "Best Mountain bikes for kids",
    "Cheap Mountain bike for kids",
    "Female specific mountain bike",
    "Road bike for beginners",
    "Commuter bike for people over 60",
    "Comfortable commuter bike",
    "Good bike for college students",
    "Mountain bike for beginners",
    "Vintage bike",
    "Comfortable city bike",
]

We need to encode the query prompts to query the database using VSS. Just like we did with the descriptions of the bikes, we'll use the SentenceTransformers model to encode the queries:

In [63]:
encoded_queries = embedder.encode(queries)
len(encoded_queries)

11

# **Constructing a pure K-nearest neighbors (KNN) VSS query**
We'll start with a KNN query. KNN is a foundational algorithm used in VSS, where the goal is to find the most similar items to a given query item. Using the chosen distance metric, the KNN algorithm calculates the distance between the query vector and each vector in the database. It then returns the K items with the smallest distances to the query vector. These are the most similar items.

The syntax for vector similarity KNN queries is (*)=>[<vector_similarity_query>] where the (*) (the * meaning all) is the filter query for the search engine. That way, one can reduce the search space by filtering the collection on which the KNN algorithm operates.

In [64]:
query = (
    Query('(*)=>[KNN 3 @vector $query_vector AS vector_score]')
     .sort_by('vector_score')
     .return_fields('vector_score', 'id', 'brand', 'model', 'description')
     .dialect(2)
)

We pass the vectorized query as $query_vector to the search function to execute the query. The following code shows an example of creating a NumPy array from a vectorized query prompt (encoded_query) as a single precision floating point array and converting it into a compact, byte-level representation that we can pass as a Redis parameter:

In [None]:
#client.ft(INDEX_NAME).search(query, { 'query_vector': np.array(encoded_query, dtype=np.float32).tobytes() }).docs

With the template for the query in place, we can use Python to execute all query prompts in a loop, passing the vectorized query prompts. Notice that for each result we calculate the vector_score as 1 - doc.vector_score, since we use cosine "distance" as the metric, the items with the smallest "distance" are closer and therefore more similar to our query.

We will then loop over the matched documents and create a list of results we can convert into a Pandas table to visualize the results:

In [70]:
def create_query_table(query, queries, encoded_queries, extra_params={}):
    results_list = []
    for i, encoded_query in enumerate(encoded_queries):
        result_docs = (
            client.ft("idx:bikes_vss")
            .search(
                query,
                {
                    "query_vector": np.array(
                        encoded_query, dtype=np.float32
                    ).tobytes()
                }
                | extra_params,
            )
            .docs
        )
        for doc in result_docs:
            vector_score = round(1 - float(doc.vector_score), 2)
            results_list.append(
                {
                    "query": queries[i],
                    "score": vector_score,
                    "id": doc.id,
                    "brand": doc.brand,
                    "model": doc.model,
                    "description": doc.description,
                }
            )

    # Optional: convert the table to Markdown using Pandas
    queries_table = pd.DataFrame(results_list)
    queries_table.sort_values(
        by=["query", "score"], ascending=[True, False], inplace=True
    )
    queries_table["query"] = queries_table.groupby("query")["query"].transform(
        lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
    )
    queries_table["description"] = queries_table["description"].apply(
        lambda x: (x[:497] + "...") if len(x) > 500 else x
    )
    queries_table.to_markdown(index=False)
    return queries_table

The query results show the individual queries' top 3 matches (our K parameter) along with the bike's id, brand, and model for each query. For example, for the query "Best Mountain bikes for kids", the highest similarity score (0.54) and therefore the closest match was the 'Nord' brand 'Chook air 5' bike model

In [72]:
query = (
    Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
    .sort_by("vector_score")
    .return_fields("vector_score", "id", "brand", "model", "description")
    .dialect(2)
)
create_query_table(query, queries, encoded_queries)

Unnamed: 0,query,score,id,brand,model,description
3,Best Mountain bikes for kids,0.54,bikes:003,Nord,Chook air 5,The Chook Air 5 gives kids aged six years and...
4,,0.51,bikes:010,nHill,Summit,This budget mountain bike from nHill performs ...
5,,0.46,bikes:001,Velorim,Jigger,"Small and powerful, the Jigger is the best rid..."
0,Bike for small kids,0.52,bikes:001,Velorim,Jigger,"Small and powerful, the Jigger is the best rid..."
1,,0.45,bikes:007,ScramBikes,WattBike,The WattBike is the best e-bike for people who...
2,,0.41,bikes:003,Nord,Chook air 5,The Chook Air 5 gives kids aged six years and...
6,Cheap Mountain bike for kids,0.49,bikes:003,Nord,Chook air 5,The Chook Air 5 gives kids aged six years and...
7,,0.45,bikes:010,nHill,Summit,This budget mountain bike from nHill performs ...
8,,0.39,bikes:001,Velorim,Jigger,"Small and powerful, the Jigger is the best rid..."
30,Comfortable city bike,0.45,bikes:007,ScramBikes,WattBike,The WattBike is the best e-bike for people who...


### **Hybrid queries**
<br>Pure KNN queries, as described in the previous section, evaluate a query against the whole space of vectors in a data collection. The larger the collection, the more computationally expensive the KNN search will be. But in the real world, unstructured data does not live in isolation, and users expecting rich search experiences need to be able to search via a combination of structured and unstructured data.

For example, users might arrive at your search interface with a brand preference in mind for the bikes dataset. Redis VSS queries can use this information to pre-filter the search space using a primary filter query. In the following query definition, we pre-filter using brand to consider only Peaknetic brand bikes. Before, our primary filter query was (*), in other words, everything. But now we can narrow the search space using (@brand:Peaknetic) before the KNN query.

Filtering by the Peaknetic brand, for which there are 2 bikes in our collection, we can see the results returned for each of the query prompts. The query with the highest returned similarity score is "Comfortable commuter bike", followed by "Road bike for beginners". Using filtering by brand, we fulfill the users' preferences and reduce the KNN search space by 80%.

In [None]:
hybrid_query = (
    Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
    .sort_by("vector_score")
    .return_fields("vector_score", "id", "brand", "model", "description")
    .dialect(2)
)
create_query_table(hybrid_query, queries, encoded_queries)

### **Creating a VSS range query**

Range queries in VSS involve retrieving items within a specific distance from a query vector. In this case, we consider "distance" to be the measure of similarity we've used to build our search indexes; the smaller the distance, the more similar the items.

Let's say you want to find the bikes whose descriptions are within a certain distance from a query vector. We can use a range query to achieve this. For example, the query command to return the top 4 documents within a 0.55 radius of a vectorized query would be as follows:

In [None]:
range_query = (
    Query(
        "@vector:[VECTOR_RANGE $range $query_vector]=>{$YIELD_DISTANCE_AS: vector_score}"
    )
    .sort_by("vector_score")
    .return_fields("vector_score", "id", "brand", "model", "description")
    .paging(0, 4)
    .dialect(2)
)
create_query_table(
    range_query, queries[:1], encoded_queries[:1], {"range": 0.55}
)