<a href="https://colab.research.google.com/github/osaeed-ds/vector-hello/blob/main/Osaeed_Redis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Redis as a Vector Database**


This is a hello world exercise based on the Vector Search quickstart on the Redis website.

https://redis.io/docs/interact/search-and-query/search/vectors/

## **Prerequisites**

In [6]:
%pip install redis sentence-transformers datasets tabulate ipywidgets

Collecting ipywidgets
  Obtaining dependency information for ipywidgets from https://files.pythonhosted.org/packages/4a/0e/57ed498fafbc60419a9332d872e929879ceba2d73cb11d284d7112472b3e/ipywidgets-8.1.1-py3-none-any.whl.metadata
  Downloading ipywidgets-8.1.1-py3-none-any.whl.metadata (2.4 kB)
Collecting widgetsnbextension~=4.0.9 (from ipywidgets)
  Obtaining dependency information for widgetsnbextension~=4.0.9 from https://files.pythonhosted.org/packages/29/03/107d96077c4befed191f7ad1a12c7b52a8f9d2778a5836d59f9855c105f6/widgetsnbextension-4.0.9-py3-none-any.whl.metadata
  Downloading widgetsnbextension-4.0.9-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab-widgets~=3.0.9 (from ipywidgets)
  Obtaining dependency information for jupyterlab-widgets~=3.0.9 from https://files.pythonhosted.org/packages/e8/05/0ebab152288693b5ec7b339aab857362947031143b282853b4c2dd4b5b40/jupyterlab_widgets-3.0.9-py3-none-any.whl.metadata
  Downloading jupyterlab_widgets-3.0.9-py3-none-any.whl.metadata (4

In [7]:
import json
import time

import numpy as np
import pandas as pd
import redis
import requests
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
from sentence_transformers import SentenceTransformer

## **Load Datasset**
Sample dataset about bicyles.  

In [8]:
url = "https://raw.githubusercontent.com/bsbodden/redis_vss_getting_started/main/data/bikes.json"
response = requests.get(url)
bikes = response.json()

In [9]:
json.dumps(bikes[0], indent=2)

'{\n  "model": "Jigger",\n  "brand": "Velorim",\n  "price": 270,\n  "type": "Kids bikes",\n  "specs": {\n    "material": "aluminium",\n    "weight": "10"\n  },\n  "description": "Small and powerful, the Jigger is the best ride for the smallest of tikes! This is the tiniest kids\\u2019 pedal bike on the market available without a coaster brake, the Jigger is the vehicle of choice for the rare tenacious little rider raring to go. We say rare because this smokin\\u2019 little bike is not ideal for a nervous first-time rider, but it\\u2019s a true giddy up for a true speedster. The Jigger is a 12 inch lightweight kids bicycle and it will meet your little one\\u2019s need for speed. It\\u2019s a single speed bike that makes learning to pump pedals simple and intuitive. It even has  a handle in the bottom of the saddle so you can easily help your child during training!  The Jigger is among the most lightweight children\\u2019s bikes on the planet. It is designed so that 2-3 year-olds fit com

## **Embedding Engine**
We will use sentence_transformers to generate our embeddings.  

In [10]:
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer('msmarco-distilbert-base-v4')

Downloading (…)98e3c/.gitattributes: 100%|██████████| 690/690 [00:00<00:00, 612kB/s]
Downloading (…)_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 276kB/s]
Downloading (…)ea4b998e3c/README.md: 100%|██████████| 3.71k/3.71k [00:00<00:00, 15.3MB/s]
Downloading (…)4b998e3c/config.json: 100%|██████████| 545/545 [00:00<00:00, 1.52MB/s]
Downloading (…)ce_transformers.json: 100%|██████████| 122/122 [00:00<00:00, 310kB/s]
Downloading pytorch_model.bin: 100%|██████████| 265M/265M [00:04<00:00, 57.6MB/s] 
Downloading (…)nce_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<00:00, 156kB/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 471kB/s]
Downloading (…)98e3c/tokenizer.json: 100%|██████████| 466k/466k [00:00<00:00, 3.61MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 319/319 [00:00<00:00, 1.41MB/s]
Downloading (…)ea4b998e3c/vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 11.8MB/s]
Downloading (…)b998e3c/modules.json: 100%|██████████

In [13]:
from textwrap import TextWrapper

sample_description = bikes[0]['description']
wrapped_sample_description = TextWrapper(width=120).wrap(sample_description)
print(wrapped_sample_description)
['Small and powerful, the Jigger is the best ride for the smallest of tikes! This is the tiniest kids’ pedal bike on the',
 'market available without a coaster brake, the Jigger is the vehicle of choice for the rare tenacious little rider raring',
 'to go. We say rare because this smokin’ little bike is not ideal for a nervous first-time rider, but it’s a true giddy']

['Small and powerful, the Jigger is the best ride for the smallest of tikes! This is the tiniest kids’ pedal bike on the', 'market available without a coaster brake, the Jigger is the vehicle of choice for the rare tenacious little rider raring', 'to go. We say rare because this smokin’ little bike is not ideal for a nervous first-time rider, but it’s a true giddy', 'up for a true speedster. The Jigger is a 12 inch lightweight kids bicycle and it will meet your little one’s need for', 'speed. It’s a single speed bike that makes learning to pump pedals simple and intuitive. It even has  a handle in the', 'bottom of the saddle so you can easily help your child during training!  The Jigger is among the most lightweight', 'children’s bikes on the planet. It is designed so that 2-3 year-olds fit comfortably in a molded ride position that', 'allows for efficient riding, balanced handling and agility. The Jigger’s frame design and gears work together so your', 'buddingbiker can stand up out o

['Small and powerful, the Jigger is the best ride for the smallest of tikes! This is the tiniest kids’ pedal bike on the',
 'market available without a coaster brake, the Jigger is the vehicle of choice for the rare tenacious little rider raring',
 'to go. We say rare because this smokin’ little bike is not ideal for a nervous first-time rider, but it’s a true giddy']

In [14]:
embedding = embedder.encode(sample_description)
print(embedding.tolist()[:5])

[0.20076492428779602, -0.1300070732831955, 0.3081614375114441, 0.20627953112125397, -0.3692361116409302]


In [15]:
print(len(embedding))



768


## **Connect to Redis**
Need to setup db in UI first and grab your credentials

In [2]:
from getpass import getpass
# Input your Redis password
REDIS_PASSWORD = getpass('Your Redis password: ')

In [4]:
import redis

client = redis.Redis(
  host='redis-15330.c280.us-central1-2.gce.cloud.redislabs.com',
  port=15330,
  password=REDIS_PASSWORD)

res = client.ping()
display(res)

True

## **Load Data into Redis and build the index**


Load the data into the bikes table in Redis

In [16]:
pipeline = client.pipeline()
for i, bike in enumerate(bikes, start=1):
    redis_key = f"bikes:{i:03}"
    pipeline.json().set(redis_key, "$", bike)
res = pipeline.execute()

Verify reading from the bikes table

In [17]:
res = client.json().get("bikes:010", "$.model")
res

['Summit']

Generate Embeddings based on the bike descriptions

In [18]:
keys = sorted(client.keys("bikes:*"))
descriptions = client.json().mget(keys, "$.description")
descriptions = [item for sublist in descriptions for item in sublist]
embedder = SentenceTransformer("msmarco-distilbert-base-v4")
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()
VECTOR_DIMENSION = len(embeddings[0])

Load the embeddings into the table into a new field description_embeddings

In [19]:
pipeline = client.pipeline()
for key, embedding in zip(keys, embeddings):
    pipeline.json().set(key, "$.description_embeddings", embedding)
pipeline.execute()

[True, True, True, True, True, True, True, True, True, True, True]

In [21]:
res = client.json().get("bikes:010")
display(res)

{'model': 'Summit',
 'brand': 'nHill',
 'price': 1200,
 'type': 'Mountain Bike',
 'specs': {'material': 'alloy', 'weight': '11.3'},
 'description': 'This budget mountain bike from nHill performs well both on bike paths and on the trail. The fork with 100mm of travel absorbs rough terrain. Fat Kenda Booster tires give you grip in corners and on wet trails. The Shimano Tourney drivetrain offered enough gears for finding a comfortable pace to ride uphill, and the Tektro hydraulic disc brakes break smoothly. Whether you want an affordable bike that you can take to work, but also take trail riding on the weekends or you’re just after a stable, comfortable ride for the bike path, the Summit gives a good value for money.',
 'description_embeddings': [-0.5381147265434265,
  -0.4946592152118683,
  -0.025176815688610077,
  0.6540350914001465,
  -0.06241396814584732,
  -0.6898809671401978,
  -0.5430220365524292,
  -0.5903496146202087,
  0.5061324238777161,
  0.20084969699382785,
  0.8015643954277

Build the index on the table.  The index includes the standard fields as well as the vector field.  Define a the distance metric when defining the index.

In [22]:
schema = (
    TextField("$.model", no_stem=True, as_name="model"),
    TextField("$.brand", no_stem=True, as_name="brand"),
    NumericField("$.price", as_name="price"),
    TagField("$.type", as_name="type"),
    TextField("$.description", as_name="description"),
    VectorField(
        "$.description_embeddings",
        "FLAT",
        {
            "TYPE": "FLOAT32",
            "DIM": VECTOR_DIMENSION,
            "DISTANCE_METRIC": "COSINE",
        },
        as_name="vector",
    ),
)
definition = IndexDefinition(prefix=["bikes:"], index_type=IndexType.JSON)
res = client.ft("idx:bikes_vss").create_index(
    fields=schema, definition=definition
)

Rerun the cell below to monitor the index build

In [25]:
info = client.ft("idx:bikes_vss").info()
num_docs = info["num_docs"]
indexing_failures = info["hash_indexing_failures"]
display(info)
display(num_docs)
display(indexing_failures)

{'index_name': 'idx:bikes_vss',
 'index_options': [],
 'index_definition': [b'key_type',
  b'JSON',
  b'prefixes',
  [b'bikes:'],
  b'default_score',
  b'1'],
 'attributes': [[b'identifier',
   b'$.model',
   b'attribute',
   b'model',
   b'type',
   b'TEXT',
   b'WEIGHT',
   b'1',
   b'NOSTEM'],
  [b'identifier',
   b'$.brand',
   b'attribute',
   b'brand',
   b'type',
   b'TEXT',
   b'WEIGHT',
   b'1',
   b'NOSTEM'],
  [b'identifier', b'$.price', b'attribute', b'price', b'type', b'NUMERIC'],
  [b'identifier',
   b'$.type',
   b'attribute',
   b'type',
   b'type',
   b'TAG',
   b'SEPARATOR',
   b','],
  [b'identifier',
   b'$.description',
   b'attribute',
   b'description',
   b'type',
   b'TEXT',
   b'WEIGHT',
   b'1'],
  [b'identifier',
   b'$.description_embeddings',
   b'attribute',
   b'vector',
   b'type',
   b'VECTOR']],
 'num_docs': '11',
 'max_doc_id': '11',
 'num_terms': '778',
 'num_records': '1069',
 'inverted_sz_mb': '0.006133079528808594',
 'vector_index_sz_mb': '2.9910

'11'

'0'

## **Start Querying Redis**


First some simple non-vector queries

In [27]:
query = Query("@brand:Peaknetic")
res = client.ft("idx:bikes_vss").search(query).docs
res

[Document {'id': 'bikes:008', 'payload': None, 'json': '{"model":"Soothe Electric bike","brand":"Peaknetic","price":1950,"type":"eBikes","specs":{"material":"alloy","weight":"14.7"},"description":"The Soothe is an everyday electric bike, from the makers of Exercycle  bikes, that conveys style while you get around the city. The Soothe lives up to its name by keeping your posture upright and relaxed for the ride ahead, keeping those aches and pains from riding at bay. It includes a low-step frame , our memory foam seat, bump-resistant shocks and conveniently placed thumb throttle. ","description_embeddings":[0.15017177164554596,-0.8277867436408997,-0.09484907239675522,0.3845421075820923,-0.19382081925868988,-0.12874725461006165,-0.35355469584465027,-0.5088202953338623,0.47191816568374634,-0.6207213401794434,0.1276237517595291,0.4462098479270935,-0.4208437204360962,-0.566098153591156,-0.24973604083061215,-0.45723432302474976,-0.13048124313354492,0.05307783558964729,0.22646388411521912,0.3

In [28]:
query = Query("@brand:Peaknetic @price:[0 1000]").return_fields(
    "id", "brand", "model", "price"
)
res = client.ft("idx:bikes_vss").search(query).docs
res

[Document {'id': 'bikes:009', 'payload': None, 'brand': 'Peaknetic', 'model': 'Secto', 'price': '430'}]

Now we are going to run multiple vector queries at the same time.  Setting up a list of queries for similarity searches, and encode all the queries.

In [29]:
queries = [
    "Bike for small kids",
    "Best Mountain bikes for kids",
    "Cheap Mountain bike for kids",
    "Female specific mountain bike",
    "Road bike for beginners",
    "Commuter bike for people over 60",
    "Comfortable commuter bike",
    "Good bike for college students",
    "Mountain bike for beginners",
    "Vintage bike",
    "Comfortable city bike",
]

In [30]:
encoded_queries = embedder.encode(queries)
len(encoded_queries)

11

Define a helper function that will perform the query and print the reults in a nice Pandas table

In [31]:
def create_query_table(query, queries, encoded_queries, extra_params={}):
    results_list = []
    for i, encoded_query in enumerate(encoded_queries):
        result_docs = (
            client.ft("idx:bikes_vss")
            .search(
                query,
                {
                    "query_vector": np.array(
                        encoded_query, dtype=np.float32
                    ).tobytes()
                }
                | extra_params,
            )
            .docs
        )
        for doc in result_docs:
            vector_score = round(1 - float(doc.vector_score), 2)
            results_list.append(
                {
                    "query": queries[i],
                    "score": vector_score,
                    "id": doc.id,
                    "brand": doc.brand,
                    "model": doc.model,
                    "description": doc.description,
                }
            )

    # Optional: convert the table to Markdown using Pandas
    queries_table = pd.DataFrame(results_list)
    queries_table.sort_values(
        by=["query", "score"], ascending=[True, False], inplace=True
    )
    queries_table["query"] = queries_table.groupby("query")["query"].transform(
        lambda x: [x.iloc[0]] + [""] * (len(x) - 1)
    )
    queries_table["description"] = queries_table["description"].apply(
        lambda x: (x[:497] + "...") if len(x) > 500 else x
    )
    queries_table.to_markdown(index=False)
    return queries_table

Now start running vector queries.  The first one runs across our multiple queries above finding the best bikes for each query.

In [32]:
query = (
    Query("(*)=>[KNN 3 @vector $query_vector AS vector_score]")
    .sort_by("vector_score")
    .return_fields("vector_score", "id", "brand", "model", "description")
    .dialect(2)
)
result_table = create_query_table(query, queries, encoded_queries)
result_table

Unnamed: 0,query,score,id,brand,model,description
3,Best Mountain bikes for kids,0.54,bikes:003,Nord,Chook air 5,The Chook Air 5 gives kids aged six years and...
4,,0.51,bikes:010,nHill,Summit,This budget mountain bike from nHill performs ...
5,,0.46,bikes:001,Velorim,Jigger,"Small and powerful, the Jigger is the best rid..."
0,Bike for small kids,0.52,bikes:001,Velorim,Jigger,"Small and powerful, the Jigger is the best rid..."
1,,0.45,bikes:007,ScramBikes,WattBike,The WattBike is the best e-bike for people who...
2,,0.41,bikes:003,Nord,Chook air 5,The Chook Air 5 gives kids aged six years and...
6,Cheap Mountain bike for kids,0.49,bikes:003,Nord,Chook air 5,The Chook Air 5 gives kids aged six years and...
7,,0.45,bikes:010,nHill,Summit,This budget mountain bike from nHill performs ...
8,,0.39,bikes:001,Velorim,Jigger,"Small and powerful, the Jigger is the best rid..."
30,Comfortable city bike,0.45,bikes:007,ScramBikes,WattBike,The WattBike is the best e-bike for people who...


Now doing a hybrid query with vector and non-vector search.  This asks for the 3 best bikes for each query with the brand Peaknetic

In [33]:
hybrid_query = (
    Query("(@brand:Peaknetic)=>[KNN 3 @vector $query_vector AS vector_score]")
    .sort_by("vector_score")
    .return_fields("vector_score", "id", "brand", "model", "description")
    .dialect(2)
)
result_table = create_query_table(hybrid_query, queries, encoded_queries)
result_table

Unnamed: 0,query,score,id,brand,model,description
2,Best Mountain bikes for kids,0.3,bikes:008,Peaknetic,Soothe Electric bike,"The Soothe is an everyday electric bike, from ..."
3,,0.23,bikes:009,Peaknetic,Secto,If you struggle with stiff fingers or a kinked...
0,Bike for small kids,0.37,bikes:008,Peaknetic,Soothe Electric bike,"The Soothe is an everyday electric bike, from ..."
1,,0.25,bikes:009,Peaknetic,Secto,If you struggle with stiff fingers or a kinked...
4,Cheap Mountain bike for kids,0.21,bikes:008,Peaknetic,Soothe Electric bike,"The Soothe is an everyday electric bike, from ..."
5,,0.12,bikes:009,Peaknetic,Secto,If you struggle with stiff fingers or a kinked...
20,Comfortable city bike,0.42,bikes:008,Peaknetic,Soothe Electric bike,"The Soothe is an everyday electric bike, from ..."
21,,0.36,bikes:009,Peaknetic,Secto,If you struggle with stiff fingers or a kinked...
12,Comfortable commuter bike,0.47,bikes:008,Peaknetic,Soothe Electric bike,"The Soothe is an everyday electric bike, from ..."
13,,0.38,bikes:009,Peaknetic,Secto,If you struggle with stiff fingers or a kinked...
