# Redis as a Vector Store for Similarity Search

The dataset used is available at https://www.kaggle.com/datasets/akashkotal/imbd-top-1000-with-description

This notebook references code and inline comments from Vector Search tutorial notebook from  `RedisVentures`: https://github.com/RedisVentures/redis-vss-getting-started/blob/main/vector_similarity_with_redis.ipynb

In [105]:
import csv

with open('Top 1000 IMDB movies.csv', newline='') as csvfile:
    reader = csv.DictReader(csvfile)
    data = [row for row in reader]

In [106]:
print(data[2:5])

[{'': '2', 'movie_name': 'The Dark Knight', 'release_year': '(2008)', 'watch_time': '152 min', 'movie_rating': 9.0, 'moview_meatscore': '84        ', 'votes': '34,709', 'gross': '$534.86M', 'description': 'When the menace known as the Joker wreaks havoc and chaos on the people of Gotham, Batman must accept one of the greatest psychological and physical tests of his ability to fight injustice.'}, {'': '3', 'movie_name': 'The Lord of the Rings: The Return of the King', 'release_year': '(2003)', 'watch_time': '201 min', 'movie_rating': 9.0, 'moview_meatscore': '94        ', 'votes': '34,709', 'gross': '$377.85M', 'description': "Gandalf and Aragorn lead the World of Men against Sauron's army to draw his gaze from Frodo and Sam as they approach Mount Doom with the One Ring."}, {'': '4', 'movie_name': "Schindler's List", 'release_year': '(1993)', 'watch_time': '195 min', 'movie_rating': 9.0, 'moview_meatscore': '94        ', 'votes': '34,709', 'gross': '$96.90M', 'description': 'In German-o

In [107]:
import pandas as pd
pd.DataFrame(data)

Unnamed: 0,Unnamed: 1,movie_name,release_year,watch_time,movie_rating,moview_meatscore,votes,gross,description
0,0,The Shawshank Redemption,(1994),142 min,9.3,81,34709,$28.34M,Two imprisoned men bond over a number of years...
1,1,The Godfather,(1972),175 min,9.2,100,34709,$134.97M,The aging patriarch of an organized crime dyna...
2,2,The Dark Knight,(2008),152 min,9.0,84,34709,$534.86M,When the menace known as the Joker wreaks havo...
3,3,The Lord of the Rings: The Return of the King,(2003),201 min,9.0,94,34709,$377.85M,Gandalf and Aragorn lead the World of Men agai...
4,4,Schindler's List,(1993),195 min,9.0,94,34709,$96.90M,"In German-occupied Poland during World War II,..."
...,...,...,...,...,...,...,...,...,...
995,995,Sabrina,(1954),113 min,7.6,72,34709,%^%^%^,A playboy becomes interested in the daughter o...
996,996,From Here to Eternity,(1953),118 min,7.6,85,34709,$30.50M,"At a U.S. Army base in 1941 Hawaii, a private ..."
997,997,Snow White and the Seven Dwarfs,(1937),83 min,7.6,95,34709,$184.93M,Exiled into the dangerous forest by her wicked...
998,998,The 39 Steps,(1935),86 min,7.6,93,34709,%^%^%^,A man in London tries to help a counter-espion...


In [108]:
import json
print(json.dumps(data[50], indent=4))

{
    "": "50",
    "movie_name": "Grave of the Fireflies",
    "release_year": "(1988)",
    "watch_time": "89 min",
    "movie_rating": 8.5,
    "moview_meatscore": "94        ",
    "votes": "34,709",
    "gross": "#48",
    "description": "A young boy and his little sister struggle to survive in Japan during World War II."
}


### Generate embeddings for Movie Description using a Sentence Transformer model

For sentence transformer, we use `msmarco-distilbert-base-v4` model as it is best suited for Asymmetric Semantic Search. This is a type of search where small length text queries as placed on larger text passages. We generate the vector embeddings using `encode` function.
https://huggingface.co/sentence-transformers/msmarco-distilbert-base-v4

In [109]:
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer('msmarco-distilbert-base-v4')

from textwrap import TextWrapper

# sample description from movies dataset
sample_description = data[40]['description']
TextWrapper(width=120).wrap(sample_description)

sentence_embedding = embedder.encode(sample_description)
print(sentence_embedding.tolist()[:3])

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[0.8253005146980286, -0.10552430897951126, -0.689163863658905]


Generate vector embeddings length

In [110]:
VECTOR_DIM = len(sentence_embedding)
VECTOR_DIM

768

## Using Redis to load movies

### Leveraging redis-py

Install the redis-py client library:
https://github.com/redis/redis-py

In [111]:
#### Use `redis-py` client to test connectivity

import redis

client = redis.Redis(host = 'localhost', port=6379, decode_responses=True)
client.ping()

True

### Movies as JSON Documents in Redis

We have the movies data loaded in memory as JSON array. We will iterate over JSON array, generate a  Redis key and store them in Redis using the [`JSON.SET`](https://redis.io/commands/json.set/) command. We'll do this [pipeline](https://redis.io/docs/manual/pipelining/) mode to minimize the round-trip times:

In [119]:
pipeline = client.pipeline()

for i, movie in enumerate(data, start=1):
    redis_key = f'movie:{i:05}'
    pipeline.json().set(redis_key, '$', movie)

pipeline.execute()

[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,

Fetch one movie from Redis using a [JSONPath](https://goessner.net/articles/JsonPath/) expression:

In [120]:
client.json().get('movie:00500', '$.movie_name')

['Star Trek']

### Enrich Movies Descriptions with Vector Embeddings

In [121]:
redis_keys = sorted(client.keys('movie:*'))
print(redis_keys)

['movie:00001', 'movie:00002', 'movie:00003', 'movie:00004', 'movie:00005', 'movie:00006', 'movie:00007', 'movie:00008', 'movie:00009', 'movie:00010', 'movie:00011', 'movie:00012', 'movie:00013', 'movie:00014', 'movie:00015', 'movie:00016', 'movie:00017', 'movie:00018', 'movie:00019', 'movie:00020', 'movie:00021', 'movie:00022', 'movie:00023', 'movie:00024', 'movie:00025', 'movie:00026', 'movie:00027', 'movie:00028', 'movie:00029', 'movie:00030', 'movie:00031', 'movie:00032', 'movie:00033', 'movie:00034', 'movie:00035', 'movie:00036', 'movie:00037', 'movie:00038', 'movie:00039', 'movie:00040', 'movie:00041', 'movie:00042', 'movie:00043', 'movie:00044', 'movie:00045', 'movie:00046', 'movie:00047', 'movie:00048', 'movie:00049', 'movie:00050', 'movie:00051', 'movie:00052', 'movie:00053', 'movie:00054', 'movie:00055', 'movie:00056', 'movie:00057', 'movie:00058', 'movie:00059', 'movie:00060', 'movie:00061', 'movie:00062', 'movie:00063', 'movie:00064', 'movie:00065', 'movie:00066', 'movie:00

In [122]:
import numpy as np

descriptions = client.json().mget(redis_keys, '$.description')
descriptions = [item for sublist in descriptions for item in sublist]
embeddings = embedder.encode(descriptions).astype(np.float32).tolist()

Next, we will add the vector embedding to the JSON documents in Redis using the `JSON.SET` command to insert a new field in each of the documents under the JSONPath `$.description_embeddings`

In [123]:
pipeline = client.pipeline()

for key, embedding in zip(redis_keys, embeddings):
    pipeline.json().set(key, '$.description_embeddings', embedding)

pipeline.execute()

[True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,
 True,

Inspecting one of the vectorized documents using the JSON.GET command:

In [124]:
print(json.dumps(client.json().get('movie:00045'), indent=4)) 

{
    "": "44",
    "movie_name": "Gladiator",
    "release_year": "(2000)",
    "watch_time": "155 min",
    "movie_rating": 8.5,
    "moview_meatscore": "67        ",
    "votes": "34,709",
    "gross": "$187.71M",
    "description": "A former Roman General sets out to exact vengeance against the corrupt emperor who murdered his family and sent him into slavery.",
    "description_embeddings": [
        0.8531678915023804,
        0.5070703625679016,
        -0.502671480178833,
        0.4632340371608734,
        -0.2895973324775696,
        0.6518955230712891,
        -0.0959697738289833,
        -0.41193637251853943,
        0.3949194252490998,
        0.35345950722694397,
        -0.238039955496788,
        0.3955552279949188,
        -0.8880314826965332,
        0.37111836671829224,
        0.027599068358540535,
        -0.12024996429681778,
        -0.34391430020332336,
        0.10061223804950714,
        -1.0479873418807983,
        0.3386252522468567,
        0.01860526949167

Indexing Documents in Redis

In [127]:
from redis.commands.search.field import TextField, NumericField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query

INDEX_NAME = 'index:movies_desc'
DOC_PREFIX = 'movie:'

schema = (
    TextField('$.movie_name', no_stem=True, as_name='movie_name'),
    TextField('$.release_year', no_stem=True, as_name='release_year'),
    NumericField('$.movie_rating', as_name='movie_rating'),
    TextField('$.moview_meatscore', as_name='movie_meat_score'),
    TextField('$.description', as_name='description'),
    VectorField('$.description_embeddings',
        'FLAT', {
            'TYPE': 'FLOAT32',
            'DIM': VECTOR_DIM,
            'DISTANCE_METRIC': 'COSINE',
        },  as_name='vector'
    ),
)

definition = IndexDefinition(prefix=[DOC_PREFIX], index_type=IndexType.JSON)
client.ft(INDEX_NAME).create_index(fields=schema, definition=definition)

'OK'

Check Status of Indexing Progress

In [128]:
# Indexing status can be obtained by FT.INFO command. It fetches number of completed documents and failures. 

info = client.ft(INDEX_NAME).info()

num_docs = info['num_docs']
successfully_indexed = int(info['percent_indexed']) * num_docs
indexing_failures = info['hash_indexing_failures']
time_taken = info['total_indexing_time']


print(f"Total Documents: {num_docs}, Successfully Indexed: {successfully_indexed} , Total Failures: {indexing_failures}, Time taken: {float(time_taken):.4f} milliseconds")

Total Documents: 1000, Successfully Indexed: 1000 , Total Failures: 0, Time taken: 478.6050 milliseconds


Querying document in Redis using Vector Similarity

In [129]:
# Below code searches against 
# a) a schema field of type `TEXT`
# b) a floating field with value between 0.0 and 9.0

In [130]:
query = (
    Query('@movie_name:Cinema Paradiso').return_fields('id', 'movie_name', 'movie_rating', 'release_year')
)
client.ft(INDEX_NAME).search(query).docs

[Document {'id': 'movie:00050', 'payload': None, 'movie_name': 'Cinema Paradiso', 'movie_rating': '8.5', 'release_year': '(1988)'}]

In [132]:
query = (
    Query('@movie_name:Cinema Paradiso @movie_rating:[0.0 9.0]').return_fields('id', 'movie_name', 'movie_rating', 'release_year')
)
client.ft(INDEX_NAME).search(query).docs

[Document {'id': 'movie:00050', 'payload': None, 'movie_name': 'Cinema Paradiso', 'movie_rating': '8.5', 'release_year': '(1988)'}]

In [133]:
query = (
    Query('@movie_name:Cinema Paradiso @movie_rating:[0.0 8.0]').return_fields('id', 'movie_name', 'movie_rating', 'release_year')
)
client.ft(INDEX_NAME).search(query).docs

[]

In [101]:
# Below are query prompts for the movies description vector
queries = [
    'Best thriller movie',
    'Best romantic drama movie',
    'Best movie about movie business and movie stars',
    'Vintage movie about war',
    'Best science fiction movie'
]
encoded_queries = embedder.encode(queries)
len(encoded_queries)

5

In [134]:
# Vector Search Query with K-Nearest Neighbours
query = (
    Query('(*)=>[KNN 5 @vector $query_vector AS similarity_score]')
     .sort_by('similarity_score')
     .return_fields('similarity_score', 'id', 'movie_name', 'movie_rating', 'release_year', 'description')
     .dialect(2)
)

In [135]:
# Next we create a NumPy array from a vectorized query prompt (`encoded_query`)

from IPython.display import display, HTML

def create_query_table(query, queries, encoded_queries, extra_params = {}):
    results_list = []
    for i, encoded_query in enumerate(encoded_queries):
        result_docs = client.ft(INDEX_NAME).search(query, { 'query_vector': np.array(encoded_query, dtype=np.float32).tobytes() } | extra_params).docs
        for doc in result_docs:
            similarity_score = round(1 - float(doc.similarity_score), 2)
            results_list.append({
                'query': queries[i], 
                'similarity_score': similarity_score, 
                'key': doc.id,
                'movie_name': doc.movie_name,
                'movie_rating': doc.movie_rating,
                'release_year': doc.release_year,
                'description': doc.description
            })

    # Pretty-print the table
    queries_table = pd.DataFrame(results_list)
    queries_table.sort_values(by=['query', 'similarity_score'], ascending=[True, False], inplace=True)
    queries_table['query'] = queries_table.groupby('query')['query'].transform(lambda x: [x.iloc[0]] + ['']*(len(x)-1))
    queries_table['description'] = queries_table['description'].apply(lambda x: (x[:497] + '...') if len(x) > 500 else x)
    html = queries_table.to_html(index=False)
    display(HTML(html))

In [136]:
create_query_table(query, queries, encoded_queries)

query,similarity_score,key,movie_name,movie_rating,release_year,description
Best movie about movie business and movie stars,0.32,movie:00976,Barton Fink,7.6,(1991),A renowned New York playwright is enticed to California to write for the movies and discovers the hellish truth of Hollywood.
,0.31,movie:00680,Ed Wood,7.8,(1994),"Ambitious but troubled movie director Edward D. Wood Jr. tries his best to fulfill his dreams, despite his lack of talent."
,0.3,movie:00586,Sullivan's Travels,7.9,(1941),Hollywood director John L Sullivan sets out to experience life as a homeless person in order to gain relevant life experience for his next movie.
,0.29,movie:00827,"South Park: Bigger, Longer & Uncut",7.7,(1999),"When Stan Marsh and his friends go see an R-rated movie, they start cursing and their parents think that Canada is to blame."
,0.28,movie:00439,8Â½,8.0,(1963),A harried movie director retreats into his memories and fantasies.
Best romantic drama movie,0.32,movie:00664,The Notebook,7.8,(2004),"A poor yet passionate young man (Ryan Gosling) falls in love with a rich young woman (Rachel McAdams), giving her a sense of freedom, but they are soon separated because of their social differences."
,0.27,movie:00156,Miracle in Cell No. 7,8.2,(2019),A story of love between a mentally-ill father who was wrongly accused of murder and his lovely six year old daughter. Prison will be their home. Based on the 2013 Korean movie Miracle in Cell No. 7 (2013).
,0.26,movie:00568,Hiroshima Mon Amour,7.9,(1959),A French actress filming an anti-war film in Hiroshima has an affair with a married Japanese architect as they share their differing perspectives on war.
,0.26,movie:00815,Adaptation.,7.7,(2002),A lovelorn screenwriter becomes desperate as he tries and fails to adapt 'The Orchid Thief' by Susan Orlean for the screen.
,0.25,movie:00805,Lost in Translation,7.7,(2003),A faded movie star and a neglected young woman form an unlikely bond after crossing paths in Tokyo.
