<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/Redis-Workshops/blob/main/02-Vector_Similarity_Search/02-Redis_VSS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Similarity Search with Redis

[Always-on demo](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/)

![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook generates vector embeddings using pre-trained `sentence-transformers/all-MiniLM-L6-v2` model from HuggingFace, loads them to Redis and runs Vector Similarity search against Redis database. 

In [1]:
#install Redis client and Hugging Face sentence transformers
!pip install redis sentence_transformers

Collecting redis
  Downloading redis-5.2.1-py3-none-any.whl (261 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.5/261.5 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence_transformers
  Downloading sentence_transformers-3.3.1-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m174.4 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<5.0.0,>=4.41.0
  Downloading transformers-4.47.1-py3-none-any.whl (10.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.1/10.1 MB[0m [31m169.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting torch>=1.11.0
  Downloading torch-2.5.1-cp39-cp39-manylinux1_x86_64.whl (906.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m906.5/906.5 MB[0m [31m199.4 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hCollecting huggingface-hub>=0.20.0
  Downloading huggingface_hub-0.27.0-py3-none-any.whl (450 kB)
[2K  

Install Redis Stack locally

### Connect to the Redis server

In [3]:
!pwd

/opt/app-root/src/Redis-Workshops/02-Vector_Similarity_Search


In [6]:
import os

# Updated Redis connection parameters
REDIS_HOST = "searchdb-redis-operator.apps.cluster-rtz4w.rtz4w.sandbox1981.opentlc.com"
REDIS_PORT = "443"
REDIS_PASSWORD = "lYlU1uI4"
CA_CERT_PATH = "./ca.pem"
SNI = "searchdb-redis-operator.apps.cluster-rtz4w.rtz4w.sandbox1981.opentlc.com"

# Shortcut for redis-cli $REDIS_CONN command with TLS and authentication
os.environ["REDIS_CONN"] = (
    f"-h {REDIS_HOST} -p {REDIS_PORT} --tls --cacert {CA_CERT_PATH} --sni {SNI} "
    f"-a {REDIS_PASSWORD} --no-auth-warning"
)

# Redis URL with TLS prefix
REDIS_URL = f"rediss://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"
INDEX_NAME = "qna:idx"

# Test Redis connection
!redis-cli $REDIS_CONN PING


In [8]:
import redis
import ssl

# Establish a secure connection to Redis using SSL/TLS
redis = redis.Redis(
    host=REDIS_HOST,
    port=int(REDIS_PORT),  # Ensure port is passed as an integer
    password=REDIS_PASSWORD,
    ssl=True,
    ssl_cert_reqs=ssl.CERT_REQUIRED,  # Enforce certificate verification
    ssl_ca_certs=CA_CERT_PATH  # Path to the CA certificate
)

# Test the connection
try:
    response = redis.ping()
    print("Redis connection successful:", response)
except Exception as e:
    print("Redis connection failed:", e)


Redis connection successful: True


In [10]:
import pandas as pd
import numpy as np
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm
from redis import Redis
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query


tqdm.pandas()



### Embedding generation model

Here we are using `sentence-transformers/all-MiniLM-L6-v2` from HuggingFace. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2



In [11]:
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Download 12k+ tweets

In [12]:
!wget https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv

--2024-12-27 17:42:13--  https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2486081 (2.4M) [text/plain]
Saving to: ‘Labelled_Tweets.csv’


2024-12-27 17:42:13 (232 MB/s) - ‘Labelled_Tweets.csv’ saved [2486081/2486081]



In [13]:
df = pd.read_csv('Labelled_Tweets.csv').drop(columns=['created_at','score'])
#df=df.head(3000) #trim dataframe to fit results into 30MB Redis database
df


Unnamed: 0,id,full_text
0,1,@KennyDegu very very little volume. With $10T ...
1,2,#ES_F achieved Target 2780 closing above 50% #...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...
3,4,@Issaquahfunds Hedged our $MSFT position into ...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...
...,...,...
12415,12587,RT @PeterLBrandt: $SPX $ES_F \r\nFollowing thi...
12416,12588,RT @vieiraUAE: Fearless Alex Vieira Calls Best...
12417,12589,$spy $spx $qqq $ndx #nyse going from poking th...
12418,12590,RT @DavidScottAdams: On watch tomorrow // Pt. ...


### Generate Embeddings

Generate vector embeddings within the dataframe. This step can take 2-3 minutes on GPU runtime for all 12k records.

In [14]:
def text_to_embedding(text):
  return model.encode(text).astype(np.float32).tobytes()

#generate vector embeddings
df["text_embedding"] = df["full_text"].progress_apply(text_to_embedding)
df.head()

  0%|          | 0/12420 [00:00<?, ?it/s]

Unnamed: 0,id,full_text,text_embedding
0,1,@KennyDegu very very little volume. With $10T ...,b'(\x92\x81\xbd\x1ah\x8b\xbd~\xdf\xe4\xbc\xbb\...
1,2,#ES_F achieved Target 2780 closing above 50% #...,b'W\x1b\x02\xbd\x14~/\xbd\x83z\xb1\xbc_\x99\xd...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...,b'\x10\xaa\xa3\xbdh}\x10\xbd\xc9\xe8\xb9=!\x08...
3,4,@Issaquahfunds Hedged our $MSFT position into ...,"b""\xe3\x7f\xd1\xbc\xc0\n`\xbd89 =\xe3\xc0\xef=..."
4,5,RT @zipillinois: 3 Surprisingly Controversial ...,b'\xcd\r\x1e\xbdC\\\xd4\xbcZ/\xa1\xbc\xdfq7=\x...


### Create Helper Functions
 
- Save dataframe to Redis HASH
- Create RediSearch Index

In [15]:
def load_dataframe(redis, df, key_prefix="tweet", id_column="id", pipe_size=100):
    records = df.to_dict(orient="records")
    pipe = redis.pipeline(transaction=False)
    i=1
    for record in tqdm(records):
        i=i+1
        key = f"{key_prefix}:{record[id_column]}"
        pipe.hset(key, mapping=record)
        if (i+1) % pipe_size == 0:
          res=pipe.execute()
    pipe.execute()

def create_redis_index(redis, idxname="tweet:idx"):
  try:
    redis.ft(idxname).dropindex()
  except:
    print("no index found")

  # Create an index
  indexDefinition = IndexDefinition(
      prefix=["tweet:"],
      index_type=IndexType.HASH,
  )

  redis.ft(idxname).create_index(
      (
          TextField("full_text", no_stem=False, sortable=False),
          VectorField("text_embedding", "HNSW", {  "TYPE": "FLOAT32", 
                                                    "DIM": 384, 
                                                    "DISTANCE_METRIC": "COSINE",
                                                  })
      ),
      definition=indexDefinition
  )



### Create index and load data to Redis

In [16]:
# clear Redis database (optional)
redis.flushdb()

# create Index
create_redis_index(redis)

# load data from Dataframe to Redis HASH
load_dataframe(redis,df,key_prefix="tweet", pipe_size=100)


no index found


  0%|          | 0/12420 [00:00<?, ?it/s]

In [18]:
# Check how the data is stored in Redis
key = "tweet:1001"  # The key to inspect

try:
    data = redis.hgetall(key)  # Fetch all fields and values in the hash
    if data:
        # Attempt to decode values, handle decoding errors gracefully
        decoded_data = {}
        for k, v in data.items():
            try:
                decoded_key = k.decode('utf-8')
                decoded_value = v.decode('utf-8')
                decoded_data[decoded_key] = decoded_value
            except UnicodeDecodeError:
                # Fall back to raw bytes if decoding fails
                decoded_data[k] = v
        print(f"Data stored in Redis under '{key}':")
        for field, value in decoded_data.items():
            if isinstance(field, bytes):
                field = field.hex()  # Represent binary keys as hex
            if isinstance(value, bytes):
                value = value.hex()  # Represent binary values as hex
            print(f"  {field}: {value}")
    else:
        print(f"No data found for key: {key}")
except Exception as e:
    print(f"Error retrieving data from Redis: {e}")


Data stored in Redis under 'tweet:1001':
  746578745f656d62656464696e67: 171e033ac062bdbc30b094bc0a4b293c0b55943b243e11bda4330d3d01987f3da258883c695552bdaaf668bc1cb5d2bb103b09bdc76d233ce5d4433d93c822bd20a32dbd3b5e09bd6900fdbcb44745bd6aad5abc51ae3dbd4a3c993cdd62cebcef5fab3c2a4454bc764dd7bc10f182bb86b0283d2a768abcd2a84dbd642c503d1053863cac6afa3c313fb53c645beebbc88d903d1157383d528182bda8fb263c022e3a3d57e6e4bdca1d11bccdb7acbd22593cbb57c9ccbbf83962bc3ee3903d2d9fa73df5370f3d96bbbbbb5dad173dc2ae45bde08e903d71d702bd0a029fbc28c4653bfea4d8bc5a7b663c9a78afbd14c589bc516f033d62428ebc7a91473b5ce797babac19a3d11f742bdc40dca3cadf247bbde0e863ca272843cc2f75ebd35c9bebd604b35bc21e039bdbed73d3d3079b33da510b4bd7afc11bd1a874fbcc56f3cbd828780bc20a922bd522920bd27fc85bac279953d07e0a63cd876463c08d0803dba9acabde17fb2bb4772a33c5235183d18cbb53dcdbf3e3cac888d3c2423ee3c467e633c087c56bde9644b3dc871423d4bd6d63c96752bbd9f71823d2f21a6bd1952c73c9f76ec3b83fd993dd366a6bcd9b282bd4aebbd3c8d1facbcdbbec03ccb8890bdc27fe4bcc20c4eb

## Query the database

[Alway-on Streamlit app](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/) 


Try queries like:
“Oil”, “Oil Reserves”, “Fossil fuels”

Lexical Full Text search quickly runs out of matches

Vector search continues to discover relevant tweets

In [19]:
user_query="oil price"
# queries to try "oil reserve", "fossil fuels"

In [20]:
#using Full Text Index
q = Query(user_query)\
  .return_fields("full_text")
res = redis.ft("tweet:idx").search(q)
if res.total==0:
  print("No matches found")
else:
  res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
  display(res_df)

Unnamed: 0,id,full_text
0,tweet:3220,The relative performance of TIPS has historica...
1,tweet:1490,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
2,tweet:1585,"RT @Benzinga: Oil Prices Rise, Fall As Russia,..."
3,tweet:1610,"Oil Prices Rise, Fall As Russia, Saudi Arabia ..."
4,tweet:311,"Oil Prices Rise, Fall As Russia, Saudi Arabia ..."
5,tweet:7189,Do higher oil prices help the consumer and sma...
6,tweet:636,Told you Saudi Arabia will bend the knee @jimc...
7,tweet:5405,https://t.co/3IJBXa5wuf Historic oil price plu...
8,tweet:5406,Historic oil price plunge trashes sector's pro...
9,tweet:3865,Today's book recommendation goes for the winne...


In [21]:
#using Vector Similarity Index
query_vector=text_to_embedding(user_query)
q = Query("*=>[KNN 10 @text_embedding $vector AS result_score]")\
                .return_fields("result_score","full_text")\
                .dialect(2)\
                .sort_by("result_score", True)
res = redis.ft("tweet:idx").search(q, query_params={"vector": query_vector})
#print(res)
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df

Unnamed: 0,id,result_score,full_text
0,tweet:444,0.369450867176,Would you spend $2 more a gallon of gasoline i...
1,tweet:11529,0.37109541893,RT @tradingcrudeoil: Crude oil closed up $0.48...
2,tweet:5654,0.381934165955,..and oil still 25.74 LMAO &gt;&gt;&gt;NO DEM...
3,tweet:204,0.396132290363,Bad news for #oil. It’s going to between $10 ...
4,tweet:9189,0.409308612347,Oil erases gains for the day in fall to $25 ht...
5,tweet:7189,0.42981672287,Do higher oil prices help the consumer and sma...
6,tweet:9330,0.430081129074,The price of Texas intermediate oil (WTI) slum...
7,tweet:531,0.431391596794,#OIL Sentiment ($22.50)\r\n\r\nWhat’s next for...
8,tweet:178,0.441844880581,OH how bullish for #oil LOL\r\n\r\n#OOTT #Oi...
9,tweet:6867,0.442979991436,$DXY 99.55-0.57%&lt;==US Dollar lower #Fed $2....
