<a href="https://colab.research.google.com/github/antonum/Redis-Workshops/blob/main/02-Vector_Similarity_Search/02-RedisVL.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Search with RedisVL



[Always-on demo](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/)

![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook generates vector embeddings using pre-trained `sentence-transformers/all-MiniLM-L6-v2` model from HuggingFace, loads them to Redis and runs Vector Similarity search against Redis database.

In [1]:
#install Redis client and Hugging Face sentence transformers
!pip install -q sentence_transformers git+https://github.com/RedisVentures/redisvl.git@readme-enhancement

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.3/250.3 kB[0m [31m11.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m394.8/394.8 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m21.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for redisvl (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency

Install Redis Stack locally

In [2]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes


deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


### Connect to the Redis server

In [3]:
import os
import redis
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
#Replace values above with your own if using Redis Cloud instance
#REDIS_HOST="redis-12110.c82.us-east-1-2.ec2.cloud.redislabs.com"
#REDIS_PORT=12110
#REDIS_PASSWORD="pobhBJP7Psicp2gV0iqa2ZOc1WdXXXXX"

#shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"


In [4]:
os.environ["TOKENIZERS_PARALLELISM"] = "false"
from redisvl.utils.vectorize import HFTextVectorizer


# create a vectorizer
# choose your model from the huggingface website
hf = HFTextVectorizer(model="sentence-transformers/all-MiniLM-L6-v2")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [5]:
# embed a sentence
test = hf.embed("This is a test sentence.")
test[:10]

[0.08429647237062454,
 0.057953670620918274,
 0.0044933767057955265,
 0.10582111030817032,
 0.007083416450768709,
 -0.01784469559788704,
 -0.016888044774532318,
 -0.01522829756140709,
 0.040473099797964096,
 0.03342254459857941]

In [6]:
import pandas as pd
import numpy as np
from tqdm.auto import tqdm

"""from sentence_transformers import SentenceTransformer

from redis import Redis
from redis.commands.search.field import (
    NumericField,
    TagField,
    TextField,
    VectorField,
)
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query
"""

tqdm.pandas()



### Embedding generation model

Here we are using `sentence-transformers/all-MiniLM-L6-v2` from HuggingFace. https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2



In [7]:
#from sentence_transformers import SentenceTransformer
#model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
#test=model.encode("This is a test sentence.")
#test[:]

Download 12k+ tweets

In [8]:
!wget https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv

--2024-02-05 21:44:03--  https://raw.githubusercontent.com/antonum/Redis-VSS-Streamlit/main/Labelled_Tweets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2486081 (2.4M) [text/plain]
Saving to: ‘Labelled_Tweets.csv’


2024-02-05 21:44:03 (59.8 MB/s) - ‘Labelled_Tweets.csv’ saved [2486081/2486081]



In [9]:
df = pd.read_csv('Labelled_Tweets.csv').drop(columns=['created_at','score'])
df=df.head(100) #trim dataframe to fit results into 30MB Redis database
df


Unnamed: 0,id,full_text
0,1,@KennyDegu very very little volume. With $10T ...
1,2,#ES_F achieved Target 2780 closing above 50% #...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...
3,4,@Issaquahfunds Hedged our $MSFT position into ...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...
...,...,...
95,101,RT @stocks_hound: Torque Esports @TorqueEsport...
96,102,RT @TommyThornton: Added back 5 ETF shorts lat...
97,103,https://t.co/9VjKMnpm7n\r\n\r\nIndependence Re...
98,104,Torque Esports @TorqueEsport &amp; ESPN @espn ...


### Generate Embeddings

Generate vector embeddings within the dataframe. This step can take 2-3 minutes on GPU runtime for all 12k records.

In [10]:
def text_to_embedding(text):
  return np.array(hf.embed(text), dtype=np.float32).tobytes()

#generate vector embeddings
df["text_embedding"] = df["full_text"].progress_apply(text_to_embedding)
df.head()

  0%|          | 0/100 [00:00<?, ?it/s]

Unnamed: 0,id,full_text,text_embedding
0,1,@KennyDegu very very little volume. With $10T ...,b'*\x92\x81\xbd\x14h\x8b\xbdy\xdf\xe4\xbc\xc0\...
1,2,#ES_F achieved Target 2780 closing above 50% #...,b'Q\x1b\x02\xbd\x14~/\xbd\xa6z\xb1\xbcb\x99\xd...
2,3,RT @KimbleCharting: Silver/Gold indicator crea...,b'\x10\xaa\xa3\xbd]}\x10\xbd\xc3\xe8\xb9=%\x08...
3,4,@Issaquahfunds Hedged our $MSFT position into ...,b'\xc4\x7f\xd1\xbc\xbe\n`\xbd79 =\xe4\xc0\xef=...
4,5,RT @zipillinois: 3 Surprisingly Controversial ...,b'\xc6\r\x1e\xbdK\\\xd4\xbcH/\xa1\xbc\xebq7=\x...


### Create Helper Functions

- Save dataframe to Redis HASH
- Create RediSearch Index

In [11]:
tmp="""
def load_dataframe(redis, df, key_prefix="tweet", id_column="id", pipe_size=100):
    records = df.to_dict(orient="records")
    pipe = redis.pipeline(transaction=False)
    i=1
    for record in tqdm(records):
        i=i+1
        key = f"{key_prefix}:{record[id_column]}"
        pipe.hset(key, mapping=record)
        if (i+1) % pipe_size == 0:
          res=pipe.execute()
    pipe.execute()

def create_redis_index(redis, idxname="tweet:idx"):
  try:
    redis.ft(idxname).dropindex()
  except:
    print("no index found")

  # Create an index
  indexDefinition = IndexDefinition(
      prefix=["tweet:"],
      index_type=IndexType.HASH,
  )

  redis.ft(idxname).create_index(
      (
          TextField("full_text", no_stem=False, sortable=False),
          VectorField("text_embedding", "HNSW", {  "TYPE": "FLOAT32",
                                                    "DIM": 384,
                                                    "DISTANCE_METRIC": "COSINE",
                                                  })
      ),
      definition=indexDefinition
  )

"""

In [12]:
schema = {
    "index": {
        "name": "tweet:idx",
        "prefix": "tweet:",
        "storage_type": "hash", # default setting -- HASH
    },
    "fields": [
        {"name": "full_text", "type": "text"},
        {
            "name": "text_embedding",
            "type": "vector",
            "attrs": {
                "dims": 384,
                "distance_metric": "cosine",
                "algorithm": "HNSW",
                "datatype": "float32"
            }

        }
    ],
}

### Create index and load data to Redis

In [22]:
# clear Redis database (optional)
#redis.flushdb()

from redisvl.index import SearchIndex

index = SearchIndex.from_dict(schema)

# create Index
#create_redis_index(redis)

# load data from Dataframe to Redis HASH
#load_dataframe(redis,df,key_prefix="tweet", pipe_size=100)
index.connect("redis://localhost:6379")

#keys = index.load(df.to_dict())
#df.to_dict('records')
keys = index.load(df.to_dict('records'))
keys

['tweet::1a79314d528844e78952b1ff687cfae8',
 'tweet::9010a8d52ea940dbbf868643e7c94dd5',
 'tweet::df0fd7da2bc54b439915b59dde12c5d2',
 'tweet::8cb3961f14ac4930a2b67f31d61ec7e1',
 'tweet::6a874d90cce44e0286eea9c8b087ddca',
 'tweet::b39a13318ffd432ab25e22cf74dee6a9',
 'tweet::7f8065b10c8b40ffa2269a468dcd4532',
 'tweet::02c499f905714fb1813f26ed50493315',
 'tweet::b0be9d8f5b124ee98b44077a0c433dbe',
 'tweet::d6b8c3ae671444339fb296b2c395c365',
 'tweet::cb7027503ad5400dbb64e922603b5f84',
 'tweet::07b8bdfa7e8b4921901e3170bfb5efcc',
 'tweet::ea7a78bfb1464499954e3869cad94736',
 'tweet::38e5ceb066814c4984ba03de904e5605',
 'tweet::776ee8c1c6f64fcab8f6e62c97f1a53f',
 'tweet::8471eb8edffc49c399992b081d584e1d',
 'tweet::6a063ab525df4379a5f6fc8f02ad41d8',
 'tweet::45aab25bc58646ae9a95e13cc276032f',
 'tweet::306ec70518234c80839ef8ad861a17f6',
 'tweet::9387fdf4869e4d24a3e976e52c08be44',
 'tweet::8b9510de1d00450b89f24263ba087dee',
 'tweet::658f57a6bfb24d88a7e7abc14da13ded',
 'tweet::da5e0dfb0924499a8d08434

In [19]:
#Check how the data is stored in Redis
!redis-cli $REDIS_CONN keys "*"
!redis-cli $REDIS_CONN hgetall "tweet::c4b63c9494194d4ea65b7e265bcd2d5f"

  1) "tweet::f9fb022c955e4c9782f30737bf814be8"
  2) "tweet::e83a57bad2204e90bae01aa52bbec588"
  3) "tweet::fa6973b183d44797b4f791f0304ddfd9"
  4) "tweet::22bbe15d1fd242e1b3de75a155c8b494"
  5) "tweet::8c3cb06e90ab4117828aa46cb2aeae90"
  6) "tweet::c9fc46ade5894ec5899723856e231d53"
  7) "tweet::f981adcf9ddc4ca3b541f7f8cdf86585"
  8) "tweet::81829c2fcac540afbca0af2dd5b7e8e5"
  9) "tweet::257535c9ae2744ecbb23d85fe7b7d9f4"
 10) "tweet::0fa38b1f64b54313bb61930a16dd75af"
 11) "tweet::08c92ee7b72141cfad04a54b0b1b1cea"
 12) "tweet::88f4072f6c2048edbaf3ceccda290b2e"
 13) "tweet::6ccd35c4b2604d56a20e242adb4f4564"
 14) "tweet::dc5ba772c1ad49a7ad53e53dcb928258"
 15) "tweet::a53455dbc59b48a4892a47b809984a5e"
 16) "tweet::110a8230a78e4889b8682533b50c5475"
 17) "tweet::9e1cfc7399444f6f91bcd6742473f63a"
 18) "tweet::7410f8dba24a4413a89c9b6e7ac7f5e1"
 19) "tweet::ad6f6f04c2024fdd911497d81acb7dd2"
 20) "tweet::61d8b65faf6446e2ae593f54d1dd9eb1"
 21) "tweet::1121125a78fb4db68128a294aaa0deda"
 22) "tweet::

## Query the database

[Alway-on Streamlit app](https://antonum-redis-vss-streamlit-streamlit-app-p4z5th.streamlit.app/)


Try queries like:
“Oil”, “Oil Reserves”, “Fossil fuels”

Lexical Full Text search quickly runs out of matches

Vector search continues to discover relevant tweets

In [20]:
user_query="oil price"
# queries to try "oil reserve", "fossil fuels"

In [21]:
from redisvl.query import FilterQuery
from redisvl.query.filter import Text

# exact match filter -- document must contain the exact word doctor
text_filter = Text("full_text") == user_query

filter_query = FilterQuery(
    return_fields=["full_text"],
    filter_expression=text_filter
)

results = index.query(filter_query)

results

ValueError: Index has not been created. Must be created before calling search

In [None]:
#using Full Text Index
q = Query(user_query)\
  .return_fields("full_text")
res = redis.ft("tweet:idx").search(q)
if res.total==0:
  print("No matches found")
else:
  res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
  display(res_df)

In [None]:
#using Vector Similarity Index
query_vector=text_to_embedding(user_query)
q = Query("*=>[KNN 10 @text_embedding $vector AS result_score]")\
                .return_fields("result_score","full_text")\
                .dialect(2)\
                .sort_by("result_score", True)
res = redis.ft("tweet:idx").search(q, query_params={"vector": query_vector})
#print(res)
res_df = pd.DataFrame([t.__dict__ for t in res.docs ]).drop(columns=["payload"])
res_df