<a href="https://colab.research.google.com/github/antonum/Redis-Workshops/blob/main/02-Vector_Similarity_Search/Redis_VL_getting_started_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting Started with RedisVL
`redisvl` is a versatile Python library with an integrated CLI, designed to enhance AI applications using Redis. This guide will walk you through the following steps:

1. Defining an `IndexSchema`
2. Preparing a sample dataset
3. Creating a `SearchIndex` object
4. Testing `rvl` CLI functionality
5. Loading the sample data
6. Building `VectorQuery` objects and executing searches
7. Updating a `SearchIndex` object

...and more!

Prerequisites:
- Ensure `redisvl` is installed in your Python environment.
- Have a running instance of [Redis Stack](https://redis.io/docs/install/install-stack/) or [Redis Cloud](https://redis.com/try-free).

_____

In [1]:
!pip install -q git+https://github.com/RedisVentures/redisvl.git@readme-enhancement

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m46.0/46.0 kB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.3/250.3 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m394.8/394.8 kB[0m [31m16.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.8/86.8 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for redisvl (pyproject.toml) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the fol

In [2]:

%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes



deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


## Define an `IndexSchema`

The `IndexSchema` maintains crucial **index configuration** and **field definitions** to
enable search with Redis. For ease of use, the schema can be constructed from a
python dictionary or yaml file.

### Example Schema Creation
Consider a dataset with user information, including `job`, `age`, `credit_score`,
and a 3-dimensional `user_embedding` vector.

You must also decide on a Redis index name and key prefix to use for this
dataset. Below are example schema definitions in both YAML and Dict format.

**YAML Definition:**

```yaml
version: '0.1.0'

index:
  name: user_simple
  prefix: user_simple_docs

fields:
    - name: user
      type: tag
    - name: credit_store
      type: tag
    - name: job
      type: text
    - name: age
      type: numeric
    - name: user_embedding
      type: vector
      attrs:
        algorithm: flat
        dims: 3
        distance_metric: cosine
        datatype: float32
```
> Store this in a local file, such as `schema.yaml`, for RedisVL usage.

**Python Dictionary:**

In [3]:
schema = {
    "index": {
        "name": "user_simple",
        "prefix": "user_simple_docs",
    },
    "fields": [
        {"name": "user", "type": "tag"},
        {"name": "credit_score", "type": "tag"},
        {"name": "job", "type": "text"},
        {"name": "age", "type": "numeric"},
        {
            "name": "user_embedding",
            "type": "vector",
            "attrs": {
                "dims": 3,
                "distance_metric": "cosine",
                "algorithm": "flat",
                "datatype": "float32"
            }
        }
    ]
}

## Sample Dataset Preparation

Below, create a mock dataset with `user`, `job`, `age`, `credit_score`, and
`user_embedding` fields. The `user_embedding` vectors are synthetic examples
for demonstration purposes.

For more information on creating real-world embeddings, refer to this
[article](https://mlops.community/vector-similarity-search-from-basics-to-production/).

In [4]:
import numpy as np
import pandas as pd


data = [
    {
        'user': 'john',
        'age': 1,
        'job': 'engineer',
        'credit_score': 'high',
        'user_embedding': np.array([0.1, 0.1, 0.5], dtype=np.float32).tobytes()
    },
    {
        'user': 'mary',
        'age': 2,
        'job': 'doctor',
        'credit_score': 'low',
        'user_embedding': np.array([0.1, 0.1, 0.5], dtype=np.float32).tobytes()
    },
    {
        'user': 'joe',
        'age': 3,
        'job': 'dentist',
        'credit_score': 'medium',
        'user_embedding': np.array([0.9, 0.9, 0.1], dtype=np.float32).tobytes()
    }
]

In [23]:
pd.DataFrame(data)

Unnamed: 0,user,age,job,credit_score,user_embedding
0,john,1,engineer,high,b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'
1,mary,2,doctor,low,b'\xcd\xcc\xcc=\xcd\xcc\xcc=\x00\x00\x00?'
2,joe,3,dentist,medium,b'fff?fff?\xcd\xcc\xcc='


>As seen above, the sample `user_embedding` vectors are converted into bytes. Using the `NumPy`, this is fairly trivial.

## Create a `SearchIndex`

With the schema and sample dataset ready, instantiate a `SearchIndex`:

In [5]:
from redisvl.index import SearchIndex

index = SearchIndex.from_dict(schema)
# or use .from_yaml('schema_file.yaml')

Now we also need to facilitate a Redis connection. There are a few ways to do this:

- Create & manage your own client connection (recommended)
- Provide a simple Redis URL and let RedisVL connect on your behalf

### Bring your own Redis connection instance

This ideal in scenarious where you have custom settings on the connection instance or if your application will share a connection pool:

In [6]:
from redis import Redis

client = Redis.from_url("redis://localhost:6379")

index.set_client(client)
# optionally provide an async Redis client object to enable async index operations

<redisvl.index.index.SearchIndex at 0x7a2391d77d60>

### Let the index manage the connection instance

This is ideal for simple cases:

In [7]:
index.connect("redis://localhost:6379")
# optionally use an async client by passing use_async=True

<redisvl.index.index.SearchIndex at 0x7a2391d77d60>

### Create the underlying index

Now that we are connected to Redis, we need to run the create command.

In [8]:
index.create(overwrite=True)

>Note that at this point, the index has no entries. Data loading follows.

## Inspect with the `rvl` CLI
Use the `rvl` CLI to inspect the created index and its fields:

In [9]:
!rvl index listall

[32m19:44:57[0m [34m[RedisVL][0m [1;30mINFO[0m   Indices:
[32m19:44:57[0m [34m[RedisVL][0m [1;30mINFO[0m   1. user_simple


In [10]:
!rvl index info -i user_simple



Index Information:
╭──────────────┬────────────────┬──────────────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes             │ Index Options   │   Indexing │
├──────────────┼────────────────┼──────────────────────┼─────────────────┼────────────┤
│ user_simple  │ HASH           │ ['user_simple_docs'] │ []              │          0 │
╰──────────────┴────────────────┴──────────────────────┴─────────────────┴────────────╯
Index Fields:
╭────────────────┬────────────────┬─────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────╮
│ Name           │ Attribute      │ Type    │ Field Option   │ Option Value   │ Field Option   │ Option Value   │ Field Option   │   Option Value │ Field Option    │ Option Value   │
├────────────────┼────────────────┼─────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────────┼────────────

## Load Data to `SearchIndex`

Load the sample dataset to Redis:

In [11]:
keys = index.load(data)

print(keys)

['user_simple_docs:bfc9e0f992964483ab389aa267230b42', 'user_simple_docs:86c63306c0574019ba9b968cd33cd579', 'user_simple_docs:f1f277d1ae894a72a512a752e90540f8']


>By default, `load` will create a unique Redis "key" as a combination of the index key `prefix` and a UUID. You can also customize the key by providing direct keys or pointing to a specified `id_field` on load.

### Upsert the index with new data
Upsert data by using the `load` method again:

In [12]:
# Add more data
new_data = [{
    'user': 'tyler',
    'age': 9,
    'job': 'engineer',
    'credit_score': 'high',
    'user_embedding': np.array([0.1, 0.3, 0.5], dtype=np.float32).tobytes()
}]
keys = index.load(new_data)

print(keys)

['user_simple_docs:579268c76fca4447bc57b2e557cda804']


## Creating `VectorQuery` Objects

Next we will create a vector query object for our newly populated index. This example will use a simple vector to demonstrate how vector similarity works. Vectors in production will likely be much larger than 3 floats and often require Machine Learning models (i.e. Huggingface sentence transformers) or an embeddings API (Cohere, OpenAI). `redisvl` provides a set of [Vectorizers](https://www.redisvl.com/user_guide/vectorizers_04.html#openai) to assist in vector creation.

In [13]:
from redisvl.query import VectorQuery
#from jupyterutils import result_print

query = VectorQuery(
    vector=[0.1, 0.1, 0.5],
    vector_field_name="user_embedding",
    return_fields=["user", "age", "job", "credit_score", "vector_distance"],
    num_results=5
)

### Executing queries
With our `VectorQuery` object defined above, we can execute the query over the `SearchIndex` using the `query` method.

In [14]:

results = index.query(query)
results


[{'id': 'user_simple_docs:bfc9e0f992964483ab389aa267230b42',
  'vector_distance': '0',
  'user': 'john',
  'age': '1',
  'job': 'engineer',
  'credit_score': 'high'},
 {'id': 'user_simple_docs:86c63306c0574019ba9b968cd33cd579',
  'vector_distance': '0',
  'user': 'mary',
  'age': '2',
  'job': 'doctor',
  'credit_score': 'low'},
 {'id': 'user_simple_docs:579268c76fca4447bc57b2e557cda804',
  'vector_distance': '0.0566298961639',
  'user': 'tyler',
  'age': '9',
  'job': 'engineer',
  'credit_score': 'high'},
 {'id': 'user_simple_docs:f1f277d1ae894a72a512a752e90540f8',
  'vector_distance': '0.653301358223',
  'user': 'joe',
  'age': '3',
  'job': 'dentist',
  'credit_score': 'medium'}]

In [15]:
pd.DataFrame(results)


Unnamed: 0,id,vector_distance,user,age,job,credit_score
0,user_simple_docs:bfc9e0f992964483ab389aa267230b42,0.0,john,1,engineer,high
1,user_simple_docs:86c63306c0574019ba9b968cd33cd579,0.0,mary,2,doctor,low
2,user_simple_docs:579268c76fca4447bc57b2e557cda804,0.0566298961639,tyler,9,engineer,high
3,user_simple_docs:f1f277d1ae894a72a512a752e90540f8,0.653301358223,joe,3,dentist,medium


## Using an Asynchronous Redis Client

The `AsyncSearchIndex` class along with an async Redis python client allows for queries, index creation, and data loading to be done asynchronously. This is the
recommended route for working with `redisvl` in production-like settings.

In [16]:
from redisvl.index import AsyncSearchIndex
from redis.asyncio import Redis

client = Redis.from_url("redis://localhost:6379")

index = AsyncSearchIndex.from_dict(schema)
index.set_client(client)

<redisvl.index.index.AsyncSearchIndex at 0x7a23637bab90>

In [17]:
# execute the vector query async
results = await index.query(query)
pd.DataFrame(results)

Unnamed: 0,id,vector_distance,user,age,job,credit_score
0,user_simple_docs:bfc9e0f992964483ab389aa267230b42,0.0,john,1,engineer,high
1,user_simple_docs:86c63306c0574019ba9b968cd33cd579,0.0,mary,2,doctor,low
2,user_simple_docs:579268c76fca4447bc57b2e557cda804,0.0566298961639,tyler,9,engineer,high
3,user_simple_docs:f1f277d1ae894a72a512a752e90540f8,0.653301358223,joe,3,dentist,medium


## Updating a schema
In some scenarios, it makes sense to update the index schema. With Redis and `redisvl`, this is easy because Redis can keep the underlying data in place while you change or make updates to the index configuration.

So for our scenario, let's imagine we want to reindex this data in 2 ways:
- by using a `Tag` type for `job` field instead of `Text`
- by using an `hnsw` vector index for the `user_embedding` field instead of a `flat` vector index

In [18]:
# Modify this schema to have what we want

index.schema.remove_field("job")
index.schema.remove_field("user_embedding")
index.schema.add_fields([
    {"name": "job", "type": "tag"},
    {
        "name": "user_embedding",
        "type": "vector",
        "attrs": {
            "dims": 3,
            "distance_metric": "cosine",
            "algorithm": "flat",
            "datatype": "float32"
        }
    }
])

In [19]:
# Run the index update but keep underlying data in place
await index.create(overwrite=True, drop=False)

19:44:58 redisvl.index.index INFO   Index already exists, overwriting.


In [20]:
# Execute the vector query async
results = await index.query(query)
pd.DataFrame(results)

Unnamed: 0,id,vector_distance,user,age,job,credit_score
0,user_simple_docs:86c63306c0574019ba9b968cd33cd579,0.0,mary,2,doctor,low
1,user_simple_docs:bfc9e0f992964483ab389aa267230b42,0.0,john,1,engineer,high
2,user_simple_docs:579268c76fca4447bc57b2e557cda804,0.0566298961639,tyler,9,engineer,high
3,user_simple_docs:f1f277d1ae894a72a512a752e90540f8,0.653301358223,joe,3,dentist,medium


## Check Index Stats
Use the `rvl` CLI to check the stats for the index:

In [21]:
!rvl stats -i user_simple


Statistics:
╭─────────────────────────────┬─────────────╮
│ Stat Key                    │ Value       │
├─────────────────────────────┼─────────────┤
│ num_docs                    │ 4           │
│ num_terms                   │ 0           │
│ max_doc_id                  │ 4           │
│ num_records                 │ 20          │
│ percent_indexed             │ 1           │
│ hash_indexing_failures      │ 0           │
│ number_of_uses              │ 2           │
│ bytes_per_record_avg        │ 1           │
│ doc_table_size_mb           │ 0.00044632  │
│ inverted_sz_mb              │ 1.90735e-05 │
│ key_table_size_mb           │ 0.000138283 │
│ offset_bits_per_record_avg  │ nan         │
│ offset_vectors_sz_mb        │ 0           │
│ offsets_per_term_avg        │ 0           │
│ records_per_doc_avg         │ 5           │
│ sortable_values_size_mb     │ 0           │
│ total_indexing_time         │ 0.334       │
│ total_inverted_index_blocks │ 11          │
│ vector_index_sz_mb 

## Cleanup

In [22]:
# clean up the index
await index.delete()