Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis metadata filtering and specification, index customization #8612

Merged
merged 30 commits into from
Aug 26, 2023

Conversation

Spartee
Copy link
Contributor

@Spartee Spartee commented Aug 2, 2023

Description

The previous Redis implementation did not allow for the user to specify the index configuration (i.e. changing the underlying algorithm) or add additional metadata to use for querying (i.e. hybrid or "filtered" search).

This PR introduces the ability to specify custom index attributes and metadata attributes as well as use that metadata in filtered queries. Overall, more structure was introduced to the Redis implementation that should allow for easier maintainability moving forward.

Example data

Suppose we have the following sample data

metadata = [
    {
        "user": "john",
        "age": 18,
        "job": "engineer",
        "credit_score": "high",
    },
    {
        "user": "derrick",
        "age": 14,
        "job": "doctor",
        "credit_score": "low",
    },
    {
        "user": "nancy",
        "age": 94,
        "job": "doctor",
        "credit_score": "high",
    },
    {
        "user": "tyler",
        "age": 100,
        "job": "engineer",
        "credit_score": "high",
    },
    {
        "user": "tim",
        "age": 12,
        "job": "dermatologist",
        "credit_score": "high",
    },
    {
        "user": "taimur",
        "age": 15,
        "job": "CEO",
        "credit_score": "low",
    },
    {
        "user": "joe",
        "age": 35,
        "job": "dentist",
        "credit_score": "medium",
    },
]

texts = ["foo", "foo", "foo", "foo", "bar", "bar", "bar"]

New Features

The following features are now available with the Redis integration into Langchain

Index schema generation

The schema for the index will now be automatically generated if not specified by the user. For example, the data above has the multiple metadata categories. The the following example

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.redis import Redis

embeddings = OpenAIEmbeddings()


rds, keys = Redis.from_texts_return_keys(
    texts,
    embeddings,
    metadatas=metadata,
    redis_url="redis://localhost:6379",
    index_name="users"
)

Loading the data in through this and the other from_documents and from_texts methods will now generate index schema in Redis like the following.

view index schema with the redisvl tool. link

$ rvl index info -i users

Index Information:

Index Name Storage Type Prefixes Index Options Indexing
users HASH ['doc:users'] [] 0
Index Fields:
Name Attribute Type Field Option Option Value
---------------- ---------------- --------- ---------------- ----------------
user user TEXT WEIGHT 1
job job TEXT WEIGHT 1
credit_score credit_score TEXT WEIGHT 1
content content TEXT WEIGHT 1
age age NUMERIC
content_vector content_vector VECTOR

Custom Metadata specification

The metadata schema generation has the following rules

  1. All text fields are indexed as text fields.
  2. All numeric fields are index as numeric fields.

If you would like to have a text field as a tag field, users can specify overrides like the following for the example data

# this can also be a path to a yaml file
index_schema = {
    "text": [{"name": "user"}, {"name": "job"}],
    "tag": [{"name": "credit_score"}],
    "numeric": [{"name": "age"}],
}

rds, keys = Redis.from_texts_return_keys(
    texts,
    embeddings,
    metadatas=metadata,
    redis_url="redis://localhost:6379",
    index_name="users"
)

This will change the index specification to

Index Information:

Index Name Storage Type Prefixes Index Options Indexing
users2 HASH ['doc:users2'] [] 0
Index Fields:
Name Attribute Type Field Option Option Value
---------------- ---------------- --------- ---------------- ----------------
user user TEXT WEIGHT 1
job job TEXT WEIGHT 1
content content TEXT WEIGHT 1
credit_score credit_score TAG SEPARATOR ,
age age NUMERIC
content_vector content_vector VECTOR

and throw a warning to the user (log output) that the generated schema does not match the specified schema.

index_schema does not match generated schema from metadata.
index_schema: {'text': [{'name': 'user'}, {'name': 'job'}], 'tag': [{'name': 'credit_score'}], 'numeric': [{'name': 'age'}]}
generated_schema: {'text': [{'name': 'user'}, {'name': 'job'}, {'name': 'credit_score'}], 'numeric': [{'name': 'age'}]}

As long as this is on purpose, this is fine.

The schema can be defined as a yaml file or a dictionary

text:
  - name: user
  - name: job
tag:
  - name: credit_score
numeric:
  - name: age

and you pass in a path like

rds, keys = Redis.from_texts_return_keys(
    texts,
    embeddings,
    metadatas=metadata,
    redis_url="redis://localhost:6379",
    index_name="users3",
    index_schema=Path("sample1.yml").resolve()
)

Which will create the same schema as defined in the dictionary example

Index Information:

Index Name Storage Type Prefixes Index Options Indexing
users3 HASH ['doc:users3'] [] 0
Index Fields:
Name Attribute Type Field Option Option Value
---------------- ---------------- --------- ---------------- ----------------
user user TEXT WEIGHT 1
job job TEXT WEIGHT 1
content content TEXT WEIGHT 1
credit_score credit_score TAG SEPARATOR ,
age age NUMERIC
content_vector content_vector VECTOR

Custom Vector Indexing Schema

Users with large use cases may want to change how they formulate the vector index created by Langchain

To utilize all the features of Redis for vector database use cases like this, you can now do the following to pass in index attribute modifiers like changing the indexing algorithm to HNSW.

vector_schema = {
    "algorithm": "HNSW"
}

rds, keys = Redis.from_texts_return_keys(
    texts,
    embeddings,
    metadatas=metadata,
    redis_url="redis://localhost:6379",
    index_name="users3",
    vector_schema=vector_schema
)

A more complex example may look like

vector_schema = {
    "algorithm": "HNSW",
    "ef_construction": 200,
    "ef_runtime": 20
}

rds, keys = Redis.from_texts_return_keys(
    texts,
    embeddings,
    metadatas=metadata,
    redis_url="redis://localhost:6379",
    index_name="users3",
    vector_schema=vector_schema
)

All names correspond to the arguments you would set if using Redis-py or RedisVL. (put in doc link later)

Better Querying

Both vector queries and Range (limit) queries are now available and metadata is returned by default. The outputs are shown.

>>> query = "foo"
>>> results = rds.similarity_search(query, k=1)
>>> print(results)
[Document(page_content='foo', metadata={'user': 'derrick', 'job': 'doctor', 'credit_score': 'low', 'age': '14', 'id': 'doc:users:657a47d7db8b447e88598b83da879b9d', 'score': '7.15255737305e-07'})]

>>> results = rds.similarity_search_with_score(query, k=1, return_metadata=False)
>>> print(results) # no metadata, but with scores
[(Document(page_content='foo', metadata={}), 7.15255737305e-07)]

>>> results = rds.similarity_search_limit_score(query, k=6, score_threshold=0.0001)
>>> print(len(results)) # range query (only above threshold even if k is higher)
4

Custom metadata filtering

A big advantage of Redis in this space is being able to do filtering on data stored alongside the vector itself. With the example above, the following is now possible in langchain. The equivalence operators are overridden to describe a new expression language that mimic that of redisvl. This allows for arbitrarily long sequences of filters that resemble SQL commands that can be used directly with vector queries and range queries.

There are two interfaces by which to do so and both are shown.

>>> from langchain.vectorstores.redis import RedisFilter, RedisNum, RedisText

>>> age_filter = RedisFilter.num("age") > 18
>>> age_filter = RedisNum("age") > 18 # equivalent
>>> results = rds.similarity_search(query, filter=age_filter)
>>> print(len(results))
3

>>> job_filter = RedisFilter.text("job") == "engineer" 
>>> job_filter = RedisText("job") == "engineer" # equivalent
>>> results = rds.similarity_search(query, filter=job_filter)
>>> print(len(results))
2

# fuzzy match text search
>>> job_filter = RedisFilter.text("job") % "eng*"
>>> results = rds.similarity_search(query, filter=job_filter)
>>> print(len(results))
2


# combined filters (AND)
>>> combined = age_filter & job_filter
>>> results = rds.similarity_search(query, filter=combined)
>>> print(len(results))
1

# combined filters (OR)
>>> combined = age_filter | job_filter
>>> results = rds.similarity_search(query, filter=combined)
>>> print(len(results))
4

All the above filter results can be checked against the data above.

TODO

  • more tests
  • docstrings
  • docs

Other

@vercel
Copy link

vercel bot commented Aug 2, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Aug 26, 2023 0:19am

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🤖:improvement Medium size change to existing code to handle new use-cases labels Aug 2, 2023
Copy link
Contributor

@tylerhutcherson tylerhutcherson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice -- left a few comments and questions

libs/langchain/langchain/utilities/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/__init__.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/utilities/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/utilities/redis.py Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/__init__.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/filters.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/filters.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
@Spartee Spartee marked this pull request as ready for review August 7, 2023 02:18
Copy link
Collaborator

@baskaryan baskaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would be good to have someone from redis side review the redis-specific logic!

libs/langchain/langchain/utilities/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/schema.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/schema.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/redis.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/base.py Outdated Show resolved Hide resolved
"distance_metric": "COSINE",
"datatype": "FLOAT32",
}

def __init__(
self,
redis_url: str,
index_name: str,
embedding_function: Callable,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we're already making breaking changes, i'd suggest we update this to be

Suggested change
embedding_function: Callable,
embedding: Embeddings,

which is interface all newer VectorStores have

libs/langchain/langchain/vectorstores/redis/base.py Outdated Show resolved Hide resolved
libs/langchain/langchain/vectorstores/redis/base.py Outdated Show resolved Hide resolved
@baskaryan
Copy link
Collaborator

fixed some of the lint issues here #9705 if you want to merge that into this pr

base_query = f"({query_prefix})=>[KNN {k} @{vector_key} $vector AS score]"

query = (
Query(base_query).return_fields(*return_fields).sort_by("score").dialect(2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need to add .paging(0, k) here

Suggested change
Query(base_query).return_fields(*return_fields).sort_by("score").dialect(2)
Query(base_query)
.return_fields(*return_fields)
.sort_by("score")
.paging(0, k)
.dialect(2)

Copy link

@alonre24 alonre24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few RediSearch syntax-related comments

from langchain.vectorstores import Redis
from langchain.embeddings import OpenAIEmbeddings
embeddings = OpenAIEmbeddings()
redisearch, keys = RediSearch.from_texts_return_keys(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't it Redis.from_texts_return_keys... ?

OPERATOR_MAP = {
RedisFilterOperator.EQ: '@%s:"%s"',
RedisFilterOperator.NE: '(-@%s:"%s")',
RedisFilterOperator.LIKE: "@%s:%s",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure that LIKE represents what you meant it would be. The difference between @%s:"%s" and @%s:%s is that the first one allows you to find exact matches for phrases, while the last one also looks for exact matches but for each token separately. Also, using @%s:%s (without quotes) will support stemming by default.
To get a behavior that is more similar to SQL's LIKE operator, we can use prefix, infix or suffix matching (see query docs).

Query(query_string)
.return_fields(*return_fields)
.sort_by("distance")
.paging(0, k)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the advantages of range queries is that you can get all the results that are within a given range (not just the top k ones), and in particular you can see how many results are within the range. To utilize this feature I would suggest having k as an optional argument (and perhaps using an upper bound such as 1000 for paging), if that makes sense to you.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed -- was going to suggest using K as an optional upper bound. @alonre24 isn't paging a default param 0-10 though?

Maybe something like:

query = (
    Query(query_string)
    .return_fields(*return_fields)
    .sort_by("distance")
    .dialect(2)
)

if k:
    query = query.paging(0, k)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the default paging is 10, that's why I suggest also choosing an upper bound, so we have:

query = query.paging(0, k if k else UPPER_BOUND)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's part of the higher level interface abstraction so it would have to be set. I've started this conversation the lc folks though. Most likely the best route is another method that exposes this feature better.


# if it's a list of strings, we assume it's a tag
if isinstance(value, (list, tuple)):
if not value or isinstance(value[0], str):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why if not value is ok?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing as saying len(value) == 0.

>>> x = []
>>> if not x:
...     print("it's like an emptiness check")
... 
it's like an emptiness check

looks cleaner than len check or try/except

# if it's a list/tuple of strings, we join it
elif isinstance(value, (list, tuple)):
if not value or isinstance(value[0], str):
clean_meta[key] = ",".join(value)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if there are tag values within the list that contain a comma, they will be split when indexed in RediSearch. Consider validating that there are no such values, or allowing the user to specify a different separator (API allows it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So currently, they could specify it in the schema and clean up the data themselves beforehand, but for the automatically generated metadata (how most will use it because they just use the data loaders), it's defaulting to , right now. Been thinking I should have a default here anyway. Good call out.


>>> from langchain.vectorstores.redis import RedisTag, RedisNum
>>> brand_is_nike = RedisTag("brand") == "nike"
>>> price_is_over_100 = RedisNum("price") < 100

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> price_is_over_100 = RedisNum("price") < 100
>>> price_is_under_100 = RedisNum("price") < 100

@baskaryan baskaryan merged commit a28eea5 into langchain-ai:master Aug 26, 2023
26 checks passed
toddkim95 pushed a commit to toddkim95/langchain that referenced this pull request Aug 26, 2023
…chain-ai#8612)

### Description

The previous Redis implementation did not allow for the user to specify
the index configuration (i.e. changing the underlying algorithm) or add
additional metadata to use for querying (i.e. hybrid or "filtered"
search).

This PR introduces the ability to specify custom index attributes and
metadata attributes as well as use that metadata in filtered queries.
Overall, more structure was introduced to the Redis implementation that
should allow for easier maintainability moving forward.

# New Features

The following features are now available with the Redis integration into
Langchain

## Index schema generation

The schema for the index will now be automatically generated if not
specified by the user. For example, the data above has the multiple
metadata categories. The the following example

```python

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores.redis import Redis

embeddings = OpenAIEmbeddings()


rds, keys = Redis.from_texts_return_keys(
    texts,
    embeddings,
    metadatas=metadata,
    redis_url="redis://localhost:6379",
    index_name="users"
)
```

Loading the data in through this and the other ``from_documents`` and
``from_texts`` methods will now generate index schema in Redis like the
following.

view index schema with the ``redisvl`` tool. [link](redisvl.com)

```bash
$ rvl index info -i users
```


Index Information:
| Index Name | Storage Type | Prefixes | Index Options | Indexing |

|--------------|----------------|---------------|-----------------|------------|
| users | HASH | ['doc:users'] | [] | 0 |
Index Fields:
| Name | Attribute | Type | Field Option | Option Value |

|----------------|----------------|---------|----------------|----------------|
| user | user | TEXT | WEIGHT | 1 |
| job | job | TEXT | WEIGHT | 1 |
| credit_score | credit_score | TEXT | WEIGHT | 1 |
| content | content | TEXT | WEIGHT | 1 |
| age | age | NUMERIC | | |
| content_vector | content_vector | VECTOR | | |


### Custom Metadata specification

The metadata schema generation has the following rules
1. All text fields are indexed as text fields.
2. All numeric fields are index as numeric fields.

If you would like to have a text field as a tag field, users can specify
overrides like the following for the example data

```python

# this can also be a path to a yaml file
index_schema = {
    "text": [{"name": "user"}, {"name": "job"}],
    "tag": [{"name": "credit_score"}],
    "numeric": [{"name": "age"}],
}

rds, keys = Redis.from_texts_return_keys(
    texts,
    embeddings,
    metadatas=metadata,
    redis_url="redis://localhost:6379",
    index_name="users"
)
```
This will change the index specification to 

Index Information:
| Index Name | Storage Type | Prefixes | Index Options | Indexing |

|--------------|----------------|----------------|-----------------|------------|
| users2 | HASH | ['doc:users2'] | [] | 0 |
Index Fields:
| Name | Attribute | Type | Field Option | Option Value |

|----------------|----------------|---------|----------------|----------------|
| user | user | TEXT | WEIGHT | 1 |
| job | job | TEXT | WEIGHT | 1 |
| content | content | TEXT | WEIGHT | 1 |
| credit_score | credit_score | TAG | SEPARATOR | , |
| age | age | NUMERIC | | |
| content_vector | content_vector | VECTOR | | |


and throw a warning to the user (log output) that the generated schema
does not match the specified schema.

```text
index_schema does not match generated schema from metadata.
index_schema: {'text': [{'name': 'user'}, {'name': 'job'}], 'tag': [{'name': 'credit_score'}], 'numeric': [{'name': 'age'}]}
generated_schema: {'text': [{'name': 'user'}, {'name': 'job'}, {'name': 'credit_score'}], 'numeric': [{'name': 'age'}]}
```

As long as this is on purpose,  this is fine.

The schema can be defined as a yaml file or a dictionary

```yaml

text:
  - name: user
  - name: job
tag:
  - name: credit_score
numeric:
  - name: age

```

and you pass in a path like

```python
rds, keys = Redis.from_texts_return_keys(
    texts,
    embeddings,
    metadatas=metadata,
    redis_url="redis://localhost:6379",
    index_name="users3",
    index_schema=Path("sample1.yml").resolve()
)
```

Which will create the same schema as defined in the dictionary example


Index Information:
| Index Name | Storage Type | Prefixes | Index Options | Indexing |

|--------------|----------------|----------------|-----------------|------------|
| users3 | HASH | ['doc:users3'] | [] | 0 |
Index Fields:
| Name | Attribute | Type | Field Option | Option Value |

|----------------|----------------|---------|----------------|----------------|
| user | user | TEXT | WEIGHT | 1 |
| job | job | TEXT | WEIGHT | 1 |
| content | content | TEXT | WEIGHT | 1 |
| credit_score | credit_score | TAG | SEPARATOR | , |
| age | age | NUMERIC | | |
| content_vector | content_vector | VECTOR | | |



### Custom Vector Indexing Schema

Users with large use cases may want to change how they formulate the
vector index created by Langchain

To utilize all the features of Redis for vector database use cases like
this, you can now do the following to pass in index attribute modifiers
like changing the indexing algorithm to HNSW.

```python
vector_schema = {
    "algorithm": "HNSW"
}

rds, keys = Redis.from_texts_return_keys(
    texts,
    embeddings,
    metadatas=metadata,
    redis_url="redis://localhost:6379",
    index_name="users3",
    vector_schema=vector_schema
)

```

A more complex example may look like

```python
vector_schema = {
    "algorithm": "HNSW",
    "ef_construction": 200,
    "ef_runtime": 20
}

rds, keys = Redis.from_texts_return_keys(
    texts,
    embeddings,
    metadatas=metadata,
    redis_url="redis://localhost:6379",
    index_name="users3",
    vector_schema=vector_schema
)
```

All names correspond to the arguments you would set if using Redis-py or
RedisVL. (put in doc link later)


### Better Querying

Both vector queries and Range (limit) queries are now available and
metadata is returned by default. The outputs are shown.

```python
>>> query = "foo"
>>> results = rds.similarity_search(query, k=1)
>>> print(results)
[Document(page_content='foo', metadata={'user': 'derrick', 'job': 'doctor', 'credit_score': 'low', 'age': '14', 'id': 'doc:users:657a47d7db8b447e88598b83da879b9d', 'score': '7.15255737305e-07'})]

>>> results = rds.similarity_search_with_score(query, k=1, return_metadata=False)
>>> print(results) # no metadata, but with scores
[(Document(page_content='foo', metadata={}), 7.15255737305e-07)]

>>> results = rds.similarity_search_limit_score(query, k=6, score_threshold=0.0001)
>>> print(len(results)) # range query (only above threshold even if k is higher)
4
```

### Custom metadata filtering

A big advantage of Redis in this space is being able to do filtering on
data stored alongside the vector itself. With the example above, the
following is now possible in langchain. The equivalence operators are
overridden to describe a new expression language that mimic that of
[redisvl](redisvl.com). This allows for arbitrarily long sequences of
filters that resemble SQL commands that can be used directly with vector
queries and range queries.

There are two interfaces by which to do so and both are shown. 

```python

>>> from langchain.vectorstores.redis import RedisFilter, RedisNum, RedisText

>>> age_filter = RedisFilter.num("age") > 18
>>> age_filter = RedisNum("age") > 18 # equivalent
>>> results = rds.similarity_search(query, filter=age_filter)
>>> print(len(results))
3

>>> job_filter = RedisFilter.text("job") == "engineer" 
>>> job_filter = RedisText("job") == "engineer" # equivalent
>>> results = rds.similarity_search(query, filter=job_filter)
>>> print(len(results))
2

# fuzzy match text search
>>> job_filter = RedisFilter.text("job") % "eng*"
>>> results = rds.similarity_search(query, filter=job_filter)
>>> print(len(results))
2


# combined filters (AND)
>>> combined = age_filter & job_filter
>>> results = rds.similarity_search(query, filter=combined)
>>> print(len(results))
1

# combined filters (OR)
>>> combined = age_filter | job_filter
>>> results = rds.similarity_search(query, filter=combined)
>>> print(len(results))
4
```

All the above filter results can be checked against the data above.


### Other

  - Issue: langchain-ai#3967 
  - Dependencies: No added dependencies
  - Tag maintainer: @hwchase17 @baskaryan @rlancemartin 
  - Twitter handle: @sampartee

---------

Co-authored-by: Naresh Rangan <naresh.rangan0@walmart.com>
Co-authored-by: Bagatur <baskaryan@gmail.com>
Copy link

@meiravgri meiravgri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was already merged but I had a few comments :)

Comment on lines +244 to +246
"dims": 1536,
"distance_metric": "COSINE",
"datatype": "FLOAT32",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to update the vector field attributes according to the embedding model
i.e
self._schema = self._get_schema_with_defaults(index_schema, vector_schema, embedding)

Comment on lines +338 to +342
index_schema (Optional[Union[Dict[str, str], str, os.PathLike]], optional):
Optional fields to index within the metadata. Overrides generated
schema. Defaults to None.
vector_schema (Optional[Dict[str, Union[str, int]]], optional): Optional
vector schema to use. Defaults to None.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't say that it defaults to None. I think that the default schema used if index_schema or/and vector_schema are not passed should be documented.

)
else:
# use the generated schema
index_schema = generated_schema

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what was the idea behind indexing all the metadata fields? They are anyway stored and loaded from redis key space during the query.
Was it to allow the hybrid search?


# filled by default_vector_schema
vector: Optional[List[Union[FlatVectorField, HNSWVectorField]]] = None
content_key: str = "content"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, having a field called "content" is mandatory.
Fix me if I'm wrong but the only location I found it might be a problem to give this field a user-defined name is in similarity_search_with_score where we return result.content
(in similarity_search() it is handled better IMO by using getattr(result, content_key) instead)
Another solution is to define the return fields as
Query().return_field(self.content_field, as_field="content")

Also, where is the field name enforced in from_existing_index()?

assert output == TEST_RESULT
assert drop(docsearch.index_name)


def test_redis_from_existing(texts: List[str]) -> None:
"""Test adding a new document"""
Redis.from_texts(
docsearch = Redis.from_texts(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add tests that create or connect to an existing index that doesn't include the default fields names

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:improvement Medium size change to existing code to handle new use-cases Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants