-
Notifications
You must be signed in to change notification settings - Fork 15.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redis metadata filtering and specification, index customization #8612
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice -- left a few comments and questions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be good to have someone from redis side review the redis-specific logic!
"distance_metric": "COSINE", | ||
"datatype": "FLOAT32", | ||
} | ||
|
||
def __init__( | ||
self, | ||
redis_url: str, | ||
index_name: str, | ||
embedding_function: Callable, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we're already making breaking changes, i'd suggest we update this to be
embedding_function: Callable, | |
embedding: Embeddings, |
which is interface all newer VectorStores have
Add the ability to clean the metadata before it goes into redis enabling document_loaders that return lists of strings to create categorical values for Tags in Redis indices. Also, added docstrings and updated the jupyter notebook
0341cc4
to
48b7df5
Compare
fixed some of the lint issues here #9705 if you want to merge that into this pr |
base_query = f"({query_prefix})=>[KNN {k} @{vector_key} $vector AS score]" | ||
|
||
query = ( | ||
Query(base_query).return_fields(*return_fields).sort_by("score").dialect(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need to add .paging(0, k)
here
Query(base_query).return_fields(*return_fields).sort_by("score").dialect(2) | |
Query(base_query) | |
.return_fields(*return_fields) | |
.sort_by("score") | |
.paging(0, k) | |
.dialect(2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a few RediSearch syntax-related comments
from langchain.vectorstores import Redis | ||
from langchain.embeddings import OpenAIEmbeddings | ||
embeddings = OpenAIEmbeddings() | ||
redisearch, keys = RediSearch.from_texts_return_keys( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it Redis.from_texts_return_keys...
?
OPERATOR_MAP = { | ||
RedisFilterOperator.EQ: '@%s:"%s"', | ||
RedisFilterOperator.NE: '(-@%s:"%s")', | ||
RedisFilterOperator.LIKE: "@%s:%s", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that LIKE represents what you meant it would be. The difference between @%s:"%s"
and @%s:%s
is that the first one allows you to find exact matches for phrases, while the last one also looks for exact matches but for each token separately. Also, using @%s:%s
(without quotes) will support stemming by default.
To get a behavior that is more similar to SQL's LIKE operator, we can use prefix, infix or suffix matching (see query docs).
Query(query_string) | ||
.return_fields(*return_fields) | ||
.sort_by("distance") | ||
.paging(0, k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the advantages of range queries is that you can get all the results that are within a given range (not just the top k ones), and in particular you can see how many results are within the range. To utilize this feature I would suggest having k as an optional argument (and perhaps using an upper bound such as 1000 for paging), if that makes sense to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed -- was going to suggest using K as an optional upper bound. @alonre24 isn't paging a default param 0-10 though?
Maybe something like:
query = (
Query(query_string)
.return_fields(*return_fields)
.sort_by("distance")
.dialect(2)
)
if k:
query = query.paging(0, k)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the default paging is 10, that's why I suggest also choosing an upper bound, so we have:
query = query.paging(0, k if k else UPPER_BOUND)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's part of the higher level interface abstraction so it would have to be set. I've started this conversation the lc folks though. Most likely the best route is another method that exposes this feature better.
|
||
# if it's a list of strings, we assume it's a tag | ||
if isinstance(value, (list, tuple)): | ||
if not value or isinstance(value[0], str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why if not value
is ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing as saying len(value) == 0
.
>>> x = []
>>> if not x:
... print("it's like an emptiness check")
...
it's like an emptiness check
looks cleaner than len check or try/except
# if it's a list/tuple of strings, we join it | ||
elif isinstance(value, (list, tuple)): | ||
if not value or isinstance(value[0], str): | ||
clean_meta[key] = ",".join(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if there are tag values within the list that contain a comma, they will be split when indexed in RediSearch. Consider validating that there are no such values, or allowing the user to specify a different separator (API allows it).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So currently, they could specify it in the schema and clean up the data themselves beforehand, but for the automatically generated metadata (how most will use it because they just use the data loaders), it's defaulting to ,
right now. Been thinking I should have a default here anyway. Good call out.
|
||
>>> from langchain.vectorstores.redis import RedisTag, RedisNum | ||
>>> brand_is_nike = RedisTag("brand") == "nike" | ||
>>> price_is_over_100 = RedisNum("price") < 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> price_is_over_100 = RedisNum("price") < 100 | |
>>> price_is_under_100 = RedisNum("price") < 100 |
…chain-ai#8612) ### Description The previous Redis implementation did not allow for the user to specify the index configuration (i.e. changing the underlying algorithm) or add additional metadata to use for querying (i.e. hybrid or "filtered" search). This PR introduces the ability to specify custom index attributes and metadata attributes as well as use that metadata in filtered queries. Overall, more structure was introduced to the Redis implementation that should allow for easier maintainability moving forward. # New Features The following features are now available with the Redis integration into Langchain ## Index schema generation The schema for the index will now be automatically generated if not specified by the user. For example, the data above has the multiple metadata categories. The the following example ```python from langchain.embeddings import OpenAIEmbeddings from langchain.vectorstores.redis import Redis embeddings = OpenAIEmbeddings() rds, keys = Redis.from_texts_return_keys( texts, embeddings, metadatas=metadata, redis_url="redis://localhost:6379", index_name="users" ) ``` Loading the data in through this and the other ``from_documents`` and ``from_texts`` methods will now generate index schema in Redis like the following. view index schema with the ``redisvl`` tool. [link](redisvl.com) ```bash $ rvl index info -i users ``` Index Information: | Index Name | Storage Type | Prefixes | Index Options | Indexing | |--------------|----------------|---------------|-----------------|------------| | users | HASH | ['doc:users'] | [] | 0 | Index Fields: | Name | Attribute | Type | Field Option | Option Value | |----------------|----------------|---------|----------------|----------------| | user | user | TEXT | WEIGHT | 1 | | job | job | TEXT | WEIGHT | 1 | | credit_score | credit_score | TEXT | WEIGHT | 1 | | content | content | TEXT | WEIGHT | 1 | | age | age | NUMERIC | | | | content_vector | content_vector | VECTOR | | | ### Custom Metadata specification The metadata schema generation has the following rules 1. All text fields are indexed as text fields. 2. All numeric fields are index as numeric fields. If you would like to have a text field as a tag field, users can specify overrides like the following for the example data ```python # this can also be a path to a yaml file index_schema = { "text": [{"name": "user"}, {"name": "job"}], "tag": [{"name": "credit_score"}], "numeric": [{"name": "age"}], } rds, keys = Redis.from_texts_return_keys( texts, embeddings, metadatas=metadata, redis_url="redis://localhost:6379", index_name="users" ) ``` This will change the index specification to Index Information: | Index Name | Storage Type | Prefixes | Index Options | Indexing | |--------------|----------------|----------------|-----------------|------------| | users2 | HASH | ['doc:users2'] | [] | 0 | Index Fields: | Name | Attribute | Type | Field Option | Option Value | |----------------|----------------|---------|----------------|----------------| | user | user | TEXT | WEIGHT | 1 | | job | job | TEXT | WEIGHT | 1 | | content | content | TEXT | WEIGHT | 1 | | credit_score | credit_score | TAG | SEPARATOR | , | | age | age | NUMERIC | | | | content_vector | content_vector | VECTOR | | | and throw a warning to the user (log output) that the generated schema does not match the specified schema. ```text index_schema does not match generated schema from metadata. index_schema: {'text': [{'name': 'user'}, {'name': 'job'}], 'tag': [{'name': 'credit_score'}], 'numeric': [{'name': 'age'}]} generated_schema: {'text': [{'name': 'user'}, {'name': 'job'}, {'name': 'credit_score'}], 'numeric': [{'name': 'age'}]} ``` As long as this is on purpose, this is fine. The schema can be defined as a yaml file or a dictionary ```yaml text: - name: user - name: job tag: - name: credit_score numeric: - name: age ``` and you pass in a path like ```python rds, keys = Redis.from_texts_return_keys( texts, embeddings, metadatas=metadata, redis_url="redis://localhost:6379", index_name="users3", index_schema=Path("sample1.yml").resolve() ) ``` Which will create the same schema as defined in the dictionary example Index Information: | Index Name | Storage Type | Prefixes | Index Options | Indexing | |--------------|----------------|----------------|-----------------|------------| | users3 | HASH | ['doc:users3'] | [] | 0 | Index Fields: | Name | Attribute | Type | Field Option | Option Value | |----------------|----------------|---------|----------------|----------------| | user | user | TEXT | WEIGHT | 1 | | job | job | TEXT | WEIGHT | 1 | | content | content | TEXT | WEIGHT | 1 | | credit_score | credit_score | TAG | SEPARATOR | , | | age | age | NUMERIC | | | | content_vector | content_vector | VECTOR | | | ### Custom Vector Indexing Schema Users with large use cases may want to change how they formulate the vector index created by Langchain To utilize all the features of Redis for vector database use cases like this, you can now do the following to pass in index attribute modifiers like changing the indexing algorithm to HNSW. ```python vector_schema = { "algorithm": "HNSW" } rds, keys = Redis.from_texts_return_keys( texts, embeddings, metadatas=metadata, redis_url="redis://localhost:6379", index_name="users3", vector_schema=vector_schema ) ``` A more complex example may look like ```python vector_schema = { "algorithm": "HNSW", "ef_construction": 200, "ef_runtime": 20 } rds, keys = Redis.from_texts_return_keys( texts, embeddings, metadatas=metadata, redis_url="redis://localhost:6379", index_name="users3", vector_schema=vector_schema ) ``` All names correspond to the arguments you would set if using Redis-py or RedisVL. (put in doc link later) ### Better Querying Both vector queries and Range (limit) queries are now available and metadata is returned by default. The outputs are shown. ```python >>> query = "foo" >>> results = rds.similarity_search(query, k=1) >>> print(results) [Document(page_content='foo', metadata={'user': 'derrick', 'job': 'doctor', 'credit_score': 'low', 'age': '14', 'id': 'doc:users:657a47d7db8b447e88598b83da879b9d', 'score': '7.15255737305e-07'})] >>> results = rds.similarity_search_with_score(query, k=1, return_metadata=False) >>> print(results) # no metadata, but with scores [(Document(page_content='foo', metadata={}), 7.15255737305e-07)] >>> results = rds.similarity_search_limit_score(query, k=6, score_threshold=0.0001) >>> print(len(results)) # range query (only above threshold even if k is higher) 4 ``` ### Custom metadata filtering A big advantage of Redis in this space is being able to do filtering on data stored alongside the vector itself. With the example above, the following is now possible in langchain. The equivalence operators are overridden to describe a new expression language that mimic that of [redisvl](redisvl.com). This allows for arbitrarily long sequences of filters that resemble SQL commands that can be used directly with vector queries and range queries. There are two interfaces by which to do so and both are shown. ```python >>> from langchain.vectorstores.redis import RedisFilter, RedisNum, RedisText >>> age_filter = RedisFilter.num("age") > 18 >>> age_filter = RedisNum("age") > 18 # equivalent >>> results = rds.similarity_search(query, filter=age_filter) >>> print(len(results)) 3 >>> job_filter = RedisFilter.text("job") == "engineer" >>> job_filter = RedisText("job") == "engineer" # equivalent >>> results = rds.similarity_search(query, filter=job_filter) >>> print(len(results)) 2 # fuzzy match text search >>> job_filter = RedisFilter.text("job") % "eng*" >>> results = rds.similarity_search(query, filter=job_filter) >>> print(len(results)) 2 # combined filters (AND) >>> combined = age_filter & job_filter >>> results = rds.similarity_search(query, filter=combined) >>> print(len(results)) 1 # combined filters (OR) >>> combined = age_filter | job_filter >>> results = rds.similarity_search(query, filter=combined) >>> print(len(results)) 4 ``` All the above filter results can be checked against the data above. ### Other - Issue: langchain-ai#3967 - Dependencies: No added dependencies - Tag maintainer: @hwchase17 @baskaryan @rlancemartin - Twitter handle: @sampartee --------- Co-authored-by: Naresh Rangan <naresh.rangan0@walmart.com> Co-authored-by: Bagatur <baskaryan@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it was already merged but I had a few comments :)
"dims": 1536, | ||
"distance_metric": "COSINE", | ||
"datatype": "FLOAT32", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to update the vector field attributes according to the embedding model
i.e
self._schema = self._get_schema_with_defaults(index_schema, vector_schema, embedding)
index_schema (Optional[Union[Dict[str, str], str, os.PathLike]], optional): | ||
Optional fields to index within the metadata. Overrides generated | ||
schema. Defaults to None. | ||
vector_schema (Optional[Dict[str, Union[str, int]]], optional): Optional | ||
vector schema to use. Defaults to None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say that it defaults to None. I think that the default schema used if index_schema
or/and vector_schema
are not passed should be documented.
) | ||
else: | ||
# use the generated schema | ||
index_schema = generated_schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what was the idea behind indexing all the metadata fields? They are anyway stored and loaded from redis key space during the query.
Was it to allow the hybrid search?
|
||
# filled by default_vector_schema | ||
vector: Optional[List[Union[FlatVectorField, HNSWVectorField]]] = None | ||
content_key: str = "content" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, having a field called "content" is mandatory.
Fix me if I'm wrong but the only location I found it might be a problem to give this field a user-defined name is in similarity_search_with_score
where we return result.content
(in similarity_search()
it is handled better IMO by using getattr(result, content_key)
instead)
Another solution is to define the return fields as
Query().return_field(self.content_field, as_field="content")
Also, where is the field name enforced in from_existing_index()
?
assert output == TEST_RESULT | ||
assert drop(docsearch.index_name) | ||
|
||
|
||
def test_redis_from_existing(texts: List[str]) -> None: | ||
"""Test adding a new document""" | ||
Redis.from_texts( | ||
docsearch = Redis.from_texts( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add tests that create or connect to an existing index that doesn't include the default fields names
Description
The previous Redis implementation did not allow for the user to specify the index configuration (i.e. changing the underlying algorithm) or add additional metadata to use for querying (i.e. hybrid or "filtered" search).
This PR introduces the ability to specify custom index attributes and metadata attributes as well as use that metadata in filtered queries. Overall, more structure was introduced to the Redis implementation that should allow for easier maintainability moving forward.
Example data
Suppose we have the following sample data
New Features
The following features are now available with the Redis integration into Langchain
Index schema generation
The schema for the index will now be automatically generated if not specified by the user. For example, the data above has the multiple metadata categories. The the following example
Loading the data in through this and the other
from_documents
andfrom_texts
methods will now generate index schema in Redis like the following.view index schema with the
redisvl
tool. linkIndex Information:
Custom Metadata specification
The metadata schema generation has the following rules
If you would like to have a text field as a tag field, users can specify overrides like the following for the example data
This will change the index specification to
Index Information:
and throw a warning to the user (log output) that the generated schema does not match the specified schema.
As long as this is on purpose, this is fine.
The schema can be defined as a yaml file or a dictionary
and you pass in a path like
Which will create the same schema as defined in the dictionary example
Index Information:
Custom Vector Indexing Schema
Users with large use cases may want to change how they formulate the vector index created by Langchain
To utilize all the features of Redis for vector database use cases like this, you can now do the following to pass in index attribute modifiers like changing the indexing algorithm to HNSW.
A more complex example may look like
All names correspond to the arguments you would set if using Redis-py or RedisVL. (put in doc link later)
Better Querying
Both vector queries and Range (limit) queries are now available and metadata is returned by default. The outputs are shown.
Custom metadata filtering
A big advantage of Redis in this space is being able to do filtering on data stored alongside the vector itself. With the example above, the following is now possible in langchain. The equivalence operators are overridden to describe a new expression language that mimic that of redisvl. This allows for arbitrarily long sequences of filters that resemble SQL commands that can be used directly with vector queries and range queries.
There are two interfaces by which to do so and both are shown.
All the above filter results can be checked against the data above.
TODO
Other