[FEATURE] Support filtering by Metadata during search in EmbeddingStore #151

TyCoding · 2023-08-31T07:12:56Z

Thanks for the project, and I'd like to ask about EmbeddingStore. I know that documents can be stored using EmbeddingStore, but are there any categorizations or divisions between different documents? For example, if I have App1 associated with Doc1 and App2 associated with Doc2 (both stored), when using App1 for chatting, I shouldn't be able to query data from Doc2. How can this be achieved?

stoneLee81 · 2023-08-31T08:27:06Z

use two collections, or use field to split them during query.

langchain4j · 2023-08-31T10:06:50Z

Hi @TyCoding can you describe your use case in more detail? What is app1 and what is app2? From the first glance it seems that you need 2 different embedding stores for your use case.

TyCoding · 2023-09-01T08:52:20Z

@langchain4j
Thank you for your reply. Actually, what I meant is to add a field identifier for the current text when writing vectors into EmbeddingStore. For example, if Document1 comes from Book1 and Document2 comes from Book2, when I'm chatting about Book1, I should only query the data in Document1 instead of querying the data in Document2. However, I did see your response in issues #114 which is similar to the functionality of "metadata filtering."

stoneLee81 · 2023-09-02T04:08:49Z

The project is great, Currently it may be the best in the open source java project of LLM。The problem is that the project has just started, and some functions are not perfect， Such as embedding store, it is the fields encapsulated into the collection are hardcoded, and It only encapsulates part of the functions of add，If you want to define fields by yourself, or add modification and deletion functions, you need to extend the class yourself, or repackage。

langchain4j · 2023-09-02T06:21:36Z

@TyCoding metadata filtering is something we will work on soon.

stephanj · 2024-01-27T08:25:22Z

Any update on the metadata filtering, it's a feature I could also use for my own project! #Thanks

kuraleta · 2024-01-27T08:47:19Z

Hi @stephanj may I ask you about your use case? This is one of our top priorities, and we will start working on it next week.

lukasstanek · 2024-01-27T13:50:57Z

We have various properties for documents we want the user to be able to filter before the vector search. For example document type.

andyflury · 2024-01-28T16:33:52Z

Would also very much appreciate to have meta data filtering.

Spring AI has something like this already, the call it metadata filters

Our use case is that we index different types of documents, e.g. web pages, pages from our documentation, support tickets, knowledge base articles, etc. In total we have several thousand documents. With meta data filter we could for example retrieve relevant support tickets, but leave out other document types

Following the SpringAI conecpt dev.langchain4j.store.embedding.EmbeddingStore.findRelevant could be extended by adding an optional Filter parameter.

A FilterExpressionBuilder could be used to create a Filter.

langchain4j · 2024-01-31T07:12:45Z

Hi all, please share for which embedding stores do you need this feature, so that we can prioritize.
Thank you!

andyflury · 2024-01-31T07:35:37Z

We are using ElasticSearch on our side. Thx!

1402564807 · 2024-02-02T06:46:31Z

Hi all, please share for which embedding stores do you need this feature, so that we can prioritize. Thank you!

Our team is using Milvus

andyflury · 2024-02-04T17:49:37Z

As I quick workaround I created my own extension of ElasticsearchEmbeddingStoreWithFilter, which overwrites buildDefaultScriptScoreQuery as follows

    private ScriptScoreQuery buildDefaultScriptScoreQuery(float[] vector, Query query, float minScore) throws JsonProcessingException {
        JsonData queryVector = toJsonData(vector);
        return ScriptScoreQuery.of(q -> q
                .minScore(minScore)
                .query(query)
                .script(s -> s.inline(InlineScript.of(i -> i
                        // The script adds 1.0 to the cosine similarity to prevent the score from being negative.
                        // divided by 2 to keep score in the range [0, 1]
                        .source("(cosineSimilarity(params.query_vector, 'vector') + 1.0) / 2")
                        .params("query_vector", queryVector)))));
    }

Plus I added a co.elastic.clients.elasticsearch._types.query_dsl.Query parameter to findRelevant

public List<EmbeddingMatch<TextSegment>> findRelevant(Embedding referenceEmbedding, Query query, int maxResults, double minScore) {

When calling findRelevant I can now pass an extra Query parameter like this:

Query query = Query.of(qu -> qu.bool(qb -> qb.filter(qf -> qf.term(qt -> qt.field("metadata.source.keyword").value("TICKET")))));

the above query will limit searches to Documents that have 'metadata.source.keyword' equal to 'TICKET'. But obviously you could add any type of Query you like.

This is obviously not a full solution, as you definitely would not want OpenAI specific Queries in langchain4j API.

langchain4j · 2024-02-06T10:07:59Z

@andyflury Thanks for the insights! I am not very familiar with Elasticsearch, but shouldn't filtering be done outside of ScriptScoreQuery?

andyflury · 2024-02-06T10:16:27Z

I'm also not an Elasticsearch expert. I basically copied this logic here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-score-query.html#vector-functions-cosine

langchain4j · 2024-02-07T16:43:15Z

Hi all, here is a draft, comments are welcome!

langchain4j · 2024-03-12T07:49:47Z

Metadata filtering is now supported in version 0.28.0 for InMemoryEmbeddingStore, MilvusEmbeddingStore and ElasticsearchEmbeddingStore. You are welcome to contribute support for the remaining EmbeddingStores!

TyCoding added the enhancement New feature or request label Aug 31, 2023

langchain4j changed the title ~~About EmbeddingStore~~ Support filtering by Metadata during search in EmbeddingStore Oct 6, 2023

langchain4j changed the title ~~Support filtering by Metadata during search in EmbeddingStore~~ [FEATURE] Support filtering by Metadata during search in EmbeddingStore Oct 6, 2023

Heezer mentioned this issue Oct 17, 2023

[FEATURE] Classify embeddings in store. #207

Closed

boris-petrov mentioned this issue Feb 1, 2024

[FEATURE] Update and remove methods should be added to EmbeddingStore #583

Closed

langchain4j closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support filtering by Metadata during search in EmbeddingStore #151

[FEATURE] Support filtering by Metadata during search in EmbeddingStore #151

TyCoding commented Aug 31, 2023

stoneLee81 commented Aug 31, 2023

langchain4j commented Aug 31, 2023

TyCoding commented Sep 1, 2023

stoneLee81 commented Sep 2, 2023

langchain4j commented Sep 2, 2023

stephanj commented Jan 27, 2024

kuraleta commented Jan 27, 2024

lukasstanek commented Jan 27, 2024

andyflury commented Jan 28, 2024

langchain4j commented Jan 31, 2024

andyflury commented Jan 31, 2024

1402564807 commented Feb 2, 2024

andyflury commented Feb 4, 2024

langchain4j commented Feb 6, 2024

andyflury commented Feb 6, 2024

langchain4j commented Feb 7, 2024

langchain4j commented Mar 12, 2024 •

edited

[FEATURE] Support filtering by Metadata during search in EmbeddingStore #151

[FEATURE] Support filtering by Metadata during search in EmbeddingStore #151

Comments

TyCoding commented Aug 31, 2023

stoneLee81 commented Aug 31, 2023

langchain4j commented Aug 31, 2023

TyCoding commented Sep 1, 2023

stoneLee81 commented Sep 2, 2023

langchain4j commented Sep 2, 2023

stephanj commented Jan 27, 2024

kuraleta commented Jan 27, 2024

lukasstanek commented Jan 27, 2024

andyflury commented Jan 28, 2024

langchain4j commented Jan 31, 2024

andyflury commented Jan 31, 2024

1402564807 commented Feb 2, 2024

andyflury commented Feb 4, 2024

langchain4j commented Feb 6, 2024

andyflury commented Feb 6, 2024

langchain4j commented Feb 7, 2024

langchain4j commented Mar 12, 2024 • edited

langchain4j commented Mar 12, 2024 •

edited