Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support filtering by Metadata during search in EmbeddingStore #151

Closed
TyCoding opened this issue Aug 31, 2023 · 17 comments
Closed
Labels
enhancement New feature or request

Comments

@TyCoding
Copy link

Thanks for the project, and I'd like to ask about EmbeddingStore. I know that documents can be stored using EmbeddingStore, but are there any categorizations or divisions between different documents? For example, if I have App1 associated with Doc1 and App2 associated with Doc2 (both stored), when using App1 for chatting, I shouldn't be able to query data from Doc2. How can this be achieved?

@TyCoding TyCoding added the enhancement New feature or request label Aug 31, 2023
@stoneLee81
Copy link

use two collections, or use field to split them during query.

@langchain4j
Copy link
Owner

Hi @TyCoding can you describe your use case in more detail? What is app1 and what is app2? From the first glance it seems that you need 2 different embedding stores for your use case.

@TyCoding
Copy link
Author

TyCoding commented Sep 1, 2023

@langchain4j
Thank you for your reply. Actually, what I meant is to add a field identifier for the current text when writing vectors into EmbeddingStore. For example, if Document1 comes from Book1 and Document2 comes from Book2, when I'm chatting about Book1, I should only query the data in Document1 instead of querying the data in Document2. However, I did see your response in issues #114 which is similar to the functionality of "metadata filtering."

@stoneLee81
Copy link

The project is great, Currently it may be the best in the open source java project of LLM。The problem is that the project has just started, and some functions are not perfect, Such as embedding store, it is the fields encapsulated into the collection are hardcoded, and It only encapsulates part of the functions of add,If you want to define fields by yourself, or add modification and deletion functions, you need to extend the class yourself, or repackage。

@langchain4j
Copy link
Owner

@TyCoding metadata filtering is something we will work on soon.

@langchain4j langchain4j changed the title About EmbeddingStore Support filtering by Metadata during search in EmbeddingStore Oct 6, 2023
@langchain4j langchain4j changed the title Support filtering by Metadata during search in EmbeddingStore [FEATURE] Support filtering by Metadata during search in EmbeddingStore Oct 6, 2023
@stephanj
Copy link

Any update on the metadata filtering, it's a feature I could also use for my own project! #Thanks

@kuraleta
Copy link
Collaborator

Hi @stephanj may I ask you about your use case? This is one of our top priorities, and we will start working on it next week.

@lukasstanek
Copy link

We have various properties for documents we want the user to be able to filter before the vector search. For example document type.

@andyflury
Copy link

Would also very much appreciate to have meta data filtering.

Spring AI has something like this already, the call it metadata filters

Our use case is that we index different types of documents, e.g. web pages, pages from our documentation, support tickets, knowledge base articles, etc. In total we have several thousand documents. With meta data filter we could for example retrieve relevant support tickets, but leave out other document types

Following the SpringAI conecpt dev.langchain4j.store.embedding.EmbeddingStore.findRelevant could be extended by adding an optional Filter parameter.

A FilterExpressionBuilder could be used to create a Filter.

@langchain4j
Copy link
Owner

Hi all, please share for which embedding stores do you need this feature, so that we can prioritize.
Thank you!

@andyflury
Copy link

We are using ElasticSearch on our side. Thx!

@1402564807
Copy link
Contributor

Hi all, please share for which embedding stores do you need this feature, so that we can prioritize. Thank you!

Our team is using Milvus

@andyflury
Copy link

As I quick workaround I created my own extension of ElasticsearchEmbeddingStoreWithFilter, which overwrites buildDefaultScriptScoreQuery as follows

    private ScriptScoreQuery buildDefaultScriptScoreQuery(float[] vector, Query query, float minScore) throws JsonProcessingException {
        JsonData queryVector = toJsonData(vector);
        return ScriptScoreQuery.of(q -> q
                .minScore(minScore)
                .query(query)
                .script(s -> s.inline(InlineScript.of(i -> i
                        // The script adds 1.0 to the cosine similarity to prevent the score from being negative.
                        // divided by 2 to keep score in the range [0, 1]
                        .source("(cosineSimilarity(params.query_vector, 'vector') + 1.0) / 2")
                        .params("query_vector", queryVector)))));
    }

Plus I added a co.elastic.clients.elasticsearch._types.query_dsl.Query parameter to findRelevant

public List<EmbeddingMatch<TextSegment>> findRelevant(Embedding referenceEmbedding, Query query, int maxResults, double minScore) {

When calling findRelevant I can now pass an extra Query parameter like this:

Query query = Query.of(qu -> qu.bool(qb -> qb.filter(qf -> qf.term(qt -> qt.field("metadata.source.keyword").value("TICKET")))));

the above query will limit searches to Documents that have 'metadata.source.keyword' equal to 'TICKET'. But obviously you could add any type of Query you like.

This is obviously not a full solution, as you definitely would not want OpenAI specific Queries in langchain4j API.

@langchain4j
Copy link
Owner

@andyflury Thanks for the insights! I am not very familiar with Elasticsearch, but shouldn't filtering be done outside of ScriptScoreQuery?

@andyflury
Copy link

I'm also not an Elasticsearch expert. I basically copied this logic here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-script-score-query.html#vector-functions-cosine

@langchain4j
Copy link
Owner

Hi all, here is a draft, comments are welcome!

@langchain4j
Copy link
Owner

langchain4j commented Mar 12, 2024

Metadata filtering is now supported in version 0.28.0 for InMemoryEmbeddingStore, MilvusEmbeddingStore and ElasticsearchEmbeddingStore. You are welcome to contribute support for the remaining EmbeddingStores!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants