Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Dynamically update maxClauseCount based on resources on a node #12549

Closed
harshavamsi opened this issue Mar 6, 2024 · 5 comments
Labels
enhancement Enhancement or improvement to existing feature or request Search:Resiliency v2.16.0 Issues and PRs related to version 2.16.0 v3.0.0 Issues and PRs related to version 3.0.0

Comments

@harshavamsi
Copy link
Contributor

Is your feature request related to a problem? Please describe

Today lucene has a concept of max_clause_count which prevents the number of clauses from exceeding when expanding wildcard and prefix queries. OpenSearch inherited ES 7.10 which used a static index.query.bool.max_clause_count setting that is set to a max of 1024 by default.

In https://issues.apache.org/jira/browse/LUCENE-8811, lucene changed the clause count from affecting only boolean queries to all queries on the index by adding the constraint at the indexSearcher level. This means that the expansion effects all the fields and is now very easy to cross the limit.

A workaround was to use a rewrite to avoid hitting that limit, but it no longer works since the number of clauses after a rewrite is now what is counted.

A lot of users trying to migrate from OS 1.x to 2.x have reported this problem where their queries would previously not hit the clause limit, but since upgrading have now run into the limits. ElasticSearch themselves have updated the way that the limit is computed and moved away from a static limit of 1024 to a more dynamic resource based(thread pool and heap size) limit.

Graylog2/graylog2-server#14272 is an example of people running into this issue.

Describe the solution you'd like

Rather than relying on INDICES_MAX_CLAUSE_COUNT_SETTING from cluster settings, compute threadpool and heap sizes from jvm stats and use them to determine clause limit.

Is there a better metric to compute them? The limit was introduced to prevent rouge queries from consuming too many threads, CPU, and memory. It would make sense to update the limit based on these attributes.

Related component

Search:Resiliency

Describe alternatives you've considered

No response

Additional context

No response

@harshavamsi harshavamsi added enhancement Enhancement or improvement to existing feature or request untriaged labels Mar 6, 2024
@harshavamsi
Copy link
Contributor Author

@msfroh @reta @andrross thoughts?

@reta
Copy link
Collaborator

reta commented Mar 8, 2024

Thanks @harshavamsi

Rather than relying on INDICES_MAX_CLAUSE_COUNT_SETTING from cluster settings, compute threadpool and heap sizes from jvm stats and use them to determine clause limit

AFAIK this is not a cluster but node level setting

Is there a better metric to compute them? The limit was introduced to prevent rouge queries from consuming too many threads, CPU, and memory. It would make sense to update the limit based on these attributes.

I think it would make sense to explore this path, but it also not clear to me how to align this max number with cpu / heap (that would vary per node), @msfroh any heuristics you may be aware of? thank you

@andrross
Copy link
Member

[Triage - attendees 1 2 3]
@harshavamsi Thanks for filing this issue.

@msfroh
Copy link
Collaborator

msfroh commented Mar 29, 2024

@msfroh any heuristics you may be aware of? thank you

I have a few thoughts about this:

  1. I'm not convinced that a good heuristic exists for how many clauses is "too many" for some hardware. In particular, the cost of a clause really depends on the clause itself. What kind of clause is it? How many docs does the clause match? Which index structure is the clause hitting (postings, point tree, doc values)?
  2. Any estimate that we come up with may be wrong. We should allow users to adjust the value without needing to restart their node.
  3. In a perfect world, I would make it an index-level setting, since we can theoretically connect the index with the workload running on it. Unfortunately, while the setting was moved into IndexSearcher, it's a static variable (probably for backward compatibility?). Before Lucene 10 is released, maybe we can turn it into a member variable?

AFAIK this is not a cluster but node level setting

@reta -- do you know if there's any backward-compatibility issue with turning it into a dynamic cluster-level setting? I suppose it would break if different nodes have different values in their opensearch.yml file.

@reta
Copy link
Collaborator

reta commented Mar 30, 2024

@reta -- do you know if there's any backward-compatibility issue with turning it into a dynamic cluster-level setting? I suppose it would break if different nodes have different values in their opensearch.yml file.

Thanks @msfroh , at high level I don't see why this settings couldn't be dynamic and at a cluster level, but the evil is in details I think: the IndexSearcher::maxClauseCount is not supposed to be changed dynamically at runtime, it is not volatile nor designed to be managed safely in a presence of multiple threads. I was looking into original pull request as well [1] for the clues at it seems like it was the concern too [2] (with BQ but IndexSearcher inherited same implementation).

[1] elastic/elasticsearch#18341
[2] elastic/elasticsearch#18341 (comment)

@reta reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.16.0 Issues and PRs related to version 2.16.0 labels Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Search:Resiliency v2.16.0 Issues and PRs related to version 2.16.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: Done
Status: Done
Development

No branches or pull requests

4 participants