-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Field masking has inconsistent memory issues with certain queries #4031
Comments
AnalysisBased on these results here are a couple of the findings. Overall, I would recommend avoiding StringQueries when using masked fields as there appears to be an disproportional use of memory compared to other search types. There are likely optimizations that could be added for some kinds of fields - such as building a cache of masked values to avoid the allocation tax, but its unclear if this would be helpful in high-cardinality scenarios. The following are a couple of sources for this recommendations GC ImpactThe attempts column increases if GC was called during a test run, as that makes the memory calculation likely to be incorrect. So when attempts is higher, more GCs were performed - higher JVM usage. Baseline performs best, with Term being impacted slightly more often, and finally StringQuery seems to trip GC more frequently. Masking Value TypeThe complexity of the masking operation, masking an LONG vs a STRING did not appear to impact the usage, how ever there is a clear increase of between 15-30% memory usage when a field is masked in either Baseline or Term scenarios. This overhead is semi-expected due to the masking operations happening as the data is accessed. Memory expense of
|
[Triage] Hi @peternied, thank you for the extremely detailed issue. You provided a lot of good details and we can take a couple of paths forward:
The first task is definitely actionable, the other two tasks can be investigated on an RFC or help wanted basis since it may not be too clear how to correct this quickly. |
@peternied I noticed that field masking is being applied on search requests where the request size is 0 and the only aggregations requested are count or cardinality. This may be an area that can be optimized to reduce the memory footprint of field masking. This PR on my fork contains an example: cwperks#24
Sample Documents
SearchRequest of size 0 with cardinality aggregation
For these types of queries, is applying the field masking necessary? It looks like extra overhead for a query that is already not requesting the sensitive info. If the size is >0 or has aggregations other than cardinality and count (such as max or min where raw data can be exposed) then the field masking should be applied. |
What is the bug?
We've seen reports of heap memory usage spiking with field masking enabled. With how masking is implemented at the leaf level its possible that certain types of queries cause the masked fields to be materialized even when they are not used.
How can one reproduce the bug?
Steps to reproduce the behavior:
./gradlew integrationTest --tests org.opensearch.security.MaskingTests
Baseline behavior
Baseline query with the following format:
Creating 3 Indices with 5000 Documents
Creating 3 Indices with 50000 Documents
Query with Aggregate Filter
Query with the following format:
Creating 3 Indices with 5000 Documents
Creating 3 Indices with 50000 Documents
Term Match Query:
Query with the following format:
Creating 3 Indices with 5000 Documents
Creating 3 Indices with 50000 Documents
The text was updated successfully, but these errors were encountered: