Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Field masking has inconsistent memory issues with certain queries #4031

Open
peternied opened this issue Feb 8, 2024 · 3 comments
Open
Labels
bug Something isn't working documentation For code documentation/ javadocs/ comments / readme etc.. help wanted Community contributions are especially encouraged for these issues. triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.

Comments

@peternied
Copy link
Member

What is the bug?
We've seen reports of heap memory usage spiking with field masking enabled. With how masking is implemented at the leaf level its possible that certain types of queries cause the masked fields to be materialized even when they are not used.

How can one reproduce the bug?
Steps to reproduce the behavior:

  • Checkout this branch main...peternied:masking-perf
  • Run ./gradlew integrationTest --tests org.opensearch.security.MaskingTests
  • Analyze the results

Baseline behavior

Baseline query with the following format:

final SearchSourceBuilder ssb = new SearchSourceBuilder();
ssb.size(0);
final SearchRequest request = new SearchRequest(INDEX_NAME_PREFIX + "*");
request.source(searchSourceBuilder);

Creating 3 Indices with 5000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 106 888,814 3,621,248 191,136 480,198
reader 100 105 808,317 1,470,368 517,640 195,083
reader ROLE_WITH_NO_MASKING 100 105 813,600 1,602,272 536,416 206,571
reader MASKING_LOW_REPEAT_VALUE 100 105 936,948 1,631,088 618,704 195,653
reader MASKING_RANDOM_LONG 100 105 919,187 1,593,456 548,264 236,892
reader MASKING_RANDOM_STRING 100 105 951,826 1,656,432 564,784 191,008

Creating 3 Indices with 50000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 110 873,379 4,872,552 575,208 508,483
reader 100 109 753,489 4,020,832 567,304 341,413
reader ROLE_WITH_NO_MASKING 100 108 715,624 988,688 553,504 78,209
reader MASKING_LOW_REPEAT_VALUE 100 108 970,238 1,222,008 650,520 89,596
reader MASKING_RANDOM_LONG 100 108 942,349 1,215,448 679,768 115,690
reader MASKING_RANDOM_STRING 100 108 971,219 1,646,856 660,856 119,863

Query with Aggregate Filter

Query with the following format:

SearchSourceBuilder ssb = new SearchSourceBuilder();
ssb.aggregation(AggregationBuilders.filters("my-filter", QueryBuilders.queryStringQuery("last")));
ssb.aggregation(AggregationBuilders.count("counting").field("genre.keyword"));
ssb.aggregation(AggregationBuilders.avg("averaging").field("longId"));
ssb.size(0);
final SearchRequest request = new SearchRequest(INDEX_NAME_PREFIX + "*");
request.source(searchSourceBuilder);

Creating 3 Indices with 5000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 106 1,069,939 4,905,008 288,144 679,704
reader 100 105 877,725 1,604,048 562,032 215,923
reader ROLE_WITH_NO_MASKING 100 106 898,739 1,860,632 354,032 260,686
reader MASKING_LOW_REPEAT_VALUE 100 106 2,441,865 3,504,944 2,040,768 235,363
reader MASKING_RANDOM_LONG 100 106 2,500,135 3,215,712 1,984,096 247,712
reader MASKING_RANDOM_STRING 100 106 2,414,665 3,330,960 2,075,848 232,634

Creating 3 Indices with 50000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 113 1,265,168 2,698,248 993,528 301,134
reader 100 112 1,308,585 5,408,936 851,672 579,648
reader ROLE_WITH_NO_MASKING 100 109 1,021,555 1,504,480 798,400 136,735
reader MASKING_LOW_REPEAT_VALUE 100 114 2,420,922 7,183,000 1,828,456 660,701
reader MASKING_RANDOM_LONG 100 112 2,225,070 2,815,568 1,994,176 142,038
reader MASKING_RANDOM_STRING 100 111 2,169,297 2,673,376 1,964,144 128,983

Term Match Query:

Query with the following format:

SearchSourceBuilder ssb = new SearchSourceBuilder();
ssb.aggregation(AggregationBuilders.filters("my-filter",  QueryBuilders.termQuery("title","last")));
ssb.aggregation(AggregationBuilders.count("counting").field("genre.keyword"));
ssb.aggregation(AggregationBuilders.avg("averaging").field("longId"));
ssb.size(0);
final SearchRequest request = new SearchRequest(INDEX_NAME_PREFIX + "*");
request.source(searchSourceBuilder);

Creating 3 Indices with 5000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 106 1,154,269 2,497,744 441,920 353,223
reader 100 105 906,467 1,586,928 563,872 203,493
reader ROLE_WITH_NO_MASKING 100 105 873,926 1,390,368 453,184 199,539
reader MASKING_LOW_REPEAT_VALUE 100 105 1,184,290 1,857,944 840,680 218,853
reader MASKING_RANDOM_LONG 100 106 1,184,382 1,847,496 795,520 224,178
reader MASKING_RANDOM_STRING 100 105 1,162,373 1,811,936 782,312 215,824

Creating 3 Indices with 50000 Documents

Role Condition Count Attempts Avg Heap Used Max Heap Used Min Heap Used Std Deviation
admin 100 110 966,775 1,264,696 825,536 79,341
reader 100 109 971,058 1,393,544 779,968 95,795
reader ROLE_WITH_NO_MASKING 100 108 964,071 1,392,504 765,104 90,630
reader MASKING_LOW_REPEAT_VALUE 100 108 1,249,295 1,707,896 1,028,104 120,058
reader MASKING_RANDOM_LONG 100 109 1,225,250 1,717,792 982,744 111,090
reader MASKING_RANDOM_STRING 100 108 1,227,667 1,667,544 991,424 116,841
@peternied peternied added bug Something isn't working untriaged Require the attention of the repository maintainers and may need to be prioritized labels Feb 8, 2024
@peternied
Copy link
Member Author

Analysis

Based on these results here are a couple of the findings. Overall, I would recommend avoiding StringQueries when using masked fields as there appears to be an disproportional use of memory compared to other search types. There are likely optimizations that could be added for some kinds of fields - such as building a cache of masked values to avoid the allocation tax, but its unclear if this would be helpful in high-cardinality scenarios.

The following are a couple of sources for this recommendations

GC Impact

The attempts column increases if GC was called during a test run, as that makes the memory calculation likely to be incorrect. So when attempts is higher, more GCs were performed - higher JVM usage. Baseline performs best, with Term being impacted slightly more often, and finally StringQuery seems to trip GC more frequently.

Masking Value Type

The complexity of the masking operation, masking an LONG vs a STRING did not appear to impact the usage, how ever there is a clear increase of between 15-30% memory usage when a field is masked in either Baseline or Term scenarios. This overhead is semi-expected due to the masking operations happening as the data is accessed.

Memory expense of StringQueries

When comparing Term vs StringQueries there is a 2x increase in memory usage when masked fields are involved, maybe additional rework that doesn't align with the behavoir pattern when looking at non-masked fields which are generally more memory expensive.

@scrawfor99
Copy link
Collaborator

[Triage] Hi @peternied, thank you for the extremely detailed issue. You provided a lot of good details and we can take a couple of paths forward:

  • update documentation
  • look for bugs causing overhead
  • improve resource utilization

The first task is definitely actionable, the other two tasks can be investigated on an RFC or help wanted basis since it may not be too clear how to correct this quickly.

@scrawfor99 scrawfor99 added help wanted Community contributions are especially encouraged for these issues. triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable. documentation For code documentation/ javadocs/ comments / readme etc.. and removed untriaged Require the attention of the repository maintainers and may need to be prioritized labels Feb 12, 2024
@cwperks
Copy link
Member

cwperks commented May 23, 2024

@peternied I noticed that field masking is being applied on search requests where the request size is 0 and the only aggregations requested are count or cardinality. This may be an area that can be optimized to reduce the memory footprint of field masking.

This PR on my fork contains an example: cwperks#24 #4362

  1. Create an index and index some documents
Sample Documents
{
    "title": "Magnum Opus",
    "artist": "First artist",
    "lyrics": "Very deep subject",
    "stars": 1,
    "genre": "rock"
},
{
    "title": "Song 1+1",
    "artist": "String",
    "lyrics": "Once upon a time",
    "stars": 2,
    "genre": "blues"
},
{
    "title": "Next song",
    "artist": "Twins",
    "lyrics": "giant nonsense",
    "stars": 3,
    "genre": "jazz"
},
{
    "title": "Poison",
    "artist": "No!",
    "lyrics": "Much too much",
    "stars": 4,
    "genre": "rock"
},
{
    "title": "ABC",
    "artist": "First artist",
    "lyrics": "abcdefghijklmnopqrstuvwxyz",
    "stars": 7,
    "genre": "rock"
}
  1. Assign a user to be able to search the index, but anonymize the artist field.

  2. Run a Search Request of size 0 with a cardinality aggregation (uniqueness count)

SearchRequest of size 0 with cardinality aggregation
POST /songs/_search
{
  "size": 0,
  "aggs": {
    "unique_artists": {
      "cardinality": {
        "field": "artist.keyword"
      }
    }
  }
}

For these types of queries, is applying the field masking necessary? It looks like extra overhead for a query that is already not requesting the sensitive info. If the size is >0 or has aggregations other than cardinality and count (such as max or min where raw data can be exposed) then the field masking should be applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation For code documentation/ javadocs/ comments / readme etc.. help wanted Community contributions are especially encouraged for these issues. triaged Issues labeled as 'Triaged' have been reviewed and are deemed actionable.
Projects
None yet
Development

No branches or pull requests

3 participants