New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak with high cardinality tag and MeterFilter meant to cap cardinality #4971
Comments
I'm still not sure what to do about this yet, but I wanted to get the issue open because this is a critical regression and we need to figure out something before 1.13 GA or we may need to revert the performance enhancement that introduced this behavior which would be rather unfortunate. |
WDYT about making this cache pluggable via config? We can provide a default cache that would cap the cache at 2x the size of the number of meters registered or any other strategy that would be an ideal default. |
@lenin-jaganathan I usually prefer to not make more configuration, but I think in this case we may not have a better way forward. I like the general direction of what you mentioned. It somewhat reminds me of the A tangential question is: should the cause of this (a filter like the one in the issue description) be something we discourage or eventually even disallow? It doesn't seem unreasonable without the new implementation detail in MeterRegistry that results in this issue. That said, we have in some way, I think, encouraged using a deny filter when reaching max allowable tags rather than the approach in the meter filter implementation here. There may be other use cases to consider as well. |
When I said config, I didn't literally mean a property. What I was thinking is more of a pluggable interface with a basic implementation which would be ideal. But, can be plugged to have desired behavior based on needs.
I probably would disagree a bit on this. There are a lot of open source libraries out there using Micrometer and there are probabilities sometimes that they can have high cardinal dimensions E.g: could HTTP path etc. |
I understood. If it is pluggable, it is exposed API that adds complexity users need to understand (as opposed to an implementation detail), which is the concern.
I understand why the example filter results in more useful metrics than denying meters, but metrics instrumentation with a high cardinality tag is essentially a bug that should be fixed if at all possible. It already requires prior knowledge that such a filter needs to be configured for a high cardinality tag on a meter. Such a filter should be a stop-gap solution or for cases where it is out of control of the thing doing the config to ensure tags are limited to low cardinality (for example, Spring Boot configures a high cardinality tag filter for HTTP client metrics because users may not use uri templates - see here). With the most common case of HTTP client instrumentation, there should always be a way that templated paths can be tagged, but this requires users to use the HTTP client in a certain way that isn't (can't be) enforced. Proper instrumentation should make it possible for the cardinality to be low as long as usage follows some pattern. Then it is up to the end user (application code) to use it that way. If not, such MeterFilters will be needed, but they should be an indication the code should be fixed; not a permanent "solution". The filter's primary purpose is to prevent an OOM error and indicate a problem to the application owners to fix. Does that make sense? Am I missing something? I like the idea of a pluggable strategy (other than the complexity it adds), but I do worry about getting the API for this right directly in GA with no pre-releases. I'm wondering if maybe the safest way forward is to disable the cache by default and offer some way to enable it or possibly configure a strategy for it in an experimental way that it is clear it is not GA. I'll try to experiment with this. |
I agree with this. |
After much thought, I eventually went to write a test for this only to realize we don't have the memory leak we thought we would because meters are only added to the preFilterIdToMeterMap if they are not already in the meterMap. So in the case given, or the general case of a filter that maps multiple IDs to a same ID, only the first time will it be added to the cache. After that, other IDs that end up mapping to the same post-filter ID will not use the cache and will have all filters applied to them and the shared meter will be retrieved from the meterMap. Incidentally, this seems like the right behavior here without the need to make anything configurable. I think we get the performance gain in most cases like we want and in cases like this where we don't get the performance boost, it is avoiding a memory leak by not adding to the cache. Here's the test I wrote: @Test
void differentPreFilterIdsMapToSameId_thenCacheIsBounded() {
registry.config().meterFilter(MeterFilter.replaceTagValues("secret", s -> "redacted"));
Counter c1 = registry.counter("counter", "secret", "value");
Counter c2 = registry.counter("counter", "secret", "value2");
assertThat(c1).isSameAs(c2);
assertThat(registry.preFilterIdToMeterMap).hasSize(1);
assertThat(registry.remove(c1)).isSameAs(c2);
assertThat(registry.remove(c2)).isNull();
assertThat(registry.getMeters()).isEmpty();
assertThat(registry.preFilterIdToMeterMap).isEmpty();
} |
More safely encapsulates the recently added preFilterIdToMeterMap and stalePreFilterIds fields by making only an unmodifiable view accessible via package-private getters for testing purposes only. Uses an unconventional name starting with an underscore `_` to differentiate these methods from supported API. This also adds a test case for what we worried would result in a memory leak but the test demonstrates it does not. See gh-4971
Yeah, I can see that now. We won't have a many-to-one mapping in the cache. Sorry, for the confusion and thanks for confirming it. |
Retrospectively thinking about what sort of memory implications this will have when we have the below scenarios,
E.g:
The above meterfilter adds a dummy URI if the number of URIs exceeds a limit (in this case 100). Once 100 unique URI's, are added, this will map all the new ID to a single meter. But, after this fix the mapping will be the same but we will keep on adding the prefilter ID's to the map which will grow indefinitely.
If filters return a NOOP meter we are fine since we don't store them anyway. But, in the above scenario we are altering the URI tag alone to make sure we don't lose the count of requests but still managing tags to a reasonable count.
Originally posted by @lenin-jaganathan in #4857 (comment)
The text was updated successfully, but these errors were encountered: