Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with high cardinality tag and MeterFilter meant to cap cardinality #4971

Closed
shakuzen opened this issue Apr 18, 2024 · 8 comments
Closed
Labels
invalid An issue that we don't feel is valid module: micrometer-core An issue that is related to our core module

Comments

@shakuzen
Copy link
Member

shakuzen commented Apr 18, 2024

Retrospectively thinking about what sort of memory implications this will have when we have the below scenarios,

  1. A meter with dynamicTagValues like uri with userId and meter-filter imposes some limit?
    E.g:
new MeterFilter() {
    private final Set<String> observedTagValues = ConcurrentHashMap.newKeySet();

    @Override
    public Meter.Id map(Meter.Id id) {
        String value = id.getTag("uri");
        if (value != null) {
            if (!observedTagValues.contains(value)) {
                if (observedTagValues.size() >= 100) {
                    maskedTag = Tag.of("uri, "MASKED");
                    return id.withTag(maskedTag);
                }
                observedTagValues.add(value);
            }
        }
        return id;
    }
}

The above meterfilter adds a dummy URI if the number of URIs exceeds a limit (in this case 100). Once 100 unique URI's, are added, this will map all the new ID to a single meter. But, after this fix the mapping will be the same but we will keep on adding the prefilter ID's to the map which will grow indefinitely.

If filters return a NOOP meter we are fine since we don't store them anyway. But, in the above scenario we are altering the URI tag alone to make sure we don't lose the count of requests but still managing tags to a reasonable count.

Originally posted by @lenin-jaganathan in #4857 (comment)

@shakuzen shakuzen added regression A regression from a previous release module: micrometer-core An issue that is related to our core module labels Apr 18, 2024
@shakuzen shakuzen added this to the 1.13.0 milestone Apr 18, 2024
@shakuzen
Copy link
Member Author

I'm still not sure what to do about this yet, but I wanted to get the issue open because this is a critical regression and we need to figure out something before 1.13 GA or we may need to revert the performance enhancement that introduced this behavior which would be rather unfortunate.

@lenin-jaganathan
Copy link
Contributor

WDYT about making this cache pluggable via config? We can provide a default cache that would cap the cache at 2x the size of the number of meters registered or any other strategy that would be an ideal default.

@shakuzen
Copy link
Member Author

@lenin-jaganathan I usually prefer to not make more configuration, but I think in this case we may not have a better way forward. I like the general direction of what you mentioned. It somewhat reminds me of the HighCardinalityTagsDetector we already have. I'll think on it some and discuss internally. A challenge is how to describe to users what is configurable, why, and what the tradeoffs are of different configurations. I think (but unfortunately we can't really know) that most users should be unaffected by this issue, which means most should not need to touch the default or know about this configuration. However, for those that do need to know about it and change the default, how do they find out they need to change it? We can log when we reach the cap (or some point before reaching the cap perhaps) but we'll need to figure out what the log message should say that will be clear to users.

A tangential question is: should the cause of this (a filter like the one in the issue description) be something we discourage or eventually even disallow? It doesn't seem unreasonable without the new implementation detail in MeterRegistry that results in this issue. That said, we have in some way, I think, encouraged using a deny filter when reaching max allowable tags rather than the approach in the meter filter implementation here. There may be other use cases to consider as well.

@lenin-jaganathan
Copy link
Contributor

I usually prefer to not make more configuration

When I said config, I didn't literally mean a property. What I was thinking is more of a pluggable interface with a basic implementation which would be ideal. But, can be plugged to have desired behavior based on needs.

should the cause of this (a filter like the one in the issue description) be something we discourage or eventually even disallow?

I probably would disagree a bit on this. There are a lot of open source libraries out there using Micrometer and there are probabilities sometimes that they can have high cardinal dimensions E.g: could HTTP path etc.
Disallowing a metric would mean the metric is no more usable because its count can be relied upon as it represents only a sample whereas flattening out the high cardinal dimension. Sure we are losing out on a dimension, but we can still answer what is the total count of things (distributions in case of timers/summaries).

@shakuzen
Copy link
Member Author

When I said config, I didn't literally mean a property.

I understood. If it is pluggable, it is exposed API that adds complexity users need to understand (as opposed to an implementation detail), which is the concern.

Disallowing a metric would mean the metric is no more usable because its count can be relied upon as it represents only a sample whereas flattening out the high cardinal dimension.

I understand why the example filter results in more useful metrics than denying meters, but metrics instrumentation with a high cardinality tag is essentially a bug that should be fixed if at all possible. It already requires prior knowledge that such a filter needs to be configured for a high cardinality tag on a meter. Such a filter should be a stop-gap solution or for cases where it is out of control of the thing doing the config to ensure tags are limited to low cardinality (for example, Spring Boot configures a high cardinality tag filter for HTTP client metrics because users may not use uri templates - see here). With the most common case of HTTP client instrumentation, there should always be a way that templated paths can be tagged, but this requires users to use the HTTP client in a certain way that isn't (can't be) enforced. Proper instrumentation should make it possible for the cardinality to be low as long as usage follows some pattern. Then it is up to the end user (application code) to use it that way. If not, such MeterFilters will be needed, but they should be an indication the code should be fixed; not a permanent "solution". The filter's primary purpose is to prevent an OOM error and indicate a problem to the application owners to fix.

Does that make sense? Am I missing something?

I like the idea of a pluggable strategy (other than the complexity it adds), but I do worry about getting the API for this right directly in GA with no pre-releases. I'm wondering if maybe the safest way forward is to disable the cache by default and offer some way to enable it or possibly configure a strategy for it in an experimental way that it is clear it is not GA. I'll try to experiment with this.

@lenin-jaganathan
Copy link
Contributor

I do worry about getting the API for this right directly in GA with no pre-releases

I agree with this.

@shakuzen
Copy link
Member Author

shakuzen commented May 1, 2024

After much thought, I eventually went to write a test for this only to realize we don't have the memory leak we thought we would because meters are only added to the preFilterIdToMeterMap if they are not already in the meterMap. So in the case given, or the general case of a filter that maps multiple IDs to a same ID, only the first time will it be added to the cache. After that, other IDs that end up mapping to the same post-filter ID will not use the cache and will have all filters applied to them and the shared meter will be retrieved from the meterMap. Incidentally, this seems like the right behavior here without the need to make anything configurable. I think we get the performance gain in most cases like we want and in cases like this where we don't get the performance boost, it is avoiding a memory leak by not adding to the cache. Here's the test I wrote:

@Test
void differentPreFilterIdsMapToSameId_thenCacheIsBounded() {
    registry.config().meterFilter(MeterFilter.replaceTagValues("secret", s -> "redacted"));
    Counter c1 = registry.counter("counter", "secret", "value");
    Counter c2 = registry.counter("counter", "secret", "value2");

    assertThat(c1).isSameAs(c2);
    assertThat(registry.preFilterIdToMeterMap).hasSize(1);
    assertThat(registry.remove(c1)).isSameAs(c2);
    assertThat(registry.remove(c2)).isNull();
    assertThat(registry.getMeters()).isEmpty();
    assertThat(registry.preFilterIdToMeterMap).isEmpty();
}

shakuzen added a commit that referenced this issue May 2, 2024
More safely encapsulates the recently added preFilterIdToMeterMap and stalePreFilterIds fields by making only an unmodifiable view accessible via package-private getters for testing purposes only. Uses an unconventional name starting with an underscore `_` to differentiate these methods from supported API.

This also adds a test case for what we worried would result in a memory leak but the test demonstrates it does not. See gh-4971
@lenin-jaganathan
Copy link
Contributor

Yeah, I can see that now. We won't have a many-to-one mapping in the cache. Sorry, for the confusion and thanks for confirming it.

@shakuzen shakuzen removed this from the 1.13.0 milestone May 2, 2024
@shakuzen shakuzen added invalid An issue that we don't feel is valid and removed regression A regression from a previous release labels May 2, 2024
@shakuzen shakuzen closed this as not planned Won't fix, can't repro, duplicate, stale May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
invalid An issue that we don't feel is valid module: micrometer-core An issue that is related to our core module
Projects
None yet
Development

No branches or pull requests

2 participants