Update AggregatorStore to reclaim unused MetricPoints for Delta aggregation temporality #4486

utpilla · 2023-05-13T04:48:22Z

Addresses #2360

This PR shows a possible approach to reclaim MetricPoints for Delta aggregation temporality.

Current behavior:

For any metric, once the SDK has encountered a given number (default: 2000) of unique dimension combinations, it drops any new measurement with a newer dimension combination.

With this PR, the AggregatorStore is updated to start reclaiming unused MetricPoints once it has encountered a particular number of unique dimensions combinations. This number is set to 75% of the max metric points allowed. For default case, this means the AggregatorStore does not begin reclaiming MetricPoints until it has seen 1500 (= 75% * 2000) unique dimension combinations. Once it hits 1500, it changes its behavior to begin reclaiming unused MetricPoints. I have set this threshold as there is some cost associated with reclaiming MetricPoints and we should try to avoid every user to pay for this.

What changes when AggregatorStore starts to reclaim unused MetricPoints?

On the Collect call, Snapshot method checks for unused MetricPoints, that is any MetricPoint, for which there is no collection pending and there is no other thread waiting to update it. If it finds such a MetricPoint, it adds the index of that MetricPoint to a queue
Any update thread which cannot find a MetricPoint for its dimension combination would now check for an available MetricPoint from the queue instead of simply incrementing the array index.
When exporting Metrics, we would not necessarily by providing the MetricPoints in a sorted order, that is, the order in which they were emitted

Changes

Added a Func named lookupAggregatorStore to allow for different behavior for Delta and Cumulative aggregation
Created separate lookup dictionaries for Delta and Cumulative aggregation
Added a Queue<int> named availableMetricPoints- This holds the indices for reclaimed MetricPoints. In the AggregatorStore ctor, it's initialized with the remaining 25% of the indices after the threshold.
There are two new fields on MetricPoint struct:
- ReferenceCount- This is used by Snapshot method to determine if a MetricPoint is in use by an Update thread. Update threads increment the ReferenceCount of the MetricPoint before the update and decrement it after the update
- LookupData- This is the lookup dictionary value type for Delta aggregation. This helps Update threads determine if the MetricPoint that they are about to update has already been reclaimed by the Snapshot thread or by some other Update thread for a different set of dimensions combination
Added an internal field named DroppedCount- This is used by the unit tests to verify that no measurements were dropped when they test the MetricPoint reclaim feature

Appropriate CHANGELOG.md files updated for non-trivial changes

temporality

utpilla · 2023-05-13T05:11:42Z

Stress Tests

For max MetricPoints = 1000, there is a drop of ~4.6% in total number of loops executed in around 5 minutes:

main branch: Loops: 7,809,957,791, Loops/Second: 23,745,565, CPU Cycles/Loop: 878, RunwayTime (Seconds): 315
This PR: Loops: 7,448,850,671, Loops/Second: 22,608,029, CPU Cycles/Loop: 904, RunwayTime (Seconds): 315

Replace Program.cs with the code below

using System.Diagnostics.Metrics;
using System.Runtime.CompilerServices;
using OpenTelemetry.Metrics;

namespace OpenTelemetry.Tests.Stress;

public partial class Program
{
    private const int ArraySize = 10;

    private static readonly Meter TestMeter = new(Utils.GetCurrentMethodName());
    private static readonly Counter<long> TestCounter = TestMeter.CreateCounter<long>("TestCounter");
    private static readonly string[] DimensionValues = new string[ArraySize];
    private static readonly ThreadLocal<Random> ThreadLocalRandom = new(() => new Random());

    public static void Main()
    {
        for (int i = 0; i < ArraySize; i++)
        {
            DimensionValues[i] = $"DimValue{i}";
        }

        using var exporter = new CustomExporter();
        using var metricReader = new PeriodicExportingMetricReader(exporter, exportIntervalMilliseconds: 10)
        {
            TemporalityPreference = MetricReaderTemporalityPreference.Delta,
        };

        using var meterProvider = Sdk.CreateMeterProviderBuilder()
            .AddMeter(TestMeter.Name)
            .AddReader(metricReader)
            .Build();

        Stress(prometheusPort: 9464);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    protected static void Run()
    {
        var random = ThreadLocalRandom.Value;
        TestCounter.Add(
            100,
            new("DimName1", DimensionValues[random.Next(0, ArraySize)]),
            new("DimName2", DimensionValues[random.Next(0, ArraySize)]),
            new("DimName3", DimensionValues[random.Next(0, ArraySize)]));
    }

    private class CustomExporter : BaseExporter<Metric>
    {
        public long Sum = 0;
        public string DimensionKey;
        public object DimensionValue;

        public CustomExporter()
        {
        }

        public override ExportResult Export(in Batch<Metric> batch)
        {
            foreach (var metric in batch)
            {
                foreach (ref readonly var metricPoint in metric.GetMetricPoints())
                {
                    foreach (var tag in metricPoint.Tags)
                    {
                        this.DimensionKey = tag.Key;
                        this.DimensionValue = tag.Value;
                    }

                    if (metric.MetricType.IsSum())
                    {
                        this.Sum += metricPoint.GetSumLong();
                    }
                }
            }

            return ExportResult.Success;
        }
    }
}

src/OpenTelemetry/Metrics/AggregatorStore.cs

github-actions · 2023-05-24T03:17:21Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

github-actions · 2023-05-31T03:17:29Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

codecov · 2023-09-01T00:21:41Z

Codecov Report

Merging #4486 (380b159) into main (535c819) will decrease coverage by 0.27%.
Report is 1 commits behind head on main.
The diff coverage is 65.89%.

@@            Coverage Diff             @@
##             main    #4486      +/-   ##
==========================================
- Coverage   83.19%   82.93%   -0.27%     
==========================================
  Files         293      294       +1     
  Lines       11984    12193     +209     
==========================================
+ Hits         9970    10112     +142     
- Misses       2014     2081      +67

Flag	Coverage Δ
unittests	`82.93% <65.89%> (-0.27%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed	Coverage
src/OpenTelemetry/Metrics/AggregatorStore.cs	`63.50%`
src/OpenTelemetry/Metrics/MetricPoint.cs	`90.90%`
src/OpenTelemetry/Metrics/LookupData.cs	`100.00%`
src/OpenTelemetry/Metrics/Tags.cs	`100.00%`

github-actions · 2023-09-09T03:16:32Z

This PR was marked stale due to lack of activity and will be closed in 7 days. Commenting or Pushing will instruct the bot to automatically remove the label. This bot runs once per day.

vishweshbankwar · 2023-09-18T18:16:01Z

@utpilla - Could you please define what "limit of active points" means here? Since it is very specific to how the reclaiming is proposed here.

alanwest · 2023-09-18T21:47:28Z

src/OpenTelemetry/Metrics/AggregatorStore.cs

+                    lock (this.tagsToMetricPointIndexDictionaryDelta!)
+                    {
+                        LookupData? dictionaryValue;
+                        if (lookupData.SortedTags != Tags.EmptyTags)


nit to consider... this might read a little better with with a property if (!lookupData.GivenEqualsSorted).

If this were the case I think EmptyTags could be moved to the LookupData class and made private.

internal class LookupData { private static readonly Tags EmptyTags = new(Array.Empty<KeyValuePair<string, object?>>()); public LookupData(int index) { ... } public LookupData(int index, Tags tags) { ... } public LookupData(int index, Tags sorted, Tags given) { ... } public bool GivenEqualsSorted => SortedTags == Tags.EmptyTags; }

src/OpenTelemetry/Metrics/AggregatorStore.cs

Co-authored-by: Alan West <3676547+alanwest@users.noreply.github.com>

src/OpenTelemetry/Metrics/AggregatorStore.cs

Co-authored-by: Vishwesh Bankwar <vishweshbankwar@users.noreply.github.com>

…atorStore-To-Reclaim-MetricPoints

utpilla · 2023-09-20T01:27:43Z

With this PR, the AggregatorStore is updated to start reclaiming unused MetricPoints once it has encountered a particular number of unique dimensions combinations. This number is set to 75% of the max metric points allowed. For default case, this means the AggregatorStore does not begin reclaiming MetricPoints until it has seen 1500 (= 75% * 2000) unique dimension combinations. Once it hits 1500, it changes its behavior to begin reclaiming unused MetricPoints. I have set this threshold as there is some cost associated with reclaiming MetricPoints and we should try to avoid every user to pay for this.

The MetricPoint reclaim behavior makes some fundamental changes to how the SDK exports metrics. It might surprise some users.

For example, @vishweshbankwar pointed out a specific case where the SDK has exported a particular MetricPoint (k1,v1) at time T1. After a while, this MetricPoint gets reclaimed. Now if their application produces a totally new set of unique metric points and consistently updates them, they might run into a situation where there is no MetricPoint available when a thread later tries to update the MetricPoint (k1,v1) at time T2 and the measurement gets dropped. When consuming the metrics, their dashboard would show them that no measurements were recorded for the MetricPoint (k1,v1) at time T2, which is not correct as it was actually dropped.

I would be in favor of the user opting in for the reclaim behavior so that they are aware of what to expect instead of simply making it the default behavior for Delta aggregation temporality.

I also spoke to @alanwest who brought up a good point to check how the Java SIG sets the default behavior and assess if it would make sense to be consistent with them.

I think for now it's okay to have this behavior (even for 1.7.0-alpha release). But before 1.7.0 stable release, we should follow-up on what the desired default behavior should be.

Yun-Ting · 2023-09-21T23:49:05Z

For example, @vishweshbankwar pointed out a specific case where the SDK has exported a particular MetricPoint (k1,v1) at time T1. After a while, this MetricPoint gets reclaimed. Now if their application produces a totally new set of unique metric points and consistently updates them, they might run into a situation where there is no MetricPoint available when a thread later tries to update the MetricPoint (k1,v1) at time T2 and the measurement gets dropped. When consuming the metrics, their dashboard would show them that no measurements were recorded for the MetricPoint (k1,v1) at time T2, which is not correct as it was actually dropped.

I'm curious about what would be the behavioral difference from the client side between the below 2 cases:

Client's cardinality reaches 2000 and new distinct tags start getting dropped, and
(with this reclaimed feature) cardinality reaches the 75% * 2000 limit and the metricPoint with the lowest reference count start getting dropped.
?

It seems that Delta temporality is designed to unburden the client from keeping high-cardinality state?

reyang · 2023-09-22T04:07:08Z

For example, @vishweshbankwar pointed out a specific case where the SDK has exported a particular MetricPoint (k1,v1) at time T1. After a while, this MetricPoint gets reclaimed. Now if their application produces a totally new set of unique metric points and consistently updates them, they might run into a situation where there is no MetricPoint available when a thread later tries to update the MetricPoint (k1,v1) at time T2 and the measurement gets dropped. When consuming the metrics, their dashboard would show them that no measurements were recorded for the MetricPoint (k1,v1) at time T2, which is not correct as it was actually dropped.

I don't feel we should be too worried about this case. Folks using DELTA aggregation temporality should expect:

If the SDK is configured to keep memory, the behavior will be the same as CUMULATIVE - new metric streams get dropped or collapsed after reaching the cardinality cap.
If the SDK is configured to reclaim memory, the behavior will be essentially the same as if we're doing CUMULATIVE for each collection cycle, and the start time is set to the wall clock after the collection is done.
In both cases, the user SHOULD ensure that cardinality is planned and properly configured, so they can get accurate data. Otherwise, they won't get accurate data - the SDK might still try its best effort (e.g. the total count might still be accurate) within the limitations.

I would be in favor of the user opting in for the reclaim behavior so that they are aware of what to expect instead of simply making it the default behavior for Delta aggregation temporality.

+1 on making this opt-in.

…tricPoints

alanwest

LGTM

I spoke with @utpilla and I agree we can ship this PR as-is for 1.7.0-alpha release. Making this opt-in vs. opt-out will be a follow up discussion to resolve prior to 1.7.0 stable.

@vishweshbankwar I think between you and me, we've reviewed this PR most closely. If you ready to give it a 👍, I'll merge it.

…tricPoints

CodeBlanch · 2023-09-27T16:28:05Z

src/OpenTelemetry/Metrics/MetricPoint.cs

+    // We never increment the ReferenceCount for MetricPoint with no tags (index == 0) and the MetricPoint for overflow attribute,
+    // but we always decrement it (in the Update methods). This should be fine.
+    // ReferenceCount doesn't matter for MetricPoint with no tags and overflow attribute as they are never reclaimed.
+    internal int ReferenceCount;


@alanwest @vishweshbankwar This is going to add to our memory consumption (4 bytes x MaxMetricPoints). Below there is another field AggregationType aggType which I'm guessing is also taking up 4 bytes. I don't think either of those two things really need the full 4 bytes, what if we did some bit manipulation on a single int field to process both needs with better memory utilization?

okonaraddi-msft · 2024-05-31T21:12:07Z

For example, @vishweshbankwar pointed out a specific case where the SDK has exported a particular MetricPoint (k1,v1) at time T1. After a while, this MetricPoint gets reclaimed. Now if their application produces a totally new set of unique metric points and consistently updates them, they might run into a situation where there is no MetricPoint available when a thread later tries to update the MetricPoint (k1,v1) at time T2 and the measurement gets dropped. When consuming the metrics, their dashboard would show them that no measurements were recorded for the MetricPoint (k1,v1) at time T2, which is not correct as it was actually dropped.

Hi @utpilla @vishweshbankwar , I'd like to better understand the risk associated with enabling reclaim. Does the above mean that any reclaimed MetricPoint's key-value set will be dropped moving forward? So if (k1,v1) is reclaimed and then some time later a metric with (k1,v1) is emitted, that metric will be dropped instead of a new metric point being created?

EDIT: Sorry, I missed the "they might run into a situation where there is no MetricPoint available" earlier, now that case makes sense. If I understand correctly, the quoted case applies only when the max limit for metric points is reached and nothing can be reclaimed to make room for (k1,v1) again.

Update AggregatorStore to reclaim MetricPoints for Delta aggergation

94e8418

temporality

alanwest reviewed May 16, 2023

View reviewed changes

src/OpenTelemetry/Metrics/AggregatorStore.cs Outdated Show resolved Hide resolved

Undo config change for local testing

0e97f6e

github-actions bot added the Stale label May 24, 2023

github-actions bot closed this May 31, 2023

utpilla added 2 commits August 30, 2023 17:11

Use file-scoped namespace

bc5ff5e

Merge with origin/main

e0adb00

utpilla reopened this Sep 1, 2023

Update AggregatorStore

967ca09

github-actions bot removed the Stale label Sep 1, 2023

github-actions bot added the Stale label Sep 9, 2023

utpilla added 2 commits September 11, 2023 18:57

Merge with origin/main

679f6d3

Code changes

dc20feb

utpilla marked this pull request as ready for review September 12, 2023 03:02

utpilla requested a review from a team as a code owner September 12, 2023 03:02

utpilla requested review from reyang and cijothomas September 12, 2023 03:02

github-actions bot removed the Stale label Sep 12, 2023

CodeBlanch added 2 commits September 13, 2023 09:38

Merge from main.

85b318a

Added nullable annotations for new code.

8f1cd10

alanwest reviewed Sep 18, 2023

View reviewed changes

src/OpenTelemetry/Metrics/AggregatorStore.cs Show resolved Hide resolved

alanwest reviewed Sep 18, 2023

View reviewed changes

src/OpenTelemetry/Metrics/AggregatorStore.cs Outdated Show resolved Hide resolved

alanwest reviewed Sep 18, 2023

View reviewed changes

src/OpenTelemetry/Metrics/AggregatorStore.cs Outdated Show resolved Hide resolved

alanwest reviewed Sep 18, 2023

View reviewed changes

src/OpenTelemetry/Metrics/AggregatorStore.cs Outdated Show resolved Hide resolved

utpilla and others added 2 commits September 18, 2023 16:57

Update src/OpenTelemetry/Metrics/AggregatorStore.cs

582915d

Co-authored-by: Alan West <3676547+alanwest@users.noreply.github.com>

Update src/OpenTelemetry/Metrics/AggregatorStore.cs

7546b3b

Co-authored-by: Alan West <3676547+alanwest@users.noreply.github.com>

vishweshbankwar reviewed Sep 19, 2023

View reviewed changes

src/OpenTelemetry/Metrics/AggregatorStore.cs Outdated Show resolved Hide resolved

vishweshbankwar reviewed Sep 19, 2023

View reviewed changes

src/OpenTelemetry/Metrics/AggregatorStore.cs Outdated Show resolved Hide resolved

vishweshbankwar reviewed Sep 19, 2023

View reviewed changes

src/OpenTelemetry/Metrics/AggregatorStore.cs Outdated Show resolved Hide resolved

utpilla and others added 7 commits September 18, 2023 22:12

Update src/OpenTelemetry/Metrics/AggregatorStore.cs

0f6890b

Co-authored-by: Vishwesh Bankwar <vishweshbankwar@users.noreply.github.com>

Update src/OpenTelemetry/Metrics/AggregatorStore.cs

67fa3c8

Co-authored-by: Vishwesh Bankwar <vishweshbankwar@users.noreply.github.com>

Address PR comments

628174d

Merge remote-tracking branch 'origin/main' into utpilla/Update-Aggreg…

966bf7a

…atorStore-To-Reclaim-MetricPoints

Use correct method for update with custom tags

21d85c9

Update CHANGELOG

4a50365

Fix typo

6d01a56

Merge branch 'main' into utpilla/Update-AggregatorStore-To-Reclaim-Me…

5abe869

…tricPoints

alanwest approved these changes Sep 22, 2023

View reviewed changes

vishweshbankwar approved these changes Sep 22, 2023

View reviewed changes

Merge branch 'main' into utpilla/Update-AggregatorStore-To-Reclaim-Me…

380b159

…tricPoints

alanwest merged commit 2a480cd into open-telemetry:main Sep 25, 2023
49 of 50 checks passed

CodeBlanch reviewed Sep 27, 2023

View reviewed changes

utpilla mentioned this pull request Oct 17, 2023

Ability to forget and reclaim MetricPoint (Delta) #2360

Closed

utpilla mentioned this pull request Nov 15, 2023

Make MetricPoint reclaim an opt-in experimental feature #5052

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update AggregatorStore to reclaim unused MetricPoints for Delta aggregation temporality #4486

Update AggregatorStore to reclaim unused MetricPoints for Delta aggregation temporality #4486

utpilla commented May 13, 2023 •

edited

utpilla commented May 13, 2023 •

edited

github-actions bot commented May 24, 2023

github-actions bot commented May 31, 2023

codecov bot commented Sep 1, 2023 •

edited

github-actions bot commented Sep 9, 2023

vishweshbankwar commented Sep 18, 2023 •

edited

alanwest Sep 18, 2023

utpilla commented Sep 20, 2023 •

edited

Yun-Ting commented Sep 21, 2023

reyang commented Sep 22, 2023

alanwest left a comment

CodeBlanch Sep 27, 2023

okonaraddi-msft commented May 31, 2024 •

edited

Update AggregatorStore to reclaim unused MetricPoints for Delta aggregation temporality #4486

Update AggregatorStore to reclaim unused MetricPoints for Delta aggregation temporality #4486

Conversation

utpilla commented May 13, 2023 • edited

Current behavior:

What changes when AggregatorStore starts to reclaim unused MetricPoints?

Changes

utpilla commented May 13, 2023 • edited

Stress Tests

github-actions bot commented May 24, 2023

github-actions bot commented May 31, 2023

codecov bot commented Sep 1, 2023 • edited

Codecov Report

github-actions bot commented Sep 9, 2023

vishweshbankwar commented Sep 18, 2023 • edited

alanwest Sep 18, 2023

Choose a reason for hiding this comment

utpilla commented Sep 20, 2023 • edited

Yun-Ting commented Sep 21, 2023

reyang commented Sep 22, 2023

alanwest left a comment

Choose a reason for hiding this comment

CodeBlanch Sep 27, 2023

Choose a reason for hiding this comment

okonaraddi-msft commented May 31, 2024 • edited

utpilla commented May 13, 2023 •

edited

utpilla commented May 13, 2023 •

edited

codecov bot commented Sep 1, 2023 •

edited

vishweshbankwar commented Sep 18, 2023 •

edited

utpilla commented Sep 20, 2023 •

edited

okonaraddi-msft commented May 31, 2024 •

edited