Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sdk-metrics] Fix race condition for MemoryPoint Reclaim #5546

Merged

Conversation

utpilla
Copy link
Contributor

@utpilla utpilla commented Apr 17, 2024

Changes

On the main branch:

An update thread does the following:

  1. Get index for the MetricPoint array
  2. Increment the ReferenceCount for the MetricPoint at that index
  3. Ensure the MetricPoint is valid for use
  4. Update necessary values
  5. Set the MetricPointStatus to CollectPending
  6. Decrement the ReferenceCount for the MetricPoint

The collect thread does the following:

  1. Enumerate over all the MetricPoints
  2. If the MetricPointStatus for a given MetricPoint is NoCollectPending, then check if it can be marked invalid
  3. Try marking the MetricPoint invalid (this happens by setting the ReferenceCount to int.MinValue when no one is using it)
  4. If the 3rd step was successful, then reclaim the MetricPoint

Where's the race condition?

Consider this sequence of steps where both the Update thread and the Collect thread are working on the same MetricPoint

Time Update thread Collect thread
T1 Get MetricPoint index Enumerate over all MetricPoints
T2 Update thread gets switched out Check MetricPointStatus for the MetricPoint
T3 Update thread is switched out Collect thread finds that MetricPointStatus is NoCollectPending
T4 Update thread is back and it increments the ReferenceCount to 1 Collect thread gets switched out
T5 Ensure that the MetricPoint is valid for use Switched out
T6 Update the required values Switched out
T7 Set MetricPointStatus to CollectPending (This is the update that we would miss) Switched out
T8 Decrement the ReferenceCount to 0 Switched out
T9 Collect thread is back and sets the ReferenceCount to int.MinValue as it finds the ReferenceCount to be 0
T10 Reclaims the MetricPoint and misses the update that happened at T7

Fix

I'm introducing a double-checked locking type construct to recheck if the MetricPointStatus was changed to CollectPending before the Collect thread could mark the MetricPoint invalid for use. When that happens, we would now call Snapshot for that MetricPoint and mark it to be reclaimed in the next Collect cycle. Note that the MetricPoint would remain invalid for use until the next Collect cycle reclaims it.

@utpilla utpilla requested a review from a team as a code owner April 17, 2024 23:30
Copy link

codecov bot commented Apr 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.58%. Comparing base (6250307) to head (c2d0230).
Report is 186 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #5546      +/-   ##
==========================================
+ Coverage   83.38%   85.58%   +2.19%     
==========================================
  Files         297      289       -8     
  Lines       12531    12493      -38     
==========================================
+ Hits        10449    10692     +243     
+ Misses       2082     1801     -281     
Flag Coverage Δ
unittests ?
unittests-Solution-Experimental 85.31% <100.00%> (?)
unittests-Solution-Stable 85.53% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
src/OpenTelemetry/Metrics/AggregatorStore.cs 87.46% <100.00%> (+7.08%) ⬆️
src/OpenTelemetry/Metrics/LookupData.cs 100.00% <100.00%> (ø)

... and 77 files with indirect coverage changes

@CodeBlanch CodeBlanch added pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package metrics labels Apr 18, 2024
@CodeBlanch CodeBlanch changed the title Fix race condition for MemoryPoint Reclaim [sdk-metrics] Fix race condition for MemoryPoint Reclaim Apr 18, 2024
@CodeBlanch
Copy link
Member

@Yun-Ting Would you mind spinning up an issue to make sure we have coyote test coverage over this delta metric point reclaim feature?

Copy link
Member

@CodeBlanch CodeBlanch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@CodeBlanch CodeBlanch merged commit 84bdeb3 into open-telemetry:main Apr 18, 2024
39 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
metrics pkg:OpenTelemetry Issues related to OpenTelemetry NuGet package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants