Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bulkheads in Metrics Only mode #72

Merged
merged 12 commits into from
Apr 24, 2018
Merged

Bulkheads in Metrics Only mode #72

merged 12 commits into from
Apr 24, 2018

Conversation

lewislabs
Copy link
Contributor

v3.2.0

This adds the ability to put Bulkheads into a metrics only mode, through configuration at a per bulkhead or global level.

If bulkhead concurrency isn't properly tuned in a production scenario then it's possible that you get a lot of BulkheadRejectionExceptions that actually would have been safe to execute without a throttle. These config keys add the ability to quickly turn off bulkhead rejections during a production scenario like this.

We felt like it was still important to get metrics for rejections during a time when they're effectively not rejecting, as this may help with tuning the concurrency levels in the future.

This PR also adds an IMetricEvent implementation, with GaugeLogMetrics to help with some of the diagnostic actions mentioned above.

The wiki is being updated with relevant information in the MARM-UpdateDocsForBulkheadMetricsOnly branch of the wiki repo.

@robhruska
Copy link
Member

Looks good to me. The gauge logger has a few gnarly bits, but given that it's a logger that's probably okay as long as it doesn't deadlock or otherwise impact a thread that's doing the actual work.

@lewislabs
Copy link
Contributor Author

@robhruska is there something specific about the logger/ConcurrentDictionary in there that you feel could be avoided?

@robhruska
Copy link
Member

Yeah, I'll comment inline.

{
_concurrencyExceededLog.Debug($"BulkheadRejections since last gauge - [Bulkhead={bulkheadName}, BulkheadType={bulkheadType}, MaxConcurrent={maxConcurrent}, CountAvailable={countAvailable}, Rejections={currentCount}]");
// Remove the rejections we've just logged from the current rejection count
_currentBulkheadsRejected.AddOrUpdate(bulkheadName, 0, (b, c) => c - currentCount);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be able to get rid of some of the code here that calculates deltas, and instead just keep a running incrementing count of rejections. The logging aggregator/search (e.g. Sumo Logic, Splunk) could possibly be used to find the per-window counts, depending on its capabilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm yeh but it could be tricky to write that kind of query in something like Sumo Logic. The context of a log message is only valuable with past data. It would also reset to 0 on a restart/redeploy of the app so you would need to some how factor that into an higher level log aggregation.

_metricEvents.RejectedByBulkhead(bulkhead.Name, command.Name);
throw new BulkheadRejectedException();
if (!onlyMetrics) throw new BulkheadRejectedException();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this will change the flow of semaphore.
We'll be in this code when TryEnter() will return false. That will indicate that we don't have space in a semaphore. Still then later on in that function we'll call Release() function on that semaphore which will increase it's CurrentValue. This will potentially cause to increase maximum allowed value in the semaphore.

Scenario:
Assuming that the semaphore.CurrentCount = 2 and at the very same time we'll have 5 threads try to enter it, 2 will get inside (with above if not being hit) resulting with CurrentCount to be set to 0 and after execution increased to 2 again. The problem starts with remaining 3. Normally (without onlyMetrics config) those would throw and then would not call.Release() code below. But because we are not returning function here, Release() will be call and we'll end up with CurrentCount = 5.

So I think that what we created here will be a self-regulated semaphore ;p. Which over time have CurrentCount equal to number of maximum concurrent calls. The think is I guess that was not the intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good critique @damtur, you're totally correct there. I'll refactor a bit more and cc you again for another pass.

@lewislabs
Copy link
Contributor Author

cc @damtur. I think this should be good now. I'm going to add a couple of unit tests around this now.

@@ -123,7 +125,7 @@ public TResult ExecuteWithBulkhead<TResult>(SyncCommand<TResult> command, Cancel
}
finally
{
bulkhead.Release();
if (!onlyMetrics) bulkhead.Release();

_metricEvents.LeaveBulkhead(bulkhead.Name, command.Name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more question. Do we want the LeveBulkhead metrics for those calls which would have tripped bulkhead if only metrics was not set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do, as it will give the full picture. In reality the call has gone through the bulkhead successfully, because it's disabled. But I could see it skewing metrics if they were being collected in a different way to logs, so I could be swayed to not make the Enter and Leave calls in this case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, as long as it's not confusing and clear what we want to achieve here I'm ok with anything. Just wanted to highlight how it works now, and make sure this is the intention.

@lewislabs lewislabs merged commit 94950da into master Apr 24, 2018
@lewislabs lewislabs deleted the MARM-ToggleBulkheads branch April 24, 2018 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants