Fix a lock contention in Signaller controller #69
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
A lock contention was observed under a relatively high amount of PURGE requests, that Signaller propagates among instances in a Varnish cluster, and changes in endpoints of a K8s service for Varnish. The lock contention causes delays in updating of Varnish configuration file, which leads to (sort of) an outage of a Varnish cluster.
The Signaller controller locks the "Signaller" structure when it tries to get/update one of the fields (endpoints) in that structure.
This commit reduces the time when the "Signaller" structure is locked by copying the current set of endpoints instead of locking the structure while the controller sends PURGE requests.
The following are extended (and ad-hoc) logs that allowed me to trace timings inside several methods upon changes in endpoints of a K8s service for a Varnish cluster (frontends). Ad-hoc logging is not included in PR.
Before changes - 9s passed before a lock inside Signaller.SetEndpoints was acquired:
In different test scenarios, more longer delays were observed. The worse case of the lock contention that was noticed was around 30 mins.
After changes: