Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a lock contention in Signaller controller #69

Merged
merged 2 commits into from
Apr 6, 2021
Merged

Fix a lock contention in Signaller controller #69

merged 2 commits into from
Apr 6, 2021

Conversation

mqmr
Copy link
Contributor

@mqmr mqmr commented Apr 6, 2021

A lock contention was observed under a relatively high amount of PURGE requests, that Signaller propagates among instances in a Varnish cluster, and changes in endpoints of a K8s service for Varnish. The lock contention causes delays in updating of Varnish configuration file, which leads to (sort of) an outage of a Varnish cluster.
The Signaller controller locks the "Signaller" structure when it tries to get/update one of the fields (endpoints) in that structure.

This commit reduces the time when the "Signaller" structure is locked by copying the current set of endpoints instead of locking the structure while the controller sends PURGE requests.

The following are extended (and ad-hoc) logs that allowed me to trace timings inside several methods upon changes in endpoints of a K8s service for a Varnish cluster (frontends). Ad-hoc logging is not included in PR.

Before changes - 9s passed before a lock inside Signaller.SetEndpoints was acquired:

I0225 11:23:03.963121       1 endpoints_watch.go:99] XXX newConfig: &{[{varnish-1 10.219.137.130 80 <nil>}] 0xc000158340}
I0225 11:23:03.965096       1 endpoints_watch.go:103] XXX GOING TO SEND newConfig to updates channel
I0225 11:23:03.965172       1 watch.go:37] XXX ENTER FRONTEND UPDATES CASE; T since loop start - 21.835µs
I0225 11:23:03.965371       1 watch.go:42] XXX FRONTENDS ARE UPDATED; T since loop start - 234.991µs
I0225 11:23:03.965397       1 watch.go:45] XXX GOING TO UPDATE SIGNALLER ENDPOINTS; T since loop start - 262.643µs
I0225 11:23:03.965439       1 types.go:56] XXX GOING TO LOCK MUTEX IN SetEndpoints;
I0225 11:23:03.965471       1 endpoints_watch.go:106] XXX SENT newConfig to updates channel
I0225 11:23:12.991572       1 types.go:58] XXX MUTEX IS LOCKED IN SetEndpoints; T since start - 9.026124302s
I0225 11:23:12.991656       1 types.go:62] XXX ENDPOINTS ARE UPDATED IN SetEndpoints; T since start - 9.026218379s
I0225 11:23:12.991696       1 watch.go:52] XXX SIGNALLER ENDPOINTS ARE SET; T since loop start - 9.026556667s

In different test scenarios, more longer delays were observed. The worse case of the lock contention that was noticed was around 30 mins.

After changes:

I0225 13:28:20.285131       1 endpoints_watch.go:103] XXX GOING TO SEND newConfig to updates channel
I0225 13:28:20.285178       1 endpoints_watch.go:106] XXX SENT newConfig to updates channel
I0225 13:28:31.272964       1 endpoints_watch.go:99] XXX newConfig: &{[{varnish-1 10.219.138.228 80 <nil>}] 0xc000680f80}
I0225 13:28:31.273082       1 endpoints_watch.go:103] XXX GOING TO SEND newConfig to updates channel
I0225 13:28:31.273151       1 endpoints_watch.go:106] XXX SENT newConfig to updates channel
I0225 13:28:31.273143       1 watch.go:37] XXX ENTER FRONTEND UPDATES CASE; T since loop start - 15.896µs
I0225 13:28:31.273312       1 watch.go:42] XXX FRONTENDS ARE UPDATED; T since loop start - 195.035µs
I0225 13:28:31.273353       1 watch.go:45] XXX GOING TO UPDATE SIGNALLER ENDPOINTS; T since loop start - 236.996µs
I0225 13:28:31.273380       1 types.go:56] XXX GOING TO LOCK MUTEX IN SetEndpoints;
I0225 13:28:31.273402       1 types.go:58] XXX MUTEX IS LOCKED IN SetEndpoints; T since start - 22.576µs
I0225 13:28:31.273429       1 types.go:62] XXX ENDPOINTS ARE UPDATED IN SetEndpoints; T since start - 49.522µs
I0225 13:28:31.273460       1 watch.go:52] XXX SIGNALLER ENDPOINTS ARE SET; T since loop start - 344.207µs

The lock contention was observed under a relatively high amount of PURGE
requests, that Signaller propagates among instances in a Varnish
cluster, and changes in endpoints of a K8s service for Varnish;

The Signaller controller locks the "Signaller" structure when it tries
to get/update one of the fields (endpoints) in that structure.

This commit reduces the time when the "Signaller" structure is locked by
copying the current set of endpoints instead of locking the structure
while the controller sends PURGE requests.
Copy link
Member

@martin-helmich martin-helmich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, good catch! 👍 I suppose the original design overlooked that even though the actual requests were offloaded into another Goroutine, the write into that channel would still block for the entire duration of all requests (at minimum -- possibly longer, if there's contention), locking the entire signaller for the whole time.

@martin-helmich martin-helmich merged commit 6dbce14 into mittwald:master Apr 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants