Fix a lock contention in Signaller controller #69

mqmr · 2021-04-06T15:15:46Z

A lock contention was observed under a relatively high amount of PURGE requests, that Signaller propagates among instances in a Varnish cluster, and changes in endpoints of a K8s service for Varnish. The lock contention causes delays in updating of Varnish configuration file, which leads to (sort of) an outage of a Varnish cluster.
The Signaller controller locks the "Signaller" structure when it tries to get/update one of the fields (endpoints) in that structure.

This commit reduces the time when the "Signaller" structure is locked by copying the current set of endpoints instead of locking the structure while the controller sends PURGE requests.

The following are extended (and ad-hoc) logs that allowed me to trace timings inside several methods upon changes in endpoints of a K8s service for a Varnish cluster (frontends). Ad-hoc logging is not included in PR.

Before changes - 9s passed before a lock inside Signaller.SetEndpoints was acquired:

I0225 11:23:03.963121       1 endpoints_watch.go:99] XXX newConfig: &{[{varnish-1 10.219.137.130 80 <nil>}] 0xc000158340}
I0225 11:23:03.965096       1 endpoints_watch.go:103] XXX GOING TO SEND newConfig to updates channel
I0225 11:23:03.965172       1 watch.go:37] XXX ENTER FRONTEND UPDATES CASE; T since loop start - 21.835µs
I0225 11:23:03.965371       1 watch.go:42] XXX FRONTENDS ARE UPDATED; T since loop start - 234.991µs
I0225 11:23:03.965397       1 watch.go:45] XXX GOING TO UPDATE SIGNALLER ENDPOINTS; T since loop start - 262.643µs
I0225 11:23:03.965439       1 types.go:56] XXX GOING TO LOCK MUTEX IN SetEndpoints;
I0225 11:23:03.965471       1 endpoints_watch.go:106] XXX SENT newConfig to updates channel
I0225 11:23:12.991572       1 types.go:58] XXX MUTEX IS LOCKED IN SetEndpoints; T since start - 9.026124302s
I0225 11:23:12.991656       1 types.go:62] XXX ENDPOINTS ARE UPDATED IN SetEndpoints; T since start - 9.026218379s
I0225 11:23:12.991696       1 watch.go:52] XXX SIGNALLER ENDPOINTS ARE SET; T since loop start - 9.026556667s

In different test scenarios, more longer delays were observed. The worse case of the lock contention that was noticed was around 30 mins.

After changes:

I0225 13:28:20.285131       1 endpoints_watch.go:103] XXX GOING TO SEND newConfig to updates channel
I0225 13:28:20.285178       1 endpoints_watch.go:106] XXX SENT newConfig to updates channel
I0225 13:28:31.272964       1 endpoints_watch.go:99] XXX newConfig: &{[{varnish-1 10.219.138.228 80 <nil>}] 0xc000680f80}
I0225 13:28:31.273082       1 endpoints_watch.go:103] XXX GOING TO SEND newConfig to updates channel
I0225 13:28:31.273151       1 endpoints_watch.go:106] XXX SENT newConfig to updates channel
I0225 13:28:31.273143       1 watch.go:37] XXX ENTER FRONTEND UPDATES CASE; T since loop start - 15.896µs
I0225 13:28:31.273312       1 watch.go:42] XXX FRONTENDS ARE UPDATED; T since loop start - 195.035µs
I0225 13:28:31.273353       1 watch.go:45] XXX GOING TO UPDATE SIGNALLER ENDPOINTS; T since loop start - 236.996µs
I0225 13:28:31.273380       1 types.go:56] XXX GOING TO LOCK MUTEX IN SetEndpoints;
I0225 13:28:31.273402       1 types.go:58] XXX MUTEX IS LOCKED IN SetEndpoints; T since start - 22.576µs
I0225 13:28:31.273429       1 types.go:62] XXX ENDPOINTS ARE UPDATED IN SetEndpoints; T since start - 49.522µs
I0225 13:28:31.273460       1 watch.go:52] XXX SIGNALLER ENDPOINTS ARE SET; T since loop start - 344.207µs

The lock contention was observed under a relatively high amount of PURGE requests, that Signaller propagates among instances in a Varnish cluster, and changes in endpoints of a K8s service for Varnish; The Signaller controller locks the "Signaller" structure when it tries to get/update one of the fields (endpoints) in that structure. This commit reduces the time when the "Signaller" structure is locked by copying the current set of endpoints instead of locking the structure while the controller sends PURGE requests.

martin-helmich

Again, good catch! 👍 I suppose the original design overlooked that even though the actual requests were offloaded into another Goroutine, the write into that channel would still block for the entire duration of all requests (at minimum -- possibly longer, if there's contention), locking the entire signaller for the whole time.

martin-helmich approved these changes Apr 6, 2021

View reviewed changes

Merge branch 'master' into signaller-lock-contention

563fb78

martin-helmich merged commit 6dbce14 into mittwald:master Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a lock contention in Signaller controller #69

Fix a lock contention in Signaller controller #69

mqmr commented Apr 6, 2021

martin-helmich left a comment

Fix a lock contention in Signaller controller #69

Fix a lock contention in Signaller controller #69

Conversation

mqmr commented Apr 6, 2021

martin-helmich left a comment

Choose a reason for hiding this comment