authorize: refactor store locking #2151

calebdoxsey · 2021-04-28T23:09:44Z

Summary

It looks like the way we're doing locking with the OPA evaluator store is causing deadlocks. I misunderstood how the RWMutex works, and it turns out it can't reliably be used recursively:

If a goroutine holds a RWMutex for reading and another goroutine might
call Lock, no goroutine should expect to be able to acquire a read lock
until the initial read lock is released. In particular, this prohibits
recursive read locking. This is to ensure that the lock eventually becomes
available; a blocked Lock call excludes new readers from acquiring the
lock.

I believe the deadlock happens like this:

We perform an evaluation which takes a read-only transaction to the store RLock
During sync we update a record Lock
During policy evaluation we attempt to retrieve a record RLock

Step (2) is blocked on (1) and step (3) is blocked on (2). But step (1) is blocked on (3), because the policy evaluation can't be completed until we get the record. This PR removes taking a read lock in the transaction, and instead only uses the lock for map access (which is very short and non-recursive).

The original purpose of the transaction lock was to prevent an update while we were evaluating a policy. This is so that the version numbers we get back from evaluation (server version / record version) reflect the data that was actually used during evaluation. So to preserve this behavior I introduce another lock in the authorize service itself. The syncer's update and evaluator will never run simultaneously, and since they don't directly depend on each other, we should have no issues with deadlocks.

Checklist

codeclimate · 2021-04-28T23:10:14Z

Code Climate has analyzed commit e0933b5 and detected 0 issues on this pull request.

View more on Code Climate.

codecov · 2021-04-28T23:14:53Z

Codecov Report

Merging #2151 (e0933b5) into master (9215833) will decrease coverage by 0.1%.
The diff coverage is 85.7%.

@@           Coverage Diff            @@
##           master   #2151     +/-   ##
========================================
- Coverage    60.3%   60.2%   -0.2%     
========================================
  Files         167     167             
  Lines       11375   11385     +10     
========================================
- Hits         6869    6859     -10     
- Misses       3716    3735     +19     
- Partials      790     791      +1

Impacted Files	Coverage Δ
authorize/authorize.go	`67.5% <ø> (ø)`
authorize/sync.go	`30.4% <0.0%> (-1.6%)`	⬇️
authorize/evaluator/store.go	`79.6% <94.1%> (+3.8%)`	⬆️
authorize/grpc.go	`75.4% <100.0%> (+0.4%)`	⬆️
pkg/grpc/databroker/syncer.go	`93.7% <100.0%> (-2.5%)`	⬇️
pkg/storage/inmemory/stream.go	`69.3% <0.0%> (-4.1%)`	⬇️
pkg/storage/redis/redis.go	`68.1% <0.0%> (-2.5%)`	⬇️
pkg/storage/inmemory/backend.go	`83.2% <0.0%> (-2.3%)`	⬇️
internal/databroker/server.go	`46.5% <0.0%> (-2.2%)`	⬇️
... and 1 more

wasaga · 2021-04-29T02:49:15Z

authorize/sync.go

 func (syncer *dataBrokerSyncer) ClearRecords(ctx context.Context) {
+	syncer.authorize.stateLock.Lock()
 	syncer.authorize.store.ClearRecords()
+	syncer.authorize.stateLock.Unlock()
 }

 func (syncer *dataBrokerSyncer) UpdateRecords(ctx context.Context, serverVersion uint64, records []*databroker.Record) {
+	syncer.authorize.stateLock.Lock()
 	for _, record := range records {
 		syncer.authorize.store.UpdateRecord(serverVersion, record)
 	}
+	syncer.authorize.stateLock.Unlock()


why not expand the interface of store and have UpdateRecords() method there - then you do not seem to need this extra stateLock mutex and you won't have to acquire the other mutex inside dataBrokerData for every UpdateRecord() ?

If we remove the state lock there's no guarantee we won't upgrade records in the middle of an evaluation. You would then have an inconsistent view of the data. For example you would get record version 1234 in the audit log, but will have actually used data from version 1235 in the evaluation.

The mutex around map access is necessary because maps are not thread safe in Go. If one goroutine reads a map while another writes it it will cause the program to crash.

If having two locks in unacceptable what should I do instead to preserve thread safety and consistency?

I missed that you also acquire RLock in Check(); I see why you do need that extra mutex now.

my comment re acquiring mutex in a loop

for _, record := range records { syncer.authorize.store.UpdateRecord(serverVersion, record) }

was about acquiring mutex on each UpdateRecord call, while you probably can change the interface of it to accept multiple records, and change underlying dataBrokerData funcs to setLocked and deleteLocked so that you could acquire mutex once outside the loop. but it's not too important probably in terms of optimization.

This violates separation of concerns. The reason I created this type was to avoid having to think about the lock from the store, since locks are easy to misuse. If you're really concerned about performance I can make the change, but I think its negligible. FWIW during sync the method is only called with a single record.

travisgroth · 2021-04-29T11:13:21Z

Was running some load against this branch and got a panic after some time:

│ panic: runtime error: invalid memory address or nil pointer dereference                                                                                                                                                                                    │
│ [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0xf62689]                                                                                                                                                                                    │
│ goroutine 70 [running]:                                                                                                                                                                                                                                    │
│ github.com/pomerium/pomerium/pkg/grpc/databroker.(*Syncer).sync(0xc00067c420, 0x1afaba0, 0xc004c0bc80, 0x0, 0x2)                                                                                                                                           │
│     /go/src/github.com/pomerium/pomerium/pkg/grpc/databroker/syncer.go:160 +0x369                                                                                                                                                                          │
│ github.com/pomerium/pomerium/pkg/grpc/databroker.(*Syncer).Run(0xc00067c420, 0x1afaba0, 0xc004c0bc80, 0xc004c759a0, 0x0)                                                                                                                                   │
│     /go/src/github.com/pomerium/pomerium/pkg/grpc/databroker/syncer.go:100 +0x297                                                                                                                                                                          │
│ github.com/pomerium/pomerium/authorize.(*Authorize).Run(0xc000cc0b80, 0x1afaba0, 0xc004c0bc00, 0x0, 0x0)                                                                                                                                                   │
│     /go/src/github.com/pomerium/pomerium/authorize/authorize.go:55 +0xc5                                                                                                                                                                                   │
│ github.com/pomerium/pomerium/internal/cmd/pomerium.Run.func4(0x0, 0x0)                                                                                                                                                                                     │
│     /go/src/github.com/pomerium/pomerium/internal/cmd/pomerium/pomerium.go:145 +0x3c                                                                                                                                                                       │
│ golang.org/x/sync/errgroup.(*Group).Go.func1(0xc004c7ae40, 0xc004c75980)                                                                                                                                                                                   │
│     /go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57 +0x59                                                                                                                                                         │
│ created by golang.org/x/sync/errgroup.(*Group).Go                                                                                                                                                                                                          │
│     /go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:54 +0x66

calebdoxsey · 2021-04-29T12:55:58Z

Based on the line number, this is a log message and appears unrelated. I'll fix it though.

travisgroth

Haven't been able to reproduce the deadlock on this branch. 👍

LGTM unless there's other feedback from @wasaga.

authorize: refactor store locking

d229161

calebdoxsey added the bug Something isn't working label Apr 28, 2021

calebdoxsey requested a review from a team as a code owner April 28, 2021 23:09

calebdoxsey requested review from wasaga and travisgroth April 28, 2021 23:09

wasaga reviewed Apr 29, 2021

View reviewed changes

fix nil reference panic

e0933b5

travisgroth approved these changes Apr 29, 2021

View reviewed changes

wasaga approved these changes Apr 29, 2021

View reviewed changes

calebdoxsey merged commit c85c8b0 into master Apr 29, 2021

calebdoxsey deleted the cdoxsey/376-deadlock branch April 29, 2021 14:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

authorize: refactor store locking #2151

authorize: refactor store locking #2151

calebdoxsey commented Apr 28, 2021

codeclimate bot commented Apr 28, 2021 •

edited

codecov bot commented Apr 28, 2021 •

edited

wasaga Apr 29, 2021 •

edited

calebdoxsey Apr 29, 2021

wasaga Apr 29, 2021

calebdoxsey Apr 29, 2021

travisgroth commented Apr 29, 2021

calebdoxsey commented Apr 29, 2021

travisgroth left a comment

authorize: refactor store locking #2151

authorize: refactor store locking #2151

Conversation

calebdoxsey commented Apr 28, 2021

Summary

Checklist

codeclimate bot commented Apr 28, 2021 • edited

codecov bot commented Apr 28, 2021 • edited

Codecov Report

wasaga Apr 29, 2021 • edited

Choose a reason for hiding this comment

calebdoxsey Apr 29, 2021

Choose a reason for hiding this comment

wasaga Apr 29, 2021

Choose a reason for hiding this comment

calebdoxsey Apr 29, 2021

Choose a reason for hiding this comment

travisgroth commented Apr 29, 2021

calebdoxsey commented Apr 29, 2021

travisgroth left a comment

Choose a reason for hiding this comment

codeclimate bot commented Apr 28, 2021 •

edited

codecov bot commented Apr 28, 2021 •

edited

wasaga Apr 29, 2021 •

edited