Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

authorize: refactor store locking #2151

Merged
merged 2 commits into from Apr 29, 2021
Merged

authorize: refactor store locking #2151

merged 2 commits into from Apr 29, 2021

Conversation

calebdoxsey
Copy link
Contributor

Summary

It looks like the way we're doing locking with the OPA evaluator store is causing deadlocks. I misunderstood how the RWMutex works, and it turns out it can't reliably be used recursively:

If a goroutine holds a RWMutex for reading and another goroutine might
call Lock, no goroutine should expect to be able to acquire a read lock
until the initial read lock is released. In particular, this prohibits
recursive read locking. This is to ensure that the lock eventually becomes
available; a blocked Lock call excludes new readers from acquiring the
lock.

I believe the deadlock happens like this:

  1. We perform an evaluation which takes a read-only transaction to the store RLock
  2. During sync we update a record Lock
  3. During policy evaluation we attempt to retrieve a record RLock

Step (2) is blocked on (1) and step (3) is blocked on (2). But step (1) is blocked on (3), because the policy evaluation can't be completed until we get the record. This PR removes taking a read lock in the transaction, and instead only uses the lock for map access (which is very short and non-recursive).

The original purpose of the transaction lock was to prevent an update while we were evaluating a policy. This is so that the version numbers we get back from evaluation (server version / record version) reflect the data that was actually used during evaluation. So to preserve this behavior I introduce another lock in the authorize service itself. The syncer's update and evaluator will never run simultaneously, and since they don't directly depend on each other, we should have no issues with deadlocks.

Checklist

  • reference any related issues
  • updated docs
  • updated unit tests
  • updated UPGRADING.md
  • add appropriate tag (improvement / bug / etc)
  • ready for review

@calebdoxsey calebdoxsey added the bug Something isn't working label Apr 28, 2021
@calebdoxsey calebdoxsey requested a review from a team as a code owner April 28, 2021 23:09
@codeclimate
Copy link

codeclimate bot commented Apr 28, 2021

Code Climate has analyzed commit e0933b5 and detected 0 issues on this pull request.

View more on Code Climate.

@codecov
Copy link

codecov bot commented Apr 28, 2021

Codecov Report

Merging #2151 (e0933b5) into master (9215833) will decrease coverage by 0.1%.
The diff coverage is 85.7%.

@@           Coverage Diff            @@
##           master   #2151     +/-   ##
========================================
- Coverage    60.3%   60.2%   -0.2%     
========================================
  Files         167     167             
  Lines       11375   11385     +10     
========================================
- Hits         6869    6859     -10     
- Misses       3716    3735     +19     
- Partials      790     791      +1     
Impacted Files Coverage Δ
authorize/authorize.go 67.5% <ø> (ø)
authorize/sync.go 30.4% <0.0%> (-1.6%) ⬇️
authorize/evaluator/store.go 79.6% <94.1%> (+3.8%) ⬆️
authorize/grpc.go 75.4% <100.0%> (+0.4%) ⬆️
pkg/grpc/databroker/syncer.go 93.7% <100.0%> (-2.5%) ⬇️
pkg/storage/inmemory/stream.go 69.3% <0.0%> (-4.1%) ⬇️
pkg/storage/redis/redis.go 68.1% <0.0%> (-2.5%) ⬇️
pkg/storage/inmemory/backend.go 83.2% <0.0%> (-2.3%) ⬇️
internal/databroker/server.go 46.5% <0.0%> (-2.2%) ⬇️
... and 1 more

Comment on lines 49 to +60
func (syncer *dataBrokerSyncer) ClearRecords(ctx context.Context) {
syncer.authorize.stateLock.Lock()
syncer.authorize.store.ClearRecords()
syncer.authorize.stateLock.Unlock()
}

func (syncer *dataBrokerSyncer) UpdateRecords(ctx context.Context, serverVersion uint64, records []*databroker.Record) {
syncer.authorize.stateLock.Lock()
for _, record := range records {
syncer.authorize.store.UpdateRecord(serverVersion, record)
}
syncer.authorize.stateLock.Unlock()
Copy link
Contributor

@wasaga wasaga Apr 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not expand the interface of store and have UpdateRecords() method there - then you do not seem to need this extra stateLock mutex and you won't have to acquire the other mutex inside dataBrokerData for every UpdateRecord() ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we remove the state lock there's no guarantee we won't upgrade records in the middle of an evaluation. You would then have an inconsistent view of the data. For example you would get record version 1234 in the audit log, but will have actually used data from version 1235 in the evaluation.

The mutex around map access is necessary because maps are not thread safe in Go. If one goroutine reads a map while another writes it it will cause the program to crash.

If having two locks in unacceptable what should I do instead to preserve thread safety and consistency?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that you also acquire RLock in Check(); I see why you do need that extra mutex now.

my comment re acquiring mutex in a loop

for _, record := range records {
  syncer.authorize.store.UpdateRecord(serverVersion, record)
}

was about acquiring mutex on each UpdateRecord call, while you probably can change the interface of it to accept multiple records, and change underlying dataBrokerData funcs to setLocked and deleteLocked so that you could acquire mutex once outside the loop. but it's not too important probably in terms of optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This violates separation of concerns. The reason I created this type was to avoid having to think about the lock from the store, since locks are easy to misuse. If you're really concerned about performance I can make the change, but I think its negligible. FWIW during sync the method is only called with a single record.

@travisgroth
Copy link
Contributor

Was running some load against this branch and got a panic after some time:

│ panic: runtime error: invalid memory address or nil pointer dereference                                                                                                                                                                                    │
│ [signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0xf62689]                                                                                                                                                                                    │
│ goroutine 70 [running]:                                                                                                                                                                                                                                    │
│ github.com/pomerium/pomerium/pkg/grpc/databroker.(*Syncer).sync(0xc00067c420, 0x1afaba0, 0xc004c0bc80, 0x0, 0x2)                                                                                                                                           │
│     /go/src/github.com/pomerium/pomerium/pkg/grpc/databroker/syncer.go:160 +0x369                                                                                                                                                                          │
│ github.com/pomerium/pomerium/pkg/grpc/databroker.(*Syncer).Run(0xc00067c420, 0x1afaba0, 0xc004c0bc80, 0xc004c759a0, 0x0)                                                                                                                                   │
│     /go/src/github.com/pomerium/pomerium/pkg/grpc/databroker/syncer.go:100 +0x297                                                                                                                                                                          │
│ github.com/pomerium/pomerium/authorize.(*Authorize).Run(0xc000cc0b80, 0x1afaba0, 0xc004c0bc00, 0x0, 0x0)                                                                                                                                                   │
│     /go/src/github.com/pomerium/pomerium/authorize/authorize.go:55 +0xc5                                                                                                                                                                                   │
│ github.com/pomerium/pomerium/internal/cmd/pomerium.Run.func4(0x0, 0x0)                                                                                                                                                                                     │
│     /go/src/github.com/pomerium/pomerium/internal/cmd/pomerium/pomerium.go:145 +0x3c                                                                                                                                                                       │
│ golang.org/x/sync/errgroup.(*Group).Go.func1(0xc004c7ae40, 0xc004c75980)                                                                                                                                                                                   │
│     /go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57 +0x59                                                                                                                                                         │
│ created by golang.org/x/sync/errgroup.(*Group).Go                                                                                                                                                                                                          │
│     /go/pkg/mod/golang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:54 +0x66

@calebdoxsey
Copy link
Contributor Author

Based on the line number, this is a log message and appears unrelated. I'll fix it though.

Copy link
Contributor

@travisgroth travisgroth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't been able to reproduce the deadlock on this branch. 👍

LGTM unless there's other feedback from @wasaga.

@calebdoxsey calebdoxsey merged commit c85c8b0 into master Apr 29, 2021
@calebdoxsey calebdoxsey deleted the cdoxsey/376-deadlock branch April 29, 2021 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants