Guard rwlock file to prevent not locked repo edits #8118

shcheklein · 2022-08-10T05:05:50Z

User were getting:

dvc.rwlock.RWLockFileCorruptedError: Unable to read RWLock-file '.dvc/tmp/rwlock'. JSON structure is corrupted

when running multiple pipelines in parallel.

The reason for this is again our unlocked(cmd) logic, which fails to acquire the repo.lock back if other pipelines are doing some heavy stuff or there are too many of them. It means that, rwlock on context manager exit is cleaning up itself while repo is not protected. Multiple processes could write simultaneously.

Overall, proper locking is pretty hard. I don't see good options for Python in terms of IPC or something that would be potentially faster / more reliable (could cleanup itself if process fails). For now I think, it's better to have this as a workaround, until we reconsider this parallel execution in general (e.g. deprecate it in favor of queue?).

skshetry · 2022-08-10T08:36:55Z

One way would be to make rwlock itself granular. eg: .dvc/tmp/rwlocks/<path>.lock. The path could be a path or a checksum of the path too. That way, we can replace the file atomically without any guards.

shcheklein · 2022-08-10T16:14:35Z

@skshetry it'a a bit more complicated as far as I understand. It would have to do multiple lock files at once + there R and W locks so we would have to maintain a few locations. It probably can be done, but we would have to guarantee still atomicity, lack of deadlocks (ordering), etc. Locking in general can be very complicated and the simpler system is the better (my preference here - don't try to be smart :) ). Even in this "simple" scenario - one global lock + one internal rwlock it has gotten very complicated because of just one additional piece - an internal unlock. We started hitting edge cases, corruptions, etc.

efiop

Feels odd to add a yet another lock to fix a problem with another lock, but if this fixes it let's merge for now in the interest of time.

shcheklein · 2022-08-11T16:54:12Z

@efiop agreed, it feels weird, looks complicated. But if we think about this - rwlock was written in way that it needs protection (that repo.lock serves as a guard). It just happened that the initial assumption that repo.lock protects- it was wrong.

Doing IPC right is very hard. Even in the current code I still see a few potential edge cases:

guard fails (too much pressure) - and we are left with rwlock that needs manual cleanup or needs to be completely removed. It's terrible UX.
I would make rwlock writes with mv to make them atomic for sure. If something happens now when write is happening we'll be getting a corrupted file again.

But the best locking logic is not to deal with it all :) My take is that probably we can deprecate parallel repro runs in favor of experiments queue.

guard rwlock file to prevent not locked repo edits

64beee9

shcheklein requested a review from a team as a code owner August 10, 2022 05:05

shcheklein requested review from efiop and pmrowla August 10, 2022 05:05

efiop approved these changes Aug 11, 2022

View reviewed changes

Merge branch 'main' into guard-rwlock

c78b0f8

efiop merged commit 206e00c into main Aug 11, 2022

efiop deleted the guard-rwlock branch August 11, 2022 12:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guard rwlock file to prevent not locked repo edits #8118

Guard rwlock file to prevent not locked repo edits #8118

Uh oh!

shcheklein commented Aug 10, 2022

Uh oh!

skshetry commented Aug 10, 2022 •

edited

Loading

Uh oh!

shcheklein commented Aug 10, 2022

Uh oh!

efiop left a comment

Uh oh!

shcheklein commented Aug 11, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Guard rwlock file to prevent not locked repo edits #8118

Guard rwlock file to prevent not locked repo edits #8118

Uh oh!

Conversation

shcheklein commented Aug 10, 2022

Uh oh!

skshetry commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shcheklein commented Aug 10, 2022

Uh oh!

efiop left a comment

Choose a reason for hiding this comment

Uh oh!

shcheklein commented Aug 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

skshetry commented Aug 10, 2022 •

edited

Loading

shcheklein commented Aug 11, 2022 •

edited

Loading