Skip to content

Conversation

@shcheklein
Copy link
Contributor

User were getting:

dvc.rwlock.RWLockFileCorruptedError: Unable to read RWLock-file '.dvc/tmp/rwlock'. JSON structure is corrupted

when running multiple pipelines in parallel.

The reason for this is again our unlocked(cmd) logic, which fails to acquire the repo.lock back if other pipelines are doing some heavy stuff or there are too many of them. It means that, rwlock on context manager exit is cleaning up itself while repo is not protected. Multiple processes could write simultaneously.

Overall, proper locking is pretty hard. I don't see good options for Python in terms of IPC or something that would be potentially faster / more reliable (could cleanup itself if process fails). For now I think, it's better to have this as a workaround, until we reconsider this parallel execution in general (e.g. deprecate it in favor of queue?).

@shcheklein shcheklein requested a review from a team as a code owner August 10, 2022 05:05
@shcheklein shcheklein requested review from efiop and pmrowla August 10, 2022 05:05
@skshetry
Copy link
Collaborator

skshetry commented Aug 10, 2022

One way would be to make rwlock itself granular. eg: .dvc/tmp/rwlocks/<path>.lock. The path could be a path or a checksum of the path too. That way, we can replace the file atomically without any guards.

@shcheklein
Copy link
Contributor Author

@skshetry it'a a bit more complicated as far as I understand. It would have to do multiple lock files at once + there R and W locks so we would have to maintain a few locations. It probably can be done, but we would have to guarantee still atomicity, lack of deadlocks (ordering), etc. Locking in general can be very complicated and the simpler system is the better (my preference here - don't try to be smart :) ). Even in this "simple" scenario - one global lock + one internal rwlock it has gotten very complicated because of just one additional piece - an internal unlock. We started hitting edge cases, corruptions, etc.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels odd to add a yet another lock to fix a problem with another lock, but if this fixes it let's merge for now in the interest of time.

@efiop efiop merged commit 206e00c into main Aug 11, 2022
@efiop efiop deleted the guard-rwlock branch August 11, 2022 12:54
@shcheklein
Copy link
Contributor Author

shcheklein commented Aug 11, 2022

@efiop agreed, it feels weird, looks complicated. But if we think about this - rwlock was written in way that it needs protection (that repo.lock serves as a guard). It just happened that the initial assumption that repo.lock protects- it was wrong.

Doing IPC right is very hard. Even in the current code I still see a few potential edge cases:

  • guard fails (too much pressure) - and we are left with rwlock that needs manual cleanup or needs to be completely removed. It's terrible UX.
  • I would make rwlock writes with mv to make them atomic for sure. If something happens now when write is happening we'll be getting a corrupted file again.

But the best locking logic is not to deal with it all :) My take is that probably we can deprecate parallel repro runs in favor of experiments queue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants