Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repro: LockError due to unnecessary write lock #5795

Closed
maximerischard opened this issue Apr 11, 2021 · 9 comments
Closed

repro: LockError due to unnecessary write lock #5795

maximerischard opened this issue Apr 11, 2021 · 9 comments
Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC p2-medium Medium priority, should be done, but less important

Comments

@maximerischard
Copy link
Contributor

Bug Report

Description

A LockError is raised if two dvc steps that share a dependency are run in parallel.

Reproduce

dvc init
echo "hello" > dependency.txt
dvc add dependency.txt

The add the following to dvc.yaml:

stages:
  step1:
    cmd:
        - sleep 20
        - echo 'output1' > out1.txt
    deps:
    - dependency.txt
    outs:
    - out1.txt
  step2:
    cmd:
        - sleep 20
        - echo 'output2' > out2.txt
    deps:
    - dependency.txt
    outs:
    - out2.txt

The run dvc repro step1 and dvc repro step2 in two separate terminals. The second command will result in a LockError:

ERROR: failed to reproduce 'dependency.txt.dvc': 'dependency.txt' is busy, it is being blocked by:
  (PID 159252): […]/dvc repro step1

If there are no processes with such PIDs, you can manually remove '.dvc/tmp/rwlock' and try again.

The reason seems to be that:

  1. step1 grabs a read lock on dependency.txt
  2. step2 triggers dependency.txt.dvc to be “run“
  3. dependency.txt.dvc requests a write lock on dependency.txt
  4. the write lock is refused because of the read lock from (1)

Step (3) seems to be the problematic one, there should be no need for a write lock on dependency.txt to be requested.

Expected

Two dvc repro or dvc run commands that read from the same file should be able to run in parallel.

Workaround

step2 can be run with dvc repro --single-item step2 but this should not be necessary.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 2.0.17 (pip)
---------------------------------
Platform: Python 3.9.0 on Linux-4.15.0-136-generic-x86_64-with-glibc2.27
Supports: http, https, s3
@maximerischard
Copy link
Contributor Author

Egregiously, dvc repro --dry step2 also results in a LockError. It seems pretty clear-cut to me that dvc repro run with the --dry should never request a write lock.

@maximerischard
Copy link
Contributor Author

Deleting the @rwlock decorator from the reproduce() method fixes this particular issue. Note that the locks will still be obtained if necessary since the run() method is also decorated. Furthermore, I would say the rwlock decorator on the run() method is still too aggressive and is likely to cause similar issues. A write lock should only be obtained if the output file is actually going to be written to.

@karajan1001 karajan1001 added bug Did we break something? discussion requires active participation to reach a conclusion labels Apr 12, 2021
@karajan1001
Copy link
Contributor

karajan1001 commented Apr 12, 2021

@maximerischard

Seems that the lock error happens in the stage dependency.txt.dvc, it is a single stage. dependency.txt is the output of this stage.

Because some of the dvc run may take quite a long time (a few days in some training process), rwlock on the run() can't be removed.

But we can remove rwlock on run() in a single-stage as it didn't change during dvc repro I think.

@pmrowla pmrowla removed the bug Did we break something? label Apr 12, 2021
@pmrowla
Copy link
Contributor

pmrowla commented Apr 12, 2021

This isn't really a bug, it's more of a known limitation in current DVC. DVC is not designed in a way to allow running more than one stage at a time within a single repository instance/workspace. There are existing feature requests to support this behavior, and it is something we would like to get to eventually:

See these related issues:
#755
#3783

The --dry behavior is something we could consider adjusting, but as it stands the current behavior for repro --dry is for DVC to do everything it would normally do during repro except for executing the actual stage command - this includes checking all of the repo locks that would normally need to be acquired during repro.

@skshetry
Copy link
Member

Closing in favor of #5007

@efiop
Copy link
Member

efiop commented Apr 12, 2021

Hi @maximerischard !

Step (3) seems to be the problematic one, there should be no need for a write lock on dependency.txt to be requested.

dvc repro for dependency.txt.dvc will try to add a new version of dependency if it has changed, hence grabbing a write lock for it. So to improve it, we could try to grab it only if the output (of dependency.txt.dvc which is dependency.txt) has changed. Pipeline stages could also benefit from such change.

@efiop
Copy link
Member

efiop commented Apr 12, 2021

For the record: Closed #5007 in favor of this one, since it contains a bit more contenxt.

@efiop efiop added enhancement Enhances DVC p2-medium Medium priority, should be done, but less important labels Apr 12, 2021
@maximerischard
Copy link
Contributor Author

Thank you so much for responding to the issue. It's hugely encouraging that dvc devs and the community are so responsive.

To respond to a few comments:

Because some of the dvc run may take quite a long time (a few days in some training process), rwlock on the run() can't be removed.

There's no question that a lock needs to be obtained when the command is actually executed. But the run() method has multiple exit conditions for which the execution is avoided. In those cases no write lock should be acquired. In other words, the lock should be obtained only once we're sure we're going to execute the step.

This isn't really a bug, it's more of a known limitation in current DVC. DVC is not designed in a way to allow running more than one stage at a time within a single repository instance/workspace.

I would question this slightly. The read-write lock functionality has no reason to exist except to enable this kind of manually parallel workflow. The docs are quite clear on this topic: “Currently, dvc repro is not able to parallelize stage execution automatically. If you need to do this, you can launch dvc repro multiple times manually.” The latter is exactly what I'm trying to do, and I would say dvc is a few bug fixes away from fulfilling that promise. (I would say dvc is a handful of coroutines away from delivering “parallelized stage execution” but that's certainly a trickier task.)

The --dry behavior is something we could consider adjusting, but as it stands the current behavior for repro --dry is for DVC to do everything it would normally do during repro except for executing the actual stage command - this includes checking all of the repo locks that would normally need to be acquired during repro.

I can see how that makes sense. It's a tricky design decision, thank you for explaining the reasoning behind it.

@efiop
Copy link
Member

efiop commented Aug 10, 2021

Fixed by #5815

@efiop efiop closed this as completed Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion enhancement Enhances DVC p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

5 participants