-
Notifications
You must be signed in to change notification settings - Fork 1.3k
dvc.data: save and try loading raw dir objects #7597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc.data: save and try loading raw dir objects #7597
Conversation
86c491e to
d6152f7
Compare
8d4fc81 to
73e1716
Compare
73e1716 to
a207fcf
Compare
This comment was marked as outdated.
This comment was marked as outdated.
a207fcf to
22fe50b
Compare
|
@dtrifiro Thanks for the benches! 🙏 Btw, could you take a look at what is still taking that long in relation to this PR? I suppose it is mostly the overhead of those ref obj recreations? Just wondering if there are any low hangers that we could pick until a proper solution with persistent refobjects arrives. |
This comment was marked as outdated.
This comment was marked as outdated.
22fe50b to
6124165
Compare
|
@dtrifiro Thanks for the traces 🙏 Any particular points you were able to identify? Any particular ideas? |
352517c to
27b7927
Compare
27b7927 to
3f24f91
Compare
3f24f91 to
b8362bc
Compare
|
Updated benchmarks can be found here: A summary for ----------------------------------------------------------------------------------- benchmark 'test_status-status': 3 tests ------------------------------------------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_status-status (test_status/0001_fix-739) 25.2779 (1.0) 25.2779 (1.0) 25.2779 (1.0) 0.0000 (1.0) 25.2779 (1.0) 0.0000 (1.0) 0;0 0.0396 (1.0) 1 1
test_status-status (test_status/0002_main) 36.9391 (1.46) 36.9391 (1.46) 36.9391 (1.46) 0.0000 (1.0) 36.9391 (1.46) 0.0000 (1.0) 0;0 0.0271 (0.68) 1 1
test_status-status (test_status/0003_2.10.0) 36.9755 (1.46) 36.9755 (1.46) 36.9755 (1.46) 0.0000 (1.0) 36.9755 (1.46) 0.0000 (1.0) 0;0 0.0270 (0.68) 1 1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------- benchmark 'test_status-status-changed': 3 tests ------------------------------------------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_status-status-changed (test_status/0001_fix-739) 14.9064 (1.0) 14.9064 (1.0) 14.9064 (1.0) 0.0000 (1.0) 14.9064 (1.0) 0.0000 (1.0) 0;0 0.0671 (1.0) 1 1
test_status-status-changed (test_status/0002_main) 26.9573 (1.81) 26.9573 (1.81) 26.9573 (1.81) 0.0000 (1.0) 26.9573 (1.81) 0.0000 (1.0) 0;0 0.0371 (0.55) 1 1
test_status-status-changed (test_status/0003_2.10.0) 27.1236 (1.82) 27.1236 (1.82) 27.1236 (1.82) 0.0000 (1.0) 27.1236 (1.82) 0.0000 (1.0) 0;0 0.0369 (0.55) 1 1
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------ benchmark 'test_status-status-changed-noop': 3 tests -----------------------------------------------------------------------------------
Name (time in s) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_status-status-changed-noop (test_status/0001_fix-739) 3.9865 (1.0) 3.9865 (1.0) 3.9865 (1.0) 0.0000 (1.0) 3.9865 (1.0) 0.0000 (1.0) 0;0 0.2508 (1.0) 1 1
test_status-status-changed-noop (test_status/0002_main) 26.6765 (6.69) 26.6765 (6.69) 26.6765 (6.69) 0.0000 (1.0) 26.6765 (6.69) 0.0000 (1.0) 0;0 0.0375 (0.15) 1 1
test_status-status-changed-noop (test_status/0003_2.10.0) 27.0214 (6.78) 27.0214 (6.78) 27.0214 (6.78) 0.0000 (1.0) 27.0214 (6.78) 0.0000 (1.0) 0;0 0.0370 (0.15) 1 1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
dvc/data/stage.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| raise FileNotFoundError | |
| raise FileNotFoundError from exc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While we are here:
| raise FileNotFoundError | |
| raise FileNotFoundError(errno.ENOENT, "No such object", odb.hash_to_path(hash_info.value)) |
Or, just use hash_info.value for simplicity.
b8362bc to
3c7d08e
Compare
We could push performance a little bit more by avoiding calling Hacking around, I was able to get a working prototype and ran some benchmarks with dvc-bench's |
3c7d08e to
42f87aa
Compare
When staging a directory, always save a "raw dir object" to the odb. If the corresponding ".dir" object has not been added to the odb, `stage()` calls can load the tree from the raw dir object instead of rebuilding it by walking the directory. This can lead to significant speed improvements when calling `dvc status` for modified directories. Fixes treeverse#7390
42f87aa to
1f92c4b
Compare
|
@dtrifiro, are those improvements on |
I forgot to update the https://github.com/dtrifiro/dvc-bench/runs/6404025103?check_suite_focus=true#step:8:258 |
efiop
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, it looks like we could clean this up a bit more, but let's run with it for now and get back to it when we safe ref objects.
| obj = load(odb, hash_info) | ||
| check(odb, obj, check_hash=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, this PR introduced a typo: it should be _odb here 🙁
Introduced in treeverse/dvc#7597
Introduced in treeverse/dvc#7597
When staging a directory, always save a "raw dir object" to the odb. If the corresponding ".dir" object has not been added to the odb,
stage()calls can load the tree from the raw dir object instead of rebuilding it by walking the directory.This can lead to significant speed improvements when calling
dvc statusfor modified directories.Fixes #7390