-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
checkout: figure out whether to link or not #548
Conversation
e200a09
to
c956fb8
Compare
9fbb805
to
abed49d
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #548 +/- ##
==========================================
+ Coverage 62.98% 67.29% +4.30%
==========================================
Files 62 65 +3
Lines 4342 4785 +443
Branches 740 803 +63
==========================================
+ Hits 2735 3220 +485
+ Misses 1448 1377 -71
- Partials 159 188 +29 ☔ View full report in Codecov by Sentry. |
7c4200f
to
f2d272d
Compare
src/dvc_data/hashfile/checkout.py
Outdated
|
||
if link_type == "symlink" and is_symlink and destination: | ||
return destination != obj_path | ||
if link_type == "hardlink" and is_hardlink and samefile(path, obj_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This os call is still a bottleneck. Here's the profile for the 1.5 million files in the imagenet 2017 annotations:
For this to be really worth it, I think we would at least have to parallelize so these are less of a blocker or else rely on metadata to determine if they are the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #549 that adds a thread pool. Feel free to reimplement a different way if you want, but this has a much more reasonable profile:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the profile for the 1.5 million files
I downloaded that, extracted that file. There were tar.gz
files inside again, which I re-extracted. And I deleted the existing .tar.gz
file.
I see following stats that says 1.07 million files. Did I do anything wrong?
Number of files: 1,077,367 (reg: 1,073,739, dir: 3,628)
I know macOS filesystem operations are slow, but I did not know they were this terrible.
On my machine (Linux/Intel), _check_relink
finishes in 5s - 6s. First time I did dvc add
, it took ~9mins, and repeated dvc add
takes <1min 30s.
I can drop one more stat calls and use metadata. But I need to stat the path to object in the cache to compare inodes (we don't have cache metadata). That drops to 3s for me. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I changed this to be strictly metadata comparison now. The inode
data comes from when we check for object integrity with HashFileDB.check()
(it's a bit hacky but works).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On my machine (Linux/Intel),
_check_relink
finishes in 5s - 6s. First time I diddvc add
, it took ~9mins, and repeateddvc add
takes <1min 30s.
🤯 I just spun up a linux vm and still got 40+ minutes for the same files with this PR. I guess it shows how much the filesystem impacts performance, but I think you have done what you can with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please share how much time _determine_files_to_relink
takes on either machines? I am just curious what it costs.
You can add @funcy.print_durations
at the top of the function, and then dvc add ...
.
I guess it shows how much the filesystem impacts performance
This is what I see in my machine for posix.stat
.
ncalls | tottime | percall | cumtime | percall | filename:lineno(function) |
---|---|---|---|---|---|
354974 | 0.9792 | 2.758e-06 | 0.9792 | 2.758e-06 | ~:0(<built-in method posix.stat>) |
Compared that to what I see in yours:
ncalls | tottime | percall | cumtime | percall | filename:lineno(function) |
---|---|---|---|---|---|
11907017 | 2235 | 0.0001877 | 2235 | 0.0001877 | ~:0(<built-in method posix.stat>) |
which is 68x slower.
Here's my cprofile for adding 1M files: add.prof.zip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please share how much time
_determine_files_to_relink
takes on either machines? I am just curious what it costs.
The last run I did on my mac with this PR, this only took ~4 seconds. Most of the time is spent on stat calls as you noted.
22ddcd8
to
000ab86
Compare
aed86fb
to
9d26064
Compare
9d26064
to
17ed57d
Compare
30b29bc
to
4706282
Compare
bf7186b
to
560819f
Compare
Requires iterative/dvc#10513 for this to work.