Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkout: figure out whether to link or not #548

Merged
merged 3 commits into from
Aug 12, 2024
Merged

checkout: figure out whether to link or not #548

merged 3 commits into from
Aug 12, 2024

Conversation

skshetry
Copy link
Member

@skshetry skshetry commented Aug 8, 2024

Requires iterative/dvc#10513 for this to work.

@skshetry skshetry force-pushed the relinking-diff branch 3 times, most recently from e200a09 to c956fb8 Compare August 8, 2024 14:14
@skshetry skshetry force-pushed the relinking-diff branch 2 times, most recently from 9fbb805 to abed49d Compare August 8, 2024 14:40
@codecov-commenter
Copy link

codecov-commenter commented Aug 8, 2024

Codecov Report

Attention: Patch coverage is 35.71429% with 36 lines in your changes missing coverage. Please review.

Project coverage is 67.29%. Comparing base (081a1d8) to head (39ae650).
Report is 23 commits behind head on main.

Files Patch % Lines
src/dvc_data/hashfile/checkout.py 12.19% 36 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #548      +/-   ##
==========================================
+ Coverage   62.98%   67.29%   +4.30%     
==========================================
  Files          62       65       +3     
  Lines        4342     4785     +443     
  Branches      740      803      +63     
==========================================
+ Hits         2735     3220     +485     
+ Misses       1448     1377      -71     
- Partials      159      188      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skshetry skshetry force-pushed the relinking-diff branch 4 times, most recently from 7c4200f to f2d272d Compare August 8, 2024 16:05

if link_type == "symlink" and is_symlink and destination:
return destination != obj_path
if link_type == "hardlink" and is_hardlink and samefile(path, obj_path):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This os call is still a bottleneck. Here's the profile for the 1.5 million files in the imagenet 2017 annotations:

add-relink-diff.prof.zip

For this to be really worth it, I think we would at least have to parallelize so these are less of a blocker or else rely on metadata to determine if they are the same.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #549 that adds a thread pool. Feel free to reimplement a different way if you want, but this has a much more reasonable profile:

add-relink-diff-thread.prof.zip

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's the profile for the 1.5 million files

I downloaded that, extracted that file. There were tar.gz files inside again, which I re-extracted. And I deleted the existing .tar.gz file.

I see following stats that says 1.07 million files. Did I do anything wrong?

Number of files: 1,077,367 (reg: 1,073,739, dir: 3,628)

I know macOS filesystem operations are slow, but I did not know they were this terrible.
On my machine (Linux/Intel), _check_relink finishes in 5s - 6s. First time I did dvc add, it took ~9mins, and repeated dvc add takes <1min 30s.

I can drop one more stat calls and use metadata. But I need to stat the path to object in the cache to compare inodes (we don't have cache metadata). That drops to 3s for me. :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I changed this to be strictly metadata comparison now. The inode data comes from when we check for object integrity with HashFileDB.check() (it's a bit hacky but works).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On my machine (Linux/Intel), _check_relink finishes in 5s - 6s. First time I did dvc add, it took ~9mins, and repeated dvc add takes <1min 30s.

🤯 I just spun up a linux vm and still got 40+ minutes for the same files with this PR. I guess it shows how much the filesystem impacts performance, but I think you have done what you can with this PR.

Copy link
Member Author

@skshetry skshetry Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share how much time _determine_files_to_relink takes on either machines? I am just curious what it costs.

You can add @funcy.print_durations at the top of the function, and then dvc add ....


I guess it shows how much the filesystem impacts performance

This is what I see in my machine for posix.stat.

ncalls tottime percall cumtime percall filename:lineno(function)
354974 0.9792 2.758e-06 0.9792 2.758e-06 ~:0(<built-in method posix.stat>)

Compared that to what I see in yours:

ncalls tottime percall cumtime percall filename:lineno(function)
11907017 2235 0.0001877 2235 0.0001877 ~:0(<built-in method posix.stat>)

which is 68x slower.

Here's my cprofile for adding 1M files: add.prof.zip

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please share how much time _determine_files_to_relink takes on either machines? I am just curious what it costs.

The last run I did on my mac with this PR, this only took ~4 seconds. Most of the time is spent on stat calls as you noted.

@skshetry skshetry force-pushed the relinking-diff branch 3 times, most recently from 22ddcd8 to 000ab86 Compare August 9, 2024 04:32
@skshetry skshetry force-pushed the relinking-diff branch 2 times, most recently from aed86fb to 9d26064 Compare August 9, 2024 10:33
@skshetry skshetry force-pushed the relinking-diff branch 3 times, most recently from 30b29bc to 4706282 Compare August 12, 2024 05:57
@skshetry skshetry marked this pull request as ready for review August 12, 2024 06:37
@skshetry skshetry merged commit a3dc83d into main Aug 12, 2024
15 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants