-
Notifications
You must be signed in to change notification settings - Fork 1.2k
status: implement support for imported files #3150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
status: implement support for imported files #3150
Conversation
dvc/dependency/repo.py
Outdated
| updated_path = os.path.join( | ||
| updated_repo.root_dir, self.def_path | ||
| ) | ||
| has_changed = not filecmp.cmp(current_path, updated_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a gotcha in filecmp.cmp:
| has_changed = not filecmp.cmp(current_path, updated_path) | |
| has_changed = not filecmp.cmp(current_path, updated_path, shallow=False) |
without it, it will only check sizes, I think.
Btw, these paths might be dirs and IIRC cmp() doesn't work with dirs. Maybe let's use md5 that we use for cached outputs? See RemoteLOCAL.get_checksum(), which you could invoke through self.repo.cache.local.get_checksum().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for pointing out where I can find a checksum function :) I already suspected filecmp wasn't good for this.
dvc/dependency/repo.py
Outdated
| updated = updated_repo.find_out_by_relpath(self.def_path).info | ||
|
|
||
| has_changed = current != updated | ||
| except OutputNotFoundError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, there is also an interesting case, when a file/dir was git-tracked before but then became dvc-tracked or vice versa. But the thingy with get_checksum should make those comparable, I think.
| if targets: | ||
| stages = cat(self.collect(t, with_deps=with_deps) for t in targets) | ||
| else: | ||
| stages = self.collect(None, with_deps=with_deps) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you've split these because of deep source, right? Nothing wrong with that, just asking 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly it :)
…us-of-imported-files
|
@efiop now that was quite fun :) I got The case where a file turns into a directory (or vise-versa) is treated with an Also since the repository is supposedly remote I didn't give any thought to symlinks and hardlinks. |
|
Thank you for your patience 🙂
@fabiosantoscode Hm, but why is that an issue though? 🤔 |
|
@efiop thanks a lot for your comments! I wouldn't have thought that it was possible to generate checksums without a DVC repo. The code seemed to require a repo. This changes a lot, and the resulting code is way easier to understand. |
|
@efiop done |
|
@fabiosantoscode almost done here, just one minor comment left. Thank you for your patience 🙂 |
|
@efiop done, it looks way simpler now. |
| # Fall through and clone | ||
| pass | ||
|
|
||
| repo_path = cached_clone( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luckily we do have git + dvc clones refactor on our todo list 🙂Not a part of this PR or anything, it is just us messing up earlier. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 🎉
| ) | ||
| path = PathInfo(os.path.join(repo_path, self.def_path)) | ||
|
|
||
| return self.repo.cache.local.get_checksum(path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We really shouldn't do this. If path happens to be a dir then it will add that dir listing to self.repo cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Suor It is not an issue really. Those are tiny files and will get cleaned up on gc.
This fix only works when the remote is located in your machine. For remote DVC projects, it doesn't work and I'm looking into why.
Also I'm having a lot of trouble writing a unit test for this. The CLI works fine but when I call
dvc.status(with_deps=True)in the test it reports that nothing changed.Fixes #2959
❗ Have you followed the guidelines in the Contributing to DVC list?
📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.
☝️
dvc lockdocs imply that locked files are impervious todvc status.☝️ I changed some code which already had DeepSource antipatterns and it re-detected them.
Thank you for the contribution - we'll try to review it as soon as possible. 🙏