Skip to content

Conversation

@fabiosantoscode
Copy link
Contributor

@fabiosantoscode fabiosantoscode commented Jan 14, 2020

This fix only works when the remote is located in your machine. For remote DVC projects, it doesn't work and I'm looking into why.

Also I'm having a lot of trouble writing a unit test for this. The CLI works fine but when I call dvc.status(with_deps=True) in the test it reports that nothing changed.

Fixes #2959


  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

☝️ dvc lock docs imply that locked files are impervious to dvc status.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

☝️ I changed some code which already had DeepSource antipatterns and it re-detected them.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@fabiosantoscode fabiosantoscode changed the title status: implement support for imported files status: implement support for imported files (WIP) Jan 14, 2020
@fabiosantoscode fabiosantoscode changed the title status: implement support for imported files (WIP) status: implement support for imported files Jan 14, 2020
updated_path = os.path.join(
updated_repo.root_dir, self.def_path
)
has_changed = not filecmp.cmp(current_path, updated_path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a gotcha in filecmp.cmp:

Suggested change
has_changed = not filecmp.cmp(current_path, updated_path)
has_changed = not filecmp.cmp(current_path, updated_path, shallow=False)

without it, it will only check sizes, I think.

Btw, these paths might be dirs and IIRC cmp() doesn't work with dirs. Maybe let's use md5 that we use for cached outputs? See RemoteLOCAL.get_checksum(), which you could invoke through self.repo.cache.local.get_checksum().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing out where I can find a checksum function :) I already suspected filecmp wasn't good for this.

updated = updated_repo.find_out_by_relpath(self.def_path).info

has_changed = current != updated
except OutputNotFoundError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, there is also an interesting case, when a file/dir was git-tracked before but then became dvc-tracked or vice versa. But the thingy with get_checksum should make those comparable, I think.

Comment on lines -16 to -20
if targets:
stages = cat(self.collect(t, with_deps=with_deps) for t in targets)
else:
stages = self.collect(None, with_deps=with_deps)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you've split these because of deep source, right? Nothing wrong with that, just asking 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly it :)

@fabiosantoscode
Copy link
Contributor Author

fabiosantoscode commented Jan 15, 2020

@efiop now that was quite fun :)

I got dvc status for non-DVC remote imports working. The cross-cases are now working as well, since the old and new files are treated individually. In the case that both have an md5 we can use, that md5 is used. If they don't, it's a matter of doing a deep copy. If one of the versions is tracked by DVC but not the other, a dvc checkout is performed beforehand. Hopefully they're both small! If they're not, I think this is enough of a corner case to not matter.

The case where a file turns into a directory (or vise-versa) is treated with an assert. I'm not sure if we're just supposed to say "hey, this changed" or raise a specific exception in this case.

Also since the repository is supposedly remote I didn't give any thought to symlinks and hardlinks.

@efiop
Copy link
Contributor

efiop commented Jan 16, 2020

Thank you for your patience 🙂

The case where a file turns into a directory (or vise-versa) is treated with an assert. I'm not sure if we're just supposed to say "hey, this changed" or raise a specific exception in this case.

@fabiosantoscode Hm, but why is that an issue though? 🤔

@fabiosantoscode
Copy link
Contributor Author

@efiop thanks a lot for your comments! I wouldn't have thought that it was possible to generate checksums without a DVC repo. The code seemed to require a repo.

This changes a lot, and the resulting code is way easier to understand.

@fabiosantoscode
Copy link
Contributor Author

@efiop done

@efiop
Copy link
Contributor

efiop commented Jan 16, 2020

@fabiosantoscode almost done here, just one minor comment left. Thank you for your patience 🙂

@fabiosantoscode
Copy link
Contributor Author

@efiop done, it looks way simpler now.

# Fall through and clone
pass

repo_path = cached_clone(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luckily we do have git + dvc clones refactor on our todo list 🙂Not a part of this PR or anything, it is just us messing up earlier. 😅

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 🎉

@efiop efiop merged commit d081627 into iterative:master Jan 16, 2020
)
path = PathInfo(os.path.join(repo_path, self.def_path))

return self.repo.cache.local.get_checksum(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We really shouldn't do this. If path happens to be a dir then it will add that dir listing to self.repo cache.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor It is not an issue really. Those are tiny files and will get cleaned up on gc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

status: implement support for import-ed files

3 participants