Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diff: use RepoTree to compare directory contents #4518

Merged
merged 4 commits into from
Sep 3, 2020

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Sep 2, 2020

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Partially addresses #2982.

  • Directory file contents is now diff'd properly (instead of only showing the top level dir/ added/modified/deleted status)
  • Uncached directory hashes are fetched from remotes using RepoTree(stream=True)

This does not implement pulling all uncached file contents by default, or the file size related tasks.

@pmrowla pmrowla added the bugfix fixes bug label Sep 2, 2020
@pmrowla pmrowla self-assigned this Sep 2, 2020
@pmrowla pmrowla added this to In progress in DVC 25 August - 8 September 2020 via automation Sep 2, 2020
@pmrowla pmrowla moved this from In progress to Review in progress in DVC 25 August - 8 September 2020 Sep 2, 2020
dvc/repo/diff.py Outdated
Comment on lines 107 to 108
# if dir hash is missing from cache, and no remote to pull it from,
# there is nothing we can do here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we print a warning or at least a debug message?



def _output_paths(repo):
on_working_tree = isinstance(repo.tree, LocalTree)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does on_working_tree matter anymore though? RepoTree will just use repo.tree and will act accordingly if it is a wtree or a git rev tree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the issue here is that DvcTree (and RepoTree) does not re-compute dir output hashes when the workspace is dirty. I'm not sure that we would want to change that behavior just for dvc diff, it seems like in a majority of scenarios we would want to re-use the already computed hashes for dir outputs when we are using DvcTree and RepoTree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's actually an inconsistency that should be solved in RepoTree. Ideally, we should be able to work with both dirty state and clean repos. But could start with the current approach too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #4523 for this issue

dvc/repo/diff.py Outdated Show resolved Hide resolved
Copy link
Member

@skshetry skshetry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, but I am not sure if this is a good thing, as output could get quite lengthy.

@efiop efiop merged commit 1f495d6 into iterative:master Sep 3, 2020
DVC 25 August - 8 September 2020 automation moved this from Review in progress to Done Sep 3, 2020
@pmrowla pmrowla deleted the diff-repo-tree branch September 3, 2020 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes bug
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants