Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline: show: outs: eliminate extra edges in DAG #3217

Conversation

fabiosantoscode
Copy link
Contributor

This fixes extra nodes edges appearing in pipeline show --outs, by using networkx to traverse the graph instead of listing all edges.

I spent a lot of time trying to figure out how to test this, so I thought I'd either leave it untested or take suggestions from the PR review.

Fixes #1744


  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to connect outputs to deps, this is not changed here. The code still connects outs to outs.

@fabiosantoscode
Copy link
Contributor Author

@Suor I initially looked at the issue description on Discord and tried to reproduce the same issue. This fixes that diamond problem, where A generates B and C, and B doesn't go on to be a dependency of the ultimate target but is still in the graph.

I'll try to reproduce the more general issue and see what fixes it.

@fabiosantoscode
Copy link
Contributor Author

@Suor there. It's getting pretty nested and I could use itertools.product() to make those two for loops just one, but it's probably unnecessarily complicated.

@fabiosantoscode
Copy link
Contributor Author

Wait, that implementation is incorrect.

@fabiosantoscode
Copy link
Contributor Author

This should do it.

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need a test to clear this out.

dvc/command/pipeline.py Outdated Show resolved Hide resolved
)

args = ["pipeline", "show", "--outs", "--dot", final_stage.path]
with tmp_dir.chdir():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are already in tmp_dir, so no need to chdir. Or am I missing something here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct! I didn't know about that. Updating.

@efiop efiop requested review from Suor and pared January 28, 2020 22:21
final_stage.path, commands=False, outs=True
)

assert set(nodes) == set(["final", "derived1", "base"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for converting list to set, set(["final", "derived1", "base"]) == {"final", "derived1", "base"}

deps=["derived1"], outs=["final"], cmd="echo final > final"
)

args = ["pipeline", "show", "--outs", "--dot", final_stage.path]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are calling __build_graph anyway, why would we provide those args? You can simply initialize command with empty list. :)

@@ -98,6 +99,28 @@ def test_dot_commands(self):
self.assertEqual(ret, 0)


def test_disconnected_stage(tmp_dir, dvc, caplog):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caplog seems to be unused

Copy link
Contributor

@pared pared left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little suggestions for tests.

Another thing, not necessarily in this change: @efiop, do you think that we should keep the name of the method __build_graph? It has special meaning in python, and at the same time we do not leverage the functionality it provides (preventing method name collisions with child classes), so maybe its better to just replace it with _build_graph and access it by command._build_graph?

@efiop
Copy link
Contributor

efiop commented Jan 29, 2020

Another thing, not necessarily in this change: @efiop, do you think that we should keep the name of the method __build_graph

Yeah, that hurts my eyes too, it was introduced a few years back by one of the contributors and we didn't touch that since. We should change it, but not necessarily in this PR (though, might as well) 🙂

@pared
Copy link
Contributor

pared commented Jan 29, 2020

@efiop that is very tactful of you, to say "one of our contributors" and not "you did that, remember?" :D
@fabiosantoscode If it's not a problem, I would suggest refactoring the name __build_graph -> _build_graph. If you mind, it can stay as is.

@efiop
Copy link
Contributor

efiop commented Jan 29, 2020

@efiop that is very tactful of you, to say "one of our contributors" and not "you did that, remember?"

Wait, I didn't mean that, though. I don't think it was you, was it? 👀

EDIT: oh, it was https://github.com/iterative/dvc/pull/1141/files . I didn't mean to shame anyone. But at least we are now clear that we shouldn't have done that, so that is what really matters. We've all come a long way 🙂

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some cosmetic things, but LGTM overall.


return nodes, edges, networkx.is_tree(G)
if outs and not commands:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aren't this two mutually exclusive?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes they seem to be!

@fabiosantoscode
Copy link
Contributor Author

I'll remove the __ for a regular _. I agree that it's a bit uncalled for.

@fabiosantoscode
Copy link
Contributor Author

I believe I've addressed all of the feedback. Additionally, I found a __write_dot method and turned it into _write_dot.

Copy link
Contributor

@efiop efiop left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much @fabiosantoscode ! 🙏

@efiop efiop merged commit 93d1b6d into iterative:master Feb 4, 2020
@fabiosantoscode fabiosantoscode deleted the feature/1744-pipeline-eliminate-extra-connection branch February 4, 2020 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pipeline: show: outs: improper DAG
4 participants