Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support push/pull/metrics/gc, etc across different commits #1691

Closed
silverdna opened this issue Mar 6, 2019 · 71 comments
Closed

support push/pull/metrics/gc, etc across different commits #1691

silverdna opened this issue Mar 6, 2019 · 71 comments
Assignees
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do research

Comments

@silverdna
Copy link

Currently dvc metrics show can show metric values across different branches (-a) and different tags (-T).
Can you consider supporting showing different metric values across different commits in the same branch?


The background of this is (simplified example): say I'm currently training a model, where I'm changing a certain parameter, param1 (for instance, number of trees in a forest). The way I probably would like to work is to find a first value for param1, commit the current state, continue changing param1 and continue committing the successive states that I consider worth saving. At some point I would like to look back and identify the setup that gave me the best results.

The way DVC currently works forces me to create a new branch/tag for each trial I want to keep track of, and this seems a bit overwhelming.

Depending on how different the experiments I'm running are and their level of granularity I could decide how to keep track of them (new commits VS new branches/tags).

Notes:

  • The example above is overly simplified and there are better ways of tuning specific models parameters. But this gets more complicated if I'm changing more stuff (model hyperparameters, data processing, features to use, etc).
  • If dvc were to support what I'm proposing here, an extra argument would probably be required to limit how many commits DVC would look back at. Otherwise it would show all the metric values since the beginning of the repo history, which can be unhelpful and messy.
@efiop
Copy link
Member

efiop commented Mar 7, 2019

@andrethrill Would you like to compare two specific commits or just the dynamics of your metrics changing across a range of commit? The latter one is probably more suitable for a graphical tool, like tensorboard or something. Or are you looking for a CLI way of doing that, using different filters (e.g. find max metric across N commits)?

@silverdna
Copy link
Author

Hi @efiop !

I'm aware of TensorBoard but that's not exactly what I was talking about.

I would like to have a way of running a few consecutive different experiments and see their metrics. Just like dvc metrics show -a currently does but without needing to create different branches. DVC seems like a good fit for this because I could checkout the experiment that gave me the best results and have everything version controlled (model, data, etc).

Or are you looking for a CLI way of doing that, using different filters (e.g. find max metric across N commits)?

If that were to be supported it would be great of course. But for what I'm talking about, just looking at the output in the same form as in dvc metrics show -a would be enough.

@efiop
Copy link
Member

efiop commented Mar 7, 2019

@andrethrill Ah, so something like dvc metrics show HEAD~10 to show metrics for 10 last commits on the current branch?

@silverdna
Copy link
Author

Exactly @efiop ! And/or some other nice variations of it: dvc metrics show HEAD~{commitHash} show metrics since commitHash on the current branch.

@efiop
Copy link
Member

efiop commented Mar 7, 2019

@andrethrill AFAIK HEAD~{commitHash} is not supported by git and it would be great to leave the syntax similar if not the same as in git :) But I get your point, there is probably a git-way to do that. Thanks for a great feature request!

@efiop efiop changed the title dvc metrics show across different commits metrics: show: support metrics across different commits Mar 7, 2019
@efiop efiop added the feature request Requesting a new feature label Mar 7, 2019
@silverdna
Copy link
Author

@efiop indeed, I was not thinking from git perspective. The syntax would have to be different :)

@brbarkley
Copy link
Contributor

@andrethrill @efiop It seems the ability to dvc metrics show only specific tag(s) (instead of either the current HEAD only or all tags) might be one feasible way to engineer this feature request. Of course, you would have to create a tag for each of the commits you would like to show.

Anyhow, I would find the ability to dvc metrics show only a specific tag or tags useful for some parts of my workflow.

@shcheklein shcheklein added the p1-important Important, aka current backlog of things to do label Jun 29, 2019
@shcheklein shcheklein changed the title metrics: show: support metrics across different commits support push/pull/metrics/gc, etc across different commits Jun 29, 2019
@shcheklein
Copy link
Member

Since the logic behind all the commands is similar it's probably make sense to implement it for all commands that support -T, -a options now.

@nik123
Copy link
Contributor

nik123 commented Jul 1, 2019

@andrethrill @efiop It seems the ability to dvc metrics show only specific tag(s) (instead of either the current HEAD only or all tags) might be one feasible way to engineer this feature request. Of course, you would have to create a tag for each of the commits you would like to show.

What about to dvc metrics show not just for arbitrary tag but for arbitrary commit? The same way git checkout command allows to checkout both arbitrary tag (git checkout <tag_id>) and commits (git checkout <commit_id>).

The syntax of course would be different from git. Something like dvc metrics show --id <tag/commit>. Additionally you can use the same syntax in other commands, like dvc push --id <tag/commit>.

This approach also solves the issue when you have several local commits and in each commit the single data file tracked by DVC has been overriden. Current implementation of dvc push only pushes last version of your data file from the last commit. The --id option will allow to push all the previous versions of your data by executing dvc push --id <commit> for all the previous commits.

@pared
Copy link
Contributor

pared commented Jul 10, 2019

@andrethrill @brbarkley @nik123

  1. I see that we might have here few smaller tasks. Ill try to identify them:
  • metrics show HEAD~X - which will show metrics from last X commits
  • metrics show --since {rev}- show metrics since rev
  • metrics show --ids={rev1},{rev2},{rev3} - like show -a but restricted to particular revs
    Do those options make sense?
  1. This is just for metrics, how would you guys see other operations, like push, pull, etc?
    a) I imagine, that for example in case of push, one might want to push all dependencies that has been binded to some git revision with stage files. Do you think it would be feasible to include some option for that? Like dvc push --all-revs? That could also be used for other ops, like pulling.
    b) Does points from 1. make sense for operations like push, pull etc? Have you ever needed to pull for some range of commits/tags/branches? Or maybe --all-revs for push would be enough?

@shcheklein
Copy link
Member

My 2 cents on this.

  1. Since -a and -T are symmetric across push/pull/gc/metrics-show, we should make a new option symmetric as well. Especially considering that it does not make implementation more complicated.
  2. Using commits is yet another way to manage "experiments". So, it makes sense to provide these options to all the commands that support -a, -T.
  3. CLI-interface-wise: would be great if we can keep a single option - --revisions? or something like that. Would really want to avoid introducing positions arguments, and a few options on top just to manage this case.
  4. (not scope of this issue, but can affect certain decisions) We'll need to introduce a filter on top -a, -T. Something like regexp to filter branch/tag names. Can we reuse this new option to simplify certain options in this case?

@pared
Copy link
Contributor

pared commented Jul 11, 2019

@shcheklein I agree that for simplicity it would be much better to implement --revisions but looking at original feature request, supporting something like dvc metric show HEAD~10 looks like desired approach too. I think that using revisions to compare last 10 commits results will be a headache for users. One will need to either tag all commits, or git log and copy-paste revs to the --revisions option.

EDIT: let me clarify that I am talking here specifically about show

@pared
Copy link
Contributor

pared commented Jul 11, 2019

I would also like to start another discussion about --revisions option.
As we discussed privately, we come to the conclusion, that probably the most convinient way for user to use --revsions would be to provide coma separeted revision ids, like:

--revisions {rev1},{rev2},{rev3} (note that providing revisions like: --revisions {rev1} {rev2} {rev3} is not viable options, since we would not know when to start parsing targets)

The problem with this approach is that coma is viable character to be included in branch name. So this edge case would break currently considered approach.

The other way to do that would be --revsion option (--revision {rev1} --revision {rev2} --revision {rev3})

I think we cannot expect users to name branches in a way that would be convinient for us, do you agree?

EDIT: as discussed with @Suor, we not necesarily need to use coma as separator, git forbids some characters in branch names, like colons.

@pared
Copy link
Contributor

pared commented Jul 11, 2019

Possible solution: require providing revisions after parsing targets, that would make parsing multiple targets and multiple revisions possible.
How that could look like:

https://asciinema.org/a/pSVHHQ17uQBwzN2v0BUI8VaSK

@Suor
Copy link
Contributor

Suor commented Jul 11, 2019

We are using rev already, why not --revs?

@pared
Copy link
Contributor

pared commented Jul 11, 2019

@Suor I agree, especially that its short and understandable.

@shcheklein
Copy link
Member

My thinking was - is it possible to derive from the string that is passed to --revs what exactly do we want to address? - like if it's a comma separated list of git hashes then we work with ids, etc . Is there an example in Git cli where it expects a list of revs?

@pared
Copy link
Contributor

pared commented Jul 12, 2019

@shcheklein looking through documentation, I think closest example would be using refspec:
https://git-scm.com/book/en/v2/Git-Internals-The-Refspec

What do you mean by what do we want to address? Determining whether it is commit sha, branch or tag?

@shcheklein
Copy link
Member

Yep, either it's a commit, branch, tag or a list of those.

@pared
Copy link
Contributor

pared commented Jul 15, 2019

@shcheklein do we actually need to know what it is? AFAIK git checkout accepts any of those.

@pared
Copy link
Contributor

pared commented Jul 15, 2019

It seems to me, that we need to decide which way we go with implementation of this feature.

  1. I would leave discussion "Do we support dvc metrics show HEAD~10" for some other issue, as it is some particular use case that is not related to push/pull etc...

  2. How to implement data sync operations for few revisions, we discussed 3 approaches so far:

  • dvc pull -rev rev1 -rev rev2 -rev rev3 file.dvc
  • dvc pull file.dvc --revs rev1 rev2
  • dvc pull --revs=rev1:rev2:rev3 file.dvc

I think we should go with the last one, because its faster to use that first one, and does not introduce some strong assumption as do the second approach (I mean requiring passing revs after targets)

@efiop
Copy link
Member

efiop commented Jul 15, 2019

@pared 1 and 2 are tied together. metrics and pull/push/etc should have (if it is feasible) the same syntax for working with references. Unless we decide to redesign it of course, but I don't see the point of that just yet. I totally agree with you, that most would probably just want to have an ability to do something with last N commits or something, so we need to give that syntax a bit of though, which might actually change the approach with --revs.

We've discussed that the second approach (the one that is requiring passing revs after targets) is absolutely terrible, just forget about it 🙂

My thinking was - is it possible to derive from the string that is passed to --revs what exactly do we want to address? - like if it's a comma separated list of git hashes then we work with ids, etc . Is there an example in Git cli where it expects a list of revs?

@shcheklein I agree with @pared , this is a terrible idea, git doesn't distinguish between those so neither should we, especially just to adopt some joining syntax. I would much rather go with --rev rev1 --rev2 rev2 and have it deterministic than invent comma joining syntax to be able to --revs rev1,rev2. Though, the : looks promising, if it is indeed forbidden by git to have tags/branches with those. That being said, using colons is not intuitive at all.

I see that we might have here few smaller tasks. Ill try to identify them:
metrics show HEAD~X - which will show metrics from last X commits
metrics show --since {rev}- show metrics since rev
metrics show --ids={rev1},{rev2},{rev3} - like show -a but restricted to particular revs
Do those options make sense?

@pared This makes a lot of sense to me from user perspective, but I would probably go with something like

metrics show --from-rev HEAD~X (it is implied that --to-rev is HEAD, same as any git command does)
metrics show --from-rev rev
metrics show --rev rev1 --rev rev2 --rev rev3

This is just for metrics, how would you guys see other operations, like push, pull, etc?
a) I imagine, that for example in case of push, one might want to push all dependencies that has been binded to some git revision with stage files. Do you think it would be feasible to include some option for that? Like dvc push --all-revs? That could also be used for other ops, like pulling.
b) Does points from 1. make sense for operations like push, pull etc? Have you ever needed to pull for some range of commits/tags/branches? Or maybe --all-revs for push would be enough?

I'm not sure --all-revs makes any sense in git world, since it feels like it would include detached heads and stuff, which is generally considered to be trash. That being said, being able to push/pull/etc all history seems very useful to me. If talking in git terms, one would maybe expect dvc push to push everything from the initial commit, but we didn't do that because it might be excessive because of the data size. But maybe users have another opinion? 🙂

@pared
Copy link
Contributor

pared commented Jul 15, 2019

@efiop by saying --all-revs I was thinking about current branch, probably the naming could be improved :)

@efiop efiop added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Aug 19, 2019
@yfarjoun
Copy link
Contributor

yfarjoun commented Dec 4, 2019

any update on this issue? I see it have been declared "important" but also removed from "In progress"....Would love to have this!

@dashohoxha
Copy link
Contributor

It seems to me that what the user wanted to accomplish (dvc metrics show accross different commits -- making small parameter changes and checking the metrics for these parameter values) can be implemented more easily and cleaner with directories for each experiment.

In general, let's say that the user has a table with parameters and their values. He can write a script that for each parameter values creates a new experiment directory and (re)produces the results. Then he stores on the table all the results (metrics), removes all the experiment directories (cleanup), and commits on Git this table that contains the parameter values and the corresponding results. This is much cleaner than making a small commit for each parameter value and considering each commit as an experiment.

Regarding the other idea of limiting the output of dvc metrics show -a -T with a range of commits, this might be useful in some cases.

@efiop
Copy link
Member

efiop commented Dec 5, 2019

@yfarjoun Sorry for such a huge delay. We've introduced required changes for internal brancher, as well as introduced non-official hidden --all-commits flag for gc(please don't rely on it, it is really in a beta mode for now). So changes for metrics and other commands should be not that far, yet they are not on this sprint. I'm bumping the priority to make this move faster. Thanks a lot for the feedback! 🙂

@efiop efiop added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important c5-half-a-day labels Dec 5, 2019
@efiop
Copy link
Member

efiop commented Dec 5, 2019

Btw, if anyone would be willing to give a shot contributing a patch for this, we will be happy to help 🙂

@yfarjoun
Copy link
Contributor

yfarjoun commented Dec 5, 2019

thanks for the update. no need to apologize, I just wanted to make sure you know that this is still a desired feature!

@charlesbaynham
Copy link
Contributor

To give a new user's perspective on the issue (talking about push/pull really rather than gc), I had assumed that dvc push was equivalent to git push: i.e. you make several local commits then push all of them to a remote. What @pared said basically:

By default, in git, if you want to save something in repo, you will commit it. You do 10 commits and then push. All 10 commits has been pushed. What dvc is currently allowing you to do, is to push current changes. So... we do 10 commits and push dependencies/outputs from only last one. I believe that default behaviour should be pushing all dependencies from all commits, that have not been yet pushed. That is the only way to make sure all commits are not broken, and not demanding from user to periodically making pushes on branch.

... so the actual behaviour of dvc push caught me by surprise initially. I understand that this is hard from a performance perspective, but from a data integrity point of view, I think it's an important option to have. Particularly for raw data which isn't reproducable from anywhere, the dvc cache is the only place where it exists: if you push in the wrong order then you can end up with lost data.

@shcheklein
Copy link
Member

And yet another confusing and missing option to push multiple commits I believe - iterative/dvc.org#1087 ... may be also make sense to have --all-commits.

@nik123
Copy link
Contributor

nik123 commented Mar 24, 2020

Is this feature still in plans?

I ended up with little workaround for pushing data among various commits. I simply added git hook at .git/hooks/pre-commit. So every time I commit something my data is syncrhonized. Here is my hook:

 #!/bin/sh

# 1. List files staged for commit (excluding deleted files)
# 2. Filter dvc files.
# 3. Push updated dvc files into remote
git diff --cached --name-only --diff-filter=d | egrep ".dvc$" | xargs --no-run-if-empty dvc push

Of course it noticeably increases time for each commit but it also solves my problem with data synchronization. I hope it would help someone else but me.

@skshetry skshetry added this to To do in DVC 24 March - 7 April 2020 via automation Mar 24, 2020
@efiop efiop self-assigned this Apr 2, 2020
efiop added a commit to efiop/dvc that referenced this issue Apr 3, 2020
efiop added a commit that referenced this issue Apr 4, 2020
@efiop efiop moved this from To do to Review in progress in DVC 24 March - 7 April 2020 Apr 6, 2020
@efiop efiop moved this from Review in progress to Done in DVC 24 March - 7 April 2020 Apr 6, 2020
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jul 21, 2020

Hi! Resurrecting this discussion 🧟 (per a support question related to deep learning: having to pick a winner from 500K epochs, and it's definitely not the last one):

Specifically on metrics diff commands, refer to #4211: dvc plots diff already accepts multiple revisions, so dvc metrics diff could also do so (and you can send it ranges of commits with something like git log --format:$h HEAD~10..).
But I'm guessing this will totally crash if I send it 500K SHAs... Plus you wouldn't even want to commit that many variations of an experiment (so this relates to run-cache as well)

But what about accepting standard Git commit ranges? (Both dit diff and git range-diff accept them, for different purposes.) And then print a summary with just some stats like mean, norm, max, min (configurable, perhaps).

Ivan mentioned we may want to avoid cryptic Git syntax in #1691 (comment), but I'm not sure why. We use Git as the underlying versioning engine so why not leverage more of it's features?

@Suor
Copy link
Contributor

Suor commented Jul 23, 2020

I don't think this issue is really related to that discussion. Epoch is not the result of the run, so there is no commit or model for each of those. In current terms it might be a datapoint in some plot or simply an intermediate state, which might be saved or not upon users wish.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jul 24, 2020

I think you're right with respect to that particular user's support case. Still I think this idea is worth considering for some of our commands:

accepting standard Git commit ranges? ... And then print a summary with just some stats like mean, norm, max, min (configurable

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Aug 23, 2020

p.s. add dvc exp diff per another support case.

@dberenbaum
Copy link
Contributor

Now that we have experiment flags like exp show --rev HEAD -n 10, I think this addresses the initial concern here. We could keep it open to consider whether to implement those flags in other commands like gc, but it might be better to open a new issue at this point than follow this whole discussion. Closing for now, but feel free to reopen or create a new issue if you think there's more to address here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p1-important Important, aka current backlog of things to do research
Projects
No open projects
Development

No branches or pull requests