support push/pull/metrics/gc, etc across different commits #1691

silverdna · 2019-03-06T15:04:32Z

Currently dvc metrics show can show metric values across different branches (-a) and different tags (-T).
Can you consider supporting showing different metric values across different commits in the same branch?

The background of this is (simplified example): say I'm currently training a model, where I'm changing a certain parameter, param1 (for instance, number of trees in a forest). The way I probably would like to work is to find a first value for param1, commit the current state, continue changing param1 and continue committing the successive states that I consider worth saving. At some point I would like to look back and identify the setup that gave me the best results.

The way DVC currently works forces me to create a new branch/tag for each trial I want to keep track of, and this seems a bit overwhelming.

Depending on how different the experiments I'm running are and their level of granularity I could decide how to keep track of them (new commits VS new branches/tags).

Notes:

The example above is overly simplified and there are better ways of tuning specific models parameters. But this gets more complicated if I'm changing more stuff (model hyperparameters, data processing, features to use, etc).
If dvc were to support what I'm proposing here, an extra argument would probably be required to limit how many commits DVC would look back at. Otherwise it would show all the metric values since the beginning of the repo history, which can be unhelpful and messy.

The text was updated successfully, but these errors were encountered:

efiop · 2019-03-07T13:34:37Z

@andrethrill Would you like to compare two specific commits or just the dynamics of your metrics changing across a range of commit? The latter one is probably more suitable for a graphical tool, like tensorboard or something. Or are you looking for a CLI way of doing that, using different filters (e.g. find max metric across N commits)?

silverdna · 2019-03-07T14:30:32Z

Hi @efiop !

I'm aware of TensorBoard but that's not exactly what I was talking about.

I would like to have a way of running a few consecutive different experiments and see their metrics. Just like dvc metrics show -a currently does but without needing to create different branches. DVC seems like a good fit for this because I could checkout the experiment that gave me the best results and have everything version controlled (model, data, etc).

Or are you looking for a CLI way of doing that, using different filters (e.g. find max metric across N commits)?

If that were to be supported it would be great of course. But for what I'm talking about, just looking at the output in the same form as in dvc metrics show -a would be enough.

efiop · 2019-03-07T14:53:16Z

@andrethrill Ah, so something like dvc metrics show HEAD~10 to show metrics for 10 last commits on the current branch?

silverdna · 2019-03-07T15:04:01Z

Exactly @efiop ! And/or some other nice variations of it: dvc metrics show HEAD~{commitHash} show metrics since commitHash on the current branch.

efiop · 2019-03-07T15:07:10Z

@andrethrill AFAIK HEAD~{commitHash} is not supported by git and it would be great to leave the syntax similar if not the same as in git :) But I get your point, there is probably a git-way to do that. Thanks for a great feature request!

silverdna · 2019-03-07T15:13:24Z

@efiop indeed, I was not thinking from git perspective. The syntax would have to be different :)

brbarkley · 2019-05-21T14:39:06Z

@andrethrill @efiop It seems the ability to dvc metrics show only specific tag(s) (instead of either the current HEAD only or all tags) might be one feasible way to engineer this feature request. Of course, you would have to create a tag for each of the commits you would like to show.

Anyhow, I would find the ability to dvc metrics show only a specific tag or tags useful for some parts of my workflow.

shcheklein · 2019-06-29T21:06:54Z

Since the logic behind all the commands is similar it's probably make sense to implement it for all commands that support -T, -a options now.

nik123 · 2019-07-01T03:52:28Z

@andrethrill @efiop It seems the ability to dvc metrics show only specific tag(s) (instead of either the current HEAD only or all tags) might be one feasible way to engineer this feature request. Of course, you would have to create a tag for each of the commits you would like to show.

What about to dvc metrics show not just for arbitrary tag but for arbitrary commit? The same way git checkout command allows to checkout both arbitrary tag (git checkout <tag_id>) and commits (git checkout <commit_id>).

The syntax of course would be different from git. Something like dvc metrics show --id <tag/commit>. Additionally you can use the same syntax in other commands, like dvc push --id <tag/commit>.

This approach also solves the issue when you have several local commits and in each commit the single data file tracked by DVC has been overriden. Current implementation of dvc push only pushes last version of your data file from the last commit. The --id option will allow to push all the previous versions of your data by executing dvc push --id <commit> for all the previous commits.

pared · 2019-07-10T10:23:10Z

@andrethrill @brbarkley @nik123

I see that we might have here few smaller tasks. Ill try to identify them:

metrics show HEAD~X - which will show metrics from last X commits
metrics show --since {rev}- show metrics since rev
metrics show --ids={rev1},{rev2},{rev3} - like show -a but restricted to particular revs
Do those options make sense?

This is just for metrics, how would you guys see other operations, like push, pull, etc?
a) I imagine, that for example in case of push, one might want to push all dependencies that has been binded to some git revision with stage files. Do you think it would be feasible to include some option for that? Like dvc push --all-revs? That could also be used for other ops, like pulling.
b) Does points from 1. make sense for operations like push, pull etc? Have you ever needed to pull for some range of commits/tags/branches? Or maybe --all-revs for push would be enough?

shcheklein · 2019-07-11T00:03:17Z

My 2 cents on this.

Since -a and -T are symmetric across push/pull/gc/metrics-show, we should make a new option symmetric as well. Especially considering that it does not make implementation more complicated.
Using commits is yet another way to manage "experiments". So, it makes sense to provide these options to all the commands that support -a, -T.
CLI-interface-wise: would be great if we can keep a single option - --revisions? or something like that. Would really want to avoid introducing positions arguments, and a few options on top just to manage this case.
(not scope of this issue, but can affect certain decisions) We'll need to introduce a filter on top -a, -T. Something like regexp to filter branch/tag names. Can we reuse this new option to simplify certain options in this case?

pared · 2019-07-11T09:44:09Z

@shcheklein I agree that for simplicity it would be much better to implement --revisions but looking at original feature request, supporting something like dvc metric show HEAD~10 looks like desired approach too. I think that using revisions to compare last 10 commits results will be a headache for users. One will need to either tag all commits, or git log and copy-paste revs to the --revisions option.

EDIT: let me clarify that I am talking here specifically about show

pared · 2019-07-11T10:16:18Z

I would also like to start another discussion about --revisions option.
As we discussed privately, we come to the conclusion, that probably the most convinient way for user to use --revsions would be to provide coma separeted revision ids, like:

--revisions {rev1},{rev2},{rev3} (note that providing revisions like: --revisions {rev1} {rev2} {rev3} is not viable options, since we would not know when to start parsing targets)

The problem with this approach is that coma is viable character to be included in branch name. So this edge case would break currently considered approach.

The other way to do that would be --revsion option (--revision {rev1} --revision {rev2} --revision {rev3})

I think we cannot expect users to name branches in a way that would be convinient for us, do you agree?

EDIT: as discussed with @Suor, we not necesarily need to use coma as separator, git forbids some characters in branch names, like colons.

pared · 2019-07-11T10:43:51Z

Possible solution: require providing revisions after parsing targets, that would make parsing multiple targets and multiple revisions possible.
How that could look like:

https://asciinema.org/a/pSVHHQ17uQBwzN2v0BUI8VaSK

Suor · 2019-07-11T12:01:20Z

We are using rev already, why not --revs?

pared · 2019-07-11T12:59:37Z

@Suor I agree, especially that its short and understandable.

shcheklein · 2019-07-11T17:52:09Z

My thinking was - is it possible to derive from the string that is passed to --revs what exactly do we want to address? - like if it's a comma separated list of git hashes then we work with ids, etc . Is there an example in Git cli where it expects a list of revs?

pared · 2019-07-12T07:43:17Z

@shcheklein looking through documentation, I think closest example would be using refspec:
https://git-scm.com/book/en/v2/Git-Internals-The-Refspec

What do you mean by what do we want to address? Determining whether it is commit sha, branch or tag?

shcheklein · 2019-07-13T01:21:50Z

Yep, either it's a commit, branch, tag or a list of those.

pared · 2019-07-15T09:50:45Z

@shcheklein do we actually need to know what it is? AFAIK git checkout accepts any of those.

pared · 2019-07-15T10:16:18Z

It seems to me, that we need to decide which way we go with implementation of this feature.

I would leave discussion "Do we support dvc metrics show HEAD~10" for some other issue, as it is some particular use case that is not related to push/pull etc...
How to implement data sync operations for few revisions, we discussed 3 approaches so far:

dvc pull -rev rev1 -rev rev2 -rev rev3 file.dvc
dvc pull file.dvc --revs rev1 rev2
dvc pull --revs=rev1:rev2:rev3 file.dvc

I think we should go with the last one, because its faster to use that first one, and does not introduce some strong assumption as do the second approach (I mean requiring passing revs after targets)

efiop · 2019-07-15T10:42:50Z

@pared 1 and 2 are tied together. metrics and pull/push/etc should have (if it is feasible) the same syntax for working with references. Unless we decide to redesign it of course, but I don't see the point of that just yet. I totally agree with you, that most would probably just want to have an ability to do something with last N commits or something, so we need to give that syntax a bit of though, which might actually change the approach with --revs.

We've discussed that the second approach (the one that is requiring passing revs after targets) is absolutely terrible, just forget about it 🙂

My thinking was - is it possible to derive from the string that is passed to --revs what exactly do we want to address? - like if it's a comma separated list of git hashes then we work with ids, etc . Is there an example in Git cli where it expects a list of revs?

@shcheklein I agree with @pared , this is a terrible idea, git doesn't distinguish between those so neither should we, especially just to adopt some joining syntax. I would much rather go with --rev rev1 --rev2 rev2 and have it deterministic than invent comma joining syntax to be able to --revs rev1,rev2. Though, the : looks promising, if it is indeed forbidden by git to have tags/branches with those. That being said, using colons is not intuitive at all.

I see that we might have here few smaller tasks. Ill try to identify them:
metrics show HEAD~X - which will show metrics from last X commits
metrics show --since {rev}- show metrics since rev
metrics show --ids={rev1},{rev2},{rev3} - like show -a but restricted to particular revs
Do those options make sense?

@pared This makes a lot of sense to me from user perspective, but I would probably go with something like

metrics show --from-rev HEAD~X (it is implied that --to-rev is HEAD, same as any git command does)
metrics show --from-rev rev
metrics show --rev rev1 --rev rev2 --rev rev3

This is just for metrics, how would you guys see other operations, like push, pull, etc?
a) I imagine, that for example in case of push, one might want to push all dependencies that has been binded to some git revision with stage files. Do you think it would be feasible to include some option for that? Like dvc push --all-revs? That could also be used for other ops, like pulling.
b) Does points from 1. make sense for operations like push, pull etc? Have you ever needed to pull for some range of commits/tags/branches? Or maybe --all-revs for push would be enough?

I'm not sure --all-revs makes any sense in git world, since it feels like it would include detached heads and stuff, which is generally considered to be trash. That being said, being able to push/pull/etc all history seems very useful to me. If talking in git terms, one would maybe expect dvc push to push everything from the initial commit, but we didn't do that because it might be excessive because of the data size. But maybe users have another opinion? 🙂

pared · 2019-07-15T10:46:27Z

@efiop by saying --all-revs I was thinking about current branch, probably the naming could be improved :)

yfarjoun · 2019-12-04T16:42:55Z

any update on this issue? I see it have been declared "important" but also removed from "In progress"....Would love to have this!

dashohoxha · 2019-12-05T09:30:25Z

It seems to me that what the user wanted to accomplish (dvc metrics show accross different commits -- making small parameter changes and checking the metrics for these parameter values) can be implemented more easily and cleaner with directories for each experiment.

In general, let's say that the user has a table with parameters and their values. He can write a script that for each parameter values creates a new experiment directory and (re)produces the results. Then he stores on the table all the results (metrics), removes all the experiment directories (cleanup), and commits on Git this table that contains the parameter values and the corresponding results. This is much cleaner than making a small commit for each parameter value and considering each commit as an experiment.

Regarding the other idea of limiting the output of dvc metrics show -a -T with a range of commits, this might be useful in some cases.

efiop · 2019-12-05T17:38:45Z

@yfarjoun Sorry for such a huge delay. We've introduced required changes for internal brancher, as well as introduced non-official hidden --all-commits flag for gc(please don't rely on it, it is really in a beta mode for now). So changes for metrics and other commands should be not that far, yet they are not on this sprint. I'm bumping the priority to make this move faster. Thanks a lot for the feedback! 🙂

efiop · 2019-12-05T17:39:43Z

Btw, if anyone would be willing to give a shot contributing a patch for this, we will be happy to help 🙂

yfarjoun · 2019-12-05T19:07:19Z

thanks for the update. no need to apologize, I just wanted to make sure you know that this is still a desired feature!

charlesbaynham · 2020-02-18T13:39:54Z

To give a new user's perspective on the issue (talking about push/pull really rather than gc), I had assumed that dvc push was equivalent to git push: i.e. you make several local commits then push all of them to a remote. What @pared said basically:

By default, in git, if you want to save something in repo, you will commit it. You do 10 commits and then push. All 10 commits has been pushed. What dvc is currently allowing you to do, is to push current changes. So... we do 10 commits and push dependencies/outputs from only last one. I believe that default behaviour should be pushing all dependencies from all commits, that have not been yet pushed. That is the only way to make sure all commits are not broken, and not demanding from user to periodically making pushes on branch.

... so the actual behaviour of dvc push caught me by surprise initially. I understand that this is hard from a performance perspective, but from a data integrity point of view, I think it's an important option to have. Particularly for raw data which isn't reproducable from anywhere, the dvc cache is the only place where it exists: if you push in the wrong order then you can end up with lost data.

shcheklein · 2020-03-24T02:26:26Z

And yet another confusing and missing option to push multiple commits I believe - iterative/dvc.org#1087 ... may be also make sense to have --all-commits.

nik123 · 2020-03-24T11:52:23Z

Is this feature still in plans?

I ended up with little workaround for pushing data among various commits. I simply added git hook at .git/hooks/pre-commit. So every time I commit something my data is syncrhonized. Here is my hook:

 #!/bin/sh

# 1. List files staged for commit (excluding deleted files)
# 2. Filter dvc files.
# 3. Push updated dvc files into remote
git diff --cached --name-only --diff-filter=d | egrep ".dvc$" | xargs --no-run-if-empty dvc push

Of course it noticeably increases time for each commit but it also solves my problem with data synchronization. I hope it would help someone else but me.

First steps for iterative#1691

First steps for #1691

jorgeorpinel · 2020-07-21T19:11:18Z

Hi! Resurrecting this discussion 🧟 (per a support question related to deep learning: having to pick a winner from 500K epochs, and it's definitely not the last one):

Specifically on metrics diff commands, refer to #4211: dvc plots diff already accepts multiple revisions, so dvc metrics diff could also do so (and you can send it ranges of commits with something like git log --format:$h HEAD~10..).
But I'm guessing this will totally crash if I send it 500K SHAs... Plus you wouldn't even want to commit that many variations of an experiment (so this relates to run-cache as well)

But what about accepting standard Git commit ranges? (Both dit diff and git range-diff accept them, for different purposes.) And then print a summary with just some stats like mean, norm, max, min (configurable, perhaps).

Ivan mentioned we may want to avoid cryptic Git syntax in #1691 (comment), but I'm not sure why. We use Git as the underlying versioning engine so why not leverage more of it's features?

Suor · 2020-07-23T07:53:28Z

I don't think this issue is really related to that discussion. Epoch is not the result of the run, so there is no commit or model for each of those. In current terms it might be a datapoint in some plot or simply an intermediate state, which might be saved or not upon users wish.

jorgeorpinel · 2020-07-24T15:34:36Z

I think you're right with respect to that particular user's support case. Still I think this idea is worth considering for some of our commands:

accepting standard Git commit ranges? ... And then print a summary with just some stats like mean, norm, max, min (configurable

jorgeorpinel · 2020-08-23T01:48:44Z

p.s. add dvc exp diff per another support case.

dberenbaum · 2022-02-18T22:26:06Z

Now that we have experiment flags like exp show --rev HEAD -n 10, I think this addresses the initial concern here. We could keep it open to consider whether to implement those flags in other commands like gc, but it might be better to open a new issue at this point than follow this whole discussion. Closing for now, but feel free to reopen or create a new issue if you think there's more to address here.

efiop changed the title ~~dvc metrics show across different commits~~ metrics: show: support metrics across different commits Mar 7, 2019

efiop added the feature request Requesting a new feature label Mar 7, 2019

shcheklein added the p1-important Important, aka current backlog of things to do label Jun 29, 2019

shcheklein changed the title ~~metrics: show: support metrics across different commits~~ support push/pull/metrics/gc, etc across different commits Jun 29, 2019

efiop assigned pared Jul 9, 2019

efiop added research c13-half-a-week labels Jul 9, 2019

efiop added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Aug 19, 2019

shcheklein mentioned this issue Dec 5, 2019

User Guide / Manage Experiments iterative/dvc.org#828

Closed

efiop added p1-important Important, aka current backlog of things to do and removed p2-medium Medium priority, should be done, but less important c5-half-a-day labels Dec 5, 2019

shcheklein mentioned this issue Jan 29, 2020

[WIP] gc: change option to use single remove flags #3207

Closed

3 tasks

shcheklein mentioned this issue Mar 24, 2020

Default push behaviour not analogous to git. iterative/dvc.org#1087

Closed

skshetry added this to To do in DVC 24 March - 7 April 2020 via automation Mar 24, 2020

efiop self-assigned this Apr 2, 2020

efiop added a commit to efiop/dvc that referenced this issue Apr 3, 2020

push/pull/status/metrics: add support for --all-commits

3bc7759

First steps for iterative#1691

efiop mentioned this issue Apr 3, 2020

push/pull/status/metrics: add support for --all-commits #3587

Merged

3 tasks

efiop added a commit that referenced this issue Apr 4, 2020

push/pull/status/metrics: add support for --all-commits (#3587)

7bd4733

First steps for #1691

efiop moved this from To do to Review in progress in DVC 24 March - 7 April 2020 Apr 6, 2020

efiop moved this from Review in progress to Done in DVC 24 March - 7 April 2020 Apr 6, 2020

jorgeorpinel mentioned this issue Jul 21, 2020

Back-fill metrics #4210

Closed

dberenbaum closed this as completed Feb 18, 2022

support push/pull/metrics/gc, etc across different commits #1691

support push/pull/metrics/gc, etc across different commits #1691

Comments

silverdna commented Mar 6, 2019

efiop commented Mar 7, 2019 • edited

silverdna commented Mar 7, 2019

efiop commented Mar 7, 2019 • edited

silverdna commented Mar 7, 2019

efiop commented Mar 7, 2019

silverdna commented Mar 7, 2019

brbarkley commented May 21, 2019

shcheklein commented Jun 29, 2019

nik123 commented Jul 1, 2019 • edited

pared commented Jul 10, 2019 • edited

shcheklein commented Jul 11, 2019

pared commented Jul 11, 2019 • edited

pared commented Jul 11, 2019 • edited

pared commented Jul 11, 2019

Suor commented Jul 11, 2019

pared commented Jul 11, 2019

shcheklein commented Jul 11, 2019

pared commented Jul 12, 2019

shcheklein commented Jul 13, 2019

pared commented Jul 15, 2019

pared commented Jul 15, 2019

efiop commented Jul 15, 2019

pared commented Jul 15, 2019 • edited

yfarjoun commented Dec 4, 2019

dashohoxha commented Dec 5, 2019

efiop commented Dec 5, 2019

efiop commented Dec 5, 2019 • edited

yfarjoun commented Dec 5, 2019

charlesbaynham commented Feb 18, 2020

shcheklein commented Mar 24, 2020

nik123 commented Mar 24, 2020 • edited

jorgeorpinel commented Jul 21, 2020 • edited

Suor commented Jul 23, 2020

jorgeorpinel commented Jul 24, 2020 • edited

jorgeorpinel commented Aug 23, 2020 • edited

dberenbaum commented Feb 18, 2022

efiop commented Mar 7, 2019 •

edited

efiop commented Mar 7, 2019 •

edited

nik123 commented Jul 1, 2019 •

edited

pared commented Jul 10, 2019 •

edited

pared commented Jul 11, 2019 •

edited

pared commented Jul 11, 2019 •

edited

pared commented Jul 15, 2019 •

edited

efiop commented Dec 5, 2019 •

edited

nik123 commented Mar 24, 2020 •

edited

jorgeorpinel commented Jul 21, 2020 •

edited

jorgeorpinel commented Jul 24, 2020 •

edited

jorgeorpinel commented Aug 23, 2020 •

edited