-
Notifications
You must be signed in to change notification settings - Fork 387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
User Guide / Manage Experiments #828
Conversation
@dashohoxha, the "manage experiments by directories" image could improve a lot by using a more significant naming. What I've seen is that users create directories to try different hypothesis, not exactly different versions 🤔 |
@MrOutis I have used dummy names just to give an idea of how it should look like (and also to show that sub-experiments can be used as well, that is experiments that are based on other experiments). However I completely agree with you. If you (or someone else) could provide some more realistic example names for directories and subdirectories, I can fix that image. |
I am aware that it is not complete or perfect, especially the pages of directory management and combined management. However this is the best that I could do. If someone has any ideas for improving it further this would be great. |
It looks great, @dashohoxha ! A few comments from the first pass I've done:
I'll read it more carefully today. |
Unfortunately this is not easy with the tool that I am using (umlet).
I used "Manage Experiments" as a shorter version of "How To Manage Experiments". It seems smother to read than "Managing Experiments". However I agree with your consistency argument.
This seems like the case of tags. If you do a single commit for each experiment, then why not put a tag for each commit? It is true that you don't need the tags to switch back to a previous commit (you can see the logs and find out the commit ID) but it is more easy with tags. Besides, the command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice read, also I like a lot last points for Branches
and Tags
. Moving the best to master is definitely not beginner's use case.
|
||
```dvc | ||
$ git checkout bigrams | ||
$ git branch -c master old-master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add branch
, diff
and other commands that are no highlighted to the list here:
keyword: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the -c
and -C
for?
I couldn't follow up with a lot of branching going on 😅 would be great if you could add some comments
$ git add . | ||
$ git commit -m 'Baseline experiment' | ||
$ dvc commit | ||
$ git tag baseline-experiment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is a lightweight tag. Why do we use it here as opposed to the regular one below? Are you sure that metrics show works well with it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is a "normal" tag, the other one is an annotated tag. It is just an example.
As far as I know metrics show
works with all the tags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not even about what can be called normal. It's more about - why do we use a lightweight here and an annotated below? If there are some reasons then explanation is required when should I use what type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there is any reason. They are both tags and both of them can be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, why use both? let's simplify. Though, there are some guidelines behind those. I think people tend to use lightweight tags for things that they won't be sharing/saving. May be we can use that difference as a guideline? Not sure that we need put so many details - I would probably just keep simple version in both cases if it works fine.
```dvc | ||
$ git checkout baseline-experiment | ||
$ dvc checkout | ||
$ dvc repro evaluate.dvc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to run repro
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe to make sure that dvc checkout
has retrieved all the correct data? I don't know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dvc status is better then and makes more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right dvc status
would make more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤝
$ git diff --cached | ||
$ git revert --continue | ||
|
||
# delete the old tag and add it to the current version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please split it into two blocks
move comments out and put them as some human readable explanation to the blocks
|
||
### Move the best experiment on top | ||
|
||
Let's say that `baseline-experiment` has the best performance and we want to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are destructive, right? Let's put a note with ! emoji.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by "destructive"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, right. I probably wanted to put this comment to the branches way of doing experiments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, how about branches part? In that place where we do force push and stuff - let's put a comment there?
|
||
```dvc | ||
$ git tag | ||
$ git log --oneline |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show output here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it will help to improve the explanation in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would improve for me as I read it for the first time. Reading it for the second time - I don't even understand why do we need git log --oneline
everywhere. You are not saying a word. I meant btw, an output of the git tag
:
$ git tag
baseline
bnigrams
it would be easier to read and it is what the previous paragraph is about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both git tag
and git log --oneline
are ways to check what tags are available. Different from git tag
, git log --oneline
also shows the position of the tags in the history of commits and their order, which one is first and which one is last.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explanation makes it 10x better! You see, you can do it :) And I'm not kidding - I didn't know the difference between those related to the tags.
I still have some questions though - would log include all commits? If you have a lot of commits (very regular thing to have in real project) tags be lost.
Is there an option for the git tag
command to sort output?
I would still prefer to simplify it tbh. And put some output - it will make way easier to understand and read. At least you have one reader who is saying that it would be easier to read. Don't rely on your opinion here.
which is based on the first one, often it is as easy as: | ||
|
||
```dvc | ||
$ cp --reflink -R experiment1/ experiment2/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reflinks are not supported on majority of systems. So, it's not that easy to copy it w/o hitting some performance problems by copying data. We should think a bit more on what should we do here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reflinks are not supported on majority of systems
"reflink" is a filesystem feature, and all the systems support filesystems with reflink. For example in Linux there are XFS, Btrfs, ZFS etc. The option "--reflink" here is a hint or a reminder that they need to use a filesystem that supports reflinks (like XFS), which they should be doing anyway since the start of the project.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my bigger comment in the review. We can't rely on XFS, etc - for now 80-90% of systems are not configured to use them by default. And it won't change anytime soon. So, if we want to provide a meaningful solution to experiments management we should at least mention something about data, code and other things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
80-90% of systems are not configured to use them by default
That's right, but they are not configured to use DVC by default either. Installing and using XFS is actually much easier than installing and using DVC.
What I am trying to say is that if someone is working on a data project, he can and should install and use XFS (or some other deduplicating filesystem). This is a basic requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's very very far from the reality. There are a lot (majority?) of cases when you don't have a choice. Your argument applies only to freelancers and students.
- [How to Manage Experiments by Tags](/doc/user-guide/experiments/tags) | ||
- [How to Manage Experiments by Branches](/doc/user-guide/experiments/branches) | ||
- [How to Manage Experiments by Directories](/doc/user-guide/experiments/dirs) | ||
- [How to Manage Experiments by Several Methods](/doc/user-guide/experiments/mixed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by a Mix of Methods? Several methods sounds strange a bit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't actually mind, choose the wording that seems right to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jorgeorpinel what's your take on this and on the one #828 (comment) here? what the best way to word it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ** How to Manage Experiments with a Mix of Methods**
or use the word "Combination", as in that doc's intro.
@@ -0,0 +1,26 @@ | |||
# How to Manage Experiments by Several Methods |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mix Methods to Manage Experiments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe ** How to Manage Experiments with a Mix of Methods**
or use the word "Combination", as in the intro paragraph.
|
||
## How it works | ||
|
||
If we have a directory named `experiment1/` which contains the pipeline of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
except the pipeline - what else does this directory contain? what about code? do we copy the whole project?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is too project specific and it is impossible to explain such details on a section like this.
The best that can be done is to have a concrete example somewhere else and to link it from the section of the examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are a few recommendations we can make - clean data before copying? use a simple JSON conf? use DVC-files that are setup properly to be relative.
Those ^^ are general problems. Without giving some hints at least it's not very actionable. People who can come with a solution to those problems don't even probably need a section like this (and I can send you a few names).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are a few recommendations we can make - clean data before copying? use a simple JSON conf? use DVC-files that are setup properly to be relative.
All these seem common sense to me. Anyway I don't see how they can be explained without having an example as a reference.
Maybe someone can give it a try and let's see whether they make sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since there is no understanding from our end on how to give a guidance let's remove this page then for now and wait until the ticker on DVC core is resolved and wait for someone to take the #159.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these seem common sense to me.
If they are common sense why would it be so hard to explain/mention them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good. DIrectory image is not a blocker - disregard the comment. Commits - agreed, we can update it later when we have some interface ready.
Did a full pass review. Did some minor modifications. Mostly minor comments that should be easy to address. The biggest one major concern is the logic behind dirs experiments - it makes sense on the high level but can we tricky to achieve in reality, or at least some recommendations should be put in that section:
- reflinks are not supported on a majority of systems. We should do something with data on those systems. Clean before copying the dir?
- code - how do we copy and/or do modification to it? or do we copy only some JSON config? (which is a good solution that should be mentioned!)
- pipeline - what are considerations/requirements to make it copiable?
Regarding the "dirs" page, I agree that in practice there are certain "tricks" that are needed to make it work, but I don't see how they can be explained in a meaningful way without a concrete example. |
By the way, the new command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partially reviewed it, left some comments.
```dvc | ||
$ git commit -am 'Evaluate' | ||
$ dvc commit # just to make sure all the data is committed | ||
$ git checkout -b unigrams |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not git branch unigrams
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, git branch
is pretty new and we don't use it in the docs. But we do use git checkout
a lot. I would open a separate terminology-focused issue to decide whether to change all instances or not.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
|
||
```dvc | ||
$ git checkout unigrams | ||
$ dvc checkout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not git checkout -b bigrams unigrams
(go to experiment of brigams based on the unigrams)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would depend on your previous suggestion, #828 (comment) ? Perhaps you can replace these 2 comments for a single one for the full thing? 🙂
|
||
```dvc | ||
$ git checkout bigrams | ||
$ git branch -c master old-master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the -c
and -C
for?
I couldn't follow up with a lot of branching going on 😅 would be great if you could add some comments
per #828 (comment) Co-Authored-By: Mr. Outis <mroutis@protonmail.com>
closing this as stale |
Close #816, Close #159