-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cyclic dependencies #296
Comments
Hi @kskyten, thank you for your kind words about DVC. You have an interesting scenario - like auto-ML. We designed DVC with this kind of scenarios in mind. Not all of our ideas were implemented in the first version (the current released version) - more scenarios are coming in the next version. First, let's discuss how can we avoid your issue. By your current code you create cyclic dependencies with symlinks: data/res.txt version N --> data/res.txt version N-1 --> data/res.txt version (N-2) --> ... --> the current data/cache file. So, you are actually having a right symlink with the right version when you do To avoid the issue you should not write to an existing symlink file directly. I'd read the data file first, delete the file and increment it later. The mkdir myrepo_old/
cd myrepo_old/
git init
dvc init
dvc run echo "0" --stdout data/res.txt
cat data/res.txt
# 0 expected
# A single increment
VAL=`cat data/res.txt`
dvc remove data/res.txt # Keep cache. So, you won't need dvc repro
dvc run echo $(($VAL + 1)) --stdout data/res.txt
cat data/res.txt
# 1 excepted
# A batch increment
for num in {0..32}; do
VAL=`cat data/res.txt`
dvc remove data/res.txt
dvc run echo $(($VAL + 1)) --stdout data/res.txt
done
cat data/res.txt
# 34 expected
# Git checkout works!
git checkout HEAD~12 -b alpha_optimum
dvc repro
cat data/res.txt
# 28 expected I'd appreciate if you could share more details. This will help us to support more scenarios.
Thank you for using DVC. |
Thanks for the great response. I now realize that my example was a bit naive. Unfortunately, I couldn't get your example working. I got the following error: A typical use for Bayesian optimization in machine learning is to optimize the hyperparameters of a model (i.e tune the parameters of your model to minimize the test set error). Here's some more information on the subject: Practical Bayesian Optimization of Machine Learning Algorithms. My specific use case is applying Bayesian optimization for simulator based statistical models (BOLFI). These statistical models contain arbitrary simulator code so encoding the statistical models as workflows with user provided components would be great. I am one of the developers of ELFI, which is a framework for likelihood free inference. I think building a likelihood free inference library on top of something like DVC would provide more flexibility, better provenance data and make collaboration between researchers easier. I am also very excited about DataLad, which is similar to DVC but the focus is more on data instead of workflow. DataLad is built on top of git-annex, which was mentioned in an other issue (#211). In my particular case, I would like to use DataLad for distributing the different components of the statistical models in a collaborative way. It also has a way of reproducing results similar to DVC, but the workflow is encoded in the git graph so as far as I know it is not possible to change a component in the middle and re-run the same workflow. Ideally, I would like to create a new branch when a component is changed and then re-run the workflow. It would awesome, if it were possible to use DVC and DataLad interoperably. I also looked at Pachyderm, which is quite interesting, but not entirely suitable for my purposes. Pachyderm defines the workflows explicitly.. Using branches/tags for each iteration sounds like a good idea. |
Bayesian optimization Tools It seems that you can use a git-lfs like interface in the latest version of git-annex with the correct configuration. Parallel computing |
Bayesian optimization. I think I got your scenario. Yes, a few separate branches (branch per experiment) is probably a better fit for gradient-based methods. DVC can handle your linear history scenario and visualize it. It would be great if you can share more details - we are always happy to implement new features. Tools. DVC supports Windows - we have a custom implementation of symlink as well as hardlink for Windows. Windows is one of risky part of Anyway, the current internal DVC sync is just a file back-end which can be relatively easy replaced by Your parallel computing scenario matches our vision with few small additions - you should initiate the communication (run commands) from your local DVC ( And thank you for the new feature requests you've created and the feedback. We really appreciate your feedback! Please feel free to share any other comments that you have. We are at an early stage and new good ideas could be easily incorporated into the DVC DNA. |
It seems like "Cyclic dependencies" was not the root cause of the issue. Please feel free to reopen if I missed something. |
I hope you do not mind me chiming it with some clarifications on the previous discussion topics. re @dmpetrov Windows is one of risky part of
re @dmpetrov git-annex that no natural: it works only with buckets (like mybucket) although a "bucket directory" (like mybucket/classifiers/cat-dog) is more practical re it uses internal format (with compressing and internal names) although a more transparent format might be a better fit (keep sync files as is)
re What is the advantage of using hardlinks compared to symlinks? Some tools/browsers (file finder in OSX?) follow the symlinks, so in case of git-annex repositories you end up deep under re if it were possible to use DVC and DataLad interoperably Well -- you could still use DataLad as a git/git-annex frontend while using dvc to track the workflow. I guess, similar to our Hope this all comes useful/informative to some degree ;-) Cheers and keep on good work! |
Thank you @yarikoptic! I'm very sorry for the delay - this thread was lost since the issue was closed. Your clarifications are very helpful. Now (after a year) I agree about the Windows part as well as the AWS bucket part. However, don't quite agree with the hardlink vs. symlink part but DVC supports CoW (by I like the idea of using DataLad (Git annex) as file storage and DVC to track workflow. It might be helpful to chat more about DataLad and use cases. Please let me know if you have time for a chat. Thank you again for the feedback and the clarifications! |
I will be happy to chat! and I guess other DataLad folks as well. We typically have our DataLad Jitsi meeting each Friday 9am EST (but not next Friday for me at least, traveling), https://meet.jit.si/DataLad . You are most welcome to join if time works for you, just let me know which date so I am there for sure -- don't want to miss it. Otherwise, we will find time I bet (the same Jitsi url would work any time). |
@yarikoptic great! 9am EST is a bit early in my time zone. |
Sorry that I wasn't complete - traveling entire next week. But the week after, if your availability holds, Tuesday 15th should work, although probably not for the other half of the team - Germans (@mih?). Should we schedule preliminary for 1pm |
Sure, 15th at 1pm (EST) works for me. Please let me know if anything changes. |
Thanks for making dvc. It looks really interesting.
I'm trying to use Bayesian optimization to optimize the hyperparameters of a model. This creates a cyclic dependency in the workflow. Executing
dvc run python code.py
multiple times with the following minimal example updates the state of the file 'data/res.txt', but reverting back to an older commit does not revert the changes. I think adding cyclic dependencies would be a useful feature to have. What kind of changes would be needed to add it?The text was updated successfully, but these errors were encountered: