-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
train a model on large dataset with gitlab-ci.yaml #20
Comments
Hey folks, if the data is pre-downloaded locally, might there be a way to check if there have been changes to the data? As I understood the CML correctly, it's suppose automate the ML-Pipeline. So, downloading the data only when needed, might be an idea to save some time. Could this be achieved by some git-like commands in DVC? |
👋 @lenaherrmann-dfki the thing is that CML helps you to launch GPU runner in many vendors (AWS, AZURE and GCP) those runners can be ephemerals just launched when you need to train (to save costs) and those runners needs to access that data. We are designing volumes to make this lightweight but big part of this responsibility resides in DVC |
@lenaherrmann-dfki you can track the data with DVC and a "local remote" # setup
cd myrepo
# assuming `git init && dvc init` already done
cp -R /predownloaded/data .
dvc add ./data
git add data.dvc .gitignore
dvc remote add --local localcache /predownloaded/cache
dvc push -r localcache
git push now in future you can: cd myrepo
dvc remote add --local localcache /predownloaded/cache
dvc pull -r localcache
echo "new stuff" >> ./data/new
dvc add ./data
git add data.dvc
dvc push -r localcache
git push |
Hi,
First of all, thank you for the nice tools developed by you. I am trying to create a training ML workflow with gitlab, CML and DVC with MinIO storage as my remote backup where I have my training dataset stored. my
.gitlab-ci.yaml
looks like this:My setup is configured as following:
My workflow is working and I am able to train my model on the runner and queue jobs but I have the following issues with it (maybe there is a better way to do this, hence I am here asking for directions):
dvc pull -r minio_data
everytime and use the same data between different training jobs? (maybe mount volumes to the docker container?).gitlab-ci.yaml
file, in case more than one person want to use this workflow to queue their training jobs in a collaborative environment. What other options do I have?Any feedback or suggestions would be appreciated. Thank you.
The text was updated successfully, but these errors were encountered: