Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train a model on large dataset with gitlab-ci.yaml #20

Open
niqbal996 opened this issue Jul 15, 2021 · 4 comments
Open

train a model on large dataset with gitlab-ci.yaml #20

niqbal996 opened this issue Jul 15, 2021 · 4 comments
Labels
question Further information is requested

Comments

@niqbal996
Copy link

Hi,

First of all, thank you for the nice tools developed by you. I am trying to create a training ML workflow with gitlab, CML and DVC with MinIO storage as my remote backup where I have my training dataset stored. my .gitlab-ci.yaml looks like this:

stages:
  - cml_run
cml:
  stage: cml_run
  image: dvcorg/cml:0-dvc2-base1-gpu

  script:
  - echo 'Hi from CML' >> report.md
  - apt-get update && apt-get install -y python3-opencv
  - pip3 install -r requirements.txt
  - dvc remote add -d minio_data s3://bucket/dataset/
  - dvc remote modify minio_data endpointurl http://<MINIOSERVER_IP_ADDRESS>:9000
  - dvc remote modify minio_data use_ssl False
  - export AWS_ACCESS_KEY_ID="xxxxxxx"
  - export AWS_SECRET_ACCESS_KEY="xxxxxxx"
  - dvc pull -r minio_data
  - python main.py
  - cml-send-comment report.md --repo=https://<my_gitlab_repo_url>

My setup is configured as following:

  • A gitlab self-hosted runner listening for job (works: Ubuntu 20.04, 2 x RTX 3070 GPUs, ).
  • An S3 MinIO storage server configured as DVC remote local backup (works with my credentials).
  • A training script (works).

My workflow is working and I am able to train my model on the runner and queue jobs but I have the following issues with it (maybe there is a better way to do this, hence I am here asking for directions):

  1. For each training job, the entire dataset is pulled from the remote and then the model is trained. This is really slow. It is my requirement to keep using dvc for data versioning but is there a way to bypass the dataset pull dvc pull -r minio_data everytime and use the same data between different training jobs? (maybe mount volumes to the docker container?)
  2. For MinIO authentication, I do not want to put my credentials as in AWS_SECRET_ACCESS_KEY in the .gitlab-ci.yaml file, in case more than one person want to use this workflow to queue their training jobs in a collaborative environment. What other options do I have?
  3. Is there a way to configure a local container registry cache for the runner (and this worflow) where I can put all the necessary docker images and use them instead of adding dependencies to the workflow like I am doing and let docker handle it?

Any feedback or suggestions would be appreciated. Thank you.

@DavidGOrtega
Copy link
Contributor

DavidGOrtega commented Jul 16, 2021

👋 @niqbal996

For each training job, the entire dataset is pulled from the remote and then the model is trained. This is really slow. It is my requirement to keep using dvc for data versioning but is there a way to bypass the dataset pull dvc pull -r minio_data everytime and use the same data between different training jobs? (maybe mount volumes to the docker container?)

Why not to download the dataset locally in the machine in a folder accesible by the local runners?
If you start the runners with docker (I recommend this as you are ) you need to add the volume.

For MinIO authentication, I do not want to put my credentials as in AWS_SECRET_ACCESS_KEY in the .gitlab-ci.yaml

You need to setup them as secrets in Gitlab also named CI variables. You can find it under the Settings -> CI/CD
image

Is there a way to configure a local container registry cache for the runner (and this worflow) where I can put all the necessary docker images and use them instead of adding dependencies to the workflow like I am doing and let docker handle it?

Of course. You could:

  • extend our docker image installing your stack
  • create a docker image with your stack, installing dvc and cml

When you build it you have that image locally you can also publish it into dockerhub to preserve it.

FROM iterativeai/cml:0-dvc2-base1-gpu
RUN pip install 'your-libraries'
docker build -t myCML Dockerfile
docker run --name myrunner -d --gpus all \
    -e RUNNER_IDLE_TIMEOUT=300 \
    -e RUNNER_LABELS=cml \
    -e RUNNER_REPO=$my_repo_url \
    -e repo_token=$my_repo_token \
    -v /path/myhugedatset:/myhugedataset \
    myCML

@lenaherrmann-dfki
Copy link

Hey folks,

if the data is pre-downloaded locally, might there be a way to check if there have been changes to the data?

As I understood the CML correctly, it's suppose automate the ML-Pipeline. So, downloading the data only when needed, might be an idea to save some time. Could this be achieved by some git-like commands in DVC?

@DavidGOrtega
Copy link
Contributor

As I understood the CML correctly, it's suppose automate the ML-Pipeline. So, downloading the data only when needed, might be an idea to save some time. Could this be achieved by some git-like commands in DVC?

👋 @lenaherrmann-dfki the thing is that CML helps you to launch GPU runner in many vendors (AWS, AZURE and GCP) those runners can be ephemerals just launched when you need to train (to save costs) and those runners needs to access that data. We are designing volumes to make this lightweight but big part of this responsibility resides in DVC

@casperdcl
Copy link

casperdcl commented Jul 16, 2021

if the data is pre-downloaded locally, might there be a way to check if there have been changes to the data?

@lenaherrmann-dfki you can track the data with DVC and a "local remote"

# setup
cd myrepo
# assuming `git init && dvc init` already done
cp -R /predownloaded/data .
dvc add ./data
git add data.dvc .gitignore
dvc remote add --local localcache /predownloaded/cache
dvc push -r localcache
git push

now in future you can:

cd myrepo
dvc remote add --local localcache /predownloaded/cache
dvc pull -r localcache
echo "new stuff" >> ./data/new
dvc add ./data
git add data.dvc
dvc push -r localcache
git push

@casperdcl casperdcl added the question Further information is requested label Jul 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants