Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import: flag / parameter to skip the computation of the checksums #10346

Open
Honzys opened this issue Mar 8, 2024 · 1 comment
Open

import: flag / parameter to skip the computation of the checksums #10346

Honzys opened this issue Mar 8, 2024 · 1 comment
Labels
feature request Requesting a new feature performance improvement over resource / time consuming tasks

Comments

@Honzys
Copy link

Honzys commented Mar 8, 2024

Hello,

First of all, thank you very much for the nice tooling you provide to everyone and thus making the data life cycle easier!

I'd like to ask you, if there's a way to skip computation of checksums of imported files?

Imagine having a TBs of data saved on the dvc remote. Imagine having isolated environments for ML trainings where only directory with dvc cache is shared (eg. docker containers with mounted volumes, or kubernetes pods, etc.) to prevent the downloading the data from remote, but rather use it from the cache.

Our use-case is to train ML models with the data I've mentioned. If I briefly describe our cycle, it would look like this:

  1. Initialize the isolated evinronment & start the actual training script.
  2. Import the data (after the first import the data are not being copied from remote but only reflinked/linked/copied from the shared cache directory, after this issue get fixed - import: local cache is ignored when importing data that already exist in cache #10255).
  3. Run the training & save the output models.
  4. Destroy the environment.

The current issue is in the step 2, where the checksums of the data (even if they're not being imported from remote), are being re-computed every time.

Have you been considering some parameter / flag that would disable the checksum computation if we can trust the data source located in the shared cache directory?
I can see that the computation is just to make sure the data didn't get corrupted on the way, but with TBs of data the import data step can take a lot of time (because all the data needs to be loaded to RAM and hashed :/)

Thank you very much in advance for your answers or any insights regarding this issue!

@shcheklein shcheklein added feature request Requesting a new feature performance improvement over resource / time consuming tasks labels Mar 8, 2024
@johnyaku
Copy link

Related #9813 #9982

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature performance improvement over resource / time consuming tasks
Projects
None yet
Development

No branches or pull requests

3 participants