-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Say I have a million files in the directory ./data/pre.
I have a python script process_dir.py which goes over each file in ./data/pre and processes it and creates a file in the same name in a directory ./data/post (if such file already exists, it skips processing it).
I defined a pipeline:
dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.py
Now let’s say I removed one file from data/pre.
When I run dvc repro it will still unnecessarily process all the 999,999 files again, because it will remove (by design) the entire content of the ./data/post directory before running the process stage. Can we think of an elegant way to define the pipeline so that process.py will not process the same file twice?
suggestion: if we were able to define a rule that will connect directly in the DAG pairs of data/pre/X.txt to data/post/X.txt in the context of the process stage, then when can adjust the process stage in the pipeline as follows:
- identify which file-pairs haven't changed and remove those files to a temp dir
- run the
processstage as you normally would - move the file-pairs from the temp dir back to their original locations.