DAG with more structured dependencies per stage

Say I have a million files in the directory `./data/pre`.

I have a python script `process_dir.py` which goes over each file in `./data/pre` and processes it and creates a file in the same name in a directory `./data/post` (if such file already exists, it skips processing it).

I defined a pipeline:
```
dvc run -n process -d process_dir.py -d data/pre -o data/post python process_dir.py
```
Now let’s say I removed one file from `data/pre`.

When I run `dvc repro` it will still unnecessarily process all the 999,999 files again, because it will remove (by design) the entire content of the `./data/post` directory before running the `process` stage. Can we think of an elegant way to define the pipeline so that `process.py` will not process the same file twice?

suggestion: if we were able to define a rule that will connect *directly* in the DAG pairs of `data/pre/X.txt` to `data/post/X.txt` in the context of the `process` stage, then when can adjust the `process` stage in the pipeline as follows:
1. identify which file-pairs haven't changed and remove those files to a temp dir
2. run the `process` stage as you normally would
3. move the file-pairs from the temp dir back to their original locations.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DAG with more structured dependencies per stage #4228

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DAG with more structured dependencies per stage #4228

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions