Reconfigurable modules #1472

dmpetrov · 2019-01-07T10:43:07Z

This is the comment #1462 (comment) extracted as a separate feature request. Also, @kskyten opened submodels issue #301.

Users should have an ability to create a "library" of reconfigurable pipelines #1462 and reuse them from different projects. Pipeline import can work through copy, Git-submodules or git clone https://my-dvc-repo.

An analogy with programming:

DVC	Programming	Goal
Stage	An operator (line of code)	The simplest unit
Pipeline (the current DVC implementation)	A function with no parameters	Organize and manage complexity in a project
Reconfigurable pipelines #1462	A function with parameters	Organize and manage complexity in a project
Reconfigurable modules (this issue)	A function with parameters in a library (lib, so, jar file)	Reuse from different projects

UPDATE: Added a link to @kskyten issue.

dmpetrov · 2019-01-09T07:43:22Z

A possible interface to reconfigurable modules is below. This example is based on https://dvc.org/doc/get-started

# Clones a repository and pull data for reconfigurable modules with data
$ dvc clone https://github.com/iterative/so-dataset-posts-25K

$ dvc run -d prepare.py -d so-dataset-posts-25K/data.xml \
        -o data.tsv -o data-test.tsv \
        python prepare.py so-dataset-posts-25K/data.xml

$ dvc clone https://github.com/iterative/text-to-bag-of-words

# Run cloned module instead of:
# dvc run -d featurization.py -d data.tsv -o matrix.pkl \
#              python featurization.py data.tsv matrix.pkl
# -d1 - pass a file as the first module input\dependency (since it can have a few)
# -o1 - instatiate (create a hardlink) the first module output as a data file
$ dvc sub text-to-bag-of-words -d1 data.tsv -o1 matrix.pkl \
              -p columns=1,2 -p lowercase=true -p max_features=9000

# Just a regular run
$ dvc run -d train.py -d matrix.pkl \
              -o model.pkl \
              python train.py matrix.pkl model.pkl

Details

The module should not be executed dvc sub from its directory text-to-bag-of-words since a single run might be not enough. Instead, a separate module instance should be created in some directory (let say .dvc/inst/) for each separate run with a unique suffix (for example .dvc/inst/text-to-bag-of-words_8bf3cfe).

Connection to build cache issue

The module unique suffix can be based on the module instance config file (not in the example above) and set of params. In such a way DVC can easely identify a similar runs and can be reused as build cache #1234 for a regular runs (not modules).

dmpetrov · 2019-01-10T23:42:32Z

@kskyten I'd love to hear your feedback on this.

dmpetrov mentioned this issue Jan 7, 2019

Reconfigurable pipelines #1462

Closed

dmpetrov added the feature request Requesting a new feature label Jan 7, 2019

This was referenced Feb 25, 2019

Submodels and metadata #301

Closed

Deploying on AWS Lambda #302

Closed

dmpetrov mentioned this issue Feb 26, 2019

Dataset storage improvements #1487

Closed

efiop closed this as completed May 3, 2021

iterative locked and limited conversation to collaborators May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Reconfigurable modules #1472

Reconfigurable modules #1472

dmpetrov commented Jan 7, 2019 •

edited

Loading

dmpetrov commented Jan 9, 2019

dmpetrov commented Jan 10, 2019

This issue was moved to a discussion.

This issue was moved to a discussion.

Reconfigurable modules #1472

Reconfigurable modules #1472

Comments

dmpetrov commented Jan 7, 2019 • edited Loading

dmpetrov commented Jan 9, 2019

dmpetrov commented Jan 10, 2019

This issue was moved to a discussion.

dmpetrov commented Jan 7, 2019 •

edited

Loading