Improve DataPipeline API #67

tchaton · 2021-02-03T17:43:22Z

🚀 Feature

Motivation

Should we DataPipeline by splitted in 2, collate_fn and uncollate_fn ?

Reason:

collate_fn is really coming from the dataset to process new raw_data
uncollate_fn is usually related to a task such as classification. It can be confusing.

default DataPipeline

class DataPipeline(CollatePipeline, UnCollatePipeline):

    ...

create text classification data pipeline.

class TextClassificationDataPipeline(TextCollatePipeline, ClassificationUnCollatePipeline):

    ...

Create BaseDataModule per data-type such as TextDataModule, ImageDataModule + their associated TextDataPipeline, ImageDataPipeline as default.

Pitch

Alternatives

Additional context

The text was updated successfully, but these errors were encountered:

carmocca · 2021-02-03T18:07:13Z

collate_fn is really coming from the dataset to process new raw_data

To be precise, before_collate is the one to process new raw_data. Then you have collate to handle batching and after_collate for any batch processing

So the first one does have some degree of conflict with the dataset, but the second two do not.

About the proposal

I generally like the idea. I can see us having to duplicate collate logic between tasks with different uncollate logic.

But doing it with mixins might grow to be confusing. See this example:

class A:
    def a(self):
        print('a')

class B:
    def b(self):
        print('b')

class C(A, B):
    ...

x = C()
x.a() # a
x.b() # b
# great!


class D:
    def a(self):
        print('d')


class C(D, B): ...

x = C()
x.a() # d
x.b() # b
# great!


class E:
    def a(self):
        print('e')


# If they subclass C, now order matters
class F(E, C):
    ...

class G(C, E):
    ...


x = F()
x.a() # e 
x.b() # b

x = G()
x.a() # d eek!
x.b() # b

so yeah... if this grows it will become a nightmare to follow

carmocca · 2021-02-03T18:10:32Z

A better solution is to do:

class DataPipeline:
    def __init__(self, collate: CollatePipeline, uncollate: UncollatePipeline):
        self.collate = collate
        self.uncollate = uncollate

    def before_collate(self, ...):
        self.collate.before_collate(...)
    
    ...

tchaton · 2021-02-03T18:13:19Z

Yes, and let's create the default for each data-type and data-type task.

tchaton · 2021-02-03T18:13:36Z

So people are just left to implement uncollate_fn

carmocca · 2021-02-03T18:16:27Z

Note that this is stepping into over-engineering territory. As of right now, there is no real duplication to warrant this extra abstraction but as we implement new tasks we will find if this is worth

edenlightning · 2021-02-16T15:24:04Z

Users should be able to modify the preprocessing step (on the GPU preferably) in after the dataloading/batching and before the model execution.

There should be a overridable "batch preprocessing" function defined in the datapipeline that is called unconditionally before the model when running it for either training or inference or maybe split by training or inference

edenlightning · 2021-02-16T17:40:51Z

Adding @carmocca and @kaushikb11 as reviewers!

edgarriba · 2021-05-03T10:37:51Z

@tchaton DataPipeline was already merge. Can we close this ?

tchaton added enhancement New feature or request help wanted Extra attention is needed labels Feb 3, 2021

edenlightning assigned justusschock Feb 16, 2021

edenlightning assigned carmocca and kaushikb11 Feb 16, 2021

edenlightning added P0 Priority API / design refactors & code health and removed P0 labels Feb 16, 2021

justusschock mentioned this issue Feb 18, 2021

Datapipeline poc #130

Closed

8 tasks

edenlightning assigned tchaton and unassigned carmocca and kaushikb11 Feb 22, 2021

edenlightning assigned carmocca and unassigned justusschock Mar 8, 2021

edenlightning added this to the 0.2 milestone Mar 22, 2021

edenlightning unassigned carmocca Mar 22, 2021

edenlightning modified the milestones: 0.2, 0.3 Apr 19, 2021

tchaton closed this as completed May 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve DataPipeline API #67

Improve DataPipeline API #67

tchaton commented Feb 3, 2021 •

edited

Loading

carmocca commented Feb 3, 2021

carmocca commented Feb 3, 2021 •

edited

Loading

tchaton commented Feb 3, 2021

tchaton commented Feb 3, 2021

carmocca commented Feb 3, 2021

edenlightning commented Feb 16, 2021

edenlightning commented Feb 16, 2021

edgarriba commented May 3, 2021

Improve DataPipeline API #67

Improve DataPipeline API #67

Comments

tchaton commented Feb 3, 2021 • edited Loading

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

carmocca commented Feb 3, 2021

About the proposal

carmocca commented Feb 3, 2021 • edited Loading

tchaton commented Feb 3, 2021

tchaton commented Feb 3, 2021

carmocca commented Feb 3, 2021

edenlightning commented Feb 16, 2021

edenlightning commented Feb 16, 2021

edgarriba commented May 3, 2021

tchaton commented Feb 3, 2021 •

edited

Loading

carmocca commented Feb 3, 2021 •

edited

Loading