Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Improve DataPipeline API #67

Closed
tchaton opened this issue Feb 3, 2021 · 8 comments
Closed

Improve DataPipeline API #67

tchaton opened this issue Feb 3, 2021 · 8 comments
Assignees
Labels
Milestone

Comments

@tchaton
Copy link
Contributor

tchaton commented Feb 3, 2021

🚀 Feature

Motivation

  • Should we DataPipeline by splitted in 2, collate_fn and uncollate_fn ?

Reason:

  • collate_fn is really coming from the dataset to process new raw_data
  • uncollate_fn is usually related to a task such as classification. It can be confusing.

default DataPipeline

class DataPipeline(CollatePipeline, UnCollatePipeline):

    ...

 

create text classification data pipeline.

class TextClassificationDataPipeline(TextCollatePipeline, ClassificationUnCollatePipeline):

    ...

 
  • Create BaseDataModule per data-type such as TextDataModule, ImageDataModule + their associated TextDataPipeline, ImageDataPipeline as default.

Pitch

Alternatives

Additional context

@tchaton tchaton added enhancement New feature or request help wanted Extra attention is needed labels Feb 3, 2021
@carmocca
Copy link
Contributor

carmocca commented Feb 3, 2021

collate_fn is really coming from the dataset to process new raw_data

To be precise, before_collate is the one to process new raw_data. Then you have collate to handle batching and after_collate for any batch processing

So the first one does have some degree of conflict with the dataset, but the second two do not.

About the proposal

I generally like the idea. I can see us having to duplicate collate logic between tasks with different uncollate logic.

But doing it with mixins might grow to be confusing. See this example:

class A:
    def a(self):
        print('a')

class B:
    def b(self):
        print('b')

class C(A, B):
    ...

x = C()
x.a() # a
x.b() # b
# great!


class D:
    def a(self):
        print('d')


class C(D, B): ...

x = C()
x.a() # d
x.b() # b
# great!


class E:
    def a(self):
        print('e')


# If they subclass C, now order matters
class F(E, C):
    ...

class G(C, E):
    ...


x = F()
x.a() # e 
x.b() # b

x = G()
x.a() # d eek!
x.b() # b

so yeah... if this grows it will become a nightmare to follow

@carmocca
Copy link
Contributor

carmocca commented Feb 3, 2021

A better solution is to do:

class DataPipeline:
    def __init__(self, collate: CollatePipeline, uncollate: UncollatePipeline):
        self.collate = collate
        self.uncollate = uncollate

    def before_collate(self, ...):
        self.collate.before_collate(...)
    
    ...

@tchaton
Copy link
Contributor Author

tchaton commented Feb 3, 2021

Yes, and let's create the default for each data-type and data-type task.

@tchaton
Copy link
Contributor Author

tchaton commented Feb 3, 2021

So people are just left to implement uncollate_fn

@carmocca
Copy link
Contributor

carmocca commented Feb 3, 2021

Note that this is stepping into over-engineering territory. As of right now, there is no real duplication to warrant this extra abstraction but as we implement new tasks we will find if this is worth

@edenlightning
Copy link
Contributor

Users should be able to modify the preprocessing step (on the GPU preferably) in after the dataloading/batching and before the model execution.

There should be a overridable "batch preprocessing" function defined in the datapipeline that is called unconditionally before the model when running it for either training or inference or maybe split by training or inference

@edenlightning
Copy link
Contributor

Adding @carmocca and @kaushikb11 as reviewers!

@edgarriba
Copy link
Contributor

@tchaton DataPipeline was already merge. Can we close this ?

@tchaton tchaton closed this as completed May 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants