Enable mapping subsets and fields #352

PhilippeMoussalli · 2023-08-14T06:25:12Z

PR that enables remapping the column names of a user's dataset to match the names of a resuable/custom component spec.

Mapping one source subset to different target subsets or vice-versa is not allowed since it might lead to unexpected behavior and edge cases.
Mapping one source subset to a different target subset with a different schema is also not allowed and is checked during the static evaluation of the pipeline.
Remapping the column names is transient and occurs only within the mapped component
- mapped subsets_fields will be written with their original name prior to mapping
- new subsets_fields introduced in the component will be written with their new name without the need to specify an explicit mapping
In order to achieve the different mapping between the components, the mapping happens at the component spec level since the component spec defines which datasets to load and how to evolve the manifest. This ensure that we don't have to remap in many places (static checking, manifest evolution), facilitates static pipeline evaluation and helps keep the lineage more consistent.
We still need to remap the column names of the dataframe after loading (to be able to execute the transform function of the component) and before writing (revert back to original column names). This is better illustrated in the following figure

…columns

mrchtr · 2023-08-16T14:54:16Z

Thanks @PhilippeMoussalli. I think we should add a description and maybe a small example on how to use it, to the documentation. Can imagine that this will be used a lot.

mrchtr · 2023-08-17T06:47:09Z

I have retested your modifications end-to-end using the constructed pipeline:

# Component 1
load_file = ComponentOp(
    component_dir="load_from_files",
     ...
)

# Component 2
language_filter = ComponentOp(
    component_dir="...",
    spec_mapping={
        # produced by component 1 -> consumed by component 2
        "file_filename": "text_data"
    }
)

# Component 3
filter_comp = ComponentOp(
    component_dir="...",
    spec_mapping={
        # produced by component 1 -> consumed by component 3
        "file_filename": "text_data"
    }
)

The subset and field mappings are working as expected. Personally, I found it a bit unintuitive to determine the correct spec_mappings. Maybe we could utilise a dataclass for defining the spec_mappings. This approach would allow us to name the properties appropriately, we would get type hints and documentation within the code.

E.g. something like this:

spec_mapping = [
   ColumnMapping(global_column="file_filename", component_column="text_data")
]

PhilippeMoussalli · 2023-08-17T07:23:03Z

I have retested your modifications end-to-end using the constructed pipeline:
# Component 1
load_file = ComponentOp(
    component_dir="load_from_files",
     ...
)

# Component 2
language_filter = ComponentOp(
    component_dir="...",
    spec_mapping={
        # produced by component 1 -> consumed by component 2
        "file_filename": "text_data"
    }
)

# Component 3
filter_comp = ComponentOp(
    component_dir="...",
    spec_mapping={
        # produced by component 1 -> consumed by component 3
        "file_filename": "text_data"
    }
)
The subset and field mappings are working as expected. Personally, I found it a bit unintuitive to determine the correct spec_mappings. Maybe we could utilise a dataclass for defining the spec_mappings. This approach would allow us to name the properties appropriately, we would get type hints and documentation within the code.

E.g. something like this:
spec_mapping = [
   ColumnMapping(global_column="file_filename", component_column="text_data")
]

Thanks for the feedback @mrchtr! I'm glad it works as expected.
Haven't thought about adding is as a dataclass but it does makes sense, I think we will still have to translate it internally as a dictionary to pass it to kubeflow but that should not affect the interface you proposed

PhilippeMoussalli · 2023-08-17T09:57:11Z

@mrchtr added a commit to change the user interface as you suggested.
e891f3f

let me know if you have any additional feedback to make it more intuitive., once we align on the interface, i'll add documentation and examples on how to use it

mrchtr · 2023-08-17T12:43:23Z

@mrchtr added a commit to change the user interface as you suggested. e891f3f

let me know if you have any additional feedback to make it more intuitive., once we align on the interface, i'll add documentation and examples on how to use it

I like it. It is quite smooth and intuitive now.

RobbeSneyders · 2023-08-17T15:10:26Z

Remapping the column names is transient and occurs only within the mapped component

mapped subsets_fields will be written with their original name prior to mapping

new subsets_fields introduced in the component will be written with their new name without the need to specify an explicit mapping

We still need to remap the column names of the dataframe after loading (to be able to execute the transform function of the component) and before writing (revert back to original column names). This is better illustrated in the following figure

Thanks @PhilippeMoussalli! Haven't been able to look at the code in detail yet, but I don't agree with the points above. I believe we should only map the input of the component and follow the original component specification for the output.

I think this is a lot more intuitive since you can still infer which data will be produced by the component by just looking at its component specification, while still providing the same flexibility.

PhilippeMoussalli · 2023-08-17T15:46:51Z

Remapping the column names is transient and occurs only within the mapped component

mapped subsets_fields will be written with their original name prior to mapping

new subsets_fields introduced in the component will be written with their new name without the need to specify an explicit mapping

We still need to remap the column names of the dataframe after loading (to be able to execute the transform function of the component) and before writing (revert back to original column names). This is better illustrated in the following figure

Thanks @PhilippeMoussalli! Haven't been able to look at the code in detail yet, but I don't agree with the points above. I believe we should only map the input of the component and follow the original component specification for the output.

I think this is a lot more intuitive since you can still infer which data will be produced by the component by just looking at its component specification, while still providing the same flexibility.

Thanks for the feedback, the main reasoning behind the transient renaming is mainly to avoid writing duplicate subsets, if we only rename the outputs then we will have duplicate subsets

RobbeSneyders · 2023-08-18T06:54:32Z

If a subset is defined in the produces section of the component spec, it will be rewritten anyway. If a subset is consumed but not produced, we leave it in its original state. There naming should not impact if data is written or not.

PhilippeMoussalli · 2023-08-23T14:55:42Z

If a subset is defined in the produces section of the component spec, it will be rewritten anyway. If a subset is consumed but not produced, we leave it in its original state. There naming should not impact if data is written or not.

Ok that makes sense! I updated the PR accordingly

RobbeSneyders · 2023-08-31T14:33:34Z

Sorry for the delay on this @PhilippeMoussalli, but I'm a bit hesitant to merge until we have a better view on #244, as I believe the component spec manipulation in this PR might lead to issues when implementing that functionality.

RobbeSneyders

Thanks @PhilippeMoussalli!

I only looked at the high level description before, so this is the first time that I dove into the code. A couple of remarks:

I'm not sure if updating the component spec is the best way to solve this.

I understand your reasoning:

In order to achieve the different mapping between the components, the mapping happens at the component spec level since the component spec defines which datasets to load and how to evolve the manifest. This ensure that we don't have to remap in many places (static checking, manifest evolution), facilitates static pipeline evaluation and helps keep the lineage more consistent.

But:

Inside the component, we only need the mapping in the DataLoader.
- Reverse map the original consumes spec to know which fields to pass to read_parquet
- Map the columns of the loaded dataframe to the ones in the original consumes spec
This should be quite easy to implement without updating the spec by just using a mapping.
Even though we now update the component spec, we still need to pass both the component spec and the mapping to the component, so we don't win a lot. I actually think it's a bit confusing.
Since we map the columns of the loaded dataframe to the ones in the original consumes spec before passing it to the transform method, the spec passed to the component will actually not match the data it receives. Components accessing the data dynamically based on the spec will fail because of this.
The other place where we need the mapping is during pipeline validation. Here I see the benefit of updating the component spec, but I don't think it outweighs the downsides mentioned above and it shouldn't be too hard to take the mapping into account separately.

We might have to split mapping the subsets and the fields

You mention the following:

Mapping one source subset to different target subsets or vice-versa is not allowed since it might lead to unexpected behavior and edge cases.

This is logical, so maybe we need to make this explicit: let the user define a subset mapping, and a mapping per subset. I'm not sure yet what this should exactly look like, but I think it would be a lot more robust.

RobbeSneyders · 2023-10-09T18:58:36Z

src/fondant/component_spec.py

@@ -91,16 +91,170 @@ def additional_fields(self) -> bool:
        return self._specification.get("additionalFields", True)


+@dataclass
+class ColumnMapping:


What's the benefit of this class compared to just using a mapping dict?

The dataclass is based on a personal wish of mine. I found it a bit confusing when I had to define the column mapping using a dictionary. On a first view, it wasn't clear whether the key or the value of a dictionary entry represented the component column. With the use of a dataclass, you can explicitly define it as follows: [ColumnMapping(dataset_column=xxx, component_column=yyy), ..]

Even if you haven't read the documentation and are looking at a pipeline for the first time, it becomes clear which column is being mapped to which.

PhilippeMoussalli · 2023-10-24T09:17:59Z

I agree with most of your points, indeed there is some redundancy and some things can be simplified.

Thanks @PhilippeMoussalli!

Since we map the columns of the loaded dataframe to the ones in the original consumes spec before passing it to the transform method, the spec passed to the component will actually not match the data it receives.

Yes indeed, that is if you look at the spec mainly as specifying the fields required for the transformation method. I was considering more as an indicator on which fields to load.

Components accessing the data dynamically based on the spec will fail because of this.

I think this only applies to generic components but in any case the mapping there is not relevant since most of them have a distinct mapping argument (we should think about combining the two maybe at some point for less confusion)

This is logical, so maybe we need to make this explicit: let the user define a subset mapping, and a mapping per subset. I'm not sure yet what this should exactly look like, but I think it would be a lot more robust.

I defined a new schema for how this could look like based on your feedback. It also contains how the new implementation would look like. Let me know what you think

PhilippeMoussalli · 2023-10-24T09:20:51Z

One other thing, I think by not mapping the produces section we might introduce some more work at the user's side since they will have to introduce remapping for all subsequent components in their pipelines. Not sure how big of an issue this can be.

RobbeSneyders · 2023-10-25T10:05:31Z

Yes indeed, that is if you look at the spec mainly as specifying the fields required for the transformation method. I was considering more as an indicator on which fields to load.

That's a good point, I think we currently use it for both. Maybe we should make this explicit. Eg. by letting the ComponentSpec have two different representations: ComponentSpec.io_spec and ComponentSpec.transform_spec. That way we can still encapsulate the related behavior in a single class, and use the relevant represenation in each place.

I think this only applies to generic components but in any case the mapping there is not relevant since most of them have a distinct mapping argument (we should think about combining the two maybe at some point for less confusion)

True, but we still pass the spec to each component, which can lead to confusion and might still be used in some cases we currently don't plan for.

I defined a new schema for how this could look like based on your feedback. It also contains how the new implementation would look like. Let me know what you think

Looks good!

One other thing, I think by not mapping the produces section we might introduce some more work at the user's side since they will have to introduce remapping for all subsequent components in their pipelines. Not sure how big of an issue this can be.

If we can split the different functionalities of the ComponentSpec as mentioned above, I indeed see more benefits than downsides to adding produces mapping.

PhilippeMoussalli · 2023-10-25T12:39:29Z

Yes indeed, that is if you look at the spec mainly as specifying the fields required for the transformation method. I was considering more as an indicator on which fields to load.

That's a good point, I think we currently use it for both. Maybe we should make this explicit. Eg. by letting the ComponentSpec have two different representations: ComponentSpec.io_spec and ComponentSpec.transform_spec. That way we can still encapsulate the related behavior in a single class, and use the relevant represenation in each place.

so then the default componentSpec would be equivalent to the transform_spec and the io_spec can be a class method that takes in the original spec and modifies it's specification based on the mapping spec.

The io_spec is used for loading/writing + evolving the manifest. However, the transform_spec function is to map the names of the dataframes before/after the transformation but that is not achievable by itself since by definition the spec does not have a mapping to the io_spec, so we can represent the mapping between the two specs as an additional property and use that for the mapping. However, something similar can be achieved by just using the mapper and it's inverse in which case the only purpose of having both transform and io spec + the mapping between them is to have a better representation of the mapping (without really enabling additional functionalities that cannot be achieved by the suggestion above). Is that correct?

One other thing, I think by not mapping the produces section we might introduce some more work at the user's side since they will have to introduce remapping for all subsequent components in their pipelines. Not sure how big of an issue this can be.

If we can split the different functionalities of the ComponentSpec as mentioned above, I indeed see more benefits than downsides to adding produces mapping.

Alright then I will also a separate mapping for the produces section.

PhilippeMoussalli · 2023-11-08T13:44:34Z

Closed in favor of redesigning the subsets and fields #567

PhilippeMoussalli added 9 commits August 9, 2023 16:28

add input remapping dict argument

7930bed

Omit class method

2d3e80e

add examples to manifest evolution

643f4da

change name of remapping parameter

96bdd85

add mapping argument to loader and writer

8a52cc0

add subset field mapping class

535e811

Add renaming to DataLoader

e73d527

integrate remapping in workflow and add tests

9dfa6eb

Merge branch 'main' into remapping-dictionary

bb52337

PhilippeMoussalli requested review from RobbeSneyders and GeorgesLorre August 14, 2023 06:25

PhilippeMoussalli changed the title ~~Remapping dictionary~~ Enable mapping subsets and fields Aug 14, 2023

correct test

a602e73

PhilippeMoussalli linked an issue Aug 14, 2023 that may be closed by this pull request

Add custom default index to DaskLoadComponent #346

Closed

PhilippeMoussalli added 2 commits August 16, 2023 15:57

Merge branch 'main' into remapping-dictionary

8e1caf8

change spec mapping order to point from df_columns to component_spec …

e7b0fbc

…columns

fix mapping argument in kfp

f2b20dc

replace custom methods with dunder methods

f0107ab

change interface used to add mapping

e891f3f

PhilippeMoussalli added 3 commits August 23, 2023 16:35

change name of susbetfield mapper to spec mapper

34b1cf3

Merge branch 'main' into remapping-dictionary

d121ec8

modify tests

69ebfa7

PhilippeMoussalli force-pushed the remapping-dictionary branch from 5bf20d2 to 69ebfa7 Compare August 23, 2023 14:43

more tweaks

0f4786a

PhilippeMoussalli force-pushed the remapping-dictionary branch from 4242b00 to 0f4786a Compare August 23, 2023 14:53

PhilippeMoussalli added 4 commits September 20, 2023 11:56

Merge branch 'main' into remapping-dictionary

478148e

update docs

99b8dbe

correct test

d5dc92a

Merge branch 'main' into remapping-dictionary

88af02f

PhilippeMoussalli requested a review from mrchtr September 20, 2023 11:44

Merge branch 'main' into remapping-dictionary

a0bef53

RobbeSneyders reviewed Oct 9, 2023

View reviewed changes

RobbeSneyders mentioned this pull request Nov 7, 2023

Redesign dataset format and interface #567

Closed

PhilippeMoussalli closed this Nov 8, 2023

RobbeSneyders deleted the remapping-dictionary branch January 11, 2024 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable mapping subsets and fields #352

Enable mapping subsets and fields #352

PhilippeMoussalli commented Aug 14, 2023 •

edited

Loading

mrchtr commented Aug 16, 2023

mrchtr commented Aug 17, 2023

PhilippeMoussalli commented Aug 17, 2023

PhilippeMoussalli commented Aug 17, 2023 •

edited

Loading

mrchtr commented Aug 17, 2023

RobbeSneyders commented Aug 17, 2023

PhilippeMoussalli commented Aug 17, 2023

RobbeSneyders commented Aug 18, 2023

PhilippeMoussalli commented Aug 23, 2023

RobbeSneyders commented Aug 31, 2023

RobbeSneyders left a comment

RobbeSneyders Oct 9, 2023

mrchtr Oct 10, 2023 •

edited

Loading

PhilippeMoussalli commented Oct 24, 2023 •

edited

Loading

PhilippeMoussalli commented Oct 24, 2023

RobbeSneyders commented Oct 25, 2023

PhilippeMoussalli commented Oct 25, 2023

PhilippeMoussalli commented Nov 8, 2023

Enable mapping subsets and fields #352

Enable mapping subsets and fields #352

Conversation

PhilippeMoussalli commented Aug 14, 2023 • edited Loading

mrchtr commented Aug 16, 2023

mrchtr commented Aug 17, 2023

PhilippeMoussalli commented Aug 17, 2023

PhilippeMoussalli commented Aug 17, 2023 • edited Loading

mrchtr commented Aug 17, 2023

RobbeSneyders commented Aug 17, 2023

PhilippeMoussalli commented Aug 17, 2023

RobbeSneyders commented Aug 18, 2023

PhilippeMoussalli commented Aug 23, 2023

RobbeSneyders commented Aug 31, 2023

RobbeSneyders left a comment

Choose a reason for hiding this comment

RobbeSneyders Oct 9, 2023

Choose a reason for hiding this comment

mrchtr Oct 10, 2023 • edited Loading

Choose a reason for hiding this comment

PhilippeMoussalli commented Oct 24, 2023 • edited Loading

PhilippeMoussalli commented Oct 24, 2023

RobbeSneyders commented Oct 25, 2023

PhilippeMoussalli commented Oct 25, 2023

PhilippeMoussalli commented Nov 8, 2023

PhilippeMoussalli commented Aug 14, 2023 •

edited

Loading

PhilippeMoussalli commented Aug 17, 2023 •

edited

Loading

mrchtr Oct 10, 2023 •

edited

Loading

PhilippeMoussalli commented Oct 24, 2023 •

edited

Loading