Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable mapping subsets and fields #352

Closed
wants to merge 24 commits into from
Closed

Conversation

PhilippeMoussalli
Copy link
Contributor

@PhilippeMoussalli PhilippeMoussalli commented Aug 14, 2023

PR that enables remapping the column names of a user's dataset to match the names of a resuable/custom component spec.

  • Mapping one source subset to different target subsets or vice-versa is not allowed since it might lead to unexpected behavior and edge cases.

  • Mapping one source subset to a different target subset with a different schema is also not allowed and is checked during the static evaluation of the pipeline.

  • Remapping the column names is transient and occurs only within the mapped component

    • mapped subsets_fields will be written with their original name prior to mapping
    • new subsets_fields introduced in the component will be written with their new name without the need to specify an explicit mapping
  • In order to achieve the different mapping between the components, the mapping happens at the component spec level since the component spec defines which datasets to load and how to evolve the manifest. This ensure that we don't have to remap in many places (static checking, manifest evolution), facilitates static pipeline evaluation and helps keep the lineage more consistent.

  • We still need to remap the column names of the dataframe after loading (to be able to execute the transform function of the component) and before writing (revert back to original column names). This is better illustrated in the following figure

image

@PhilippeMoussalli PhilippeMoussalli changed the title Remapping dictionary Enable mapping subsets and fields Aug 14, 2023
@PhilippeMoussalli PhilippeMoussalli linked an issue Aug 14, 2023 that may be closed by this pull request
@mrchtr
Copy link
Contributor

mrchtr commented Aug 16, 2023

Thanks @PhilippeMoussalli. I think we should add a description and maybe a small example on how to use it, to the documentation. Can imagine that this will be used a lot.

@mrchtr
Copy link
Contributor

mrchtr commented Aug 17, 2023

I have retested your modifications end-to-end using the constructed pipeline:

# Component 1
load_file = ComponentOp(
    component_dir="load_from_files",
     ...
)

# Component 2
language_filter = ComponentOp(
    component_dir="...",
    spec_mapping={
        # produced by component 1 -> consumed by component 2
        "file_filename": "text_data"
    }
)

# Component 3
filter_comp = ComponentOp(
    component_dir="...",
    spec_mapping={
        # produced by component 1 -> consumed by component 3
        "file_filename": "text_data"
    }
)

The subset and field mappings are working as expected. Personally, I found it a bit unintuitive to determine the correct spec_mappings. Maybe we could utilise a dataclass for defining the spec_mappings. This approach would allow us to name the properties appropriately, we would get type hints and documentation within the code.

E.g. something like this:

spec_mapping = [
   ColumnMapping(global_column="file_filename", component_column="text_data")
]

@PhilippeMoussalli
Copy link
Contributor Author

I have retested your modifications end-to-end using the constructed pipeline:

# Component 1
load_file = ComponentOp(
    component_dir="load_from_files",
     ...
)

# Component 2
language_filter = ComponentOp(
    component_dir="...",
    spec_mapping={
        # produced by component 1 -> consumed by component 2
        "file_filename": "text_data"
    }
)

# Component 3
filter_comp = ComponentOp(
    component_dir="...",
    spec_mapping={
        # produced by component 1 -> consumed by component 3
        "file_filename": "text_data"
    }
)

The subset and field mappings are working as expected. Personally, I found it a bit unintuitive to determine the correct spec_mappings. Maybe we could utilise a dataclass for defining the spec_mappings. This approach would allow us to name the properties appropriately, we would get type hints and documentation within the code.

E.g. something like this:

spec_mapping = [
   ColumnMapping(global_column="file_filename", component_column="text_data")
]

Thanks for the feedback @mrchtr! I'm glad it works as expected.
Haven't thought about adding is as a dataclass but it does makes sense, I think we will still have to translate it internally as a dictionary to pass it to kubeflow but that should not affect the interface you proposed

@PhilippeMoussalli
Copy link
Contributor Author

PhilippeMoussalli commented Aug 17, 2023

@mrchtr added a commit to change the user interface as you suggested.
e891f3f

let me know if you have any additional feedback to make it more intuitive., once we align on the interface, i'll add documentation and examples on how to use it

@mrchtr
Copy link
Contributor

mrchtr commented Aug 17, 2023

@mrchtr added a commit to change the user interface as you suggested. e891f3f

let me know if you have any additional feedback to make it more intuitive., once we align on the interface, i'll add documentation and examples on how to use it

I like it. It is quite smooth and intuitive now.

@RobbeSneyders
Copy link
Member

  • Remapping the column names is transient and occurs only within the mapped component

    • mapped subsets_fields will be written with their original name prior to mapping
    • new subsets_fields introduced in the component will be written with their new name without the need to specify an explicit mapping
  • We still need to remap the column names of the dataframe after loading (to be able to execute the transform function of the component) and before writing (revert back to original column names). This is better illustrated in the following figure

Thanks @PhilippeMoussalli! Haven't been able to look at the code in detail yet, but I don't agree with the points above. I believe we should only map the input of the component and follow the original component specification for the output.

I think this is a lot more intuitive since you can still infer which data will be produced by the component by just looking at its component specification, while still providing the same flexibility.

@PhilippeMoussalli
Copy link
Contributor Author

  • Remapping the column names is transient and occurs only within the mapped component

    • mapped subsets_fields will be written with their original name prior to mapping
    • new subsets_fields introduced in the component will be written with their new name without the need to specify an explicit mapping
  • We still need to remap the column names of the dataframe after loading (to be able to execute the transform function of the component) and before writing (revert back to original column names). This is better illustrated in the following figure

Thanks @PhilippeMoussalli! Haven't been able to look at the code in detail yet, but I don't agree with the points above. I believe we should only map the input of the component and follow the original component specification for the output.

I think this is a lot more intuitive since you can still infer which data will be produced by the component by just looking at its component specification, while still providing the same flexibility.

Thanks for the feedback, the main reasoning behind the transient renaming is mainly to avoid writing duplicate subsets, if we only rename the outputs then we will have duplicate subsets

image

@RobbeSneyders
Copy link
Member

If a subset is defined in the produces section of the component spec, it will be rewritten anyway. If a subset is consumed but not produced, we leave it in its original state. There naming should not impact if data is written or not.

@PhilippeMoussalli
Copy link
Contributor Author

If a subset is defined in the produces section of the component spec, it will be rewritten anyway. If a subset is consumed but not produced, we leave it in its original state. There naming should not impact if data is written or not.

Ok that makes sense! I updated the PR accordingly

@RobbeSneyders
Copy link
Member

Sorry for the delay on this @PhilippeMoussalli, but I'm a bit hesitant to merge until we have a better view on #244, as I believe the component spec manipulation in this PR might lead to issues when implementing that functionality.

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PhilippeMoussalli!

I only looked at the high level description before, so this is the first time that I dove into the code. A couple of remarks:

I'm not sure if updating the component spec is the best way to solve this.

I understand your reasoning:

In order to achieve the different mapping between the components, the mapping happens at the component spec level since the component spec defines which datasets to load and how to evolve the manifest. This ensure that we don't have to remap in many places (static checking, manifest evolution), facilitates static pipeline evaluation and helps keep the lineage more consistent.

But:

  • Inside the component, we only need the mapping in the DataLoader.

    • Reverse map the original consumes spec to know which fields to pass to read_parquet
    • Map the columns of the loaded dataframe to the ones in the original consumes spec

    This should be quite easy to implement without updating the spec by just using a mapping.

  • Even though we now update the component spec, we still need to pass both the component spec and the mapping to the component, so we don't win a lot. I actually think it's a bit confusing.

  • Since we map the columns of the loaded dataframe to the ones in the original consumes spec before passing it to the transform method, the spec passed to the component will actually not match the data it receives. Components accessing the data dynamically based on the spec will fail because of this.

  • The other place where we need the mapping is during pipeline validation. Here I see the benefit of updating the component spec, but I don't think it outweighs the downsides mentioned above and it shouldn't be too hard to take the mapping into account separately.

We might have to split mapping the subsets and the fields

You mention the following:

Mapping one source subset to different target subsets or vice-versa is not allowed since it might lead to unexpected behavior and edge cases.

This is logical, so maybe we need to make this explicit: let the user define a subset mapping, and a mapping per subset. I'm not sure yet what this should exactly look like, but I think it would be a lot more robust.

@@ -91,16 +91,170 @@ def additional_fields(self) -> bool:
return self._specification.get("additionalFields", True)


@dataclass
class ColumnMapping:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of this class compared to just using a mapping dict?

Copy link
Contributor

@mrchtr mrchtr Oct 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The dataclass is based on a personal wish of mine. I found it a bit confusing when I had to define the column mapping using a dictionary. On a first view, it wasn't clear whether the key or the value of a dictionary entry represented the component column. With the use of a dataclass, you can explicitly define it as follows: [ColumnMapping(dataset_column=xxx, component_column=yyy), ..]

Even if you haven't read the documentation and are looking at a pipeline for the first time, it becomes clear which column is being mapped to which.

@PhilippeMoussalli
Copy link
Contributor Author

PhilippeMoussalli commented Oct 24, 2023

I agree with most of your points, indeed there is some redundancy and some things can be simplified.

Thanks @PhilippeMoussalli!

  • Since we map the columns of the loaded dataframe to the ones in the original consumes spec before passing it to the transform method, the spec passed to the component will actually not match the data it receives.

Yes indeed, that is if you look at the spec mainly as specifying the fields required for the transformation method. I was considering more as an indicator on which fields to load.

Components accessing the data dynamically based on the spec will fail because of this.

I think this only applies to generic components but in any case the mapping there is not relevant since most of them have a distinct mapping argument (we should think about combining the two maybe at some point for less confusion)

This is logical, so maybe we need to make this explicit: let the user define a subset mapping, and a mapping per subset. I'm not sure yet what this should exactly look like, but I think it would be a lot more robust.

I defined a new schema for how this could look like based on your feedback. It also contains how the new implementation would look like. Let me know what you think

Screenshot from 2023-10-24 11-15-30

@PhilippeMoussalli
Copy link
Contributor Author

One other thing, I think by not mapping the produces section we might introduce some more work at the user's side since they will have to introduce remapping for all subsequent components in their pipelines. Not sure how big of an issue this can be.

Screenshot from 2023-10-24 11-16-25

@RobbeSneyders
Copy link
Member

Yes indeed, that is if you look at the spec mainly as specifying the fields required for the transformation method. I was considering more as an indicator on which fields to load.

That's a good point, I think we currently use it for both. Maybe we should make this explicit. Eg. by letting the ComponentSpec have two different representations: ComponentSpec.io_spec and ComponentSpec.transform_spec. That way we can still encapsulate the related behavior in a single class, and use the relevant represenation in each place.

I think this only applies to generic components but in any case the mapping there is not relevant since most of them have a distinct mapping argument (we should think about combining the two maybe at some point for less confusion)

True, but we still pass the spec to each component, which can lead to confusion and might still be used in some cases we currently don't plan for.

I defined a new schema for how this could look like based on your feedback. It also contains how the new implementation would look like. Let me know what you think

Looks good!

One other thing, I think by not mapping the produces section we might introduce some more work at the user's side since they will have to introduce remapping for all subsequent components in their pipelines. Not sure how big of an issue this can be.

If we can split the different functionalities of the ComponentSpec as mentioned above, I indeed see more benefits than downsides to adding produces mapping.

@PhilippeMoussalli
Copy link
Contributor Author

Yes indeed, that is if you look at the spec mainly as specifying the fields required for the transformation method. I was considering more as an indicator on which fields to load.

That's a good point, I think we currently use it for both. Maybe we should make this explicit. Eg. by letting the ComponentSpec have two different representations: ComponentSpec.io_spec and ComponentSpec.transform_spec. That way we can still encapsulate the related behavior in a single class, and use the relevant represenation in each place.

so then the default componentSpec would be equivalent to the transform_spec and the io_spec can be a class method that takes in the original spec and modifies it's specification based on the mapping spec.

The io_spec is used for loading/writing + evolving the manifest. However, the transform_spec function is to map the names of the dataframes before/after the transformation but that is not achievable by itself since by definition the spec does not have a mapping to the io_spec, so we can represent the mapping between the two specs as an additional property and use that for the mapping. However, something similar can be achieved by just using the mapper and it's inverse in which case the only purpose of having both transform and io spec + the mapping between them is to have a better representation of the mapping (without really enabling additional functionalities that cannot be achieved by the suggestion above). Is that correct?

One other thing, I think by not mapping the produces section we might introduce some more work at the user's side since they will have to introduce remapping for all subsequent components in their pipelines. Not sure how big of an issue this can be.

If we can split the different functionalities of the ComponentSpec as mentioned above, I indeed see more benefits than downsides to adding produces mapping.

Alright then I will also a separate mapping for the produces section.

@PhilippeMoussalli
Copy link
Contributor Author

Closed in favor of redesigning the subsets and fields #567

@RobbeSneyders RobbeSneyders deleted the remapping-dictionary branch January 11, 2024 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add custom default index to DaskLoadComponent
3 participants