-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable mapping subsets and fields #352
Conversation
Thanks @PhilippeMoussalli. I think we should add a description and maybe a small example on how to use it, to the documentation. Can imagine that this will be used a lot. |
I have retested your modifications end-to-end using the constructed pipeline:
The subset and field mappings are working as expected. Personally, I found it a bit unintuitive to determine the correct spec_mappings. Maybe we could utilise a dataclass for defining the spec_mappings. This approach would allow us to name the properties appropriately, we would get type hints and documentation within the code. E.g. something like this:
|
Thanks for the feedback @mrchtr! I'm glad it works as expected. |
Thanks @PhilippeMoussalli! Haven't been able to look at the code in detail yet, but I don't agree with the points above. I believe we should only map the input of the component and follow the original component specification for the output. I think this is a lot more intuitive since you can still infer which data will be produced by the component by just looking at its component specification, while still providing the same flexibility. |
Thanks for the feedback, the main reasoning behind the transient renaming is mainly to avoid writing duplicate subsets, if we only rename the outputs then we will have duplicate subsets |
If a subset is defined in the |
5bf20d2
to
69ebfa7
Compare
4242b00
to
0f4786a
Compare
Ok that makes sense! I updated the PR accordingly |
Sorry for the delay on this @PhilippeMoussalli, but I'm a bit hesitant to merge until we have a better view on #244, as I believe the component spec manipulation in this PR might lead to issues when implementing that functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @PhilippeMoussalli!
I only looked at the high level description before, so this is the first time that I dove into the code. A couple of remarks:
I'm not sure if updating the component spec is the best way to solve this.
I understand your reasoning:
In order to achieve the different mapping between the components, the mapping happens at the component spec level since the component spec defines which datasets to load and how to evolve the manifest. This ensure that we don't have to remap in many places (static checking, manifest evolution), facilitates static pipeline evaluation and helps keep the lineage more consistent.
But:
-
Inside the component, we only need the mapping in the DataLoader.
- Reverse map the original
consumes
spec to know which fields to pass toread_parquet
- Map the columns of the loaded dataframe to the ones in the original
consumes
spec
This should be quite easy to implement without updating the spec by just using a mapping.
- Reverse map the original
-
Even though we now update the component spec, we still need to pass both the component spec and the mapping to the component, so we don't win a lot. I actually think it's a bit confusing.
-
Since we map the columns of the loaded dataframe to the ones in the original
consumes
spec before passing it to thetransform
method, the spec passed to the component will actually not match the data it receives. Components accessing the data dynamically based on the spec will fail because of this. -
The other place where we need the mapping is during pipeline validation. Here I see the benefit of updating the component spec, but I don't think it outweighs the downsides mentioned above and it shouldn't be too hard to take the mapping into account separately.
We might have to split mapping the subsets and the fields
You mention the following:
Mapping one source subset to different target subsets or vice-versa is not allowed since it might lead to unexpected behavior and edge cases.
This is logical, so maybe we need to make this explicit: let the user define a subset mapping, and a mapping per subset. I'm not sure yet what this should exactly look like, but I think it would be a lot more robust.
@@ -91,16 +91,170 @@ def additional_fields(self) -> bool: | |||
return self._specification.get("additionalFields", True) | |||
|
|||
|
|||
@dataclass | |||
class ColumnMapping: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the benefit of this class compared to just using a mapping dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The dataclass is based on a personal wish of mine. I found it a bit confusing when I had to define the column mapping using a dictionary. On a first view, it wasn't clear whether the key or the value of a dictionary entry represented the component column. With the use of a dataclass, you can explicitly define it as follows: [ColumnMapping(dataset_column=xxx, component_column=yyy), ..]
Even if you haven't read the documentation and are looking at a pipeline for the first time, it becomes clear which column is being mapped to which.
I agree with most of your points, indeed there is some redundancy and some things can be simplified.
Yes indeed, that is if you look at the spec mainly as specifying the fields required for the transformation method. I was considering more as an indicator on which fields to load.
I think this only applies to generic components but in any case the mapping there is not relevant since most of them have a distinct mapping argument (we should think about combining the two maybe at some point for less confusion)
I defined a new schema for how this could look like based on your feedback. It also contains how the new implementation would look like. Let me know what you think |
That's a good point, I think we currently use it for both. Maybe we should make this explicit. Eg. by letting the
True, but we still pass the spec to each component, which can lead to confusion and might still be used in some cases we currently don't plan for.
Looks good!
If we can split the different functionalities of the |
so then the default componentSpec would be equivalent to the The
Alright then I will also a separate mapping for the produces section. |
Closed in favor of redesigning the subsets and fields #567 |
PR that enables remapping the column names of a user's dataset to match the names of a resuable/custom component spec.
Mapping one source subset to different target subsets or vice-versa is not allowed since it might lead to unexpected behavior and edge cases.
Mapping one source subset to a different target subset with a different schema is also not allowed and is checked during the static evaluation of the pipeline.
Remapping the column names is transient and occurs only within the mapped component
In order to achieve the different mapping between the components, the mapping happens at the component spec level since the component spec defines which datasets to load and how to evolve the manifest. This ensure that we don't have to remap in many places (static checking, manifest evolution), facilitates static pipeline evaluation and helps keep the lineage more consistent.
We still need to remap the column names of the dataframe after loading (to be able to execute the transform function of the component) and before writing (revert back to original column names). This is better illustrated in the following figure