-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use evolve in Component class #56
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NielsRogge.
I would propose a different approach for the Dataset
class, see my comments.
d88c22f
to
2bc4bfa
Compare
I see some potential conflict with the join and merge subset PR's, how would you advice we proceed ? |
@GeorgesLorre this PR can be merged after the join and merge PRs |
1 similar comment
@GeorgesLorre this PR can be merged after the join and merge PRs |
50016c2
to
49b30be
Compare
evolved_manifest = self.copy() | ||
|
||
# Update `component_id` of the metadata | ||
component_id = component_spec.name.lower().replace(" ", "_") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we take this from the image
field directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the image
field currently isn't used, and should point to a URL like ghcr.io/ml6team/image_filtering:latest
, right?
In that case, we could split the string to get the component ID from that. For now I just used the name
property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use it to build the kubeflow_component.yaml
example the FondantSpec so it should be properly defined for each component and pointing to a valid url.
Splitting it should be easy though so would rather do it now or add a TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What benefit would using the image
provide over using the name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could include the docker tag of the images that you've used to run a certain a component in the manifest which can be used for versioning. (Figuring out what component version generated which manifest)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @NielsRogge! Some minor comments left.
fondant/component.py
Outdated
dataset = FondantDataset(manifest) | ||
dataset = self._process_dataset(dataset) | ||
input_manifest = self._load_or_create_manifest() | ||
dataset = FondantDataset(input_manifest) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this dataset is used by the LoadComponent
. We might be able to restructure a bit so we don't create it in this case.
@@ -74,27 +74,37 @@ def _load_or_create_manifest(self) -> Manifest: | |||
"""Abstract method that returns the dataset manifest""" | |||
|
|||
@abstractmethod | |||
def _process_dataset(self, dataset: FondantDataset) -> FondantDataset: | |||
def _process_dataset(self, dataset: FondantDataset) -> dd.DataFrame: | |||
"""Abstract method that processes the input dataframe and updates the Fondant Dataset | |||
with the new or loaded subsets""" | |||
|
|||
def run(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice to have a long descriptive docstring describing in-details what the run()
method is responsible for. Alternatively, you can include this doc at the class definition level (FondantComponent(ABC)
) to provide a more comprehensive explanation of the component's functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we might need to refactor the run
and process_dataset
methods in a follow-up PR as the load component doesn't use the dataset
argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking this up Niels! Left a few comments mainly related to docstrings but the overall logic of handling the manifest evolve and component spec in the Component seems to be valid.
examples/pipelines/simple_pipeline/components/image_filtering/src/main.py
Show resolved
Hide resolved
Feel free to approve the PR, I've addressed all comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work
This PR uses the `evolve` functionality (which updates the manifest based on a component spec) in the `Component` class. Fixes #45 --------- Co-authored-by: Niels Rogge <nielsrogge@Nielss-MacBook-Pro.local>
This PR uses the
evolve
functionality (which updates the manifest based on a component spec) in theComponent
class.Fixes #45