WG Data(name provisional) proposal #673

tarilabs · 2023-12-14T18:21:10Z

I'm following up on action item: raise WG proposal to Kubeflow per yesterday's Model Registry meeting (recording timestamp).

As discussed in KF community meeting.

Main links:

👉 I'm starting to raise a draft PR in order to "seed/bootstrap" the work in raising the request to form the WG--using a draft PR give us a branch we can collaborate on between stakeholders @andreyvelich @Tomcli @dhirajsb @rimolive

This also give us a medium we can keeps-tab-on so to report back on progress during Tuesdays' community plenary meetings, wdyt?

google-oss-prow · 2023-12-14T18:21:17Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: tarilabs
Once this PR has been reviewed and has the lgtm label, please assign theadactyl for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

thesuperzapper · 2023-12-14T20:07:34Z

I am very strongly opposed to using the name WG-Lifecycle, because that implies that the working group is related to the lifecycle of Kubeflow itself.

My proposal for the name is: WG-Data

Where "data" can mean both actual data (spark) and metadata (model registry). We can also split it up in the future, if the members who are maintaining these components diverge.

tarilabs · 2023-12-14T20:22:14Z

My proposal for the name is: WG-Data

very well noted @thesuperzapper , as also marked here:
https://github.com/kubeflow/community/pull/673/files#diff-11b55409b3d27f083915bd4b910672caaf0e9550cf34d77fe76e8b6b9515023dR524

I just wanted to have a branch where to start collecting this kind of feedback in a non-sparse way and also to report back to you and the group on the progress on Tuesday meetings.

wgs.yaml

dhirajsb · 2023-12-14T20:26:18Z

@thesuperzapper how about we make it more explicit WG ML Model Data?

thesuperzapper · 2023-12-14T20:33:08Z

As it currently stands, this WG does not meet the requirement for diverse leadership given all chairs come from one company (IBM - which owns RedHat).

dhirajsb · 2023-12-14T20:36:27Z

@thesuperzapper Andrey is listed as a Chair, he's from Apple

tarilabs · 2023-12-14T20:37:04Z

noticing only now it was not marked as Draft PR despite being my intent:

using a draft PR give us a branch we can collaborate on

my sincerest apologies.

Marked as Draft PR per original message in thead.

rimolive · 2023-12-14T20:37:07Z

@thesuperzapper Is there a minimum number of companies to compose the chair to make the WG eligible?

thesuperzapper · 2023-12-14T20:52:25Z

While there is no specific number requirement, the steering comity must approve the new WG (currently, @jbottum @james-jwu) in line with the community's interests. I would expect at least some concern with having 4 leads from one company and only 1 from another.

For reference, here is the lifecycle and other info about forming a working group:

Also, there are only meant to be 2-3 chairs, some other WGs have more, but in most cases, there are 2 active members and we just need to formally clean up the inactive chairs.

thesuperzapper · 2023-12-14T20:59:06Z

Also, some of the proposed chairs are not even current Kubeflow org members, so are ineligible unless they go through that process first:

rimolive · 2023-12-14T21:03:52Z

Thank you for the references! Those are valid points though, and I'll see how we can work on the eligibility topic as well as your concerns.

tarilabs · 2023-12-14T21:08:56Z

As Ricardo noted, thanks !

Is there guidance for deputies to keep work WG ongoing during leaves, please?
The reason >3 is I was going through this point earlier today and seeing other WGs have >3 I assumed it was for that semantic.

As noted, will work out to account all the feedback received; thank you those are very helpful

andreyvelich · 2023-12-15T15:56:48Z

Thank you for starting this @tarilabs! Let's collaborate together on this PR for the WG Charter and Name.

Please provide your suggestion on how we should name this WG that initially will have Spark Operator and Model Registry component.

A few initial suggestions if WG Lifecycle is too ambitious:

WG Data
WG ML Data
WG ML Lifecycle

I would expect at least some concern with having 4 leads from one company and only 1 from another.

This is valid concern @thesuperzapper. We can add folks from Spark Operator maintainers to this WG
cc @mwielgus @vara-bonthu @yuchaoran2011

andreyvelich · 2023-12-15T15:59:26Z

cc @kubeflow/wg-training-leads
@kubeflow/wg-pipeline-leads
@kubeflow/wg-deployment-leads
@kubeflow/wg-notebooks-leads
@kubeflow/wg-manifests-leads

bigsur0 · 2023-12-15T18:20:37Z

I would request "WG ML Lifecycle" if the purpose of the group is to house things in the MLOps orbit that don't have a more specific working group yet so they can "incubate". Data Preparation, Feature Store, and Model Registry being 3 examples that have been recently discussed that likely aren't big enough yet to have their own working group. I guess one key aspect here is to consider how new efforts can happen without the overhead of setting-up a new working group for each one until it is truly merited and bandwidth is available.

Is there a process that exists for refactoring a topic out of one working group to a new working group?

jbottum · 2023-12-18T23:02:11Z

Kubeflow seems to be entering a new growth phase. The community needs a structure to support add-on components (Spark, Ray, Model Registry, Feature Store, etc). We want to encourage contributors and users to meet, discuss, experiment, decide, store code and produce documentation with a goal that integrations will help both Kubeflow and the add-on projects. We need to minimize overhead. We need to set expectations (of support...to/from Kubeflow and for users) especially if we are experimenting and trying to find market acceptance. Most importantly, we need active user participation, comment and leadership. I want to move this forward...I am a +1 to adding a single umbrella WG for all of these projects to get things moving. @james-jwu would you please provide your thoughts

thesuperzapper · 2023-12-19T04:44:27Z

I think that the name WG Data will happily encompass the various categories proposed:

distributed processing (spark, Ray, etc.)
model registry (unnamed redhat proposal)
feature store (potentially feast)

Also, WG Data follows the convention of being a single word, like all other working group names.

I am still very against WG Lifecycle, at best it's like calling it WG Other because the whole point of Kubeflow is to map across the MLOps lifecycle, so it's just confusing.

Separately to the discussion around names, I think we should confirm that the maintainers of these various components are actually overlapping, otherwise it will make it difficult for this "mega working group" to function.

vara-bonthu · 2023-12-19T13:11:40Z

+1 to @thesuperzapper

I would suggest voting for WG Data, as it seems most appropriate for the Spark Operator. This is because it is primarily used for data processing, both batch and streaming, as well as some ML processing.

tarilabs · 2023-12-19T14:06:45Z

New commit ae188fe incorporates some feedback received around:

put even more prominent name is provisional. Noted more recent feedback here and here seems will eventually converge into WG Data but while still draft is a chance to account for all proposals like here
reflected name provisional in PR title
reworked designated chairs

will keep posted during KF Community meeting on any further updates.

thesuperzapper · 2023-12-19T17:45:50Z

Just so we are clear, I think WG Data should be the name, not WG ML Data as the PR currently stands.

according to KF process the Charter is to be submitted _after_: > Add WG-related docs like charter.md, schedules, roadmaps, etc. to your new kubeflow/community/wg-foo directory once the above PR is merged from here: https://github.com/kubeflow/community/blob/master/wgs/wg-lifecycle.md#:~:text=Add%20WG%2Drelated%20docs%20like%20charter.md%2C%20schedules%2C%20roadmaps%2C%20etc.%20to%20your%20new%20kubeflow/community/wg%2Dfoo%20directory%20once%20the%20above%20PR%20is%20merged The group however pointed out in more recent WGs creation the Charter was submitted with the WG creation PR. example: kubeflow#358 Therefore, advancing Charter proposal at once in this PR.

tarilabs · 2024-01-23T10:45:44Z

fyi I've added draft of the charter for this WG on suggestion by other members with commit: f77d17b

according to KF process the Charter is to be submitted after:

Add WG-related docs like charter.md, schedules, roadmaps, etc. to your new kubeflow/community/wg-foo directory once the above PR is merged

from here: https://github.com/kubeflow/community/blob/master/wgs/wg-lifecycle.md#:~:text=Add%20WG%2Drelated%20docs%20like%20charter.md%2C%20schedules%2C%20roadmaps%2C%20etc.%20to%20your%20new%20kubeflow/community/wg%2Dfoo%20directory%20once%20the%20above%20PR%20is%20merged

The group however pointed out in more recent WGs creation the Charter was submitted with the WG creation PR.
example: #358

Therefore, advancing Charter proposal at once in this PR.
I'm going to "migrate" some comments as review on the Markdown.

tarilabs · 2024-01-23T10:46:31Z

wg-data/charter.md

+The WG "Data" is focused on enhancing the support for Data/metadata-related tasks within Kubeflow, with a specific focus on the [Spark operator](https://github.com/kubeflow/community/pull/672) and [Model Registry](https://github.com/kubeflow/kubeflow/issues/7396).
+The group aims to streamline data processing workflows, facilitate efficient data lifecycle and ML models' metadata management, while ensuring seamless integration with other Kubeflow components.
+
+An additional goal of the group is to offer a common ground for data/metadata-related topics in the MLOps orbit that didn't have a more specific working group yet, so they can "incubate as one", coherent effort.


incorporating from: https://www.google.com/url?q=https://github.com/kubeflow/community/pull/673%23issuecomment-1858308169&sa=D&source=docs&ust=1706009878028156&usg=AOvVaw3Ngz-lMd-KZ06OcXOO7Ilc

tarilabs · 2024-01-23T10:47:00Z

wg-data/charter.md

+
+An additional goal of the group is to offer a common ground for data/metadata-related topics in the MLOps orbit that didn't have a more specific working group yet, so they can "incubate as one", coherent effort.
+
+For example: Data Preparation, Feature Store, and Model Registry have been recently discussed in the Kubeflow community while not mature enough yet to have their own working group, they can be nurtured together as part of this WG.


Data Preparation not 100% if this might be confused with Notebooks? 🤔

@rimolive suggested to s/Data Preparation/Big Data Processing/ or something like that

tarilabs · 2024-01-23T10:48:48Z

wg-data/charter.md

+- Providing technical guidance and mentorship to contributors working on Spark operator, Model Registry, and the projects in scope of this WG.
+- Overseeing the technical direction of the subprojects and ensuring consistency with Kubeflow's vision for data processing and metadata management.
+
+### Deviations from [wg-governance]


refs https://www.google.com/url?q=https://github.com/kubeflow/community/blob/cd3a73a86e43bf43ceb72d1e0030f07ee1792e2a/wgs/templates/wg-charter-template.md?plain%3D1%23L48-L52&sa=D&source=docs&ust=1706009878029299&usg=AOvVaw3KMVpK7DBdrn2Gp_uZU484

tarilabs · 2024-01-23T10:49:05Z

wg-data/charter.md

+### Subproject Creation
+
+CHOOSE ONE
+1. WG Technical Leads
+2. Federation of Subprojects


to me the "WG Tech lead" is the most applicable, but open for comments

andreyvelich

Thank you for adding initial charter @tarilabs!
I left a few comments.

andreyvelich · 2024-01-23T17:46:24Z

wg-data/charter.md

+
+## Scope
+
+The WG "Data" is focused on enhancing the support for Data/metadata-related tasks within Kubeflow, with a specific focus on the [Spark operator](https://github.com/kubeflow/community/pull/672) and [Model Registry](https://github.com/kubeflow/kubeflow/issues/7396).


Do we want to mention specific tools in the Scope ? What about other data processing frameworks like Dask, Ray. Not sure, if we need to mention them in scope.

andreyvelich · 2024-01-23T17:49:54Z

wg-data/charter.md

+The WG "Data" is focused on enhancing the support for Data/metadata-related tasks within Kubeflow, with a specific focus on the [Spark operator](https://github.com/kubeflow/community/pull/672) and [Model Registry](https://github.com/kubeflow/kubeflow/issues/7396).
+The group aims to streamline data processing workflows, facilitate efficient data lifecycle and ML models' metadata management, while ensuring seamless integration with other Kubeflow components.
+
+An additional goal of the group is to offer a common ground for data/metadata-related topics in the MLOps orbit that didn't have a more specific working group yet, so they can "incubate as one", coherent effort.


When we say data/metadata what exactly do we mean here ? What would be the differences from the ML perspective ?

andreyvelich · 2024-01-23T17:56:18Z

wg-data/charter.md

+## Scope
+
+The WG "Data" is focused on enhancing the support for Data/metadata-related tasks within Kubeflow, with a specific focus on the [Spark operator](https://github.com/kubeflow/community/pull/672) and [Model Registry](https://github.com/kubeflow/kubeflow/issues/7396).
+The group aims to streamline data processing workflows, facilitate efficient data lifecycle and ML models' metadata management, while ensuring seamless integration with other Kubeflow components.


Should we elaborate a little bit about streamline data processing ?
E.g. Simplify and improve data processing between various stages of ML lifecycle. For example, from Data Preparation to model training and fine-tuning.
cc @bigsur0 @kubeflow/wg-training-leads

andreyvelich · 2024-01-23T18:00:01Z

wg-data/charter.md

+- Onboarding and maintenance of the Spark operator for scalable and distributed data processing.
+[See also](https://github.com/kubeflow/community/pull/672)
+- Continued development of the Model Registry to manage and version machine learning models efficiently.
+[See also](https://github.com/kubeflow/kubeflow/issues/7396)


Let's add actual links to the repos here.
https://github.com/kubeflow/model-registry
https://github.com/kubeflow/spark-operator

andreyvelich · 2024-01-23T18:01:41Z

wg-data/charter.md

+
+### In scope
+
+#### Code, Binaries, and Services


I think, we need to mention APIs for SparkJob, like TFJob here.
@vara-bonthu @yuchaoran2011 @mwielgus Anything else that we are going to maintain from the Spark Operator side ? For example, SDKs, UIs.

andreyvelich · 2024-01-23T18:05:33Z

wg-data/charter.md

+#### Cross-cutting and Externally Facing Processes
+
+- Ensuring seamless integration of these WG subprojects with the rest of the Kubeflow platform. For example:
+  - Coordinating with [wg-pipelines] for integrations of Model Registry with KFP.


Let's name it accordingly:

Suggested change

- Coordinating with [wg-pipelines] for integrations of Model Registry with KFP.

- Coordinating with WG Pipelines for integrations of Model Registry with KFP.

andreyvelich · 2024-01-23T18:05:44Z

wg-data/charter.md

+
+- Ensuring seamless integration of these WG subprojects with the rest of the Kubeflow platform. For example:
+  - Coordinating with [wg-pipelines] for integrations of Model Registry with KFP.
+  - Coordinating with [wg-serving] for integrations of Model Registry with KServe and ModelMesh.


Suggested change

- Coordinating with [wg-serving] for integrations of Model Registry with KServe and ModelMesh.

- Coordinating with WG Serving for integrations of Model Registry with KServe and ModelMesh.

andreyvelich · 2024-01-23T18:07:09Z

wg-data/charter.md

+- Ensuring seamless integration of these WG subprojects with the rest of the Kubeflow platform. For example:
+  - Coordinating with [wg-pipelines] for integrations of Model Registry with KFP.
+  - Coordinating with [wg-serving] for integrations of Model Registry with KServe and ModelMesh.
+  - ...


I think, we also can add coordinating with WG Training to streamline ML training data passing between Spark and distributed ML training workers.
Like what we started to discuss here: kubeflow/training-operator#1923
WDYT @bigsur0 ?

andreyvelich · 2024-01-23T18:09:56Z

wg-data/charter.md

+1. WG Technical Leads
+2. Federation of Subprojects
+
+[wg-governance]: ../wg-governance.md


Need to fix path for the doc.

andreyvelich · 2024-01-23T18:12:46Z

wgs.yaml

+    chairs:
+    - github: andreyvelich
+      name: Andrey Velichkevich
+      company: Apple
+    - github: rimolive
+      name: Ricardo Martinelli de Oliveira
+      company: Red Hat
+    - github: Tomcli
+      name: Tommy Li
+      company: IBM
+    tech_leads:
+    - github: dhirajsb
+      name: Dhiraj Bokde
+      company: Red Hat
+    - github: andreyvelich
+      name: Andrey Velichkevich
+      company: Apple


@vara-bonthu @yuchaoran2011 @mwielgus being a maintainer of Spark Operator component, do you want to be part of this WG as tech leads or chairs ?
It's ok if we are going to have different folks on tech lead and chairs.

create WG Lifecycle proposal

15d00ec

google-oss-prow bot requested review from james-jwu and theadactyl December 14, 2023 18:21

google-oss-prow bot added the size/M label Dec 14, 2023

tarilabs commented Dec 14, 2023

View reviewed changes

wgs.yaml Outdated Show resolved Hide resolved

tarilabs marked this pull request as draft December 14, 2023 20:35

google-oss-prow bot added the do-not-merge/work-in-progress label Dec 14, 2023

implement draft feeback 1/n

ae188fe

tarilabs changed the title ~~WG Lifecycle proposal~~ WG Data(name provisional) proposal Dec 19, 2023

implement draft feeback 2/n

6213313

tarilabs mentioned this pull request Jan 3, 2024

Model Registry proposal (ref KF community meeting 20240102) #682

Open

rareddy mentioned this pull request Jan 5, 2024

Action items for adoption of Model Registry in Kubeflow #685

Open

9 tasks

google-oss-prow bot added size/L and removed size/M labels Jan 23, 2024

tarilabs commented Jan 23, 2024

View reviewed changes

andreyvelich reviewed Jan 23, 2024

View reviewed changes

tarilabs mentioned this pull request Jan 24, 2024

add tarilabs as member kubeflow/internal-acls#648

Merged

andreyvelich mentioned this pull request Apr 15, 2024

Add Data WG Team kubeflow/internal-acls#671

Merged


		An additional goal of the group is to offer a common ground for data/metadata-related topics in the MLOps orbit that didn't have a more specific working group yet, so they can "incubate as one", coherent effort.

		For example: Data Preparation, Feature Store, and Model Registry have been recently discussed in the Kubeflow community while not mature enough yet to have their own working group, they can be nurtured together as part of this WG.


		## Scope

		The WG "Data" is focused on enhancing the support for Data/metadata-related tasks within Kubeflow, with a specific focus on the [Spark operator](https://github.com/kubeflow/community/pull/672) and [Model Registry](https://github.com/kubeflow/kubeflow/issues/7396).

	- Coordinating with [wg-pipelines] for integrations of Model Registry with KFP.
	- Coordinating with WG Pipelines for integrations of Model Registry with KFP.

	- Coordinating with [wg-serving] for integrations of Model Registry with KServe and ModelMesh.
	- Coordinating with WG Serving for integrations of Model Registry with KServe and ModelMesh.

WG Data(name provisional) proposal #673

Are you sure you want to change the base?

WG Data(name provisional) proposal #673

Conversation

tarilabs commented Dec 14, 2023

google-oss-prow bot commented Dec 14, 2023

thesuperzapper commented Dec 14, 2023

tarilabs commented Dec 14, 2023

dhirajsb commented Dec 14, 2023 • edited Loading

thesuperzapper commented Dec 14, 2023

dhirajsb commented Dec 14, 2023

tarilabs commented Dec 14, 2023

rimolive commented Dec 14, 2023

thesuperzapper commented Dec 14, 2023

thesuperzapper commented Dec 14, 2023

rimolive commented Dec 14, 2023

tarilabs commented Dec 14, 2023

andreyvelich commented Dec 15, 2023

andreyvelich commented Dec 15, 2023

bigsur0 commented Dec 15, 2023

jbottum commented Dec 18, 2023

thesuperzapper commented Dec 19, 2023

vara-bonthu commented Dec 19, 2023

tarilabs commented Dec 19, 2023 • edited Loading

thesuperzapper commented Dec 19, 2023

tarilabs commented Jan 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreyvelich Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

dhirajsb commented Dec 14, 2023 •

edited

Loading

tarilabs commented Dec 19, 2023 •

edited

Loading

andreyvelich Jan 23, 2024 •

edited

Loading