Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release] Training Operator 1.8 Roadmap #1994

Open
9 of 11 tasks
andreyvelich opened this issue Jan 24, 2024 · 19 comments
Open
9 of 11 tasks

[Release] Training Operator 1.8 Roadmap #1994

andreyvelich opened this issue Jan 24, 2024 · 19 comments

Comments

@andreyvelich
Copy link
Member

andreyvelich commented Jan 24, 2024

This is the tracking issue for Training Operator 1.8 release.
The feature freeze date for the next Kubeflow 1.9 release is April 15th.

We are targeting the following features for Training Operator 1.8:

SDK

Backend

Misc

@deepanker13 @droctothorpe @tenzen-y @kubeflow/wg-training-leads @kuizhiqing @terrytangyuan @lowang-bh Please let me know items that we want to add for Training Operator 1.8.

cc @kubeflow/release-team

@andreyvelich
Copy link
Member Author

@johnugeorge @deepanker13 Do we need to create tracking issue with remaining items for Train/Fine-tune API for LLMs ?

@terrytangyuan
Copy link
Member

terrytangyuan commented Jan 25, 2024

I'd like to get #1953 merged as well. I think the risk is pretty low.

@StefanoFioravanzo
Copy link
Member

@andreyvelich thanks for putting this together. On the "Misc: Improve docs for the training operator", if you can start a seprate issue highligintg known issues, doc areas to be improved or particular topics you want to address we can start coordinating with the release team doc leads as well to get some help.

I would suggest having a separate issue for autogen APIs, in case you want to address that as well.

@andreyvelich
Copy link
Member Author

@terrytangyuan Sure, can we discuss the MXJob deprecation plan on the next AutoML and Training WG meeting ?
I think, it would be better if we are going to remove support for MXJob in 2 releases. For example, in 1.8 release we are going to inform users that MXJob will be removed in the next version. And when we release 1.9 we will remove MXJob.
That should give sufficient time for users to migrate even that MXNet has already been archived.
WDYT @kubeflow/wg-training-leads @tenzen-y ?

if you can start a seprate issue highligintg known issues

@StefanoFioravanzo Sure, I will create an issue based on tasks that we discuss on the last call.
Also, I will create issue for SDK doc autogen.

@tenzen-y
Copy link
Member

First of all, as I mentioned here: kubeflow/katib#2255 (comment), I would suggest supporting kubernetes v1.27-v1.29.

Also, Moving #1906 forward would be better. It probably isn't possible to complete all the tasks, but I think we will be able to get some results.

@tenzen-y
Copy link
Member

tenzen-y commented Jan 25, 2024

I think, it would be better if we are going to remove support for MXJob in 2 releases. For example, in 1.8 release we are going to inform users that MXJob will be removed in the next version. And when we release 1.9 we will remove MXJob.
That should give sufficient time for users to migrate even that MXNet has already been archived.
WDYT @kubeflow/wg-training-leads @tenzen-y ?

SGTM. We can say that we don't any maintenance for MXJob during one release, which means it was deprecated.
Creating a dedicated issue would be better.

@terrytangyuan
Copy link
Member

@andreyvelich Sounds good

@andreyvelich
Copy link
Member Author

First of all, as I mentioned here: kubeflow/katib#2255 (comment), I would suggest supporting kubernetes v1.27-v1.29.

It's good point about Kubernetes version @tenzen-y!
I agree that 1.27 - 1.29 should be our target.
@kubeflow/release-team What do you think about target goal of supporting Kubernetes 1.27 - 1.29 for Kubeflow 1.9 release?

@andreyvelich andreyvelich pinned this issue Jan 29, 2024
@tenzen-y
Copy link
Member

Ah, I found the features that we drop from the previous release due to the release deadline.

Can we put the following to improve UX:

@andreyvelich
Copy link
Member Author

I just had discussion with @kubeflow/release-managers on Kubernetes versions.
We are going to target Kubernetes 1.27 - 1.29 for the next release of Training Operator.

@tenzen-y
Copy link
Member

I just had discussion with @kubeflow/release-managers on Kubernetes versions. We are going to target Kubernetes 1.27 - 1.29 for the next release of Training Operator.

It's nice notifications! Thank you!

@deepanker13
Copy link
Contributor

@johnugeorge @deepanker13 Do we need to create tracking issue with remaining items for Train/Fine-tune API for LLMs ?

Okay I will create one

@StefanoFioravanzo
Copy link
Member

StefanoFioravanzo commented Feb 28, 2024

Hello @kubeflow/wg-training-leads, this is a kind reminder that Monday, March 4th will be our Kubeflow 1.9 release development checkpoint, we will be halfway through our dev cycle, and we expect most of the work to be well underway (reminder: code freeze is scheduled for Apr 15th)

Can you please acknowledge your status with respect to your roadmap, comment on the progress made so far, and provide an assessment of the work that remains?

(understandably) Not everything may be completed in time. Please proactively let the release team know if there are delays, blockers, or uncertain situations, know so that we can align expectations and try and help you out, if possible.

@satishpasumarthi
Copy link

satishpasumarthi commented Apr 22, 2024

Hi ! When is the v1.8 is planned for release? Some managed k8s versions e.g EKS reach end of support very soon. (July 24, 2024) https://docs.aws.amazon.com/eks/latest/userguide/kubernetes-versions.html#kubernetes-release-calendar
So this release very important k8s who plan to migrate. Is there any tentative timeline ? Please advise.
@StefanoFioravanzo @andreyvelich

@andreyvelich
Copy link
Member Author

Hi @satishpasumarthi, we are planing to make the first RC.0 for Training Operator v1.8 this week.
We will support Kubernetes v1.27-1.29 in that release.

@satishpasumarthi
Copy link

Hi @satishpasumarthi, we are planing to make the first RC.0 for Training Operator v1.8 this week. We will support Kubernetes v1.27-1.29 in that release.

Thanks for the reply @andreyvelich . I see only PRs for supporting v1.28 and v1.29 #2039 and #2038. My understanding was v1.27 is already supported in v1.7. Please correct me if I am mistaken

@tenzen-y
Copy link
Member

Hi @satishpasumarthi, we are planing to make the first RC.0 for Training Operator v1.8 this week. We will support Kubernetes v1.27-1.29 in that release.

Thanks for the reply @andreyvelich . I see only PRs for supporting v1.28 and v1.29 #2039 and #2038. My understanding was v1.27 is already supported in v1.7. Please correct me if I am mistaken

@satishpasumarthi You're correct.
In v1.7, the training-operator supports v1.25-v1.27. In v1.8, the training-operator will support v1.27-v1.29.

@rimolive
Copy link
Member

Is there anything missing to cut the release? We want to start the manifests sync for training-operator for Kubeflow 1.9.0-rc0

@tenzen-y
Copy link
Member

Is there anything missing to cut the release? We want to start the manifests sync for training-operator for Kubeflow 1.9.0-rc0

Not yet. Johnu will prepare the release today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants