Skip to content
This repository has been archived by the owner on Aug 17, 2023. It is now read-only.

Upgrades in 1.1 should follow kustomize off the shelf workflow #304

Open
jlewi opened this issue Apr 11, 2020 · 3 comments
Open

Upgrades in 1.1 should follow kustomize off the shelf workflow #304

jlewi opened this issue Apr 11, 2020 · 3 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented Apr 11, 2020

Filing this issue to track simplifying the upgrade process in Kubeflow 1.1.

Here's the current instructions for how Kubeflow upgrades are done.
https://www.kubeflow.org/docs/upgrading/upgrade/

This differs from the standard off the shelf workflow for kustomize applications
https://github.com/kubernetes-sigs/kustomize/blob/master/docs/workflows.md#off-the-shelf-configuration

In particular, we introduce a KFUpgrade resource which defines pointers to the old and new KFDef.
https://www.kubeflow.org/docs/upgrading/upgrade/#upgrade-instructions

kfctl then does a lot of a magic in order to try to reapply any user defined kustomizations ontop of the new configs.

With the new kustomize patterns (http://bit.ly/kf_kustomize_v3) we should be able to simplify this and I think eliminate the need for kfctl. Instead users should be able to just

  1. Update .cache to point to a new version of the kubeflow/manifests directory
  2. Run kustomize build to regenerate the package.

This is because the new pattern with stacks is that kfctl generates a new kustomize package using Kubeflow defined packages in .cache as the base. So a user can regenerate .cache without losing any of their kustomizations.

There are a couple of issues that we run into when applying the updated manifests

  1. Pruning - how do we cleanup resources from earlier versions of Kubeflow that are no longer in the latest Kubeflow resource
  2. Updating immutable fields - Certain fields are immutable and will cause errors when apply is called.

Rather than rely on kfctl logic to solve these problems we should follow a shift left pattern. Our expectation should be that we rely on existing tools (e.g. kubectl, kpt, etc...) to apply the manifests and handle these problems.

kpt for example supports pruning

/cc @richardsliu @yanniszark @kunmingg

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.72

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@jlewi
Copy link
Contributor Author

jlewi commented Apr 20, 2020

As noted in kubeflow/kubeflow#4873; kustomize commonLabels should only be used for immutable labels
https://kubectl.docs.kubernetes.io/pages/app_management/labels_and_annotations.html

As these are used in selectors which are immutable.

Right now our applications are using version in the version and instance label which is used in selector and set via commonLabels.
https://github.com/kubeflow/manifests/blob/abc6898ba535515e88846e7cc97faa208ffdacb9/jupyter/jupyter-web-app/overlays/application/kustomization.yaml#L11

We need to fix this so that labels will be immutable across version updates.

It looks like kubectl has alpha support in prune.

  # Note: --prune is still in Alpha
  # Apply the configuration in manifest.yaml that matches label app=nginx and delete all the other resources that are
not in the file and match label app=nginx.
  kubectl apply --prune -f manifest.yaml -l app=nginx

So if we have appropriate, immutable, labels for each application then we should be able to use kubectl to apply a new version and prune any removed resources.

jlewi pushed a commit to jlewi/manifests that referenced this issue May 1, 2020
* Fix kubeflow#1131

* kustomize commonLabels get subsituted into selector fields. Selector fields
  are immutable. So if commonLabels change (e.g. between versions) then
  we can't reapply/update the existing resources which breaks upgrades
 (kubeflow/kfctl#304)

* For the most part the problematic commonLabels were on our Application
  resources. The following labels were being set

  "app.kubernetes.io/version"
  "app.kubernetes.io/instance"
  "app.kubernetes.io/managed-by"
  "app.kubernetes.io/part-of"

* Version was definetely changing between versions. instance was also changing
  between versions to include the version number.

* managed-by and part-of could also change (e.g. we may not be using kfctl)
* We could still set these labels if we wanted to; we just shouldn't set
  them as commonLabels and/or include them in the selector as the will
  inhibit upgrades with kubectl apply.

* I created a test validate_resources_test.go to ensure none of these
  labels are included in commonLabels

* I created a simple go binary tools/fix_common_labels.go to update
  all the resources.

* generat_tests.py - Delete the code to remove unmatched tests.
  * We no longer generate tests that way and the delete code was going
    to delete valid tests like our new validation test

* Get rid of the clean rule in the Makefile for the same reason.
k8s-ci-robot pushed a commit to kubeflow/manifests that referenced this issue May 1, 2020
* Fix #1131

* kustomize commonLabels get subsituted into selector fields. Selector fields
  are immutable. So if commonLabels change (e.g. between versions) then
  we can't reapply/update the existing resources which breaks upgrades
 (kubeflow/kfctl#304)

* For the most part the problematic commonLabels were on our Application
  resources. The following labels were being set

  "app.kubernetes.io/version"
  "app.kubernetes.io/instance"
  "app.kubernetes.io/managed-by"
  "app.kubernetes.io/part-of"

* Version was definetely changing between versions. instance was also changing
  between versions to include the version number.

* managed-by and part-of could also change (e.g. we may not be using kfctl)
* We could still set these labels if we wanted to; we just shouldn't set
  them as commonLabels and/or include them in the selector as the will
  inhibit upgrades with kubectl apply.

* I created a test validate_resources_test.go to ensure none of these
  labels are included in commonLabels

* I created a simple go binary tools/fix_common_labels.go to update
  all the resources.

* generat_tests.py - Delete the code to remove unmatched tests.
  * We no longer generate tests that way and the delete code was going
    to delete valid tests like our new validation test

* Get rid of the clean rule in the Makefile for the same reason.
@jbottum
Copy link

jbottum commented May 21, 2020

@jlewi Hey Jeremy - will this feature be included in Kubeflow 1.1 ?

jlewi pushed a commit to jlewi/kfctl that referenced this issue Jun 5, 2020
* This is GCP specific code that allows CloudEndpoints to be created using
  the CloudEndpoint controller. A Cloud endpoint is a KRM style resource
  so we can kust have `kfctl apply -f {path}` invoke the appropriate
  logic.

* For GCP this addresses GoogleCloudPlatform/kubeflow-distribution#36; specifically when
  deploying private GKE the CloudEndpoints controller won't be able
  to contact the servicemanagement API. This provides a work around
  by running it locally.

* This pattern seems extensible; i.e. other platforms could link in
  code to handle CR's specific to their platforms. This could basically
  be an alternative to plugins.

* I added a context flag to control the kubecontext that apply applies to.
  Unfortunately, it doesn't look like there is an easy way to use
  that in the context of applying KFDef. It looks like the current logic
  assumes the cluster will be added to the KFDef metadata and then
  look up that cluster in .kubeconfig.

  * Modifying that logic to support the context flag seemed riskier
    then simply adding a comment to the flag.

* Added some warnings that KFUpgrade is deprecated since per kubeflow#304
  we want to follow the off shelf workflow.
jlewi pushed a commit to jlewi/kfctl that referenced this issue Jun 5, 2020
* This is GCP specific code that allows CloudEndpoints to be created using
  the CloudEndpoint controller. A Cloud endpoint is a KRM style resource
  so we can kust have `kfctl apply -f {path}` invoke the appropriate
  logic.

* For GCP this addresses GoogleCloudPlatform/kubeflow-distribution#36; specifically when
  deploying private GKE the CloudEndpoints controller won't be able
  to contact the servicemanagement API. This provides a work around
  by running it locally.

* This pattern seems extensible; i.e. other platforms could link in
  code to handle CR's specific to their platforms. This could basically
  be an alternative to plugins.

* I added a context flag to control the kubecontext that apply applies to.
  Unfortunately, it doesn't look like there is an easy way to use
  that in the context of applying KFDef. It looks like the current logic
  assumes the cluster will be added to the KFDef metadata and then
  look up that cluster in .kubeconfig.

  * Modifying that logic to support the context flag seemed riskier
    then simply adding a comment to the flag.

* Added some warnings that KFUpgrade is deprecated since per kubeflow#304
  we want to follow the off shelf workflow.
k8s-ci-robot pushed a commit that referenced this issue Jun 6, 2020
* This is GCP specific code that allows CloudEndpoints to be created using
  the CloudEndpoint controller. A Cloud endpoint is a KRM style resource
  so we can kust have `kfctl apply -f {path}` invoke the appropriate
  logic.

* For GCP this addresses GoogleCloudPlatform/kubeflow-distribution#36; specifically when
  deploying private GKE the CloudEndpoints controller won't be able
  to contact the servicemanagement API. This provides a work around
  by running it locally.

* This pattern seems extensible; i.e. other platforms could link in
  code to handle CR's specific to their platforms. This could basically
  be an alternative to plugins.

* I added a context flag to control the kubecontext that apply applies to.
  Unfortunately, it doesn't look like there is an easy way to use
  that in the context of applying KFDef. It looks like the current logic
  assumes the cluster will be added to the KFDef metadata and then
  look up that cluster in .kubeconfig.

  * Modifying that logic to support the context flag seemed riskier
    then simply adding a comment to the flag.

* Added some warnings that KFUpgrade is deprecated since per #304
  we want to follow the off shelf workflow.
vpavlin pushed a commit to vpavlin/kfctl that referenced this issue Jul 10, 2020
…low#351)

* This is GCP specific code that allows CloudEndpoints to be created using
  the CloudEndpoint controller. A Cloud endpoint is a KRM style resource
  so we can kust have `kfctl apply -f {path}` invoke the appropriate
  logic.

* For GCP this addresses GoogleCloudPlatform/kubeflow-distribution#36; specifically when
  deploying private GKE the CloudEndpoints controller won't be able
  to contact the servicemanagement API. This provides a work around
  by running it locally.

* This pattern seems extensible; i.e. other platforms could link in
  code to handle CR's specific to their platforms. This could basically
  be an alternative to plugins.

* I added a context flag to control the kubecontext that apply applies to.
  Unfortunately, it doesn't look like there is an easy way to use
  that in the context of applying KFDef. It looks like the current logic
  assumes the cluster will be added to the KFDef metadata and then
  look up that cluster in .kubeconfig.

  * Modifying that logic to support the context flag seemed riskier
    then simply adding a comment to the flag.

* Added some warnings that KFUpgrade is deprecated since per kubeflow#304
  we want to follow the off shelf workflow.
vpavlin pushed a commit to vpavlin/kfctl that referenced this issue Jul 20, 2020
…low#351)

* This is GCP specific code that allows CloudEndpoints to be created using
  the CloudEndpoint controller. A Cloud endpoint is a KRM style resource
  so we can kust have `kfctl apply -f {path}` invoke the appropriate
  logic.

* For GCP this addresses GoogleCloudPlatform/kubeflow-distribution#36; specifically when
  deploying private GKE the CloudEndpoints controller won't be able
  to contact the servicemanagement API. This provides a work around
  by running it locally.

* This pattern seems extensible; i.e. other platforms could link in
  code to handle CR's specific to their platforms. This could basically
  be an alternative to plugins.

* I added a context flag to control the kubecontext that apply applies to.
  Unfortunately, it doesn't look like there is an easy way to use
  that in the context of applying KFDef. It looks like the current logic
  assumes the cluster will be added to the KFDef metadata and then
  look up that cluster in .kubeconfig.

  * Modifying that logic to support the context flag seemed riskier
    then simply adding a comment to the flag.

* Added some warnings that KFUpgrade is deprecated since per kubeflow#304
  we want to follow the off shelf workflow.
vpavlin pushed a commit to vpavlin/kfctl that referenced this issue Jul 22, 2020
…low#351)

* This is GCP specific code that allows CloudEndpoints to be created using
  the CloudEndpoint controller. A Cloud endpoint is a KRM style resource
  so we can kust have `kfctl apply -f {path}` invoke the appropriate
  logic.

* For GCP this addresses GoogleCloudPlatform/kubeflow-distribution#36; specifically when
  deploying private GKE the CloudEndpoints controller won't be able
  to contact the servicemanagement API. This provides a work around
  by running it locally.

* This pattern seems extensible; i.e. other platforms could link in
  code to handle CR's specific to their platforms. This could basically
  be an alternative to plugins.

* I added a context flag to control the kubecontext that apply applies to.
  Unfortunately, it doesn't look like there is an easy way to use
  that in the context of applying KFDef. It looks like the current logic
  assumes the cluster will be added to the KFDef metadata and then
  look up that cluster in .kubeconfig.

  * Modifying that logic to support the context flag seemed riskier
    then simply adding a comment to the flag.

* Added some warnings that KFUpgrade is deprecated since per kubeflow#304
  we want to follow the off shelf workflow.
vpavlin pushed a commit to vpavlin/kfctl that referenced this issue Jul 22, 2020
…low#351)

* This is GCP specific code that allows CloudEndpoints to be created using
  the CloudEndpoint controller. A Cloud endpoint is a KRM style resource
  so we can kust have `kfctl apply -f {path}` invoke the appropriate
  logic.

* For GCP this addresses GoogleCloudPlatform/kubeflow-distribution#36; specifically when
  deploying private GKE the CloudEndpoints controller won't be able
  to contact the servicemanagement API. This provides a work around
  by running it locally.

* This pattern seems extensible; i.e. other platforms could link in
  code to handle CR's specific to their platforms. This could basically
  be an alternative to plugins.

* I added a context flag to control the kubecontext that apply applies to.
  Unfortunately, it doesn't look like there is an easy way to use
  that in the context of applying KFDef. It looks like the current logic
  assumes the cluster will be added to the KFDef metadata and then
  look up that cluster in .kubeconfig.

  * Modifying that logic to support the context flag seemed riskier
    then simply adding a comment to the flag.

* Added some warnings that KFUpgrade is deprecated since per kubeflow#304
  we want to follow the off shelf workflow.
crobby pushed a commit to crobby/kfctl that referenced this issue Feb 25, 2021
…low#351)

* This is GCP specific code that allows CloudEndpoints to be created using
  the CloudEndpoint controller. A Cloud endpoint is a KRM style resource
  so we can kust have `kfctl apply -f {path}` invoke the appropriate
  logic.

* For GCP this addresses GoogleCloudPlatform/kubeflow-distribution#36; specifically when
  deploying private GKE the CloudEndpoints controller won't be able
  to contact the servicemanagement API. This provides a work around
  by running it locally.

* This pattern seems extensible; i.e. other platforms could link in
  code to handle CR's specific to their platforms. This could basically
  be an alternative to plugins.

* I added a context flag to control the kubecontext that apply applies to.
  Unfortunately, it doesn't look like there is an easy way to use
  that in the context of applying KFDef. It looks like the current logic
  assumes the cluster will be added to the KFDef metadata and then
  look up that cluster in .kubeconfig.

  * Modifying that logic to support the context flag seemed riskier
    then simply adding a comment to the flag.

* Added some warnings that KFUpgrade is deprecated since per kubeflow#304
  we want to follow the off shelf workflow.
jbottum added a commit to jbottum/kubeflow that referenced this issue Feb 26, 2021
Per Yuan, I deleted - * Process and tools for upgrades from Release N-1 to N i.e. 1.0.x to 1.1, [kubeflow#304](kubeflow/kfctl#304)
Per James, I added - * Manage recurring Runs via new “Jobs” page (exact name on UI is TBD)
jbottum added a commit to jbottum/kubeflow that referenced this issue Mar 3, 2021
Per Yuan, I deleted - * Process and tools for upgrades from Release N-1 to N i.e. 1.0.x to 1.1, [kubeflow#304](kubeflow/kfctl#304)
Per James, I added - * Manage recurring Runs via new “Jobs” page (exact name on UI is TBD)
google-oss-robot pushed a commit to kubeflow/kubeflow that referenced this issue Mar 4, 2021
* Update ROADMAP.md

I updated Kubeflow 1.1, added Kubeflow 1.2 and Kubelfow 1.3 roadmap items.

* Update ROADMAP.md

Improved wording of features to simplify understanding

* Update ROADMAP.md

Added details on KFServing 0.5 enhancements

* Update ROADMAP.md

updated the notebooks section in Kubeflow 1.3 with these modificiations, 

* Notebooks
  * Important backend updates to Notebooks (i.e. to improve interop with Tensorboard)
  * New and expanded Jupyter Notebook stack along with easy to customize common base images
  * Addition of R-Studio and Code-Server (VS-Code) support

* Update ROADMAP.md

Reorganized Working Group updates into 1st section.   added that customizing jupyter base image is a stretch feature

* Update ROADMAP.md

Per Yuan, I deleted - * Process and tools for upgrades from Release N-1 to N i.e. 1.0.x to 1.1, [#304](kubeflow/kfctl#304)
Per James, I added - * Manage recurring Runs via new “Jobs” page (exact name on UI is TBD)

* Update ROADMAP.md

Added Multi-Model Serving, https://github.com/yuzliu/kfserving/blob/master/docs/MULTIMODELSERVING_GUIDE.md to KFServing 0.5 roadmap items
juliusvonkohout pushed a commit to juliusvonkohout/kubeflow that referenced this issue Mar 7, 2021
* Update ROADMAP.md

I updated Kubeflow 1.1, added Kubeflow 1.2 and Kubelfow 1.3 roadmap items.

* Update ROADMAP.md

Improved wording of features to simplify understanding

* Update ROADMAP.md

Added details on KFServing 0.5 enhancements

* Update ROADMAP.md

updated the notebooks section in Kubeflow 1.3 with these modificiations, 

* Notebooks
  * Important backend updates to Notebooks (i.e. to improve interop with Tensorboard)
  * New and expanded Jupyter Notebook stack along with easy to customize common base images
  * Addition of R-Studio and Code-Server (VS-Code) support

* Update ROADMAP.md

Reorganized Working Group updates into 1st section.   added that customizing jupyter base image is a stretch feature

* Update ROADMAP.md

Per Yuan, I deleted - * Process and tools for upgrades from Release N-1 to N i.e. 1.0.x to 1.1, [kubeflow#304](kubeflow/kfctl#304)
Per James, I added - * Manage recurring Runs via new “Jobs” page (exact name on UI is TBD)

* Update ROADMAP.md

Added Multi-Model Serving, https://github.com/yuzliu/kfserving/blob/master/docs/MULTIMODELSERVING_GUIDE.md to KFServing 0.5 roadmap items
Subreptivus pushed a commit to equinor/kubeflow that referenced this issue Mar 10, 2021
* Update ROADMAP.md

I updated Kubeflow 1.1, added Kubeflow 1.2 and Kubelfow 1.3 roadmap items.

* Update ROADMAP.md

Improved wording of features to simplify understanding

* Update ROADMAP.md

Added details on KFServing 0.5 enhancements

* Update ROADMAP.md

updated the notebooks section in Kubeflow 1.3 with these modificiations, 

* Notebooks
  * Important backend updates to Notebooks (i.e. to improve interop with Tensorboard)
  * New and expanded Jupyter Notebook stack along with easy to customize common base images
  * Addition of R-Studio and Code-Server (VS-Code) support

* Update ROADMAP.md

Reorganized Working Group updates into 1st section.   added that customizing jupyter base image is a stretch feature

* Update ROADMAP.md

Per Yuan, I deleted - * Process and tools for upgrades from Release N-1 to N i.e. 1.0.x to 1.1, [kubeflow#304](kubeflow/kfctl#304)
Per James, I added - * Manage recurring Runs via new “Jobs” page (exact name on UI is TBD)

* Update ROADMAP.md

Added Multi-Model Serving, https://github.com/yuzliu/kfserving/blob/master/docs/MULTIMODELSERVING_GUIDE.md to KFServing 0.5 roadmap items
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
No open projects
Kubeflow 1.1
  
To do
Development

No branches or pull requests

2 participants