Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kubeflow release process and customization of v1.2.0+ installation #5440

Closed
sylus opened this issue Nov 27, 2020 · 9 comments
Closed

Kubeflow release process and customization of v1.2.0+ installation #5440

sylus opened this issue Nov 27, 2020 · 9 comments

Comments

@sylus
Copy link

sylus commented Nov 27, 2020

/kind bug

Hello amazing Kubeflow community!

We’ve been using Kubeflow in an experimental phase since before 1.0 and pretty heavily in our Data Analytics as a Service environment after the 1.0 release and has been working quite well! Very recently we attempted to upgrade to the 1.1.0 release and encountered a bunch of workflow issues.

For reasons specific to our environment, we have to customize the installation of Kubeflow. With the 1.0 release of Kubeflow, we were able to accomplish this by the following:

  1. Modified the kfctl_k8s_istio.yaml
  2. Ran kfctl build -f kfctl_k8s_istio
  3. And performed customizations by modifying manifests in the kustomize folder

https://github.com/statcan/kubeflow-manifest

We first attempted the upgrade to 1.1.0 by performing the same task. We re-modified the updates and aligned them with what we needed. However, after deployment, we found out that the manifests in the 1.1.0 tag are not actually the latest manifests for the 1.1.0 release but instead they were made available after the release (I believe in August based on kubeflow/manifests#1364 (comment)). However, when attempting to re-apply our modifications to the manifests, the new structure of the release manifests for 1.1 does not allow for the same process. We don’t want to modify the .cache folder (and in fact, we don’t commit the .cache folder) for obvious reasons.

Figuring that we were incorrectly modifying the release manifests we further researched how Kubeflow recommends modifying the release. We found https://www.kubeflow.org/docs/other-guides/kustomize/ which don’t really identify how to actually make changes (it gives some “ideas” but no examples).

Our workflow was dependent on the full output into the kustomize folder like was done in 1.0 and we no longer get that output anymore to modify. We started looking into docs on this and found https://developers.redhat.com/blog/2020/07/23/open-data-hub-and-kubeflow-installation-customization/ which documented some ways to customize the install, which involves a fork and ways of working with it. Our preferred method is #3, which is to use overlays to apply configuration. But according to kubeflow/kfctl#402, the use of overlays has been deprecated.

We were able to add a component say for example creating a dex folder then adding a configmap and then doing a strategic merge. We can then do kustomize build . on just that specific component to see our replacement works. So kustomize build kustomize/dex works. However when we try kustomize build kustomize/kubeflow-apps, we get an error rendering the manifests:

Error: accumulating resources: accumulateFile "accumulating resources from '../../.cache/manifests/manifests-1.1-branch/stacks/azure': '/Users/XXXXX/Desktop/kubeflow/daaas/.cache/manifests/manifests-1.1-branch/stacks/azure' must resolve to a file", accumulateDirector: "recursed accumulation of path '/Users/XXXXX/Desktop/kubeflow/daaas/.cache/manifests/manifests-1.1-branch/stacks/azure': accumulating resources: accumulateFile \"accumulating resources from '../../admission-webhook/webhook/v3': '/Users/XXXXX/Desktop/kubeflow/daaas/.cache/manifests/manifests-1.1-branch/admission-webhook/webhook/v3' must resolve to a file\", accumulateDirector: \"recursed accumulation of path '/Users/XXXXX/Desktop/kubeflow/daaas/.cache/manifests/manifests-1.1-branch/admission-webhook/webhook/v3': accumulating resources: accumulateFile \\\"accumulating resources from '../overlays/application/application.yaml': security; file '/Users/XXXXX/Desktop/kubeflow/daaas/.cache/manifests/manifests-1.1-branch/admission-webhook/webhook/overlays/application/application.yaml' is not in or below '/Users/XXXXX/Desktop/kubeflow/daaas/.cache/manifests/manifests-1.1-branch/admission-webhook/webhook/v3'\\\", loader.New \\\"Error loading ../overlays/application/application.yaml with git: url lacks host: ../overlays/application/application.yaml, dir: got file 'application.yaml', but '/Users/XXXXX/Desktop/kubeflow/daaas/.cache/manifests/manifests-1.1-branch/admission-webhook/webhook/overlays/application/application.yaml' must be a directory to be a root, get: invalid source string: ../overlays/application/application.yaml\\\"\""

Therefore, I think our questions are as follows:

  1. What is the officially supported way of applying customizations to the install?
  2. With kustomize, how can we preview the changes of our overlays in kubeflow-apps like we were able to do with the other components? Given kustomize build kustomize/kubeflow-apps doesn't work.
  3. We are also a bit confused about where to find the manifests associated with a release and a fixed set of artifacts. Sometimes we find the manifests references 1.1.0 and otherwise the v1.1 branch. What manifests are we supposed to use? We would expect to pull them from the tagged release in kubeflow/manifests (1.1.0)

Thank you in advance for your time and please reach out if you have any questions.

Note: You can see a list of all the customizations we want to make that were done without issue in Kubeflow 1.0.0 here: StatCan#11

Environment:

  • Kubeflow version: (version number can be found at the bottom left corner of the Kubeflow dashboard):
  • kfctl version: (use kfctl version): 1.1.0
  • Kubernetes platform: (e.g. minikube): 1.16.15
  • Kubernetes version: (use kubectl version): 1.16.15
  • OS (e.g. from /etc/os-release): Ubuntu 16.04
@sylus
Copy link
Author

sylus commented Nov 27, 2020

@aronchick is it possible you would be able to assist with this? :D

@sylus
Copy link
Author

sylus commented Dec 1, 2020

So I think I might understand this a bit better now after going through it :D

Some of the confusion resulted in that pre v1.1.0 that the full output was in the kustomize folder, however now in v1.1.0+ we are just encouraged on using the delta via strategic merges etc. Due to this I was able to get all of our custom configuration working in the v1.2.0 line though in order to make our customizations we had to explicitly call all of the components that the kubeflow-app calling stacks/azure was using.

https://github.com/StatCan/kubeflow-manifest/tree/feat-kubeflow

The biggest problem I was having was how I can test the delta so what I ended up doing was running the following command and then storing in git temporarily the output while was performing the adjustments. This then lets me see the delta and that things were working correctly.

kfctl build -V -f ./kfctl_azure.yaml -d > output.yaml

Here was the contents of the kfctl_azure.yml which allowed for us to store our overrides in the kustomize folder:

SEE BELOW

@sylus
Copy link
Author

sylus commented Dec 2, 2020

Actually was able to improve upon the above as we still needed some vars from the top level first to be initiated and then we made our own “stack” similar to how the azure one functions except with our customizations and improvements. It also does it the "kustomize way" but can definitely see the benefits that our kfctl file is now really small and does most of the grunt work. Nice thing is that inheritance works well and is really easy to modify the defaults now without needing tool like yq for everything.

We also got ours working on Azure, Multi User pipelines and under Istio.

https://github.com/StatCan/kubeflow-manifest/tree/feat-kubeflow

Run our KFDefinition file:

kfctl build -V -f kfctl_azure.yaml

To check output

kfctl build -V -f ./kfctl_azure.yaml -d > output.yaml
apiVersion: kfdef.apps.kubeflow.org/v1
kind: KfDef
metadata:
  clusterName: k8s-cancentral-01-default-aks
  name: kubeflowmanifests
  namespace: kubeflow
spec:
  applications:
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: namespaces/base
    name: namespaces
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: application/v3
    name: application
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: metacontroller/base
    name: metacontroller
  - kustomizeConfig:
      repoRef:
        name: daaas
        path: stacks/daaas
    name: kubeflow-apps
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: knative/installs/generic
    name: knative
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: kfserving/installs/generic
    name: kfserving
  - kustomizeConfig:
      repoRef:
        name: manifests
        path: stacks/azure/application/oidc-authservice
    name: oidc-authservice
  repos:
  - name: manifests
    uri: https://github.com/kubeflow/manifests/archive/v1.2-branch.tar.gz
  version: v1.2-branch
status:
  reposCache:
  - localPath: '".cache/manifests/manifests-1.2-branch"'
    name: manifests
  - localPath: '"kustomize"'
    name: daaas

This will in turn calls our “DAaaS” stack along with other application-level components such as OIDC which in turn call the kubeflow components in .cache with our overrides:

.
├── application
│   ├── admission-webhook
│   │   └── kustomization.yaml
│   ├── argo
│   │   ├── cluster-role.yaml
│   │   ├── config-map.yaml
│   │   ├── kustomization.yaml
│   │   └── service.yaml
│   ├── centraldashboard
│   │   ├── deployment.yaml
│   │   └── kustomization.yaml
│   ├── jupyter-web-app
│   │   ├── cluster-role.yaml
│   │   ├── configs
│   │   │   └── spawner_ui_config.yaml
│   │   ├── deployment.yaml
│   │   ├── deployment_patch.yaml
│   │   ├── kustomization.yaml
│   │   └── params.env
│   ├── katib
│   │   └── kustomization.yaml
│   ├── kubeflow-roles
│   │   └── kustomization.yaml
│   ├── kustomization.yaml
│   ├── metadata
│   │   └── kustomization.yaml
│   ├── minio
│   │   ├── kustomization.yaml
│   │   └── persistent-volume-claim.yaml
│   ├── mpi-job
│   │   └── kustomization.yaml
│   ├── mxnet-job
│   │   └── kustomization.yaml
│   ├── mysql
│   │   └── kustomization.yaml
│   ├── notebook-controller
│   │   ├── deployment.yaml
│   │   └── kustomization.yaml
│   ├── pipeline
│   │   ├── deployment.yaml
│   │   ├── kustomization.yaml
│   │   └── service.yaml
│   ├── profiles
│   │   ├── deployment.yaml
│   │   └── kustomization.yaml
│   ├── pytorch-job
│   │   └── kustomization.yaml
│   ├── spark-operator
│   │   └── kustomization.yaml
│   └── tf-training
│       └── kustomization.yaml
├── kfserving
│   └── kustomization.yaml
├── knative
│   └── kustomization.yaml
├── kubeflow-apps
│   └── kustomization.yaml
├── metacontroller
│   └── kustomization.yaml
├── namespaces
│   ├── kustomization.yaml
│   └── namespace_patch.json
├── oidc-authservice
│   ├── configmap.yaml
│   ├── envoy-filter.yaml
│   ├── kustomization.yaml
│   └── statefulset.yaml
└── stacks
    └── daaas
        ├── config
        │   └── params.env
        └── kustomization.yaml

Then in the above I can still use our envsubst and yq replacements on the kustomize folder to interpolate our OIDC parameters through a GitHub actions P.R.

I think my questions now get reduced to the following:

  1. Does this look correct and is the officially supported way of applying customizations to the install?
  2. I noticed that the KFDef now call the manifests-1.2 branch instead of a stable tag is that okay we are doing that?

@sylus sylus changed the title Kubeflow release process and customization of v1.1.0+ installation Kubeflow release process and customization of v1.2.0+ installation Dec 3, 2020
@berndverst
Copy link
Member

berndverst commented Dec 15, 2020

@sylus -- the manifests v1.2 changes (and initial issues) are my fault. I noticed only 2 weeks before v1.2 release that a release was upcoming and scrambling to get some things out. We don't have any dedicated person at Microsoft focused on Kubeflow so I took this on myself. Let's just say it has been a bit of a steep learning curve. The changes you observed above with the .cache folder for example were new to me also.

The Azure v1.2 manifests in the v1.2 branch point to a fixed commit. This is not unlike pointing at a tag that is pointing at a fixed commit. Prior to Kubeflow v1.2 all manifests for all platforms were simply pointing at the branch without pinning any commit.

Also FYI, just now I made another change as I noticed all along Knative and Kfserving were deploying to the wrong namespaces kubeflow/manifests#1698.

Now Knative may not actually work for another reason I just learned about... kubeflow/kfctl#462 but that's an issue for another day

@slenky
Copy link

slenky commented Dec 15, 2020

Regarding testing your changes on specific modules, can you please try kustomize build stacks/azure --load_restrictor none ?
It does a trick for me.

@slenky
Copy link

slenky commented Dec 15, 2020

Now Knative may not actually work for another reason I just learned about... kubeflow/kfctl#462 but that's an issue for another day

After each execution of kfctl on Azure just use kubectl label ns kubeflow control-plane- --overwrite

@berndverst
Copy link
Member

Now Knative may not actually work for another reason I just learned about... kubeflow/kfctl#462 but that's an issue for another day

After each execution of kfctl on Azure just use kubectl label ns kubeflow control-plane- --overwrite

Good tip!

I think I also know how to contribute to the upstream project the ability to define a custom Kubeflow control-plane selector label. Might do so in the incoming weeks if I have time.

@stale
Copy link

stale bot commented Jun 3, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the lifecycle/stale label Jun 3, 2021
@stale stale bot closed this as completed Jun 11, 2021
Needs Triage automation moved this from To Do to Closed Jun 11, 2021
@kubeflow-bot kubeflow-bot removed this from Closed in Needs Triage Jun 11, 2021
@pwzhong
Copy link

pwzhong commented Mar 1, 2022

@berndverst Kubeflow Azure distribution has remained in v1.2 for two years, while v1.5 is coming up soon. Do you know if there is any plan to release Azure distribution with more latest version? Who is the point of contact/representative?
As you have worked on v1.2, do you plan to continue to work on a latest version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants