Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all distributions of Kubeflow 1.1 have incorrect container images #1553

Closed
thesuperzapper opened this issue Sep 14, 2020 · 17 comments · Fixed by #1645
Closed

all distributions of Kubeflow 1.1 have incorrect container images #1553

thesuperzapper opened this issue Sep 14, 2020 · 17 comments · Fixed by #1645

Comments

@thesuperzapper
Copy link
Member

thesuperzapper commented Sep 14, 2020

Currently, this repository (both master and v1.1-branch) have the incorrect images for some components of kubefow.
As far as I can see, this should affect all stacks at least partially, including: GCP, AWS, Azure, IBM, Vanilla K8S.

Most of these errors have occurred because the stacks have moved to the base_v3 versions of the components, whereas the maintainers/bot has only updated the base spec. But in some cases, clearly no-one is maintaining the YAML for the component.

In terms of fixing this, we obviously need the owners of each section to fix their manifests.

Assuming we fix these issues, and cherry-pick them into the v1.1-branch, there will still be an issue, as kfdef files target the branch directly rather than a tag, so users may not realise that if they rerun kfctl build ... they will get different manifests.

Related issues: #1496

admission-webhook

centraldashboard

jupyter-web-app

notebook-controller

kfam

profile-controller

pytorch-operator

tf_operator

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.95

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@thesuperzapper
Copy link
Member Author

@jlewi you might want to be across this

@thesuperzapper
Copy link
Member Author

/priority p0

@jlewi
Copy link
Contributor

jlewi commented Sep 17, 2020

@thesuperzapper I would suggest breaking this issue upto into specific issues for each each application and assigning it to the application owners. Application owners are responsible for ensuring their manifests are shipping the right docker image release.

@jlewi
Copy link
Contributor

jlewi commented Sep 17, 2020

I spot checked the date of the commits of two of the images:
(TFJob) Jun 22 - https://github.com/kubeflow/tf-operator/tree/a2ae7bff612bf26d0b05a54ddcfc6a400a5c8e89
(Jupyter Web app) Feb 6 - https://github.com/kubeflow/kubeflow/tree/d9be4b9e
(Notebook controller) Feb 25 - https://github.com/kubeflow/kubeflow/tree/f39279c0

The apps cd pipeline looks like it has entries to update the base_v3 entry for most manifests
https://github.com/kubeflow/testing/blob/d4394eaff73af5645d156412f1de374aba31bf33/apps-cd/applications.yaml#L85

Likewise there are auto opened the PRs to update the manifests
#1479

So it looks to me like application owners didn't review the PRs and get them merged and they missed the release train.

@jlewi
Copy link
Contributor

jlewi commented Sep 18, 2020

@thesuperzapper Can you and the other proposed notebook lead fix this for notebooks?

@jlewi
Copy link
Contributor

jlewi commented Sep 18, 2020

/assign @thesuperzapper

@thesuperzapper
Copy link
Member Author

For anyone who comes across this issue and needs a fix, add this to your images section inside ./kustomize/kubeflow-apps/kustomization.yaml which was generated by kfctl build:

images:
  - name: gcr.io/kubeflow-images-public/admission-webhook
    newName: gcr.io/kubeflow-images-public/admission-webhook
    newTag: v1.1.0-g3ac3d08b
  - name: gcr.io/kubeflow-images-public/centraldashboard
    newName: gcr.io/kubeflow-images-public/centraldashboard
    newTag: v1.1.0-g35d7484a
  - name: gcr.io/kubeflow-images-public/jupyter-web-app
    newName: gcr.io/kubeflow-images-public/jupyter-web-app
    newTag: v1.1.0-gd3377cbd
  - name: gcr.io/kubeflow-images-public/notebook-controller
    newName: gcr.io/kubeflow-images-public/notebook-controller
    newTag: v1.1.0-gd3377cbd
  - name: gcr.io/kubeflow-images-public/kfam
    newName: gcr.io/kubeflow-images-public/kfam
    newTag: v1.1.0-g9f3bfd00
  - name: gcr.io/kubeflow-images-public/profile-controller
    newName: gcr.io/kubeflow-images-public/profile-controller
    newTag: v1.1.0-ga49f658f
  - name: gcr.io/kubeflow-images-public/pytorch-operator
    newName: gcr.io/kubeflow-images-public/pytorch-operator
    newTag: v1.1.0-gd596e904
  - name: gcr.io/kubeflow-images-public/tf_operator
    newName: gcr.io/kubeflow-images-public/tf_operator
    newTag: v1.1.0-g92389064

@jlewi
Copy link
Contributor

jlewi commented Sep 28, 2020

@thesuperzapper ping? What's the plan for notebooks?

@thesuperzapper
Copy link
Member Author

@jlewi, other than fixing it for Kubeflow 1.2, we don't have a plan right now.

Can I ask why we have chosen to use a branch like v1.1-branch rather than tags on master (which are immutable)?

The current kfdef files directly clone that branch (with any changes which may have been added), which makes it quite hard for us to release a patched version of 1.1 without pulling the rug out from under users.

@jlewi
Copy link
Contributor

jlewi commented Sep 29, 2020

@thesuperzapper I'm not sure what you are referring to. We do have immutable tags for each specific patch release.
https://github.com/kubeflow/manifests/tags

We use branches as opposed to immutable tags on master because following standard practice we want to allow the releases and masters to evolve independently in case of need to support fixes.

The current kfdef files directly clone that branch (with any changes which may have been added), which makes it quite hard for us to release a patched version of 1.1 without pulling the rug out from under users

It is up to the KFDef owners to decide whether to pin to a specific tag or pull from the head of the branch depending on what they are optimizing for.

@jlewi
Copy link
Contributor

jlewi commented Sep 29, 2020

@thesuperzapper @kimwnasptd what's the plan for getting the manifests on master upgraded?

@kimwnasptd
Copy link
Member

@jlewi I'll take a look at the current open PRs that update the controller and jupyter web app and test the latest built tags. If they work as expected then I'll move on with merging the corresponding PRs and closing the older ones.

@Jeffwan
Copy link
Member

Jeffwan commented Oct 8, 2020

This is a good point.. I think last release didn't handle v1.1-branch assets well. We fixed few issues in the past like profile-controller and training images. It's WG owner's responsibility to update them before the release. This definitely needs more collaboration

@jlewi
Copy link
Contributor

jlewi commented Oct 30, 2020

@kubeflow/wg-notebook-leads and @thesuperzapper Any update on this for notebooks? Do you have a plan for getting the notebook manifests updated before the release?

@thesuperzapper thesuperzapper added this to To do in Notebooks WG Nov 3, 2020
@kimwnasptd
Copy link
Member

@jlewi @Jeffwan for the controller we have #1570 which I propose to drop, since the changes were introduced from an irrelevant PR. So the controller should be the current ones.

For the web app the we have #1531, which contains @thesuperzapper's changes for Tolerations and affinity. Merging this PR alone will not suffice since we will need to also make a follow up PR to update the jupyter web app's config map in the manifests.

If @thesuperzapper has the cycles to make the PR to also update the config map in the manifests then we can merge #1531 and be ready for 1.2. If not then lets stick to the current manifests and aim for an overall update on 1.3 which will include the updated jupyter web app code kubeflow/kubeflow#5310.

@jlewi
Copy link
Contributor

jlewi commented Nov 4, 2020

Thanks @kimwnasptd for the update. Do you have a plan for how you will publish docker images going forward?

Kubeflow 1.2 automation moved this from In progress to Done Nov 16, 2020
Notebooks WG automation moved this from To do to Done Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Kubeflow 1.2
  
Done
Development

Successfully merging a pull request may close this issue.

5 participants