Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add image trigger annotation for filling in image-field-value of container. Use imagestream name and tag for JUPYTER_IMAGE #800

Conversation

shalberd
Copy link
Contributor

@shalberd shalberd commented Nov 23, 2022

but not for Notebook container image ref

achieving independence from the internal openshift docker registry in handling images in conjunction with imagestreams.
See https://itnext.io/variations-on-imagestreams-in-openshift-4-f8ee5e8be633 for background.

Jupyter uses referencepolicy type: Source, making an internal openshift image repo unneccessary.
However, the internal openshift repo can be used, just via imagestream name and tag, not via direct reference of the image location in the internal openshift docker repo.

to be tested in conjunction with Kubeflow notebook controller PR-133

Description

fixes #792 #813 #1340

no dependency on dockerImageRepository field of ImageStream in imageUrl variable, thereby no dependency on internal Docker registry when spawning images that are coming from an imageStream.

Also, now, the image: field of Notebook spec has the combination of image name derived from imagestream name plus imagestream tag value (imageSelection variable).

The Notebook annotation notebooks.opendatahub.io/last-image-selection https://github.com/opendatahub-io/odh-dashboard/blob/main/backend/src/utils/notebookUtils.ts#L266, keeps the imageSelection variable, that is referencing imagestream name plus tag value.

With imagestreams, be they spec.tags.referencePolicy type Source (external) or Local (internal openshift registry), we can use the imagestream tag and name to reference the image, wherever it may be. So, we use the imageSelection variable in the container image-field.

That is so because spec.lookupPolicy.local is set to true, it is thus absolutely ok to reference the image stream name directly for any image: field of a container, as long as the ImageStream is in the same namespace as the dashboard and controller, which it is.

Even for the DSG feature in which you spawn Notebooks in different namespaces, you are already covering that correctly with system:image-puller, making it even more logical not to reference from the internal docker repo location in the image field of the notebook.

Image-field fill-in happens on StatefulSet creation via metadata annotation and image policy admission plugin of openshift. Image-field can either initially be set empty or even be left out completely, tested with StatefulSet outside of notebook controller context.

Also, for JUPYTER_IMAGE env variable, I am using the value of imagestreammetadata.name:imgestreamtag / variable imageUrl, that is, the original location in the source repo. That is the only location where I could think of a direct image reference being useful, but I have not gotten any response on how JUPYTER_IMAGE is used in RHOSDS.

Do not merge until changes in notebook controller reconciliation logic have been implemented.

How Has This Been Tested?

Would like to test this with a sample docker image someone would provide. In an Openshift environment that does NOT have an internal openshift docker registry. Need assistence from odh-dashboard developers.

@andrewballantyne Would like to test how the Notebook CR is created with the latest changes. Coordination via slack. It will not lead to a working pod deployment, but I want to check the status of Notebook CR instance in a notebook namespace that is != dashboard namespace.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@openshift-ci openshift-ci bot added the needs-ok-to-test The openshift bot needs to label PRs from non members to avoid strain on the CI label Nov 23, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 23, 2022

Hi @shalberd. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@andrewballantyne
Copy link
Member

/ok-to-test

@openshift-ci openshift-ci bot added ok-to-test The openshift bot needs `ok-to-test` to allow non member PRs to run the tests. and removed needs-ok-to-test The openshift bot needs to label PRs from non members to avoid strain on the CI labels Dec 5, 2022
Copy link
Member

@andrewballantyne andrewballantyne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good. We typically don't use underscores for property names, but if we don't have a linter for it, it's not worth changing until we do.

Sven and I are in talks about issues with backoff errors when starting a Notebook with his PR.

@andrewballantyne andrewballantyne self-assigned this Dec 6, 2022
@andrewballantyne
Copy link
Member

/hold

Looks like there may be an issue with the OpenShift side of things that prevents us from merging this atm. The change in image references seems to cause it not to find the image itself and pulling falls and the pod does not start (backoff error). Sven is investigating - openshift/openshift-apiserver#339

@openshift-ci openshift-ci bot added the do-not-merge/hold This PR is hold for some reason label Dec 7, 2022
@shalberd
Copy link
Contributor Author

shalberd commented Dec 9, 2022

Larger issue with image-resolve from imagestreams in kubeflow notebook controller. Explained the situation to Kubeflow at kubeflow/kubeflow#6829

Basically, the issue lies less with the Notebook CRD, but with the resource creating the StatefulSet from a Notebook CR; that is the notebook-controller.

I have tested this within-namespace (Notebook / StatefulSet and imagestream in same namespace) and across-namespaces and it is working beautifully for StatefulSets:

https://docs.openshift.com/container-platform/4.10/openshift_images/triggering-updates-on-imagestream-changes.html#images-triggering-updates-imagestream-changes-kubernetes-about_triggering-updates-on-imagestream-changes

When one of the core Kubernetes resources contains both a pod template and this annotation, OpenShift Container Platform attempts to update the object by using the image currently associated with the image stream tag that is referenced by trigger. The update is performed against the fieldPath specified.

Example, having a StatefulSet in odhsven2, referring to an imagestream in odhsven namespace, just adding an annotation and leaving the image-field empty automatically fills in the correct image.

The StatefulSet annotation is like this, in metadata:

  annotations:
    image.openshift.io/triggers: '[{"from":{"kind":"ImageStreamTag","name":"s2i-generic-data-science-notebook:v0.0.5", "namespace":"odhsven"},"fieldPath":"spec.template.spec.containers[?(@.name==\"jupyter-nb-sthoms\")].image"}]'

So, notebook controller needs to exclude this annotation as well as the image-field from reconciliation. Most importantly, notebook controller needs to take it from the Notebook CR, if present, on StatefulSet creation.

With the lookup-fieldPath to the image-value being the same value as Notebook metadata.name, as that is also the notebook containers[0].name.

@shalberd shalberd changed the title new field from_name in ImageTagInfo. Use the value for JUPYTER_IMAGE [WIP] new field from_name in ImageTagInfo. Use the value for JUPYTER_IMAGE Dec 14, 2022
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress This PR is in WIP state label Dec 14, 2022
@shalberd
Copy link
Contributor Author

/retest

frontend/src/api/network/notebooks.ts Outdated Show resolved Hide resolved
frontend/src/api/network/notebooks.ts Outdated Show resolved Hide resolved
frontend/src/api/network/notebooks.ts Outdated Show resolved Hide resolved
backend/src/utils/notebookUtils.ts Outdated Show resolved Hide resolved
backend/src/utils/notebookUtils.ts Outdated Show resolved Hide resolved
backend/src/types.ts Outdated Show resolved Hide resolved
@lucferbux
Copy link
Contributor

@shalberd can you rebase latest changes? there has been some upgrades in the environment and depencies, once that's done we can re-review it. Thanks in advance!

@andrewballantyne
Copy link
Member

@shalberd can you rebase latest changes? there has been some upgrades in the environment and depencies, once that's done we can re-review it. Thanks in advance!

Also if you could run our linter and fix the issues the PR is flagging, that'll also help. npm test I think works from the root of the repo and it will run everything including our limited test suite (atm).

@andrewballantyne
Copy link
Member

andrewballantyne commented Mar 22, 2023

Also @lucferbux has a issue to fix this -- but @shalberd we are using Node 18 now. You'll have to upgrade that. We have to fix the documentation (which is for the next release -- end of next week)

Ref: #1034

@shalberd shalberd force-pushed the assembleNotebook_image_independent_of_dockerImageRepository_openshift branch from a75bad2 to 3061877 Compare March 22, 2023 14:45
@openshift-merge-robot openshift-merge-robot removed the needs-rebase PR needs to be rebased label Mar 22, 2023
@shalberd
Copy link
Contributor Author

shalberd commented Mar 22, 2023

@andrewballantyne rebased, will run linter tomorrow, but most issues should be fixed now. The general idea of the new annotation is that there can be any value in the container image-field really, so I used an empty space. It is later overwritten with the correct value by the image policy admission plug-in. Can you run th linting remotely?

See what happens to the image-field cause of the image policy admission plug-in doing its work:

  • Without open shift registry:

<image stream name>:<image stream tag> references in image-field or annotation of a Kubernetes object (Deployment, StatefulSet, Pod etc.) always resolve to external image reference from imagestream spec.tags[tagname].from.name regardless of whether spec.tags[tagname].referencePolicy.type is set to Local or Source in image stream--> perfect for air gapped and ImageContentSourcePolicy use or if one just wants to pull directly from the external location when there is no internal openshift registry.

containers:
    - name: testissource
      image: >-
         quay.io/thoth-station/s2i-generic-data-science notebook@sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58
  • With openshift registry:

if spec.tags[tagname].referencePolicy.type of the imagestream is set to Source then <image stream name>:<image stream tag> references in image-field or annotation of a Kubernetes object (Deployment, StatefulSet, Pod etc.) resolves to the remote repo location:

containers:
    - name: testissource
      image: >-
         quay.io/thoth-station/s2i-generic-data-science-notebook@sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58

if spec.tags[tagname].referencePolicy.type of the imagestream is set to Local then <image stream name>:<image stream tag> references in image-field or annotation of a Kubernetes object (Deployment, StatefulSet, Pod etc.) resolves to the namespace-specific location of the image in the open shift-internal registry:

containers
    - name: testisreflocal
      image: >-
        image-registry.openshift-image-registry.svc:5000/testis/jupyter-tensorflow-notebook@sha256:fc52e4fbc8c1c70dfa22dbfe6b0353f5165c507c125df4438fca6a3f31fe976e

@shalberd
Copy link
Contributor Author

/retest

@shalberd
Copy link
Contributor Author

shalberd commented Jul 27, 2023

@lucferbux ok, squashing achieved. One single, meaningful commit. Because there was a mix of local changes and merge commits in-between, I had to use a tool called git-squash being in the current feature branch and having main branch up-to-date, doing git-squash main main being the branch to compare to for changes, entered single commit message.

That helped me with the process, in the end having that one new commit and porce-pushing to the feature branch afterwards.

@shalberd
Copy link
Contributor Author

shalberd commented Jul 27, 2023

@lucferbux @atheo89 In the manifests deployment recipes, we currently still have image pull policy Always, so this fast-changing image behind the pr-800 tag is not an issue when using tag-format.

If you want to test with modified odh-manifests, you can replace the values, either leave it at newTag: or use digest: sha256:xxx notation.

Advantage of using digests is also that is it 100% comparable in terms of image used between several people trying it out, as the digest is truly unique.

https://github.com/opendatahub-io/odh-manifests/blob/master/odh-dashboard/base/kustomization.yaml#L28

put

newTag: pr-800

or

digest: sha256:84690bc6833889364212f4ccb30ac18636c81e137cf27c129ed9ad92180eeeef

then, for kubeflow notebook controller

https://github.com/opendatahub-io/odh-manifests/blob/master/odh-notebook-controller/kf-notebook-controller/overlays/openshift/kustomization.yaml#L14

put

newTag: pr-133

or

sha256:090cf186fe7feae9399455911e03164c43b41f0b0f1dc80cd754b46535598382

An advantage I see with image behind a certain tag changing, even when using imagePullPolicy: IfNotPresent:

when using digest notation, collaborators can be absolutely sure they are using the same images :-)
That is how I did it when doing testing on our openshift cluster with custom manifests

https://quay.io/repository/opendatahub/kubeflow-notebook-controller?tab=tags

https://quay.io/repository/opendatahub/odh-dashboard?tab=tags

@andrewballantyne andrewballantyne added the pr/no-tests-allowed Added by an official approver - this PR is allowed no tests. Omitted, a test must accompany the PR label Sep 7, 2023
@shalberd
Copy link
Contributor Author

shalberd commented Oct 18, 2023

I'll check the code once more once we have kubeflow notebook controller PR-133 done for references to https://github.com/search?q=repo%3Aopendatahub-io%2Fodh-dashboard%20%22image.split%22&type=code

and resolve conflicts.

update Oct 19: rebase done to latest HEAD of main. Will change the base branch of this PR from main to f/imagestreams once that one is fast-forwarded.

@openshift-merge-robot openshift-merge-robot added the needs-rebase PR needs to be rebased label Oct 18, 2023
@shalberd shalberd force-pushed the assembleNotebook_image_independent_of_dockerImageRepository_openshift branch from d3f0d96 to f4a6755 Compare October 19, 2023 07:17
@openshift-merge-robot openshift-merge-robot removed the needs-rebase PR needs to be rebased label Oct 19, 2023
@shalberd shalberd force-pushed the assembleNotebook_image_independent_of_dockerImageRepository_openshift branch from 641f214 to 8af82c3 Compare October 19, 2023 07:29
@andrewballantyne andrewballantyne changed the base branch from main to f/imagestreamhandling October 19, 2023 13:32
@shalberd shalberd force-pushed the assembleNotebook_image_independent_of_dockerImageRepository_openshift branch from 8af82c3 to dcb33b3 Compare October 29, 2023 14:19
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 29, 2023

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from andrewballantyne. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shalberd shalberd force-pushed the assembleNotebook_image_independent_of_dockerImageRepository_openshift branch 5 times, most recently from 8105b82 to 56ef006 Compare October 30, 2023 16:33
@openshift-merge-robot openshift-merge-robot added needs-rebase PR needs to be rebased and removed needs-rebase PR needs to be rebased labels Oct 30, 2023
…me as value of env JUPYTER_IMAGE while using the combination of imagestream/image name and tag as the image source for notebook; achieving independence from the internal openshift docker registry
Copy link
Contributor Author

@shalberd shalberd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes implemented

backend/src/utils/notebookUtils.ts Outdated Show resolved Hide resolved
backend/src/utils/notebookUtils.ts Show resolved Hide resolved
frontend/src/api/k8s/notebooks.ts Outdated Show resolved Hide resolved
frontend/src/utilities/imageUtils.ts Show resolved Hide resolved
backend/src/utils/notebookUtils.ts Show resolved Hide resolved
backend/src/utils/notebookUtils.ts Show resolved Hide resolved
@shalberd
Copy link
Contributor Author

/retest

Copy link
Member

@harshad16 harshad16 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for you continued patience with this feature.
Provided my review on this, i believe , we should this whole part towards the notebook-controller itself, instead of trying to do this via dashboard, as it would require more changes based on restart and creation of notebook.

If you agree, i would share that lets address the changes in Notebook-controller and close this PR if all works as we require.

@@ -268,6 +266,7 @@ export const assembleNotebook = async (
'opendatahub.io/username': username,
'kubeflow-resource-stopped': null,
'opendatahub.io/accelerator-name': accelerator.accelerator?.metadata.name || '',
'image.openshift.io/triggers': `[{"from":{"kind":"ImageStreamTag","name":"${imageSelection}", "namespace":"${namespace}"},"fieldPath":"spec.template.spec.containers[?(@.name==\\"${name}\\")].image"}]`,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the preview test of this,

In case of the change of referencePolicy from Local to Source or vice-versa; or change to the image tag source. The annotation quickly updates the resources, which causes the pods to restart.
As it restarts, in case of a change of image tag, this becomes an issue for long-running notebooks.
We have a requirement for not hindering long-running notebook, and also that we update tag for the same imagestream tag based on cve fix.

In this case, we would need to use the image-trigger annotation field called paused, based on notebook scale down or up. If the user scales down, then only the image should be changed or else it should stay the same.

Perhaps, we should do this whole action directly on Notebook-Controller, which notebook-controller takes care of the Notebook Image, based on its type.

Copy link
Contributor Author

@shalberd shalberd May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Harshad, sorry for the delay, thank you for your thoughts. Generally speaking, you are right when it comes to long-running notebooks (of course, if a node is ever down, even the best notebook when not replicated is not truly long-running, there is no guarantee).

In case of the change of referencePolicy from Local to Source or vice-versa; or change to the image tag source. The annotation quickly updates the resources, which causes the pods to restart.

Yes, that is true, it kind of "ripples through" without much delay, leading to pod restarts. We have that currently in our private Harbor repo that imagestream tag.from.name is based on, when the sha hash changes, yup. Leads to restarts of workbench pods.

We have a requirement for not hindering long-running notebook ... In this case, we would need to use the image-trigger annotation field called paused, based on notebook scale down or up. If the user scales down, then only the image should be changed

paused: true or false could be based on a custom notebook annotation. Possibly together with a slider in dashboard GUI that allows a user to override if they want to update.

About stopped slider in Dashboard GUI, so a stopped workbench and notebook / statefulset: the change in the imagestream tag underlying digest applies there, too, which is good. What I mean:
I looked up a workbench statefulset with replicas: 0 and looked at the notebook container image field value.
It contained the latest digest behind the tag of the imgestream / docker image base. So when paused: false is set in the annotation (which is the default) https://docs.openshift.com/container-platform/4.12/openshift_images/triggering-updates-on-imagestream-changes.html then changes ripple though to even a replicas: 0 statefulset / podspec.

You wrote: "if the user scales down, then only should the image be changed". It does not matter whether scaled down or not, image is always changed. What matters only the the value of paused: in the image change trigger annotation.

I think I understand what you want to achieve with either odh notebook controller or kubeflow notebook controller with respect to paused: true vs paused: false, I just think setting the value on the image change trigger annotation as a whole should start on odh dashboard. That is of course up fro discussion :-)

Perhaps, we should do this whole action directly on Notebook-Controller, which notebook-controller takes care of the Notebook Image, based on its type.

You mean based on a new notebook opendatahub annotation, like long-running: true/false or pauseable: true/false?

And by whole action, you mean the injection of the image stream change trigger annotation into the pod (not container)?

Copy link
Contributor Author

@shalberd shalberd May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot just close this PR as dashboard notebook assembly won't work since it is based on imagestream status:
dockerImageRepository, which is empty when no internal openshift registry is present.

Also, on each notebook podspec update via dashboard GUI (requests and limits sizes, env vars ...), the image-field would again be set by dashboard. So we do need to touch the assemble and update logic here in some minimal way at least.

I agree on handling some aspects by odh (or kubeflow?) notebook controller, but we defininitely need some changes here in dashboard, too.

@shalberd
Copy link
Contributor Author

shalberd commented Jun 4, 2024

@harshad16 @jiridanek @jstourac @atheo89 @lucferbux

closed in favor odh odh notebook controller doing the notebook image field lookup in all cases (no openshift internal registry, openshift internal registry): opendatahub-io/kubeflow#336 and opendatahub-io/kubeflow#329

as well as dashboard changes #2867

Thank you all very much, neat work

minor remaining: changing notebook container imagePullPolicy: IfNotPresent from old imagePullPolicy: Always

as a nice-to-have feature for external registry and large images with a unique hash. Needs to be validated if that works with internal openshift registry, too. Saves times on workbench startup.

@shalberd shalberd closed this Jun 4, 2024
@andrewballantyne
Copy link
Member

Thanks again Sven for all your hard work on this front!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold This PR is hold for some reason ok-to-test The openshift bot needs `ok-to-test` to allow non member PRs to run the tests. pr/no-tests-allowed Added by an official approver - this PR is allowed no tests. Omitted, a test must accompany the PR
Projects
None yet
5 participants