Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot reference imagestream name and tag from StatefulSet image field, only DeploymentConfig in OCP 4.10.3 #339

Closed
shalberd opened this issue Dec 7, 2022 · 21 comments
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@shalberd
Copy link

shalberd commented Dec 7, 2022

@dmage @benluddy @bparees I hope I am at the correct place here, found this online for context:

Image streams are part of the OpenShift extension APIs. Other OpenShift extension resources, such as build configurations and deployment configurations, provide native support for image streams. OpenShift tooling, such as the oc command, offers easy-to-use commands to manage image stream resources, as well as other extension API resources.
OpenShift adds its extension APIs using standard Kubernetes extension mechanisms, such as custom resource definitions (CRDs) and admission plugins. This feature allows OpenShift to support using image streams together with standard Kubernetes workload API resources, such as Deployments, StatefulSets, and Jobs.

When I have an imagestream in the same namespace as a StatefulSet, I cannot use the image field and reference imagestream name and tag there. It does not get resolved.

In contrast, when I reference the imagestream from a DeploymentConfig, or from a Pod, or from a Deployment, the image name and digest get resolved correctly, as it should be.

Should this not have been fixed in the admission controller or something like that a long time ago, for all kinds of objects?

See https://bugzilla.redhat.com/show_bug.cgi?id=2000216 and https://access.redhat.com/solutions/6455191

In any case, the expected result would be: image stream tags get resolved in created and edited StatefulSets.

@bparees
Copy link
Contributor

bparees commented Dec 7, 2022

yes, it should work. Can you provide the exact steps(oc commands, in order) you are taking along w/ your yaml files for the imagestream and statefulset?

Due to some limitations in the admission logic, i think it's critical that your imagestream exist and be successfully imported before you create the statefulset. At least depending what mechanism you're using (which will be clearer if we can see your statefulset yaml and imagestream).

i'm guessing you've already been here, but here are the docs on these features:
https://docs.openshift.com/container-platform/4.11/openshift_images/using-imagestreams-with-kube-resources.html

https://docs.openshift.com/container-platform/4.11/openshift_images/triggering-updates-on-imagestream-changes.html

@shalberd
Copy link
Author

shalberd commented Dec 8, 2022

Hi Ben, thank you for grabbing this topic, I really appreciate it.

Imagestream is applied by a kustomize-style yaml files deployment. Statefulset is created by Kubeflow Notebooks Controller as a result of a Notebook CR being present.

The imagestream exists and tag has been successfully imported at created: '2022-11-21T13:14:31Z'. We do not use an the internal openshift docker registry.

The statefulset references imagestream name and tag in its container image field. Its creation timestamp is way past the imagestream one creationTimestamp: '2022-12-07T16:54:40Z'

We are getting the following events on the pod, which is why I got to the bugzilla link, access.redhat link and so on in the first message of this issue here.

 [Warning] Failed to pull image "s2i-generic-data-science-notebook:v0.0.5": rpc error: code = Unknown desc = reading manifest v0.0.5 in docker.io/library/s2i-generic-data-science-notebook: errors: denied: requested access to the resource is denied unauthorized: authentication required
jupyter-nb-sthoms-0
Namespace odhsven
Dec 8, 2022, 5:21 AM
Generated from kubelet xxxx-ostlxxx.ocp4dev.xxxx.mydomain.ch
138 times in the last 14 hours
Failed to pull image "s2i-generic-data-science-notebook:v0.0.5": rpc error: code = Unknown desc = reading manifest v0.0.5 in docker.io/library/s2i-generic-data-science-notebook: errors: denied: requested access to the resource is denied unauthorized: authentication required

I am referencing the yamls below. The problem occurs on OCP 4.10.3 and also on the latest DEV versions 4.13.x. Imagestream and StatefulSet are in the same namespace. Imagestream, when referenced by DeploymentConfigs, Deployments, and Pods in the image-field, resolves to the src location of the imagestream-tag. In those types of objects, the resolve and image pull through is working fine, just not with StatefulSet.

There are two containers in the pod, disregard the oauth-one, please, we are trying to instead use the central openshift imagestream for that some day, too :-)

kind: ImageStream
apiVersion: image.openshift.io/v1
metadata:
  annotations:
    kfctl.kubeflow.io/kfdef-instance: opendatahub.odhsven
    opendatahub.io/notebook-image-desc: >-
      Jupyter notebook image with a set of data science libraries that advanced
      AI/ML notebooks will use as a base image to provide a standard for
      libraries avialable in all notebooks
    opendatahub.io/notebook-image-name: Standard Data Science
    opendatahub.io/notebook-image-url: 'https://github.com/thoth-station/s2i-generic-data-science-notebook'
    openshift.io/image.dockerRepositoryCheck: '2022-11-21T13:14:31Z'
  resourceVersion: '804058030'
  name: s2i-generic-data-science-notebook
  uid: 9be25400-03ae-4672-9e4d-06e6d31ebd07
  creationTimestamp: '2022-11-21T13:14:30Z'
  generation: 2
  namespace: odhsven
  labels:
    app.kubernetes.io/part-of: jupyterhub
    component.opendatahub.io/name: jupyterhub
    opendatahub.io/component: 'true'
    opendatahub.io/notebook-image: 'true'
spec:
  lookupPolicy:
    local: true
  tags:
    - name: v0.0.5
      annotations:
        opendatahub.io/notebook-python-dependencies: >-
          [{"name":"Boto3","version":"1.17.11"},{"name":"Kafka-Python","version":"2.0.2"},{"name":"Matplotlib","version":"3.4.2"},{"name":"Numpy","version":"1.21.0"},{"name":"Pandas","version":"1.2.5"},{"name":"Scipy","version":"1.7.0"}]
        opendatahub.io/notebook-software: '[{"name":"Python","version":"v3.8.6"}]'
        openshift.io/imported-from: quay.io/thoth-station/s2i-generic-data-science-notebook
      from:
        kind: DockerImage
        name: 'quay.io/thoth-station/s2i-generic-data-science-notebook:v0.0.5'
      generation: 2
      importPolicy: {}
      referencePolicy:
        type: Source
status:
  dockerImageRepository: ''
  tags:
    - tag: v0.0.5
      items:
        - created: '2022-11-21T13:14:31Z'
          dockerImageReference: >-
            quay.io/thoth-station/s2i-generic-data-science-notebook@sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58
          image: >-
            sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58
          generation: 2
kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: jupyter-nb-sthoms
  namespace: odhsven
  uid: f3692652-6718-40fc-909f-355eb22b5002
  resourceVersion: '829380751'
  generation: 1
  creationTimestamp: '2022-12-07T16:54:40Z'
  ownerReferences:
    - apiVersion: kubeflow.org/v1beta1
      kind: Notebook
      name: jupyter-nb-sthoms
      uid: 15d89396-c670-4c36-b0c9-5872fd040811
      controller: true
      blockOwnerDeletion: true
spec:
  replicas: 1
  selector:
    matchLabels:
      statefulset: jupyter-nb-sthoms
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: jupyter-nb-sthoms
        notebook-name: jupyter-nb-sthoms
        opendatahub.io/odh-managed: 'true'
        opendatahub.io/user: sthoms
        statefulset: jupyter-nb-sthoms
    spec:
      restartPolicy: Always
      serviceAccountName: jupyter-nb-sthoms
      schedulerName: default-scheduler
      enableServiceLinks: false
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                  - key: nvidia.com/gpu.present
                    operator: NotIn
                    values:
                      - 'true'
      terminationGracePeriodSeconds: 30
      securityContext: {}
      containers:
        - resources:
            limits:
              cpu: '2'
              memory: 8Gi
            requests:
              cpu: '1'
              memory: 8Gi
          readinessProbe:
            httpGet:
              path: /notebook/odhsven/jupyter-nb-sthoms/api
              port: notebook-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          name: jupyter-nb-sthoms
          livenessProbe:
            httpGet:
              path: /notebook/odhsven/jupyter-nb-sthoms/api
              port: notebook-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: NOTEBOOK_ARGS
              value: |-
                --ServerApp.port=8888
                                  --ServerApp.token=''
                                  --ServerApp.password=''
                                  --ServerApp.base_url=/notebook/odhsven/jupyter-nb-sthoms
                                  --ServerApp.quit_button=False
                                  --ServerApp.tornado_settings={"user":"sthoms","hub_host":"https://odh-dashboard-odhsven.apps.ocp4dev.xxxx.mydomain.ch","hub_prefix":"/notebookController/sthoms"}
            - name: JUPYTER_IMAGE
              value: 'quay.io/thoth-station/s2i-generic-data-science-notebook:v0.0.5'
            - name: JUPYTER_NOTEBOOK_PORT
              value: '8888'
            - name: NB_PREFIX
              value: /notebook/odhsven/jupyter-nb-sthoms
          ports:
            - name: notebook-port
              containerPort: 8888
              protocol: TCP
          imagePullPolicy: Always
          volumeMounts:
            - name: jupyterhub-nb-sthoms-pvc
              mountPath: /opt/app-root/src
          terminationMessagePolicy: File
          image: 's2i-generic-data-science-notebook:v0.0.5'
          workingDir: /opt/app-root/src
        - resources:
            limits:
              cpu: 100m
              memory: 64Mi
            requests:
              cpu: 100m
              memory: 64Mi
          readinessProbe:
            httpGet:
              path: /oauth/healthz
              port: oauth-proxy
              scheme: HTTPS
            initialDelaySeconds: 5
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          name: oauth-proxy
          livenessProbe:
            httpGet:
              path: /oauth/healthz
              port: oauth-proxy
              scheme: HTTPS
            initialDelaySeconds: 30
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
          ports:
            - name: oauth-proxy
              containerPort: 8443
              protocol: TCP
          imagePullPolicy: Always
          volumeMounts:
            - name: oauth-config
              mountPath: /etc/oauth/config
            - name: tls-certificates
              mountPath: /etc/tls/private
          terminationMessagePolicy: File
          image: 'registry.redhat.io/openshift4/ose-oauth-proxy:v4.10'
          args:
            - '--provider=openshift'
            - '--https-address=:8443'
            - '--http-address='
            - '--openshift-service-account=jupyter-nb-sthoms'
            - '--cookie-secret-file=/etc/oauth/config/cookie_secret'
            - '--cookie-expire=24h0m0s'
            - '--tls-cert=/etc/tls/private/tls.crt'
            - '--tls-key=/etc/tls/private/tls.key'
            - '--upstream=http://localhost:8888'
            - '--upstream-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
            - >-
              --skip-auth-regex=^(?:/notebook/$(NAMESPACE)/jupyter-nb-sthoms)?/api$
            - '--email-domain=*'
            - '--skip-provider-button'
            - >-
              --openshift-sar={"verb":"get","resource":"notebooks","resourceAPIGroup":"kubeflow.org","resourceName":"jupyter-nb-sthoms","namespace":"$(NAMESPACE)"}
            - >-
              --logout-url=https://odh-dashboard-odhsven.apps.ocp4dev.xxxx.mydomain.ch/notebookController/sthoms/home
      serviceAccount: jupyter-nb-sthoms
      volumes:
        - name: jupyterhub-nb-sthoms-pvc
          persistentVolumeClaim:
            claimName: jupyterhub-nb-sthoms-pvc
        - name: oauth-config
          secret:
            secretName: jupyter-nb-sthoms-oauth-config
            defaultMode: 420
        - name: tls-certificates
          secret:
            secretName: jupyter-nb-sthoms-tls
            defaultMode: 420
      dnsPolicy: ClusterFirst
  serviceName: ''
  podManagementPolicy: OrderedReady
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  revisionHistoryLimit: 10
status:
  observedGeneration: 1
  replicas: 1
  currentReplicas: 1
  updatedReplicas: 1
  currentRevision: jupyter-nb-sthoms-8578f7d994
  updateRevision: jupyter-nb-sthoms-8578f7d994
  collisionCount: 0
  availableReplicas: 0

It does not really matter, too, if it is that particular StatefulSet or one created ad-hoc by me, the behavior and bug is always the same. This is problematic because KF Notebook Controller instantiation of pods is based on StatefuSet, not Deployment or DeploymentConfig or Pod directly.

@bparees
Copy link
Contributor

bparees commented Dec 8, 2022

I attempted to create this behavior and could not. Here's my version of the yamls (just stripped the status fields and the fields that are populated at creation time, and stripped the NS):

kind: ImageStream
apiVersion: image.openshift.io/v1
metadata:
  annotations:
    kfctl.kubeflow.io/kfdef-instance: opendatahub.odhsven
    opendatahub.io/notebook-image-desc: >-
      Jupyter notebook image with a set of data science libraries that advanced
      AI/ML notebooks will use as a base image to provide a standard for
      libraries avialable in all notebooks
    opendatahub.io/notebook-image-name: Standard Data Science
    opendatahub.io/notebook-image-url: 'https://github.com/thoth-station/s2i-generic-data-science-notebook'
  name: s2i-generic-data-science-notebook
  labels:
    app.kubernetes.io/part-of: jupyterhub
    component.opendatahub.io/name: jupyterhub
    opendatahub.io/component: 'true'
    opendatahub.io/notebook-image: 'true'
spec:
  lookupPolicy:
    local: true
  tags:
    - name: v0.0.5
      annotations:
        opendatahub.io/notebook-python-dependencies: >-
          [{"name":"Boto3","version":"1.17.11"},{"name":"Kafka-Python","version":"2.0.2"},{"name":"Matplotlib","version":"3.4.2"},{"name":"Numpy","version":"1.21.0"},{"name":"Pandas","version":"1.2.5"},{"name":"Scipy","version":"1.7.0"}]
        opendatahub.io/notebook-software: '[{"name":"Python","version":"v3.8.6"}]'
        openshift.io/imported-from: quay.io/thoth-station/s2i-generic-data-science-notebook
      from:
        kind: DockerImage
        name: 'quay.io/thoth-station/s2i-generic-data-science-notebook:v0.0.5'
      generation: 2
      importPolicy: {}
      referencePolicy:
        type: Source
kind: StatefulSet
apiVersion: apps/v1
metadata:
  name: jupyter-nb-sthoms
spec:
  replicas: 1
  selector:
    matchLabels:
      statefulset: jupyter-nb-sthoms
  template:
    metadata:
      creationTimestamp: null
      labels:
        app: jupyter-nb-sthoms
        notebook-name: jupyter-nb-sthoms
        opendatahub.io/odh-managed: 'true'
        opendatahub.io/user: sthoms
        statefulset: jupyter-nb-sthoms
    spec:
      restartPolicy: Always
      serviceAccountName: jupyter-nb-sthoms
      schedulerName: default-scheduler
      enableServiceLinks: false
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              preference:
                matchExpressions:
                  - key: nvidia.com/gpu.present
                    operator: NotIn
                    values:
                      - 'true'
      terminationGracePeriodSeconds: 30
      securityContext: {}
      containers:
        - resources:
            limits:
              cpu: '2'
              memory: 8Gi
            requests:
              cpu: '1'
              memory: 8Gi
          readinessProbe:
            httpGet:
              path: /notebook/odhsven/jupyter-nb-sthoms/api
              port: notebook-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          name: jupyter-nb-sthoms
          livenessProbe:
            httpGet:
              path: /notebook/odhsven/jupyter-nb-sthoms/api
              port: notebook-port
              scheme: HTTP
            initialDelaySeconds: 10
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: NOTEBOOK_ARGS
              value: |-
                --ServerApp.port=8888
                                  --ServerApp.token=''
                                  --ServerApp.password=''
                                  --ServerApp.base_url=/notebook/odhsven/jupyter-nb-sthoms
                                  --ServerApp.quit_button=False
                                  --ServerApp.tornado_settings={"user":"sthoms","hub_host":"https://odh-dashboard-odhsven.apps.ocp4dev.xxxx.mydomain.ch","hub_prefix":"/notebookController/sthoms"}
            - name: JUPYTER_IMAGE
              value: 'quay.io/thoth-station/s2i-generic-data-science-notebook:v0.0.5'
            - name: JUPYTER_NOTEBOOK_PORT
              value: '8888'
            - name: NB_PREFIX
              value: /notebook/odhsven/jupyter-nb-sthoms
          ports:
            - name: notebook-port
              containerPort: 8888
              protocol: TCP
          imagePullPolicy: Always
          volumeMounts:
            - name: jupyterhub-nb-sthoms-pvc
              mountPath: /opt/app-root/src
          terminationMessagePolicy: File
          image: 's2i-generic-data-science-notebook:v0.0.5'
          workingDir: /opt/app-root/src
        - resources:
            limits:
              cpu: 100m
              memory: 64Mi
            requests:
              cpu: 100m
              memory: 64Mi
          readinessProbe:
            httpGet:
              path: /oauth/healthz
              port: oauth-proxy
              scheme: HTTPS
            initialDelaySeconds: 5
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          name: oauth-proxy
          livenessProbe:
            httpGet:
              path: /oauth/healthz
              port: oauth-proxy
              scheme: HTTPS
            initialDelaySeconds: 30
            timeoutSeconds: 1
            periodSeconds: 5
            successThreshold: 1
            failureThreshold: 3
          env:
            - name: NAMESPACE
              valueFrom:
                fieldRef:
                  apiVersion: v1
                  fieldPath: metadata.namespace
          ports:
            - name: oauth-proxy
              containerPort: 8443
              protocol: TCP
          imagePullPolicy: Always
          volumeMounts:
            - name: oauth-config
              mountPath: /etc/oauth/config
            - name: tls-certificates
              mountPath: /etc/tls/private
          terminationMessagePolicy: File
          image: 'registry.redhat.io/openshift4/ose-oauth-proxy:v4.10'
          args:
            - '--provider=openshift'
            - '--https-address=:8443'
            - '--http-address='
            - '--openshift-service-account=jupyter-nb-sthoms'
            - '--cookie-secret-file=/etc/oauth/config/cookie_secret'
            - '--cookie-expire=24h0m0s'
            - '--tls-cert=/etc/tls/private/tls.crt'
            - '--tls-key=/etc/tls/private/tls.key'
            - '--upstream=http://localhost:8888'
            - '--upstream-ca=/var/run/secrets/kubernetes.io/serviceaccount/ca.crt'
            - >-
              --skip-auth-regex=^(?:/notebook/$(NAMESPACE)/jupyter-nb-sthoms)?/api$
            - '--email-domain=*'
            - '--skip-provider-button'
            - >-
              --openshift-sar={"verb":"get","resource":"notebooks","resourceAPIGroup":"kubeflow.org","resourceName":"jupyter-nb-sthoms","namespace":"$(NAMESPACE)"}
            - >-
              --logout-url=https://odh-dashboard-odhsven.apps.ocp4dev.xxxx.mydomain.ch/notebookController/sthoms/home
      serviceAccount: jupyter-nb-sthoms
      volumes:
        - name: jupyterhub-nb-sthoms-pvc
          persistentVolumeClaim:
            claimName: jupyterhub-nb-sthoms-pvc
        - name: oauth-config
          secret:
            secretName: jupyter-nb-sthoms-oauth-config
            defaultMode: 420
        - name: tls-certificates
          secret:
            secretName: jupyter-nb-sthoms-tls
            defaultMode: 420
      dnsPolicy: ClusterFirst
  serviceName: ''
  podManagementPolicy: OrderedReady
  updateStrategy:
    type: RollingUpdate
    rollingUpdate:
      partition: 0
  revisionHistoryLimit: 10
status:
  observedGeneration: 1
  replicas: 1
  currentReplicas: 1
  updatedReplicas: 1
  currentRevision: jupyter-nb-sthoms-8578f7d994
  updateRevision: jupyter-nb-sthoms-8578f7d994
  collisionCount: 0
  availableReplicas: 0

I ran:

# create imagestream
oc create -f is.yaml

then i ran

# check import was successful
oc get is -o yaml

to confirm the import, which i did by checking the status section of the output:

  status:
    dockerImageRepository: image-registry.openshift-image-registry.svc:5000/p1/s2i-generic-data-science-notebook
    tags:
    - items:
      - created: "2022-12-08T15:31:49Z"
        dockerImageReference: quay.io/thoth-station/s2i-generic-data-science-notebook@sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58
        generation: 2
        image: sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58
      tag: v0.0.5
# create stateful set
oc create -f ss.yaml

and then i confirmed that the image field is substituted as expected (this happens during creation admission):

oc get statefulset -o yaml | grep image:
          image: quay.io/thoth-station/s2i-generic-data-science-notebook@sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58
          image: registry.redhat.io/openshift4/ose-oauth-proxy:v4.10

No pods ended up being created and i didn't really look into why, but fundamentally the statefulset's pod template is as expected: the image field points to the resolved imagestreamtag value and not 's2i-generic-data-science-notebook:v0.0.5' as was in the original statefulset yaml. (your pod's "failure to pull the image" error are directly caused by the fact that the image field in the statefulset was not substituted correctly when the statefulset was created)

The OCP version i used was one of our 4.13 CI builds: 4.13.0-0.ci-2022-12-07-190258 but this behavior hasn't been modified in several releases as far as i know (since the bug you pointed to in 4.6 i believe)

i'm not really sure what to tell you since i used pretty much your exact yaml files to attempt the recreate and it seems to be working as expected.

What OCP version are you on?

@shalberd
Copy link
Author

shalberd commented Dec 8, 2022

I am on OCP 4.10.3 on-prem without openshift registry and on 4.13 DEV on AWS with openshift registry.

So hold on, I am probably doing circular reasoning, should work more along the lines of Occam. You analysis helps for now. @bparees Do not go any further for now, thank you.

The differentiating feature I have is the KF Notebook and ODH Notebook Controller, that watches for Notebook CRs and then creates StatefulSets from them. I think I until now only manually modified the existing, KF or ODH Notebook controller-created StatefulSet / Notebook CR. Will try with a brand new one without any annotations from notebook controller.

What bugs me is that the substitution you mention works for everything except a StatefulSet, which is really pretty weird. The substitution in my case is image: quay.io/thoth-station/s2i-generic-data-science-notebook@sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58, in DeploymentConfigs, Pods, Deployments, just not in StatefulSets .... But, that makes sense, as I think I only edited StatefulSet objects created by the notebook controller with the image field pointing to the imagestream. I think the KF and ODH Notebook Controller is keeping the image field in the StatefulSet from being updated.

@bparees
Copy link
Contributor

bparees commented Dec 8, 2022

The differentiating feature I have is the KF Notebook and ODH Notebook Controller, that watches for Notebook CRs and then creates StatefulSets from them. I think I until now only manually modified the existing, KF or ODH Notebook controller-created StatefulSet. Will try with a brand new one without any annotations from notebook controller.

ah hah. Yes, if you have some controller that is reconciling the statefulset, it is very likely that the substituted value is being stomped back by the controller, after the admission logic has mutated it.

What bugs me is that the substitution you mention works for everything except a StatefulSet, which is really pretty weird.

assuming those other resources don't have a controller reconciling them, that would explain the difference. The substitution is probably working, but then immediately getting stomped/reverted by the notebook controller.

Try creating a statefulset directly and i think you'll see it work correctly.

I think the KF and ODH Notebook Controller is keeping the image field in the StatefulSet from being updated.

more accurately, it is probably reverting the change after it is made, but yes, fundamentally it's a case of multiple writers competing over the resource.

@shalberd
Copy link
Author

shalberd commented Dec 8, 2022

Will look more into that tomorrow, but we are on to something, yes.

@shalberd
Copy link
Author

shalberd commented Dec 9, 2022

hah, indeed. Turns out with a fresh StatefulSet, without any notebook controller annotations, the image name and digest is resolved correctly from the imagestream tag and name.

So yes, the Notebook controller Kubeflow Notebook CR owns StatefulSets created from a Notebook def instance and thus interferes with or overrides imagestream resolve on admission of the StatefulSet image resolve to the cluster.

  ownerReferences:
    - apiVersion: kubeflow.org/v1beta1
      kind: Notebook
      controller: true
      blockOwnerDeletion: true

@kpouget @DaoDaoNoCode seems the notebook controller overrides the value of the container.image field as resolved before by the Openshift image policy admission plugin/controller from the imagestream tag and name. As far as notebook controller goes, reconciling logic and StatefulSet logic is here, it seems:

https://github.com/kubeflow/kubeflow/blob/master/components/notebook-controller/controllers/notebook_controller.go#L75

Apparently, there is a kubeflow reconcilehelper package and the reconciliation logic is here (update-logic)

https://github.com/kubeflow/kubeflow/blob/master/components/notebook-controller/controllers/notebook_controller.go#L162

The whole StatefulSet template spec, of which the image-field is a part of, is continually updated, thus overwriting any changes from the image policy admission plug-in:

https://github.com/kubeflow/kubeflow/blob/0327c5dcd943806f75b646012002646cff627d8b/components/common/reconcilehelper/util.go#L131

Anyway, the image policy admission plugin itself is working as expected on the Kubernetes / Openshift side for StatefulSets not owned by KF Notebook controller. Here is the result with a fresh StatefulSet, working as expected:

Statefulset before apply, non-notebook-controller-related, referencing imagestream name and tag in same namespace:

piVersion: apps/v1
kind: StatefulSet
metadata:
  name: testwithoutnotebookannotation
  namespace: odhsven
spec:
  serviceName: testwithoutnotebookannotation
  replicas: 1
  selector:
    matchLabels:
      app: testwithoutnotebookannotation
  template:
    metadata:
      labels:
        app: testwithoutnotebookannotation
    spec:
      terminationGracePeriodSeconds: 10
      containers:
        - name: jupyter
          image: 's2i-generic-data-science-notebook:v0.0.5'

Result working as expected, excerpt from the StatefulSet container spec after applied in cluster:

  labels:
        app: testwithoutnotebookannotation
    spec:
      containers:
        - name: jupyter
          image: >-
            quay.io/thoth-station/s2i-generic-data-science-notebook@sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58

With this fresh StatefulSet and the pod created from it, all is working fine afterwards.
Image-pull is working as expected, since nothing else was defined, image pull policy ifNotPresent (on node) kicks in, the image had been present on my worker node before, all good:

Container image "quay.io/thoth-station/s2i-generic-data-science-notebook@sha256:3f619c61501218f03d39d97541336dee024f446e64f3a47e2bc7e62cddeb2e58" already present on machine

Looks like I need to escalate this error with the ODH Notebook Controller team. Leaving ticket open for now, but thank you for all the help. I also created a ticket at kubeflow explaining the context kubeflow/kubeflow#6829

@shalberd
Copy link
Author

shalberd commented Dec 15, 2022

@bparees We have had some issues on a hyperscaler-based Openshift 4.11 and 4.13.0-ec.0 cluster with the image policy admission plugin / metadata annotation doing nothing. Where in Openshift's config is it defined which plugins are active and which are not?
For Openshift 3.10, I found some hints, but cannot find documentation about that cluster-specific stuff for 4.10 or later anywhere.

@bparees
Copy link
Contributor

bparees commented Dec 15, 2022

it's not configurable in 4.x.

@shalberd
Copy link
Author

shalberd commented Dec 15, 2022

weird, thank you. Could PodSecurityPolicy be an issue potentially interfering with image-resolve? Context: fresh StatefulSet on a hyperscaler cluster, annotation in metadata

annotations:
    [image.openshift.io/triggers](http://image.openshift.io/triggers): >-
      [{"from":{"kind":"ImageStreamTag","name":"s2i-generic-data-science-notebook:v0.0.5","namespace":"odhsven"},"fieldPath":"spec.template.spec.containers[?(@.name==\"nginx\")].image"}]

on 4.10.37 in-house, it worked. On the hyperscalers with 4.11 and 4.13, it did not work. The image-field did not get updated. We carefully made sure to reference the correct container name, imagestream name and tag, all that.

I wonder if something keeps changing in the backend. For example, on 4.10.3 in-house, I had to either have

image: " "

or no image-field specified initially in the containers section of the StatefulSet for lookup to work. Now, on 4.10.37 in-house, I can use whatever value I want for the image-field, and it gets updated accordingly.

@bparees
Copy link
Contributor

bparees commented Dec 15, 2022

does the same annotation work on a deployment or any other resource type?

I don't think PSP could be interfering. The way this is supposed to work, iirc is:

  1. an imagechangetrigger controller watches all the workload(deployment, statefulset, daemonset, etc) resources that have the imagetrigger annotation. Basically anything that contains a PodTemplate
  2. it also watches all the imagestreams that are ref'd by the annotations on those workload resources
  3. when the imagestream changes (new image tag SHA), it updates the workload resource (using the field path indicated by the annotation)
  4. assuming the annotation is specified correctly, once it updates, the workload resource controller (deployment controller, statefulset controller, etc) will see that the pod template has changed (image field value change) and rollout new pods for the workload resource

I don't think the initial value of the image field should matter, though you should be aware that an initial rollout of pods may occur w/ the initial value of the image field, before the imagechangetrigger observes the resources and updates them (which will then cause a subsequent rollout)

@shalberd
Copy link
Author

shalberd commented Dec 15, 2022

Yes, the annotation works on Pods, for example, on my in-house OCP 4.10.3 and 4.10.37 clusters. Just not on the hyperscalers 4.11 and 4.13.

That one-space-string or no-field-necessary issue for image-value might have been something specific to 4.10.3.

Now, on 4.10.37, I can have any string I want as the image-value, the actual value from the imagestream gets inserted / replaced. Once I apply the definition, the image-field gets updated accordingly.

Also, the imagestream is initialized and applied, no changes to the images referenced by the imagestream. So, in that way, that substitution of the image-field does not even necessitate an image-change when applying to a new Pod, Deployment, StatefulSet.

@dmage
Copy link
Member

dmage commented Dec 15, 2022

You may have hit a HyperShift bug - https://issues.redhat.com/browse/OCPBUGS-4025

@shalberd
Copy link
Author

hyper-cool you found that, Oleg, děkuji / merci vielmal

@bparees
Copy link
Contributor

bparees commented Dec 15, 2022

@shalberd these are hypershift guest clusters where you're seeing the issue? When you said hyperscalers i just thought you meant a traditional ocp cluster running on AWS/GCP/Azure.

@shalberd
Copy link
Author

shalberd commented Dec 15, 2022

It is on aws *aws-2.ci.openshift.org .... For more @andrewballantyne from Open Data Hub (ODH) team can answer. I am not a member of Open Data Hub, but pushing the use of imagestream abstractions, non-dependence on openshift-internal docker registry, ImageContentSourcePolicy support from an enterprise perspective, very on-prem OCP oriented, but making sure to test extensively with the ODH team.

Why use hypershift with AWS and Azure anyways? I mean, does that not add an unneccessary layer of abstraction, with reconciliations interfering there, too. Just interested ...

@shalberd
Copy link
Author

shalberd commented Dec 19, 2022

@bparees Regarding collisions between the image policy admission plugin / imagechangetrigger writing to the image-field and the notebook controller overwriting it on reconcilation, what is the conceptually best way to avoid overwriting the image-field by KF notebook controller? Is it

a) excluding the image-field from reconciliation, for example, or is it instead to
b) let the controller handle getting the image-value from the imagestream itself, similar to this approach for the oauth-image in another controller?

To me, the approach that the Elastic operator took looks like duplicating logic, but I am very much new to the topic of operators and object reconciliations. Also, since imagestream is specific to openshift and not always is the image-field of a notebook CR content based an a metadata annotation, this being a speciality case on openshift, I'd go towards approach a.

@bparees
Copy link
Contributor

bparees commented Dec 21, 2022

I think approach (a) is reasonable in this case, but keep in mind that it means that the kf notebook controller now loses the ability to update the image to a new value (such as it might want to do during an upgrade). Unless your plan is that the resource would always be defined with the imagetrigger annotation, in which case upgrades would be handled by updating the imagestream itself and letting the trigger flow through to rollout the workload resource. But now you've tied your logic to openshift behavior (the imagetrigger annotation and imagestream resource, which aren't things that exist on vanilla k8s)

So you need to fully think through who should be controlling the value of the image field, and where this operator/controller will be used.

One way to approach it would be for the kf controller to manage the image field value if, and only if, there is no image.openshift.io/triggers annotation on the resource. The presence of that annotation would be an indication that the user has chosen an alternate way to manage the image field value, and the kf notebook controller should defer to that alternate source of image values. Unfortunately this means adding "openshift specific/aware" logic to the controller, which i understand might be undesirable, but the good news is the logic would be completely portable to vanilla k8s (since it's just looking at annotation keys)

And then if there is not such an annotation, the kf notebook controller can directly set the image field, allowing it to update the image field to new values during upgrades of the kf notebook controller.

This would allow a user to update the workload resource to be imagetriggered based on an imagestream of their choice, when they want that, but if they didn't want it (or weren't on openshift), they'd get whatever standard image the kf notebook controller specifies/defaults.

Again, ultimately this really comes down to a philosophy of how the image value should be managed, and by whom. Is it a default provided by the notebook, but intended to be modified/controlled by the user? Or is it something that's intended to be tightly specified/controlled by the notebook? That will guide you as to how best to implement the behavior and what controls to offer users.

(e.g. another option would be to introduce an entirely new annotation that tells the controller whether or not to manage the image field. Then users could disable controller management of the image field for both that case of "i want the imagetrigger to manage it" but also "i just want to manually set it to something else for some reason")

@bparees
Copy link
Contributor

bparees commented Dec 21, 2022

btw, serverside apply may offer another approach to this problem, by allowing controllers to assert ownership/priority of specific fields within a resource:

https://kubernetes.io/docs/reference/using-api/server-side-apply/

I suspect that would first require updates to the imagechangetrigger logic to respect those semantics (in addition to the kf notebook controller supporting them), but since serverside apply is k8s' long term strategic answer to these sorts of conflicts, i'd be remiss in not at least mentioning it.

@shalberd
Copy link
Author

shalberd commented Jan 5, 2023

Thanks a lot, that helps me get started, I really appreciate your input. Here's to a good new year.

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 5, 2023
@shalberd shalberd closed this as completed Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

4 participants