Health Check (Liveness and Readiness probe for Operator) #1234

luszczynski · 2019-03-20T19:49:31Z

What did you do?

I've created a new ansible operator using:

operator-sdk new gogs-operator --api-version=org.example.gogs/v1alpha1 --kind=Gogs --type=ansible --generate-playbook

What did you expect to see?

I expect to see a deployment object using liveness and readiness probe to check the health of my operator.

The file is gogs-operator/deploy/operator.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gogs-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      name: gogs-operator
  template:
    metadata:
      labels:
        name: gogs-operator
    spec:
      serviceAccountName: gogs-operator
      containers:
        - name: ansible
          command:
          - /usr/local/bin/ao-logs
          - /tmp/ansible-operator/runner
          - stdout
          # Replace this with the built image name
          image: "{{ REPLACE_IMAGE }}"
          imagePullPolicy: "{{ pull_policy|default('Always') }}"
          volumeMounts:
          - mountPath: /tmp/ansible-operator/runner
            name: runner
            readOnly: true
        - name: operator
          # Replace this with the built image name
          image: "{{ REPLACE_IMAGE }}"
          imagePullPolicy: "{{ pull_policy|default('Always') }}"
          volumeMounts:
          - mountPath: /tmp/ansible-operator/runner
            name: runner
          env:
            - name: WATCH_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: OPERATOR_NAME
              value: "gogs-operator"
      volumes:
        - name: runner
          emptyDir: {}

What did you see instead? Under which circumstances?

A deployment without liveness and readiness probe.

Environment

operator-sdk version:

operator-sdk version v0.6.0

Kubernetes version information:

oc v3.11.88
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Kubernetes cluster kind:

Additional context

Additionally, operator-sdk should:

Create liveness and readiness in the Deployment object
Expose a rest endpoint to be checked by Openshift/Kubernetes

Do these requirements above make sense?

The text was updated successfully, but these errors were encountered:

lilic · 2019-03-22T09:56:19Z

Do these requirements above make sense?

cc @shawn-hurley @jmrodri as it's an ansible operator enhancement issue ^

luszczynski · 2019-03-22T12:44:45Z

@lilic
Maybe we should extend this issue to other kind of operators. All operators should have liveness and readiness out of the box IMO.

joelanford · 2019-03-22T17:04:43Z

@lili @luszczynski We actually used to have a default readiness check but it (in combination with leader election) interfered with rolling out operator deployment updates. See #920 and #932.

If we reintroduce these (or similar) checks, we need to make sure we don't have a regression with the operator deployment rollout issue.

shawn-hurley · 2019-03-25T14:44:04Z

@luszczynski Thanks for the request! I think this makes sense to tackle in a future sprint.

openshift-bot · 2019-07-25T10:14:15Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2019-08-24T11:53:38Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2019-09-23T13:29:51Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2019-09-23T13:30:00Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

camilamacedo86 · 2019-09-23T14:49:21Z

Hi @joelanford should it be closed? I am re-opening it since shows that it still making sense.

flickerfly · 2019-10-17T19:18:08Z

What did the checks look like when they were conflicting with the election?

EDIT: Found where this was originally discussed in #920.

christianh814 · 2019-11-15T20:51:49Z

I got asked this question at a workshops and it's actually a valid one. There should be SOME sort of default readiness/health check for an ansible operator

jeyaramashok · 2019-12-05T16:37:02Z

Is there a design proposal on how to handle this?

flickerfly · 2019-12-05T17:52:16Z

I haven't found a design proposal in my poking around.

I also haven't run into any problems simply using ansible --version since my operators are mostly Ansible. I've been thinking about developing a python script to do some more intense checking for permissions, dependencies and the like, but haven't yet. (Someday maybe I'll be a cool Go guy. Y'all do some great work.)

EDIT: Removed sentence about something I was looking for because I found it.

christianh814 · 2019-12-05T18:18:07Z

cc @chris-short

chris-short · 2019-12-06T14:20:02Z

I see a health check as being as simple as running ansible --version and getting a 0 exit code. Yes, I'd love to see something that is a more accurate description of health but, I don't know what that would be at this point.

camilamacedo86 · 2020-07-10T20:46:40Z

Should we not add for all types? We might could to do it in upstream .. wdyt @asmacdo

camilamacedo86 · 2020-09-01T12:00:33Z

IMO: to close this one we need to address it to upstream (go) and in SDK for the helm

akang0 · 2020-09-24T14:49:40Z

@asmacdo @camilamacedo86 Short question, is liveness and readiness probe available for go operator ?

camilamacedo86 · 2020-10-30T13:52:18Z

is it implemented in controller-runtime already? See: kubernetes-sigs/controller-runtime#855
If yes I think we should use the controller-runtime implementation. So, IMO we need here:

Check if the above are features that can be used from controller runtime
To do it in upstream (Kubebuilder)
Path the solution for Helm/Ansible (we might need to review what is done for ansible already)

@asmacdo I understand that we have liveness check for Ansible. So, could you please give a hand by supplementing here what is or not done for ansible and why? PS. would be nice to have the link for the code implementation and/or pr where it was addressed.

camilamacedo86 · 2020-11-09T18:45:21Z

Just to keep it update. The controller-runtime provides features to address this need and it is addressed on upstream for v3-alpha plugin. See:kubernetes-sigs/kubebuilder#1856

So, the next step would be we check if we can add it to v1+ ansible/helm plugin or we need to push it for v2+ plugin as well for them. Also. we might need to deprecate some specific implementation done for Ansible to address the same in favor of controller-runtime/upstream one.

@asmacdo have we any specific reason for not use the controller-runtime implementation to address it for Ansible as well?

c/c @joelanford @jmrodri @estroz @asmacdo

camilamacedo86 · 2020-12-09T22:24:06Z

/lifecycle frozen

camilamacedo86 · 2020-12-16T18:19:36Z

It will be solved for GO with the go/v3 plugin that will be available for the next release and for Helm/Ansible we have the PR: #4326

…4326) **Description of the change:** - Add the probes for Helm/Ansible by default as it was done for Golang go/v3 in upstream. - For Ansible/Helm-based operators, added new flag `--health-probe-bind-address` to allow customize the probe port used. - For Ansible-based operators, deprecated the ping endpoint **Motivation for the change:** - #1234

…perator-framework#4326) **Description of the change:** - Add the probes for Helm/Ansible by default as it was done for Golang go/v3 in upstream. - For Ansible/Helm-based operators, added new flag `--health-probe-bind-address` to allow customize the probe port used. - For Ansible-based operators, deprecated the ping endpoint **Motivation for the change:** - operator-framework#1234 Signed-off-by: reinvantveer <rein.van.t.veer@geodan.nl>

…perator-framework#4326) **Description of the change:** - Add the probes for Helm/Ansible by default as it was done for Golang go/v3 in upstream. - For Ansible/Helm-based operators, added new flag `--health-probe-bind-address` to allow customize the probe port used. - For Ansible-based operators, deprecated the ping endpoint **Motivation for the change:** - operator-framework#1234 Signed-off-by: rearl <rearl@secureworks.com>

joelanford added kind/feature Categorizes issue or PR as related to a new feature. language/ansible Issue is related to an Ansible operator project labels Mar 20, 2019

lilic added needs discussion and removed language/ansible Issue is related to an Ansible operator project labels Mar 22, 2019

joelanford mentioned this issue Apr 9, 2019

Deadlock on operator with Leader Locker with Evicted Pod Bug 1749620 #1305

Closed

lilic added this to the 1.0.0 milestone Apr 26, 2019

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2019

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 24, 2019

openshift-ci-robot closed this as completed Sep 23, 2019

camilamacedo86 reopened this Sep 23, 2019

camilamacedo86 removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 5, 2019

vinny-sabatini mentioned this issue Dec 16, 2019

Add kubernetes readiness check uri KohlsTechnology/eunomia#210

Closed

joelanford removed this from the 1.0.0 milestone Jan 10, 2020

estroz added this to the v1.0.0 milestone Mar 2, 2020

estroz added language/go Issue is related to a Go operator project language/helm Issue is related to a Helm operator project labels Jul 24, 2020

estroz modified the milestones: v1.0.0, v1.1.0 Jul 24, 2020

jberkhahn modified the milestones: v1.1.0, Backlog Sep 30, 2020

jberkhahn assigned asmacdo Sep 30, 2020

jberkhahn added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Sep 30, 2020

estroz unassigned asmacdo Oct 21, 2020

estroz added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 21, 2020

camilamacedo86 self-assigned this Nov 9, 2020

openshift-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Dec 9, 2020

camilamacedo86 removed the needs discussion label Dec 16, 2020

camilamacedo86 mentioned this issue Dec 16, 2020

For Ansible/Helm-based operators, add Liveness and Readiness probe #4326

Merged

2 tasks

camilamacedo86 closed this as completed Dec 19, 2020

camilamacedo86 modified the milestones: Backlog, v1.5.0 Dec 19, 2020

roytman mentioned this issue Aug 19, 2022

Add liveness and readiness probs to manager fybrik/fybrik#1702

Closed

ramperher mentioned this issue Jan 9, 2024

Support startup probes #6659

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health Check (Liveness and Readiness probe for Operator) #1234

Health Check (Liveness and Readiness probe for Operator) #1234

luszczynski commented Mar 20, 2019

lilic commented Mar 22, 2019

luszczynski commented Mar 22, 2019

joelanford commented Mar 22, 2019

shawn-hurley commented Mar 25, 2019

openshift-bot commented Jul 25, 2019

openshift-bot commented Aug 24, 2019

openshift-bot commented Sep 23, 2019

openshift-ci-robot commented Sep 23, 2019

camilamacedo86 commented Sep 23, 2019 •

edited

Loading

flickerfly commented Oct 17, 2019 •

edited

Loading

christianh814 commented Nov 15, 2019

jeyaramashok commented Dec 5, 2019

flickerfly commented Dec 5, 2019 •

edited

Loading

christianh814 commented Dec 5, 2019

chris-short commented Dec 6, 2019

camilamacedo86 commented Jul 10, 2020

camilamacedo86 commented Sep 1, 2020

akang0 commented Sep 24, 2020

camilamacedo86 commented Oct 30, 2020 •

edited

Loading

camilamacedo86 commented Nov 9, 2020 •

edited

Loading

camilamacedo86 commented Dec 9, 2020

camilamacedo86 commented Dec 16, 2020

Health Check (Liveness and Readiness probe for Operator) #1234

Health Check (Liveness and Readiness probe for Operator) #1234

Comments

luszczynski commented Mar 20, 2019

lilic commented Mar 22, 2019

luszczynski commented Mar 22, 2019

joelanford commented Mar 22, 2019

shawn-hurley commented Mar 25, 2019

openshift-bot commented Jul 25, 2019

openshift-bot commented Aug 24, 2019

openshift-bot commented Sep 23, 2019

openshift-ci-robot commented Sep 23, 2019

camilamacedo86 commented Sep 23, 2019 • edited Loading

flickerfly commented Oct 17, 2019 • edited Loading

christianh814 commented Nov 15, 2019

jeyaramashok commented Dec 5, 2019

flickerfly commented Dec 5, 2019 • edited Loading

christianh814 commented Dec 5, 2019

chris-short commented Dec 6, 2019

camilamacedo86 commented Jul 10, 2020

camilamacedo86 commented Sep 1, 2020

akang0 commented Sep 24, 2020

camilamacedo86 commented Oct 30, 2020 • edited Loading

camilamacedo86 commented Nov 9, 2020 • edited Loading

camilamacedo86 commented Dec 9, 2020

camilamacedo86 commented Dec 16, 2020

camilamacedo86 commented Sep 23, 2019 •

edited

Loading

flickerfly commented Oct 17, 2019 •

edited

Loading

flickerfly commented Dec 5, 2019 •

edited

Loading

camilamacedo86 commented Oct 30, 2020 •

edited

Loading

camilamacedo86 commented Nov 9, 2020 •

edited

Loading