Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health Check (Liveness and Readiness probe for Operator) #1234

Closed
luszczynski opened this issue Mar 20, 2019 · 25 comments
Closed

Health Check (Liveness and Readiness probe for Operator) #1234

luszczynski opened this issue Mar 20, 2019 · 25 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. language/go Issue is related to a Go operator project language/helm Issue is related to a Helm operator project lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@luszczynski
Copy link

What did you do?

I've created a new ansible operator using:

operator-sdk new gogs-operator --api-version=org.example.gogs/v1alpha1 --kind=Gogs --type=ansible --generate-playbook

What did you expect to see?

I expect to see a deployment object using liveness and readiness probe to check the health of my operator.

The file is gogs-operator/deploy/operator.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gogs-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      name: gogs-operator
  template:
    metadata:
      labels:
        name: gogs-operator
    spec:
      serviceAccountName: gogs-operator
      containers:
        - name: ansible
          command:
          - /usr/local/bin/ao-logs
          - /tmp/ansible-operator/runner
          - stdout
          # Replace this with the built image name
          image: "{{ REPLACE_IMAGE }}"
          imagePullPolicy: "{{ pull_policy|default('Always') }}"
          volumeMounts:
          - mountPath: /tmp/ansible-operator/runner
            name: runner
            readOnly: true
        - name: operator
          # Replace this with the built image name
          image: "{{ REPLACE_IMAGE }}"
          imagePullPolicy: "{{ pull_policy|default('Always') }}"
          volumeMounts:
          - mountPath: /tmp/ansible-operator/runner
            name: runner
          env:
            - name: WATCH_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace
            - name: POD_NAME
              valueFrom:
                fieldRef:
                  fieldPath: metadata.name
            - name: OPERATOR_NAME
              value: "gogs-operator"
      volumes:
        - name: runner
          emptyDir: {}

What did you see instead? Under which circumstances?

A deployment without liveness and readiness probe.

Environment

  • operator-sdk version:

operator-sdk version v0.6.0

  • Kubernetes version information:

oc v3.11.88
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

  • Kubernetes cluster kind:

Additional context

Additionally, operator-sdk should:

  1. Create liveness and readiness in the Deployment object
  2. Expose a rest endpoint to be checked by Openshift/Kubernetes

Do these requirements above make sense?

@joelanford joelanford added kind/feature Categorizes issue or PR as related to a new feature. language/ansible Issue is related to an Ansible operator project labels Mar 20, 2019
@lilic
Copy link
Member

lilic commented Mar 22, 2019

Do these requirements above make sense?

cc @shawn-hurley @jmrodri as it's an ansible operator enhancement issue ^

@luszczynski
Copy link
Author

@lilic
Maybe we should extend this issue to other kind of operators. All operators should have liveness and readiness out of the box IMO.

@lilic lilic added needs discussion and removed language/ansible Issue is related to an Ansible operator project labels Mar 22, 2019
@joelanford
Copy link
Member

@lili @luszczynski We actually used to have a default readiness check but it (in combination with leader election) interfered with rolling out operator deployment updates. See #920 and #932.

If we reintroduce these (or similar) checks, we need to make sure we don't have a regression with the operator deployment rollout issue.

@shawn-hurley
Copy link
Member

@luszczynski Thanks for the request! I think this makes sense to tackle in a future sprint.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 25, 2019
@openshift-bot
Copy link

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 24, 2019
@openshift-bot
Copy link

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

@openshift-ci-robot
Copy link

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Sep 23, 2019

Hi @joelanford should it be closed? I am re-opening it since shows that it still making sense.

@flickerfly
Copy link
Contributor

flickerfly commented Oct 17, 2019

What did the checks look like when they were conflicting with the election?

EDIT: Found where this was originally discussed in #920.

@christianh814
Copy link

I got asked this question at a workshops and it's actually a valid one. There should be SOME sort of default readiness/health check for an ansible operator

@jeyaramashok
Copy link
Contributor

Is there a design proposal on how to handle this?

@flickerfly
Copy link
Contributor

flickerfly commented Dec 5, 2019

I haven't found a design proposal in my poking around.

I also haven't run into any problems simply using ansible --version since my operators are mostly Ansible. I've been thinking about developing a python script to do some more intense checking for permissions, dependencies and the like, but haven't yet. (Someday maybe I'll be a cool Go guy. Y'all do some great work.)

EDIT: Removed sentence about something I was looking for because I found it.

@christianh814
Copy link

cc @chris-short

@camilamacedo86 camilamacedo86 removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Dec 5, 2019
@chris-short
Copy link

I see a health check as being as simple as running ansible --version and getting a 0 exit code. Yes, I'd love to see something that is a more accurate description of health but, I don't know what that would be at this point.

@joelanford joelanford removed this from the 1.0.0 milestone Jan 10, 2020
@estroz estroz added this to the v1.0.0 milestone Mar 2, 2020
@camilamacedo86
Copy link
Contributor

Should we not add for all types? We might could to do it in upstream .. wdyt @asmacdo

@estroz estroz added language/go Issue is related to a Go operator project language/helm Issue is related to a Helm operator project labels Jul 24, 2020
@estroz estroz modified the milestones: v1.0.0, v1.1.0 Jul 24, 2020
@camilamacedo86
Copy link
Contributor

IMO: to close this one we need to address it to upstream (go) and in SDK for the helm

@akang0
Copy link

akang0 commented Sep 24, 2020

@asmacdo @camilamacedo86 Short question, is liveness and readiness probe available for go operator ?

@jberkhahn jberkhahn modified the milestones: v1.1.0, Backlog Sep 30, 2020
@jberkhahn jberkhahn added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. labels Sep 30, 2020
@estroz estroz added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Oct 21, 2020
@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Oct 30, 2020

is it implemented in controller-runtime already? See: kubernetes-sigs/controller-runtime#855
If yes I think we should use the controller-runtime implementation. So, IMO we need here:

  • Check if the above are features that can be used from controller runtime
  • To do it in upstream (Kubebuilder)
  • Path the solution for Helm/Ansible (we might need to review what is done for ansible already)

@asmacdo I understand that we have liveness check for Ansible. So, could you please give a hand by supplementing here what is or not done for ansible and why? PS. would be nice to have the link for the code implementation and/or pr where it was addressed.

@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Nov 9, 2020

Just to keep it update. The controller-runtime provides features to address this need and it is addressed on upstream for v3-alpha plugin. See:kubernetes-sigs/kubebuilder#1856

So, the next step would be we check if we can add it to v1+ ansible/helm plugin or we need to push it for v2+ plugin as well for them. Also. we might need to deprecate some specific implementation done for Ansible to address the same in favor of controller-runtime/upstream one.

@asmacdo have we any specific reason for not use the controller-runtime implementation to address it for Ansible as well?

c/c @joelanford @jmrodri @estroz @asmacdo

@camilamacedo86 camilamacedo86 self-assigned this Nov 9, 2020
@camilamacedo86
Copy link
Contributor

/lifecycle frozen

@openshift-ci-robot openshift-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Dec 9, 2020
@camilamacedo86
Copy link
Contributor

It will be solved for GO with the go/v3 plugin that will be available for the next release and for Helm/Ansible we have the PR: #4326

camilamacedo86 added a commit that referenced this issue Dec 18, 2020
…4326)

**Description of the change:**

- Add the probes for Helm/Ansible by default as it was done for Golang go/v3 in upstream. 
- For Ansible/Helm-based operators, added new flag `--health-probe-bind-address` to allow customize the probe port used.
- For Ansible-based operators, deprecated the ping endpoint

**Motivation for the change:**

- #1234
@camilamacedo86 camilamacedo86 modified the milestones: Backlog, v1.5.0 Dec 19, 2020
reinvantveer pushed a commit to reinvantveer/operator-sdk that referenced this issue Feb 5, 2021
…perator-framework#4326)

**Description of the change:**

- Add the probes for Helm/Ansible by default as it was done for Golang go/v3 in upstream. 
- For Ansible/Helm-based operators, added new flag `--health-probe-bind-address` to allow customize the probe port used.
- For Ansible-based operators, deprecated the ping endpoint

**Motivation for the change:**

- operator-framework#1234

Signed-off-by: reinvantveer <rein.van.t.veer@geodan.nl>
rearl-scwx pushed a commit to rearl-scwx/operator-sdk that referenced this issue Feb 5, 2021
…perator-framework#4326)

**Description of the change:**

- Add the probes for Helm/Ansible by default as it was done for Golang go/v3 in upstream. 
- For Ansible/Helm-based operators, added new flag `--health-probe-bind-address` to allow customize the probe port used.
- For Ansible-based operators, deprecated the ping endpoint

**Motivation for the change:**

- operator-framework#1234

Signed-off-by: rearl <rearl@secureworks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/feature Categorizes issue or PR as related to a new feature. language/go Issue is related to a Go operator project language/helm Issue is related to a Helm operator project lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests