Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probing container should *not* be restarted with a /healthz http liveness probe [Conformance] {E2eNode Suite} #59150

Closed
ravisantoshgudimetla opened this issue Jan 31, 2018 · 5 comments · Fixed by #59218
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node.

Comments

@ravisantoshgudimetla
Copy link
Contributor

We have noticed few months ago that this test flaked again(openshift/origin#12072 (comment))

Opening this issue again to start dicussion on the test.

Following are my observations after going through history and logs :

  • We are bringing up a nginx container and trying to check if <container_IP:80> is available.
  • While this is not a bad idea, there are many things that could go wrong because of which nginx may not come to running. Some of them could be related to underlying infrastructure on which we have no control over, some could be related to network plugin being used etc. Some were mentioned in [k8s.io] Probing container should *not* be restarted with a /healthz http liveness probe [Conformance] {E2eNode Suite} #30714 (comment) as well.
    For example, the latest one failed with dial tcp 10.128.1.69:80: getsockopt: no route to host. This could be related to network setup as well.
  • This makes the test indeterministic and therefore flaky.

Since the goal this test is to check that a good container won't restart based on liveness probe, would it make sense to add a liveness command that checks for the existence of a particular directory that was created during container start up or something else instead of a HTTP GET.

/sig node
cc @Random-Liu

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Jan 31, 2018
@tianshapjq
Copy link
Contributor

configure liveness and readiness probes
not sure if this is helpful, but to my understanding, this is technically already supported as described in this page. You just have to figure out which probe you need most, either httpGet or exec.

@ravisantoshgudimetla
Copy link
Contributor Author

@tianshapjq It is not about configuring which probe to use. The problem is with test case flaking out and it flakes for what I believe could be an infrastructure related issue. I raised this issue to ensure that everyone is ok switching to exec from httpGet.

@ravisantoshgudimetla
Copy link
Contributor Author

ravisantoshgudimetla commented Feb 1, 2018

https://github.com/kubernetes/kubernetes/blob/master/test/e2e/common/container_probe.go#L114 - I just noticed that we already have a test which does exec liveness probe test. Do we still need the http test? If not, can we delete it or increase timeouts?

@fejta
Copy link
Contributor

fejta commented Feb 8, 2018

/kind flake

@k8s-ci-robot k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Feb 8, 2018
@fejta
Copy link
Contributor

fejta commented Feb 8, 2018

@kubernetes/sig-node-test-failures

k8s-github-robot pushed a commit that referenced this issue Feb 28, 2018
Automatic merge from submit-queue (batch tested with PRs 60342, 60505, 59218, 52900, 60486). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Increase failureThresholds for failing HTTP liveness test

**What this PR does / why we need it**:
Removes test from e2e which relies on HTTP liveness as a measure to tell if the container is good or bad. While this is not a bad idea, we cannot rely on this test as HTTP liveness relies on network/infrastructure etc on which sometimes we have no control over. While increasing the timeout may be an option it may not be ideal for all cloud providers/type of hardware etc.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #59150 

**Special notes for your reviewer**:
I have stated reasons in the issue #59150. We have seen that this test is flaking recently in openshift/origin#12072

**Release note**:

```release-note
NONE
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/flake Categorizes issue or PR as related to a flaky test. sig/node Categorizes an issue or PR as relevant to SIG Node.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants