Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log non-graceful termination to /var/log/kube-apiserver/termination.log and stdout #876

Merged
merged 1 commit into from Jul 16, 2020

Conversation

sttts
Copy link
Contributor

@sttts sttts commented Jun 4, 2020

This

  • adds a /var/log/kube-apiserver/termination.log file to masters (sitting next to the audit logs we already have). It lists non-graceful terminations for easier CI and customer cluster analysis.
  • lets the watch-termination binary create NonGracefulTermination events in the openshift-kube-apiserver namespace on next launch.

Container logs alone are not enough because logging during termination is broken, and we lose logs from old pods. So we have to persist the data somewhere on disk.

Etcd is also no option as etcd struggles with the same termination issues and is not reliable.

Depends on openshift/origin#25192.

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2020
@sttts sttts force-pushed the sttts-termination-log branch 2 times, most recently from d36f601 to 6188db2 Compare June 4, 2020 09:19
Copy link
Contributor

@tnozicka tnozicka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer a carry patch on kube-apiserver which is unlikely to conflict and has lower risk of getting some signal handling wrong, but this doesn't look too bad either :)

@sttts sttts force-pushed the sttts-termination-log branch 3 times, most recently from 9730e1e to 164fe52 Compare June 4, 2020 15:45
@p0lyn0mial
Copy link
Contributor

/test e2e-gcp-upgrade

@openshift-ci-robot
Copy link

@p0lyn0mial: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

  • /test e2e-aws
  • /test e2e-aws-operator
  • /test e2e-aws-operator-encryption
  • /test e2e-aws-operator-encryption-perf
  • /test e2e-aws-serial
  • /test e2e-aws-upgrade
  • /test images
  • /test unit
  • /test verify
  • /test verify-deps

Use /test all to run the following jobs:

  • pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws
  • pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-operator
  • pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-serial
  • pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-upgrade
  • pull-ci-openshift-cluster-kube-apiserver-operator-master-images
  • pull-ci-openshift-cluster-kube-apiserver-operator-master-unit
  • pull-ci-openshift-cluster-kube-apiserver-operator-master-verify
  • pull-ci-openshift-cluster-kube-apiserver-operator-master-verify-deps

In response to this:

/test e2e-gcp-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sttts sttts force-pushed the sttts-termination-log branch 2 times, most recently from 654d7df to 57486ec Compare June 5, 2020 11:42
@sttts
Copy link
Contributor Author

sttts commented Jun 5, 2020

/retest

1 similar comment
@sttts
Copy link
Contributor Author

sttts commented Jun 8, 2020

/retest

@sttts sttts force-pushed the sttts-termination-log branch 3 times, most recently from 3817cdb to 6e96841 Compare June 10, 2020 07:53
@smarterclayton
Copy link
Contributor

I really don't like nested signal termination in this case either - why wouldn't kube-apiserver summarize failure better on exit to logs? That's why brian and I added fallbacktologsonerror - so that our infra components could log better failures on exit?

@sttts
Copy link
Contributor Author

sttts commented Jun 12, 2020

/retest

@sttts
Copy link
Contributor Author

sttts commented Jun 15, 2020

Flakes.

/retest

@sttts
Copy link
Contributor Author

sttts commented Jun 15, 2020

I really don't like nested signal termination in this case either - why wouldn't kube-apiserver summarize failure better on exit to logs? That's why brian and I added fallbacktologsonerror - so that our infra components could log better failures on exit?

Because we don't have logs. Logging of termination is broken in kubelet since 4.1. Our BZ is open since 4.1 too. Am happy to remove this again as soon as

a) kubelet and cri-o start working and doing their job
b) we have termination logs of old pods e.g. through loki.

We are spending half of our time hunting issues because we are blind. Waiting is no option.

@p0lyn0mial
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 15, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

20 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@p0lyn0mial
Copy link
Contributor

/hold

failing ci/prow/e2e-aws is real, it looks like the must-gather test doesn't know how to handle terminating.gz files

STEP: /tmp/test.oc-adm-must-gather.044028853/registry-svc-ci-openshift-org-ci-op-6nby4y5y-stable-sha256-90e78122d7d240f2b1a0eaa05507303205292cd745fbfa5b36298942dab81fe4/audit_logs/kube-apiserver/ip-10-0-163-93.us-west-2.compute.internal-.terminating.gz

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 9, 2020
@sttts
Copy link
Contributor Author

sttts commented Jul 16, 2020

openshift/origin#25282 merged.

/retest

@sttts
Copy link
Contributor Author

sttts commented Jul 16, 2020

/hold cancel

@openshift-ci-robot openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 16, 2020
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 9bd499c into openshift:master Jul 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants