Log non-graceful termination to /var/log/kube-apiserver/termination.log and stdout #876

sttts · 2020-06-04T09:07:48Z

This

adds a /var/log/kube-apiserver/termination.log file to masters (sitting next to the audit logs we already have). It lists non-graceful terminations for easier CI and customer cluster analysis.
lets the watch-termination binary create NonGracefulTermination events in the openshift-kube-apiserver namespace on next launch.

Container logs alone are not enough because logging during termination is broken, and we lose logs from old pods. So we have to persist the data somewhere on disk.

Etcd is also no option as etcd struggles with the same termination issues and is not reliable.

Depends on openshift/origin#25192.

bindata/v4.1.0/kube-apiserver/pod.yaml

tnozicka

I'd prefer a carry patch on kube-apiserver which is unlikely to conflict and has lower risk of getting some signal handling wrong, but this doesn't look too bad either :)

p0lyn0mial · 2020-06-04T15:54:21Z

/test e2e-gcp-upgrade

openshift-ci-robot · 2020-06-04T15:54:36Z

@p0lyn0mial: The specified target(s) for /test were not found.
The following commands are available to trigger jobs:

/test e2e-aws
/test e2e-aws-operator
/test e2e-aws-operator-encryption
/test e2e-aws-operator-encryption-perf
/test e2e-aws-serial
/test e2e-aws-upgrade
/test images
/test unit
/test verify
/test verify-deps

Use /test all to run the following jobs:

pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws
pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-operator
pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-serial
pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws-upgrade
pull-ci-openshift-cluster-kube-apiserver-operator-master-images
pull-ci-openshift-cluster-kube-apiserver-operator-master-unit
pull-ci-openshift-cluster-kube-apiserver-operator-master-verify
pull-ci-openshift-cluster-kube-apiserver-operator-master-verify-deps

In response to this:

/test e2e-gcp-upgrade

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

sttts · 2020-06-05T14:58:02Z

/retest

sttts · 2020-06-08T07:49:20Z

/retest

smarterclayton · 2020-06-10T15:41:17Z

I really don't like nested signal termination in this case either - why wouldn't kube-apiserver summarize failure better on exit to logs? That's why brian and I added fallbacktologsonerror - so that our infra components could log better failures on exit?

sttts · 2020-06-12T14:58:05Z

/retest

sttts · 2020-06-15T09:39:55Z

Flakes.

/retest

sttts · 2020-06-15T10:00:42Z

I really don't like nested signal termination in this case either - why wouldn't kube-apiserver summarize failure better on exit to logs? That's why brian and I added fallbacktologsonerror - so that our infra components could log better failures on exit?

Because we don't have logs. Logging of termination is broken in kubelet since 4.1. Our BZ is open since 4.1 too. Am happy to remove this again as soon as

a) kubelet and cri-o start working and doing their job
b) we have termination logs of old pods e.g. through loki.

We are spending half of our time hunting issues because we are blind. Waiting is no option.

p0lyn0mial · 2020-06-15T10:01:46Z

/lgtm

openshift-bot · 2020-06-15T11:23:33Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T00:42:04Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T02:13:08Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T02:52:08Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T04:23:16Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T06:07:05Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T06:22:48Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T07:38:14Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T10:14:06Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T11:57:37Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T12:10:33Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T14:07:29Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T14:33:42Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T16:17:29Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T16:30:29Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T19:58:15Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-08T22:47:14Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-09T00:57:08Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-09T02:54:09Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-09T03:07:20Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-09T04:38:18Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-09T06:09:24Z

/retest

Please review the full test history for this PR and help us cut down flakes.

p0lyn0mial · 2020-07-09T06:13:08Z

/hold

failing ci/prow/e2e-aws is real, it looks like the must-gather test doesn't know how to handle terminating.gz files

STEP: /tmp/test.oc-adm-must-gather.044028853/registry-svc-ci-openshift-org-ci-op-6nby4y5y-stable-sha256-90e78122d7d240f2b1a0eaa05507303205292cd745fbfa5b36298942dab81fe4/audit_logs/kube-apiserver/ip-10-0-163-93.us-west-2.compute.internal-.terminating.gz

sttts · 2020-07-16T09:06:22Z

openshift/origin#25282 merged.

/retest

sttts · 2020-07-16T09:06:38Z

/hold cancel

openshift-bot · 2020-07-16T10:49:41Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-07-16T11:02:42Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot requested review from mfojtik and soltysh June 4, 2020 09:08

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2020

sttts force-pushed the sttts-termination-log branch from cbe2285 to d5b388f Compare June 4, 2020 09:09

sttts commented Jun 4, 2020

View reviewed changes

bindata/v4.1.0/kube-apiserver/pod.yaml Outdated Show resolved Hide resolved

sttts force-pushed the sttts-termination-log branch 2 times, most recently from d36f601 to 6188db2 Compare June 4, 2020 09:19

tnozicka reviewed Jun 4, 2020

View reviewed changes

bindata/v4.1.0/kube-apiserver/pod.yaml Outdated Show resolved Hide resolved

tnozicka reviewed Jun 4, 2020

View reviewed changes

bindata/v4.1.0/kube-apiserver/pod.yaml Outdated Show resolved Hide resolved

tnozicka reviewed Jun 4, 2020

View reviewed changes

sttts force-pushed the sttts-termination-log branch 3 times, most recently from 9730e1e to 164fe52 Compare June 4, 2020 15:45

sttts force-pushed the sttts-termination-log branch 2 times, most recently from 654d7df to 57486ec Compare June 5, 2020 11:42

sttts force-pushed the sttts-termination-log branch 3 times, most recently from 3817cdb to 6e96841 Compare June 10, 2020 07:53

openshift-ci-robot assigned p0lyn0mial Jun 15, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jun 15, 2020

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 9, 2020

sttts mentioned this pull request Jul 15, 2020

e2e/mustgather: ignore non audit files in /var/log/*-apiserver dir openshift/origin#25282

Merged

openshift-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 16, 2020

openshift-merge-robot merged commit 9bd499c into openshift:master Jul 16, 2020

Log non-graceful termination to /var/log/kube-apiserver/termination.log and stdout #876

Log non-graceful termination to /var/log/kube-apiserver/termination.log and stdout #876

Conversation

sttts commented Jun 4, 2020 • edited

tnozicka left a comment

Choose a reason for hiding this comment

p0lyn0mial commented Jun 4, 2020

openshift-ci-robot commented Jun 4, 2020

sttts commented Jun 5, 2020

sttts commented Jun 8, 2020

smarterclayton commented Jun 10, 2020

sttts commented Jun 12, 2020

sttts commented Jun 15, 2020

sttts commented Jun 15, 2020 • edited

p0lyn0mial commented Jun 15, 2020

openshift-bot commented Jun 15, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 8, 2020

openshift-bot commented Jul 9, 2020

openshift-bot commented Jul 9, 2020

openshift-bot commented Jul 9, 2020

openshift-bot commented Jul 9, 2020

openshift-bot commented Jul 9, 2020

p0lyn0mial commented Jul 9, 2020

sttts commented Jul 16, 2020

sttts commented Jul 16, 2020

openshift-bot commented Jul 16, 2020

openshift-bot commented Jul 16, 2020

sttts commented Jun 4, 2020 •

edited

sttts commented Jun 15, 2020 •

edited