Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Revert "Revert "[Re-Apply][Distroless] Convert the GCE manifests for master containers.""" #83390

Conversation

mborsz
Copy link
Member

@mborsz mborsz commented Oct 2, 2019

Reverts #78466

This makes gce 5000 failing: https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1179079009347047428

[namespace test-cpzh07-18 object latency-deployment-61 creation error: Post https://35.237.145.223/apis/apps/v1/namespaces/test-cpzh07-18/deployments: dial tcp 35.237.145.223:443: connect: connection refused
namespace test-cpzh07-18 object latency-deployment-64 creation error: Post https://35.237.145.223/apis/apps/v1/namespaces/test-cpzh07-18/deployments: dial tcp 35.237.145.223:443: connect: connection refused
namespace test-cpzh07-18 object latency-deployment-65 creation error: Post https://35.237.145.223/apis/apps/v1/namespaces/test-cpzh07-18/deployments: dial tcp 35.237.145.223:443: connect: connection refused
namespace test-cpzh07-17 object latency-deployment-24 creation error: Post https://35.237.145.223/apis/apps/v1/namespaces/test-cpzh07-17/deployments: dial tcp 35.237.145.223:443: connect: connection refused
namespace test-cpzh07-17 object latency-deployment-74 creation error: Post https://35.237.145.223/apis/apps/v1/namespaces/test-cpzh07-17/deployments: dial tcp 35.237.145.223:443: connect: connection refused
namespace test-cpzh07-18 object latency-deployment-24 creation error: Post https://35.237.145.223/apis/apps/v1/namespaces/test-cpzh07-18/deployments: dial tcp 35.237.145.223:443: connect: connection refused
(...)

I believe that this was caused by that PR as:

  • It seems to be the only significant change in that commit range: 2d491ac...c2c8215
  • It fails in the similar way as on the previous attempts to submit distroless PR (kube-apiserver restarts on the first logrotate and never manages to recover)
NONE

/assign @wojtek-t

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Oct 2, 2019
@wojtek-t wojtek-t added kind/bug Categorizes issue or PR as related to a bug. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Oct 2, 2019
@wojtek-t wojtek-t added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Oct 2, 2019
@wojtek-t
Copy link
Member

wojtek-t commented Oct 2, 2019

@dims @yuwenma - FYI

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mborsz, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 2, 2019
@k8s-ci-robot k8s-ci-robot merged commit fdd1f27 into kubernetes:master Oct 2, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 2, 2019
@dims
Copy link
Member

dims commented Oct 3, 2019

@mborsz @wojtek-t Looks like the liveness probe failed triggering the restart of the api server

$ rg -i liveness systemd.log  | grep -v dockerd | grep kube-apiserver | grep -i start
Oct 01 17:51:16.097400 gce-scale-cluster-master kubelet[2779]: I1001 17:51:16.095374    2779 kuberuntime_manager.go:612] Container "kube-apiserver" ({"docker" "13fad7ec052d788db8bfcf9678f50dbd1b1e17d0b51a8501097d41fb95cbea27"}) of pod kube-apiserver-gce-scale-cluster-master_kube-system(ad5dcc4cf986873a44cd4b433aa90322): Container kube-apiserver failed liveness probe, will be restarted
Oct 01 17:53:11.079120 gce-scale-cluster-master kubelet[2779]: I1001 17:53:11.079085    2779 kuberuntime_manager.go:612] Container "kube-apiserver" ({"docker" "34f9e54bb8054fb48a1883cb191647e0b9b08c359b27aa0c6d92984f72a95332"}) of pod kube-apiserver-gce-scale-cluster-master_kube-system(ad5dcc4cf986873a44cd4b433aa90322): Container kube-apiserver failed liveness probe, will be restarted
Oct 01 17:56:16.072287 gce-scale-cluster-master kubelet[2779]: I1001 17:56:16.072145    2779 kuberuntime_manager.go:612] Container "kube-apiserver" ({"docker" "10bc3bf994cc31461d8f17129ee469d1dc7d8a6d066abb90bf1bc71cd0ca9e17"}) of pod kube-apiserver-gce-scale-cluster-master_kube-system(ad5dcc4cf986873a44cd4b433aa90322): Container kube-apiserver failed liveness probe, will be restarted
Oct 01 17:57:56.072276 gce-scale-cluster-master kubelet[2779]: I1001 17:57:56.072219    2779 kuberuntime_manager.go:612] Container "kube-apiserver" ({"docker" "5fde61ce66ceac5f1d5e4737a561212d477317ddbb88d2c943416ae8e9844048"}) of pod kube-apiserver-gce-scale-cluster-master_kube-system(ad5dcc4cf986873a44cd4b433aa90322): Container kube-apiserver failed liveness probe, will be restarted
Oct 01 17:59:56.076562 gce-scale-cluster-master kubelet[2779]: I1001 17:59:56.074764    2779 kuberuntime_manager.go:612] Container "kube-apiserver" ({"docker" "b226dcd7d2cb5b3f34dde6748eb35512b323cb0e9f5bf57c9566384f5d2d5ba0"}) of pod kube-apiserver-gce-scale-cluster-master_kube-system(ad5dcc4cf986873a44cd4b433aa90322): Container kube-apiserver failed liveness probe, will be restarted
Oct 01 18:05:36.073246 gce-scale-cluster-master kubelet[2779]: I1001 18:05:36.073208    2779 kuberuntime_manager.go:612] Container "kube-apiserver" ({"docker" "6074d26423c7a201e0c1cc472095424c33b0a394fcd7c167ac20810e93574e49"}) of pod kube-apiserver-gce-scale-cluster-master_kube-system(ad5dcc4cf986873a44cd4b433aa90322): Container kube-apiserver failed liveness probe, will be restarted
Oct 01 18:12:46.075489 gce-scale-cluster-master kubelet[2779]: I1001 18:12:46.072512    2779 kuberuntime_manager.go:612] Container "kube-apiserver" ({"docker" "4c1b3310e340c72a27d7ccbf0da1cd4b99da306b711a1dcf61f5795ef7b16b1c"}) of pod kube-apiserver-gce-scale-cluster-master_kube-system(ad5dcc4cf986873a44cd4b433aa90322): Container kube-apiserver failed liveness probe, will be restarted
Oct 01 18:14:19.608688 gce-scale-cluster-master kubelet[2779]: I1001 18:14:19.608638    2779 kuberuntime_manager.go:612] Container "kube-apiserver" ({"docker" "4407bd6e90da4549e5e870db0e9b07ba75509e791b0b44d1d40fde6f2a7241c8"}) of pod kube-apiserver-gce-scale-cluster-master_kube-system(ad5dcc4cf986873a44cd4b433aa90322): Container kube-apiserver failed liveness probe, will be restarted
Oct 01 18:20:51.072711 gce-scale-cluster-master kubelet[2779]: I1001 18:20:51.072680    2779 kuberuntime_manager.go:612] Container "kube-apiserver" ({"docker" "de9d0ad7ac0fc5b74ffb1129a158d7b02c75c37741191f23c65b53043fc44adb"}) of pod kube-apiserver-gce-scale-cluster-master_kube-system(ad5dcc4cf986873a44cd4b433aa90322): Container kube-apiserver failed liveness probe, will be restarted
Oct 01 18:22:21.100552 gce-scale-cluster-master kubelet[2779]: I1001 18:22:21.100509    2779 kuberuntime_manager.go:612] Container "kube-apiserver" ({"docker" "a30cb14a2ecec3322cafcd42d0ee10ee1cad1eaa9d49dc0fbe41ca57d0bf2cb5"}) of pod kube-apiserver-gce-scale-cluster-master_kube-system(ad5dcc4cf986873a44cd4b433aa90322): Container kube-apiserver failed liveness probe, will be restarted

@dims
Copy link
Member

dims commented Oct 3, 2019

and the failure seems to be related to a problem with http request timing out

17929-Oct 01 17:51:16.092916 gce-scale-cluster-master kubelet[2779]: I1001 17:51:16.092786    2779 prober.go:116] Liveness probe for "kube-apiserver-gce-scale-cluster-master_kube-system(ad5dcc4cf986873a44cd4b433aa90322):kube-apiserver" failed (failure): Get https://127.0.0.1:443/healthz?exclude=etcd&exclude=kms-provider-0&exclude=kms-provider-1: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

@dims
Copy link
Member

dims commented Oct 3, 2019

and this as well at the same time:

Oct 01 17:50:44.407323 gce-scale-cluster-master kernel: TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
Oct 01 17:51:56.337244 gce-scale-cluster-master kernel: TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.

@dims
Copy link
Member

dims commented Oct 3, 2019

the log rotation seems to have run a few times before

[dims@dims-a01 21:18] ~/Downloads ⟩ rg "Started .* log rotation" systemd.log
3463:Oct 01 17:08:03.415655 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
4793:Oct 01 17:10:00.971497 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
6206:Oct 01 17:15:10.965969 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
7721:Oct 01 17:20:10.966515 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
9343:Oct 01 17:25:10.964466 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
10967:Oct 01 17:30:10.964996 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
12596:Oct 01 17:35:10.965277 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
14209:Oct 01 17:40:10.965352 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
15887:Oct 01 17:45:10.967283 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
18756:Oct 01 17:52:46.233058 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
19986:Oct 01 17:55:10.964841 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
22268:Oct 01 18:00:10.964951 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
25109:Oct 01 18:05:10.964749 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
28416:Oct 01 18:10:10.965011 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
31094:Oct 01 18:15:10.964775 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
34237:Oct 01 18:20:10.967123 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.
38986:Oct 01 18:27:48.607859 gce-scale-cluster-master systemd[1]: Started Kubernetes log rotation.

and looking at the timestamp on the gz file kube-apiserver.log-20191001-1569954310.gz which is 1569954310 (epoch time converter - https://www.unixtimestamp.com/index.php) seems to be 10/01/2019 @ 6:25pm (UTC) which seems to be way later than the timestamps where the problem started to happen (17:50-ish from logs above) .

@dims
Copy link
Member

dims commented Oct 3, 2019

Do we know anyone who can help poke at the request_sock_TCP: Possible SYN flooding angle?

@mborsz
Copy link
Member Author

mborsz commented Oct 3, 2019

@dims Logrotate process wakes up every 5 minutes and checks if there is a work to do (i.e. if any of log files has grown more than 5GiB). In that attempt, it happened twice:

  • kube-apiserver-audit.log-20191001-1569952210.gz at 17:50:10
  • kube-apiserver.log-20191001-1569954310.gz at 18:25:10

The first logrotate attempt triggered this issue I think.

@mborsz
Copy link
Member Author

mborsz commented Oct 3, 2019

What is worth mentioning is that kube-apiserver-audit.log hasn't been changed by the distroless change (it doesn't use klog), so likely any heavy IO load on the disk will trigger similar issue (potentially with some fsyncs).

@dims
Copy link
Member

dims commented Oct 3, 2019

@mborsz ah! i missed the kube-apiserver-audit.log-20191001-1569952210.gz one. thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants