Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Revert "[Re-Apply][Distroless] Convert the GCE manifests for master containers."" #78466

Conversation

yuwenma
Copy link
Contributor

@yuwenma yuwenma commented May 29, 2019

We fixed the duplicate log issue in klog (kubernetes/klog#65). The new klog release (v0.3.2) is introduced to k/k in #78465.

Does this PR introduce a user-facing change?:

NONE

What type of PR is this?

/kind api-change
/kind bug

/kind cleanup

/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels May 29, 2019
@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/gcp and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels May 29, 2019
@yuwenma
Copy link
Contributor Author

yuwenma commented May 29, 2019

/hold (Wait until #78465 is merged)
/assign @wojtek-t @MaciekPytel
/cc @dims @tallclair

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 29, 2019
@yuwenma
Copy link
Contributor Author

yuwenma commented May 29, 2019

/assign @MaciekPytel This PR is a re-apply of #76396. We pushed a fix in klog.

@xichengliudui
Copy link
Contributor

/test pull-kubernetes-kubemark-e2e-gce-big

fi
params+=" --log-file=${LOG_PATH}"
params+=" --logtostderr=false"
params+=" --log-file-max-size=0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add also --stderrthreshold=FATAL to avoid logging anything to stderr (it is the case before the change)?

My understanding is that this contributes to the issue -- any data wrote to stderr must be read by docker and from my experience docker reads that data quite slow.

I think this generates significant part of the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant adding stderrthreshold here and in other components as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the new klog logic, if logtostderr and alsotostderr are both false, stderrthreshold will not be considered (since no stderr would ever happpen).
See here

In such case, no data will be written to std error whatever the log severity is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Makes sense.

@mborsz
Copy link
Member

mborsz commented May 29, 2019

I think this change makes sense. I think we should add --stderrthreshold=FATAL as well to avoid logging anything to stderr (it is the case before the change), which is redirected to pipe with docker on the other side. From my experience docker usually reads data quite slowly from that pipe so this can lead to potential scalability issues.

Another thing is that given we have seen that changes like that can affect performance of 5k node tests, I suggest running manual test first before we merge this. Could you run something like that?

@yuwenma
Copy link
Contributor Author

yuwenma commented May 29, 2019

I think this change makes sense. I think we should add --stderrthreshold=FATAL as well to avoid logging anything to stderr (it is the case before the change), which is redirected to pipe with docker on the other side. From my experience docker usually reads data quite slowly from that pipe so this can lead to potential scalability issues.

Another thing is that given we have seen that changes like that can affect performance of 5k node tests, I suggest running manual test first before we merge this. Could you run something like that?

Can you give some guidance how to run a 5k node tests?
Just a reminder that we are hitting the v1.15 code freeze (EOD this week). And there's another PR blocked by this change. Is there any chance to get this PR go in today?

@mborsz
Copy link
Member

mborsz commented May 29, 2019

In my opinion this PR should work, but I really prefer testing this before we submit to avoid third revert.

/test pull-kubernetes-e2e-gce-large-performance

This should test this PR in 2k scale. 5k scale would be better, but we don't have resources until the end of the next week to test this in 5k scale.

@yuwenma
Copy link
Contributor Author

yuwenma commented May 30, 2019

/test pull-kubernetes-e2e-gce-large-performance

@mborsz
Copy link
Member

mborsz commented May 30, 2019

I'm afraid the first test failure is not a flake.

I took a look at logs and I see few problems there:

  • kube-apiserver is restarting few times
➜  e2e-021a5abcf8-a7a7f-master zgrep -- --address= kube-apiserver.log*
kube-apiserver.log:I0530 01:41:22.465795       1 flags.go:33] FLAG: --address="127.0.0.1"
kube-apiserver.log-20190530-1559174412.gz:I0529 22:21:45.415090       1 flags.go:33] FLAG: --address="127.0.0.1"
kube-apiserver.log-20190530-1559174412.gz:I0529 23:41:21.603442       1 flags.go:33] FLAG: --address="127.0.0.1"
➜  e2e-021a5abcf8-a7a7f-master
  • some log entries are still repeated, e.g.:
➜  e2e-021a5abcf8-a7a7f-master zgrep 'E0529 23:41:19.450955' kube-apiserver.log*
kube-apiserver.log-20190530-1559174412.gz:E0529 23:41:19.450955       1 metrics.go:96] Error in audit plugin 'buffered' affecting 1 audit events: audit backend shut down
kube-apiserver.log-20190530-1559174412.gz:E0529 23:41:19.450955       1 metrics.go:96] Error in audit plugin 'buffered' affecting 1 audit events: audit backend shut down
kube-apiserver.log-20190530-1559174412.gz:E0529 23:41:19.450955       1 metrics.go:96] Error in audit plugin 'buffered' affecting 1 audit events: audit backend shut down

@mborsz
Copy link
Member

mborsz commented May 30, 2019

I see what happened, this PR doesn't contain klog version update which happens in #78465

We need to rebase this PR to contain that commit and rerun the test, at least now we know that 2000 node tests reproduces the issue we saw in 5k node scale.

@mborsz
Copy link
Member

mborsz commented May 30, 2019

As soon as pull-kubernetes-e2e-gce-large-performance passes I think it's good to submit.

/lgtm
/approve
/hold

@dims
Copy link
Member

dims commented Sep 18, 2019

/test pull-kubernetes-e2e-gce-large-performance

@dims
Copy link
Member

dims commented Sep 18, 2019

let's try this again for 1.17

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 18, 2019
@dims
Copy link
Member

dims commented Sep 18, 2019

/test pull-kubernetes-e2e-gce-large-performance

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

6 similar comments
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@dims
Copy link
Member

dims commented Sep 19, 2019

W0919 10:10:30.497] ERROR: (gcloud.compute.instances.create) Could not fetch resource:
W0919 10:10:30.498]  - Quota 'CPUS' exceeded.  Limit: 5200.0 in region us-east1.

:(

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 20, 2019
@dims
Copy link
Member

dims commented Sep 23, 2019

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 23, 2019
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel or /hold comment for consistent failures.

@wojtek-t
Copy link
Member

/hold

Holding for a moment given that we currently have a regression (since Friday). Forfunately we seem to know where the problem is already - an issue will be opened later today.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 23, 2019
@dims
Copy link
Member

dims commented Sep 30, 2019

@wojtek-t how are things? could we try again?

@wojtek-t
Copy link
Member

wojtek-t commented Oct 1, 2019

We recovered from the regression. Maybe we can try.

/hold cancel

@mborsz @mm4tt - FYI

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 1, 2019
@k8s-ci-robot k8s-ci-robot merged commit 6610260 into kubernetes:master Oct 1, 2019
@k8s-ci-robot k8s-ci-robot added this to the v1.17 milestone Oct 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/provider/gcp Issues or PRs related to gcp provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants