New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

intermittent master init failures in kubemark and other tests #55686

Closed
porridge opened this Issue Nov 14, 2017 · 6 comments

Comments

Projects
None yet
3 participants
@porridge
Member

porridge commented Nov 14, 2017

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:

Last 3 failures to bring up cluster for job ci-kubernetes-kubemark-5-gce-last-release:

Looking at the only log available kubemark-5-old-master/serial-1.log - everything looks good and no unexplained deviations from a log from a successful run can be seen.

I think we need more logs from the master in such case than just the serial console output.

/cc @kubernetes/sig-scalability-bugs

@porridge

This comment has been minimized.

Show comment
Hide comment
@porridge

porridge Nov 14, 2017

Member

/assign

Member

porridge commented Nov 14, 2017

/assign

k8s-merge-robot added a commit that referenced this issue Nov 23, 2017

Merge pull request #55690 from porridge/debug-curl
Automatic merge from submit-queue (batch tested with PRs 56208, 55690). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Dump last curl output if cluster fails to come up.

**What this PR does / why we need it**:
This is a step toward solving #55686

**Release note**:
```release-note
NONE
```
@porridge

This comment has been minimized.

Show comment
Hide comment
@porridge

porridge Dec 6, 2017

Member

I faced the same problem when trying to bring up a 4k+ node cluster: #56714 (comment)
Digging deeper, it seems that listing all pods in all namespaces is unlikely to succeed within 5 seconds on a cluster that large. In fact I'm surprised it worked so well so far, I think we were just very lucky.

Member

porridge commented Dec 6, 2017

I faced the same problem when trying to bring up a 4k+ node cluster: #56714 (comment)
Digging deeper, it seems that listing all pods in all namespaces is unlikely to succeed within 5 seconds on a cluster that large. In fact I'm surprised it worked so well so far, I think we were just very lucky.

@porridge porridge changed the title from intermittent master init failures in kubemark tets to intermittent master init failures in kubemark and other tets Dec 8, 2017

@porridge porridge changed the title from intermittent master init failures in kubemark and other tets to intermittent master init failures in kubemark and other tests Dec 8, 2017

k8s-merge-robot added a commit that referenced this issue Dec 28, 2017

Merge pull request #56888 from porridge/limit-curl-get
Automatic merge from submit-queue (batch tested with PRs 57670, 56888). If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Limit number of pods listed as master liveness check.

**What this PR does / why we need it**:

Another step in making #55686 less likely.

**Release note**:
```release-note
NONE
```
@fejta-bot

This comment has been minimized.

Show comment
Hide comment
@fejta-bot

fejta-bot Mar 8, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

fejta-bot commented Mar 8, 2018

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@porridge

This comment has been minimized.

Show comment
Hide comment
@porridge

porridge Mar 8, 2018

Member

/unsassign

Member

porridge commented Mar 8, 2018

/unsassign

@fejta-bot

This comment has been minimized.

Show comment
Hide comment
@fejta-bot

fejta-bot Apr 7, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

fejta-bot commented Apr 7, 2018

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten
/remove-lifecycle stale

@fejta-bot

This comment has been minimized.

Show comment
Hide comment
@fejta-bot

fejta-bot May 7, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

fejta-bot commented May 7, 2018

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment