Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check for hostError and automaticRestart when test finishes. #71456

Merged
merged 1 commit into from
Nov 30, 2018

Conversation

mborsz
Copy link
Member

@mborsz mborsz commented Nov 27, 2018

What type of PR is this?
/kind flake

What this PR does / why we need it:
When test finishes, check gce activity logs for hostError and automaticRestart. This is useful for debugging failed (flaky) test attempts.
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/flake Categorizes issue or PR as related to a flaky test. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Nov 27, 2018
@mborsz
Copy link
Member Author

mborsz commented Nov 27, 2018

/assign @wojtek-t

@k8s-ci-robot k8s-ci-robot added sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Nov 27, 2018
logName=\"projects/${PROJECT}/logs/compute.googleapis.com%2Factivity_log\"
(jsonPayload.event_subtype=\"compute.instances.hostError\" OR jsonPayload.event_subtype=\"compute.instances.automaticRestart\")
jsonPayload.resource.name:\"${group}\"
timestamp >= \"${creation_timestamp}\""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great.
My only question here is - what is the estimation on the time it takes to run this command? This pretty much goes over activity log on all nodes, which for 5k-node clusters may be quite a lot.
Did you try running that on larger clusters?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes ~20 seconds per instance group. So approx. 1 minute for all of them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool - thanks.

@wojtek-t
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 27, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mborsz, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 27, 2018
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

1 similar comment
@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to fejta).

Review the full test history for this PR.

Silence the bot with an /lgtm cancel comment for consistent failures.

@k8s-ci-robot k8s-ci-robot merged commit d460cb2 into kubernetes:master Nov 30, 2018
@mikedanese
Copy link
Member

We started to see LogDump failures in GKE when this merged:

https://gubernator.k8s.io/build/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gke/34578

Could this be related?

@mborsz
Copy link
Member Author

mborsz commented Nov 30, 2018

It is related, log says:

W1130 16:04:52.986] Zone: us-central1-f
W1130 16:04:54.206] INSTANCE_GROUPS=
W1130 16:04:54.206] NODE_NAMES=
W1130 16:04:54.206] ./cluster/log-dump/log-dump.sh: line 438: INSTANCE_GROUPS[@]: unbound variable

where line 438 is added in this PR. Will revert and check what the problem is next week.

k8s-ci-robot added a commit that referenced this pull request Dec 18, 2018
Roll forward #71456 with a check whether INSTANCE_GROUPS is empty.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/flake Categorizes issue or PR as related to a flaky test. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants