Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detail how to debug CI failures as a contributor #3143

Merged
merged 8 commits into from Feb 22, 2019
Merged

Detail how to debug CI failures as a contributor #3143

merged 8 commits into from Feb 22, 2019

Conversation

mariantalla
Copy link
Contributor

@mariantalla mariantalla commented Jan 23, 2019

This PR adds more detail on what contributors can do to debug CI failures and flakes on their contributions.

Tasks:

  • What to do if failure is not related to the changes of the PR
    • Triaging failure
    • Opening an issue
    • Notifying SIG

Is it worth, as part of this PR:

  • Including an overview of testgrid? Other tooling?

Related:

Not complete yet, just outline.
@mariantalla mariantalla changed the title WIP - Detail how to debug CI failure as a contributor [WIP] Detail how to debug CI failure as a contributor Jan 23, 2019
@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 23, 2019
@k8s-ci-robot k8s-ci-robot added area/developer-guide Issues or PRs related to the developer guide cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Jan 24, 2019
@mariantalla
Copy link
Contributor Author

@tpepper @nikhita @guineveresaenger @parispittman

I've added some documentation for "What do I do when my PR fails CI tests?". How does this align with what you wanted to include as part of #1537 ?

@spiffxp @BenTheElder: A lot of this content comes from the testing.md file that kubernetes/sig-release#428 aims to remove - mostly stuff that I thought were useful to contributors in general (as opposed to specific CI-related roles) but would love to hear your feedback.

@BenTheElder
Copy link
Member

thanks @mariantalla -- I'm pretty swamped at the moment, mind pinging me sometime next week to revisit? I would love to help add some details 😅

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the writeup @mariantalla 👍
i've added some minor comments.

- If yes, comment on it and link your PR, the failed run that affected you and any other information you think might be relevant
- If no, open a new issue and notify the appropriate SIG (see: SIG test escalation)

#### SIG test escalation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would title this:

Escalating failures to a SIG

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, changed it!


#### SIG test escalation
- Figure out corresponding sig from test name/description
- Mention the sig's github handle on the issue, optionally cc the SIG's chair(s) (locate them under kubernetes/community/sig-<name>)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some capitalizations:

sig -> SIG
github -> GitHub
cc -> CC

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, capitalized sig and github except where (I thought) it made sense to keep lower-case (e.g. in urls)

I changed cc to code format (cc) rather than capitalizing with the logic that it's sort of a command within a GitHub conversation... no strong opinions though. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good 👍

@@ -225,3 +225,40 @@ version and the watch cache test is skipped.
## End-to-End tests

Please refer to [End-to-End Testing in Kubernetes](e2e-tests.md).

## Running your contribution in the Kubernetes CI
Once you open a PR, prow will run pre-submit tests in CI.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly mention /ok-to-test and /retest and/or link to some of the PR / test related bot commands?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

### Troubleshooting a failure
Click on `Details` and look at the [`gubernator`](gubernator.k8s.io/) output for the test.

#### Troubleshooting failures/flakes that are not caused by your change
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

possibly unify the two sections?

we can explain that if the gubernator output seems unrelated to the change they might try to first call /retest them self. but if a test continues to fail, to contact a reviewer/maintainer.

usually what happens is that contributors first ping in the PR and if nobody responds they go asking on slack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we unified them under a single title and structured them a bit more.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 5, 2019
Signed-off-by: Maria Ntalla <mntalla@pivotal.io>
Signed-off-by: Maria Ntalla <mntalla@pivotal.io>
@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Feb 14, 2019
@mariantalla mariantalla changed the title [WIP] Detail how to debug CI failure as a contributor Detail how to debug CI failures as a contributor Feb 19, 2019
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 19, 2019
@mariantalla
Copy link
Contributor Author

/cc @spiffxp @fejta @stevekuznetsov @timothysc @tpepper

(sig-testing-leads and #1537 creator)

Copy link
Member

@spiffxp spiffxp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments

- preferably as a comment on your PR, by tagging the [team][k-teams] (for example a [reviewers team for the SIG][k-teams-review])
- if you don't get a response in 24h, engage with the SIG on their channel on the Kubernetes slack and/or attend one of the [SIG meetings][sig-meetings] to ask for input.

[prow-doc]: https://kubernetes.io/blog/2018/08/29/the-machines-can-do-the-work-a-story-of-kubernetes-testing-ci-and-automating-the-contributor-experience/#enter-prow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect this to link to https://git.k8s.io/test-infra/prow or https://prow.k8s.io

If you want to link to the blog post, maybe a "see this blog post for more information on all of the automation involved in testing your PR" or something

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, changed it!

Find out more about [other commands][prow-cmds] you can use to interact with prow through GitHub comments.

### Troubleshooting a failure
Click on `Details` and look at the [`gubernator`](gubernator.k8s.io/) output for the test.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reword this as we are looking to move away from gubernator. It also doesn't make much sense to a new contributor. What am I going here to look for? Artifacts produced by the test and cluster under test, such as:

  • test results
  • more metadata about the tests (what versions of things were used? how long did they take?)
  • output from tests that have failed
  • build log showing the full test run
  • logs from the cluster under test (k8s components such as kublet and apiserver, possibly other logs such as etcd and kernel)
  • junit xml files
  • coverage files
  • etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, removed the reference to gubernator and added specifics on what specifically they can find under Details that is useful to debugging. I haven't gone into a lot of detail as to how to read through logs etc (I worry it might get too tiresome for folks to read & follow), any opinions on this?

- Run [`/retest`][retest] on your PR to re-trigger the tests

- Is it a failure that shouldn't be happening (in other words, is the test now wrong)
- Get in touch with the SIG
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which SIG?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think the "sig-on-PR-label" is probably the best bet - is that right?

Changed the text to say that for now.

- Is it a flake?
- Check if an issue has already been opened for that flake
- If not, open a new one (like [this example][new-issue-example]) and [label it `kind/flake`][kind/flake]
- Run [`/retest`][retest] on your PR to re-trigger the tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is a flake, and there is an issue for it, is there anything else I can do to aid with troubleshooting beyond running /retest on my PR?

eg: I personally use N=1 to say fine it's a flake, N=2 to spam retest, N=3 to say OK what can I do to help identify this flake and make it go away. Sometimes that's looking at the actual failure and seeing if it's the same failure each time it happens. Maybe it's looking back to see when this flake started to happen, and possibly identify the window of commits that should be explored further. etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, good point! I added a small section with pointers to what contributors can do - similar to working through artifacts I was cautious of being too prescriptive with a course of action that changes a lot depending on the context of the specific flake occurrence. Happy to add more though if you feel this is the right place to put more information.


- Is it a failure that shouldn't be happening (in other words, is the test now wrong)
- Get in touch with the SIG
- preferably as a comment on your PR, by tagging the [team][k-teams] (for example a [reviewers team for the SIG][k-teams-review])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anecdotally, does the use of these teams work as described here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, interesting. @hoegaarden and I have had mixed experiences so far, i.e. it's not much overhead & it hasn't hurt to tag reviewers, it sometimes helped (but sometimes made no significant difference). Not sure how representative this is though; would you recommend a different/additional suggestion here? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no particular suggestion, I wish we had ways of measuring efficacy here

@fejta
Copy link
Contributor

fejta commented Feb 20, 2019

/uncc
Looks great, looks like lots of activity. please add me back if I'm needed here!

@k8s-ci-robot k8s-ci-robot removed the request for review from fejta February 20, 2019 19:43
Stop mentioning gubernator, add pointers on how to help with flake
resolution as a contributor.

Signed-off-by: Maria Ntalla <mntalla@pivotal.io>
@spiffxp
Copy link
Member

spiffxp commented Feb 22, 2019

/approve
/lgtm
Thanks for making this better

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 22, 2019
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mariantalla, spiffxp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 22, 2019
@k8s-ci-robot k8s-ci-robot merged commit 7efa062 into kubernetes:master Feb 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/developer-guide Issues or PRs related to the developer guide cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants