New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detail how to debug CI failures as a contributor #3143
Detail how to debug CI failures as a contributor #3143
Conversation
Not complete yet, just outline.
@tpepper @nikhita @guineveresaenger @parispittman I've added some documentation for "What do I do when my PR fails CI tests?". How does this align with what you wanted to include as part of #1537 ? @spiffxp @BenTheElder: A lot of this content comes from the |
thanks @mariantalla -- I'm pretty swamped at the moment, mind pinging me sometime next week to revisit? I would love to help add some details 😅 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the writeup @mariantalla 👍
i've added some minor comments.
contributors/devel/testing.md
Outdated
- If yes, comment on it and link your PR, the failed run that affected you and any other information you think might be relevant | ||
- If no, open a new issue and notify the appropriate SIG (see: SIG test escalation) | ||
|
||
#### SIG test escalation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would title this:
Escalating failures to a SIG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, changed it!
contributors/devel/testing.md
Outdated
|
||
#### SIG test escalation | ||
- Figure out corresponding sig from test name/description | ||
- Mention the sig's github handle on the issue, optionally cc the SIG's chair(s) (locate them under kubernetes/community/sig-<name>) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some capitalizations:
sig
-> SIG
github
-> GitHub
cc
-> CC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, capitalized sig and github except where (I thought) it made sense to keep lower-case (e.g. in urls)
I changed cc to code format (cc
) rather than capitalizing with the logic that it's sort of a command within a GitHub conversation... no strong opinions though. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good 👍
contributors/devel/testing.md
Outdated
@@ -225,3 +225,40 @@ version and the watch cache test is skipped. | |||
## End-to-End tests | |||
|
|||
Please refer to [End-to-End Testing in Kubernetes](e2e-tests.md). | |||
|
|||
## Running your contribution in the Kubernetes CI | |||
Once you open a PR, prow will run pre-submit tests in CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly mention /ok-to-test
and /retest
and/or link to some of the PR / test related bot commands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
contributors/devel/testing.md
Outdated
### Troubleshooting a failure | ||
Click on `Details` and look at the [`gubernator`](gubernator.k8s.io/) output for the test. | ||
|
||
#### Troubleshooting failures/flakes that are not caused by your change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possibly unify the two sections?
we can explain that if the gubernator output seems unrelated to the change they might try to first call /retest
them self. but if a test continues to fail, to contact a reviewer/maintainer.
usually what happens is that contributors first ping in the PR and if nobody responds they go asking on slack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, we unified them under a single title and structured them a bit more.
Opt for active voicing for better readability.
Signed-off-by: Maria Ntalla <mntalla@pivotal.io>
Signed-off-by: Maria Ntalla <mntalla@pivotal.io>
/cc @spiffxp @fejta @stevekuznetsov @timothysc @tpepper ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments
- preferably as a comment on your PR, by tagging the [team][k-teams] (for example a [reviewers team for the SIG][k-teams-review]) | ||
- if you don't get a response in 24h, engage with the SIG on their channel on the Kubernetes slack and/or attend one of the [SIG meetings][sig-meetings] to ask for input. | ||
|
||
[prow-doc]: https://kubernetes.io/blog/2018/08/29/the-machines-can-do-the-work-a-story-of-kubernetes-testing-ci-and-automating-the-contributor-experience/#enter-prow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect this to link to https://git.k8s.io/test-infra/prow or https://prow.k8s.io
If you want to link to the blog post, maybe a "see this blog post for more information on all of the automation involved in testing your PR" or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, changed it!
Find out more about [other commands][prow-cmds] you can use to interact with prow through GitHub comments. | ||
|
||
### Troubleshooting a failure | ||
Click on `Details` and look at the [`gubernator`](gubernator.k8s.io/) output for the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would reword this as we are looking to move away from gubernator. It also doesn't make much sense to a new contributor. What am I going here to look for? Artifacts produced by the test and cluster under test, such as:
- test results
- more metadata about the tests (what versions of things were used? how long did they take?)
- output from tests that have failed
- build log showing the full test run
- logs from the cluster under test (k8s components such as kublet and apiserver, possibly other logs such as etcd and kernel)
- junit xml files
- coverage files
- etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, removed the reference to gubernator and added specifics on what specifically they can find under Details
that is useful to debugging. I haven't gone into a lot of detail as to how to read through logs etc (I worry it might get too tiresome for folks to read & follow), any opinions on this?
- Run [`/retest`][retest] on your PR to re-trigger the tests | ||
|
||
- Is it a failure that shouldn't be happening (in other words, is the test now wrong) | ||
- Get in touch with the SIG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which SIG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think the "sig-on-PR-label" is probably the best bet - is that right?
Changed the text to say that for now.
- Is it a flake? | ||
- Check if an issue has already been opened for that flake | ||
- If not, open a new one (like [this example][new-issue-example]) and [label it `kind/flake`][kind/flake] | ||
- Run [`/retest`][retest] on your PR to re-trigger the tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is a flake, and there is an issue for it, is there anything else I can do to aid with troubleshooting beyond running /retest
on my PR?
eg: I personally use N=1 to say fine it's a flake, N=2 to spam retest, N=3 to say OK what can I do to help identify this flake and make it go away. Sometimes that's looking at the actual failure and seeing if it's the same failure each time it happens. Maybe it's looking back to see when this flake started to happen, and possibly identify the window of commits that should be explored further. etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, good point! I added a small section with pointers to what contributors can do - similar to working through artifacts I was cautious of being too prescriptive with a course of action that changes a lot depending on the context of the specific flake occurrence. Happy to add more though if you feel this is the right place to put more information.
|
||
- Is it a failure that shouldn't be happening (in other words, is the test now wrong) | ||
- Get in touch with the SIG | ||
- preferably as a comment on your PR, by tagging the [team][k-teams] (for example a [reviewers team for the SIG][k-teams-review]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anecdotally, does the use of these teams work as described here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, interesting. @hoegaarden and I have had mixed experiences so far, i.e. it's not much overhead & it hasn't hurt to tag reviewers, it sometimes helped (but sometimes made no significant difference). Not sure how representative this is though; would you recommend a different/additional suggestion here? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no particular suggestion, I wish we had ways of measuring efficacy here
/uncc |
Stop mentioning gubernator, add pointers on how to help with flake resolution as a contributor. Signed-off-by: Maria Ntalla <mntalla@pivotal.io>
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mariantalla, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This PR adds more detail on what contributors can do to debug CI failures and flakes on their contributions.
Tasks:
Is it worth, as part of this PR:
Related: