Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build dashboard triage log #52653

Closed
bcmills opened this issue May 2, 2022 · 110 comments
Closed

build dashboard triage log #52653

bcmills opened this issue May 2, 2022 · 110 comments
Labels
Builders x/build issues (builders, bots, dashboards) FrozenDueToAge NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. umbrella
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented May 2, 2022

We've been publishing minutes for various recurring discussions (proposal review, Go 2 review, compiler & runtime meeting notes). This issue is an attempt to apply the same pattern for builder triage.

We'll add a post here for the commands run throughout the week to triage failures on the Go build dashboard (https://build.golang.org).

For each day's triage, I first run fetchlogs to fetch the previous day's logs (and then some, because fetchlogs doesn't yet have a date flag). Then, I use greplogs to identify failures since the previous run, excluding known-bad commits and known-flaky builders.

greplogs --triage outputs Markdown containing GitHub task lists. Entries that have been triaged will be checked off the corresponding post.

The commands to perform a typical triage run look like:

$ fetchlogs -branch=release-branch.go1.19,release-branch.go1.18
$ fetchlogs -n 1024 -repo all
$ greplogs --triage --since=$LAST_TRIAGE_DATE

fetchlogs may take several minutes to finish; greplogs should be faster.

If the greplogs output has too much noise (such as due to a large build break or malfunctioning builder), use the --omit, --since, and/or --before flags to prune it down. When you've got it down to a manageable size, paste the Markdown output from greplogs into a new comment on this issue.

Then, check off the failures from the list as you triage them. (It's ok to leave entries unchecked if you haven't gotten to them yet, but try to finish the last run before moving on to a new one.)

@bcmills bcmills added Builders x/build issues (builders, bots, dashboards) umbrella labels May 2, 2022
@bcmills bcmills added this to the Unreleased milestone May 2, 2022
@bcmills

This comment was marked as resolved.

@bcmills

This comment was marked as resolved.

@bcmills
Copy link
Contributor Author

bcmills commented May 2, 2022

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|2278a51\|3ce203d\|e7c56fe\|a5dd684\|e7b0559 --since=2022-04-29 --before=2022-05-02

(29 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 2, 2022

Notes for today:

@bcmills
Copy link
Contributor Author

bcmills commented May 2, 2022

(The above commands are using a greplogs patched for triage: patches are in aclements/go-misc#11.)

@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

greplogs --triage -l -E . --omit=loong64\|plan9-386\|plan9-amd64\|linux-amd64-unified\|freebsd-arm-paulzhol --since=2022-05-02 --before=2022-05-03 --details

(112 matching logs)

Lots of fallout from #52666. I'll notch that out by omitting the affected x/sys runs.

@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

greplogs --triage -l -E . --omit=loong64\|plan9-386\|plan9-amd64\|linux-amd64-unified\|freebsd-arm-paulzhol\|b6088cc --since=2022-05-02 --before=2022-05-03

(48 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

Notes for today:

@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

I see some major breakage on the dashboard for tomorrow, so I'm going to go ahead and run the logs from just before that.
(The tree is broken at CL 353989 and mostly-fixed at CL 397018, save for a few builders with remaining failures.)

@bcmills
Copy link
Contributor Author

bcmills commented May 3, 2022

Notes:

@bcmills
Copy link
Contributor Author

bcmills commented May 4, 2022

Notes:

@bcmills
Copy link
Contributor Author

bcmills commented May 5, 2022

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|linux-amd64-boringcrypto\|freebsd-arm-paulzhol\|openbsd-mips64-jsing --since=2022-05-04 --before=2022-05-05 --details

(601 matching logs)

Clearly I need to notch out some of yesterday's breakage. 😅

@bcmills
Copy link
Contributor Author

bcmills commented May 5, 2022

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|linux-amd64-boringcrypto\|freebsd-arm-paulzhol\|openbsd-mips64-jsing\|a5481fb\|78819d0\|f52b4ec --since=2022-05-04 --before=2022-05-05

(52 matching logs)

That's better, but still in bad shape. 😞

@bcmills
Copy link
Contributor Author

bcmills commented May 5, 2022

Notes:

@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2022

Another rough day for the builders.

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|linux-amd64-boringcrypto\|freebsd-arm-paulzhol\|openbsd-mips64-jsing --since=2022-05-05 --before=2022-05-06 --details

(348 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2022

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|linux-amd64-boringcrypto\|freebsd-arm-paulzhol\|openbsd-mips64-jsing\|5073c1c\|a7ab208\|983906f\|bb1f441\|7c74b0d --since=2022-05-05 --before=2022-05-06

(39 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 6, 2022

Notes:

@bcmills
Copy link
Contributor Author

bcmills commented May 9, 2022

greplogs --triage -l -E . --omit=loong64\|plan9\|linux-amd64-unified\|linux-amd64-boringcrypto\|freebsd-arm-paulzhol\|openbsd-mips64-jsing --since=2022-05-06 --before=2022-05-09

(57 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 9, 2022

Notes:

@heschi heschi added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label May 9, 2022
@bcmills
Copy link
Contributor Author

bcmills commented May 10, 2022

greplogs --triage -l -E . --omit=loong64\|plan9-\(386\|amd64\)\|linux-amd64-unified\|freebsd-arm-paulzhol\|openbsd-mips64-jsing --since=2022-05-09 --before=2022-05-09T16:02:00

(4 matching logs)

greplogs --triage -l -E . --omit=loong64\|plan9-\(386\|amd64\)\|linux-amd64-unified\|freebsd-arm-paulzhol\|openbsd-mips64-jsing\|ppc64\|riscv64\|darwin-arm64 --since=2022-05-09T16:02:00 --before=2022-05-10

(18 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 10, 2022

Notes:

@bcmills
Copy link
Contributor Author

bcmills commented May 11, 2022

greplogs --triage -l -E . --omit=loong64\|plan9-\(386\|amd64\)\|linux-amd64-unified\|freebsd-arm-paulzhol\|openbsd-mips64-jsing\|ppc64\|riscv64 --since=2022-05-10 --before=2022-05-10T17:20:00

(14 matching logs)

greplogs --triage -l -E . --omit=loong64\|plan9-\(386\|amd64\)\|linux-amd64-unified\|freebsd-arm-paulzhol\|openbsd-mips64-jsing\|riscv64 --since=2022-05-10T17:20:00 --before=2022-05-11

(13 matching logs)

@bcmills
Copy link
Contributor Author

bcmills commented May 11, 2022

Notes:

  • The solaris-amd64-oraclerel builder is catching a lot of platform-independent invalid assumptions about timeouts in tests, but the high rate of other failures on that builder (such as cmd/compile,runtime: frequent test timeouts on solaris-amd64-oraclerel #51443) makes it tedious to diagnose. @rorth, I'm going to skip triaging that builder (and consider the port broken) until some of these issues are resolved.
  • ppc64 and ppc64le are fixed as of CL 405116; riscv64 is still broken at head.

@toothrot
Copy link
Contributor

toothrot commented Jul 20, 2022

@toothrot
Copy link
Contributor

@cherrymui
Copy link
Member

cherrymui commented Jul 27, 2022

greplogs --triage --since=2022-07-20

@cherrymui
Copy link
Member

cherrymui commented Jul 27, 2022

@dmitshur
Copy link
Contributor

dmitshur commented Aug 5, 2022

greplogs --triage --since=2022-07-27

(127 matching logs)

@findleyr
Copy link
Contributor

findleyr commented Aug 5, 2022

In the last triage batch, there were a couple duplicate issues filed for gopls flakes that had already been fixed (with closed issues).

That's fine, I don't mind de-duping, but is there a way that I can preempt the triage process by making sure the existing issues are associated with the flakes? Perhaps a label I can add, or a particular format to the issue I create?

@heschi
Copy link
Contributor

heschi commented Aug 19, 2022

greplogs --triage --since=2022-08-04 --omit dragonfly-amd64-622 --omit android-arm.\*-corellium --omit linux-ppc64le-.\* --omit openbsd-arm.\*-jsing --omit linux-amd64-alpine --omit 40e737f\|04bbc27\|12ff722 --known-issue #54416=TestTestConn/UnixPipe '--known-issue=#54553=lock ordering' --known-issue=#54503=gopkg.in '--known-issue=#54555=connections still open after closing DB' --known-issue=#53456=TestDebugLines --known-issue=#51323=INTERNAL_ERROR --known-issue=#29951=TestNewIntAllocs --known-issue=#38111=TestLookup --known-issue=#53397=issue52788.go --known-issue=#54337=TestAppendOfMake --known-issue=#53702=issue53702.go --known-issue=#54557=wycheproof --known-issue=#22857=TestLookupLongTXT --known-issue=#54458=TestCgoTraceParser '--known-issue=#54411=newstack at runtime/internal/atomic' --omit js-wasm '--known-issue=#53722=Get "https://proxy.golang.com.cn.*lookup'