Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: frequent timeouts running js-wasm TryBots #35170

Open
bcmills opened this issue Oct 26, 2019 · 16 comments
Open

x/build: frequent timeouts running js-wasm TryBots #35170

bcmills opened this issue Oct 26, 2019 · 16 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Oct 26, 2019

I've noticed that the js-wasm TryBots on my recent CLs seem to start with one or more spurious failures before finally running successfully.

I suspect that that increases load on the TryBots generally, and may increase overall TryBot latency in some circumstances.

A log from a recent example is here (https://farmer.golang.org/temporarylogs?name=js-wasm&rev=9516a47489b92b64e9cf5633e3ab7e37def3353f&st=0xc00ab39b80):

  builder: js-wasm
      rev: 9516a47489b92b64e9cf5633e3ab7e37def3353f
 buildlet: (nil *buildlet.Client)
  started: 2019-10-26 01:37:51.137407213 +0000 UTC m=+287652.939853079
    ended: 2019-10-26 01:38:21.20458221 +0000 UTC m=+287683.007028020
  success: false

Events:
  2019-10-26T01:37:51Z ask_maintner_has_ancestor 
  2019-10-26T01:38:21Z finish_ask_maintner_has_ancestor after 30s; err=context deadline exceeded

Build log:
Error checking whether commit 9516a47489b92b64e9cf5633e3ab7e37def3353f includes ancestor 3dced519cbabc213df369d9112206986e62687fa: context deadline exceeded


Error: context deadline exceeded

CC @dmitshur @toothrot @bradfitz

@agnivade
Copy link
Contributor

@agnivade agnivade commented Oct 26, 2019

One change was that the builder was recently bumped to Node 13.

@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Oct 27, 2019

I think that's unrelated. When I've seen this, the error is also as Bryan quoted above: a timeout doing a gRPC call to maintner, asking whether we should even do this build. We used to need to skip builds for js/wasm if the git history of the rev to be built didn't include js-wasm support. But now that we've supported js/wasm for a number of releases (since Go 1.11) and we don't even support Go 1.10 (or even Go 1.11) any more, we can just remove that condition on that builder.

But also: maintner shouldn't be so slow at that query.

@gopherbot
Copy link

@gopherbot gopherbot commented Oct 27, 2019

Change https://golang.org/cl/203757 mentions this issue: dashboard: stop checking whether the branch/rev supports js/wasm for js/wasm

gopherbot pushed a commit to golang/build that referenced this issue Oct 28, 2019
…js/wasm

It's supported it for ages. We no longer support or build branches
that didn't have js/wasm support.

See golang/go#35170 (comment)

Updates golang/go#35170 (still need a maintner fix, ideally)

Change-Id: If5489dd266dea9a3a48bbc29f83528c0f939d23b
Reviewed-on: https://go-review.googlesource.com/c/build/+/203757
Reviewed-by: Bryan C. Mills <bcmills@google.com>
@bradfitz
Copy link
Contributor

@bradfitz bradfitz commented Oct 28, 2019

Not very scientific, but the has-ancestor RPCs are taking only 100ms for me. (includes TCP+TLS setup, etc) Not anywhere near 30 seconds.

$ time maintq has-ancestor 0ae9389609f23dc905c58fc2ad7bcc16b770f337 3dced519cbabc213df369d9112206986e62687fa
has_ancestor:true

real    0m0.105s
user    0m0.076s
sys     0m0.028s

codebien added a commit to codebien/build that referenced this issue Nov 13, 2019
…js/wasm

It's supported it for ages. We no longer support or build branches
that didn't have js/wasm support.

See golang/go#35170 (comment)

Updates golang/go#35170 (still need a maintner fix, ideally)

Change-Id: If5489dd266dea9a3a48bbc29f83528c0f939d23b
Reviewed-on: https://go-review.googlesource.com/c/build/+/203757
Reviewed-by: Bryan C. Mills <bcmills@google.com>
@agnivade
Copy link
Contributor

@agnivade agnivade commented May 8, 2020

@dmitshur - Would you be able to check if this is still happening ?

@neelance
Copy link
Member

@neelance neelance commented Oct 28, 2020

@dmitshur Do you know if this is still an issue?

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Oct 28, 2020

I feel like I've seen some discussion about this still happening occasionally quite recently, more so than the comments in this issue would suggest. Perhaps it was in another thread. I'll keep an eye out for this.

@rsc
Copy link
Contributor

@rsc rsc commented Oct 29, 2020

Yes, it is still happening. Last week I caught the js-wasm trybot taking 25 minutes when most were done in 15,
and just now I caught it taking 34 minutes when everything else was done in 16.
That one was https://go-review.googlesource.com/c/go/+/266357 and I gave up waiting and just submitted the CL.

Any advice about what to do when js-wasm looks like it is stuck/very slow and how to debug further would be greatly appreciated.

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Oct 29, 2020

Any advice about what to do when js-wasm looks like it is stuck/very slow and how to debug further would be greatly appreciated.

I suggest trying to determine if there is a specific place where the build progresses more slowly than others. I.e., is it perhaps that time to first test is delayed, then it's fast, or is one of the package tests very slow, while the rest is fast, or is slowness equally distributed across all packages?

I'll also try to look into logs for the two occurrences you've shared and see if I can determine it from that.

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Oct 29, 2020

From CL 266374 where I recently started trybots:

image

In this instance, the js-wasm builder is in waiting_for_machine state for over 10 minutes. That is very unexpected—it's a GCP builder that we should be able to spin up without running out of physical machines. Perhaps we're running out of some quota. Not sure why only js-wasm and not other GCP builders. But this is a lead.

Edit: Here's the scheduler state at the time:

image

Issues #31282, #36088 may be related.

@bcmills
Copy link
Member Author

@bcmills bcmills commented Oct 30, 2020

In several occurrences that I saw, the js-wasm TryBot started running and partially completed, then descheduled and started over again from the beginning. I think there is some sort of failure being buried by retry logic, rather than just a slow test.

If we don't have the bandwidth to diagnose this at the moment, perhaps we should temporarily demote js-wasm to an opt-in SlowBot?

@thanm
Copy link
Contributor

@thanm thanm commented Oct 30, 2020

I'm also experiencing js-wasm slowdowns for my trybot runs. Here is an example:

Screenshot from 2020-10-30 13-18-40

Log (https://farmer.golang.org/temporarylogs?name=js-wasm&rev=ff065bd219cd7a9df01466e338b28f1891671da3&st=0xc0184e91e0):

  builder: js-wasm
      rev: ff065bd219cd7a9df01466e338b28f1891671da3
 buildlet: (nil *buildlet.Client)
  started: 2020-10-30 16:46:48.188839196 +0000 UTC m=+727935.299326955
   status: still running

Events:
  2020-10-30T16:46:48Z checking_for_snapshot 
  2020-10-30T16:46:48Z finish_checking_for_snapshot after 23.9ms
  2020-10-30T16:46:48Z get_buildlet 
 +2093.2s (now)

Build log:

(buildlet still starting; no live streaming. reload manually to see status)

Seems as though it is not getting past "get buildlet".

[Edit: did finally make progress after 40 min wait.]

@neelance
Copy link
Member

@neelance neelance commented Oct 30, 2020

I don't know much about farmer, but on https://farmer.golang.org/#sched it seems to me as if multiple trybot runs are waiting on the js-wasm tests, but I can only see very few buildlets with the prefix GCE VM: buildlet-js-wasm-. Is something preventing these buildlets from spawning?

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Oct 30, 2020

Is something preventing these buildlets from spawning?

It seems so. I don't have an idea about what it might be yet. It may be related to #42285. I'm going to look more later today.

@dmitshur
Copy link
Contributor

@dmitshur dmitshur commented Nov 30, 2020

I extracted some raw data from some Wasm trybot runs, which I was going to analyze more and post something here, but then this work got pre-empted by other work, and by now I've lost it due to a forced browser update.

From memory, the high-level summary was that the Wasm trybot runs are generally quick, with two components that contribute a fairly significant proportion of the overall time:

  • test directory
  • reboot test

There's a commonly used dist-test adjust policy fasterTrybots that skips precisely those two components during pre-submit tests (but not during post-submit tests), which is an option available to speed up Wasm trybots.

However, it seems the root problem here was with the scheduling of the Wasm builders and starting the trybot runs, not the length of trybot test execution itself under a typical "happy" run. There are many possible places to look next. As @bcmills also pointed out in #42699, there is a problem of stalls sometimes resulting in extended test executions that eventually instead of what should be failures. Stalls could be due to something Wasm-specific (e.g., #31282) or something more general. We also have some possible blindspots on the side of coordinator (e.g., #39349, #39665). As I've recently experienced during #42379 (comment), it's possible for stalls to happen when a test execution is sharded and not happen during a sequential all.bash test run.

All this is to say that more work needs to be done to get closer to finding and resolving the root issue(s) here. I think we have confirmation that this problem is intermittent and skipping a deterministically slow test is not an available option.

@gopherbot
Copy link

@gopherbot gopherbot commented Sep 2, 2021

Change https://golang.org/cl/347295 mentions this issue: dashboard: disable reboot test on js-wasm builder

gopherbot pushed a commit to golang/build that referenced this issue Sep 2, 2021
The reboot test verifies that the modified version of Go can be used
for bootstrapping purposes. The js-wasm builder is configured with
GOHOSTOS=linux and GOHOSTARCH=amd64 so that it builds packages and
commands for host, linux/amd64, which is used during bootstrapping
in the reboot test. Since this doesn't provide additional coverage
beyond what we get from linux-amd64's execution of the reboot test,
disable this test on the js-wasm builder.

Also remove some accidental duplicate code.

For golang/go#35170.

Change-Id: Ic770d04f6f5adee7cb836b1a8440dc818fdc5e5b
Reviewed-on: https://go-review.googlesource.com/c/build/+/347295
Trust: Dmitri Shuralyov <dmitshur@golang.org>
Run-TryBot: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Bryan C. Mills <bcmills@google.com>
Reviewed-by: Alexander Rakoczy <alex@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
8 participants