Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/tools/internal/lsp: frequent out-of-memory errors on windows-amd64-race builder #33951

Open
bcmills opened this issue Aug 29, 2019 · 11 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Aug 29, 2019

The tests under x/tools/internal/lsp flake frequently on the windows-amd64-race builder with out of memory errors:
https://build.golang.org/log/23e29f4766fa39ac7b7945a729cec7d6ed82a41d
https://build.golang.org/log/086b928e2af996b8955666acce9c2c36ff7f82e8
https://build.golang.org/log/1c6543c8679c8710e897463a62921a3bddfa9a2e

The flakiness of these tests makes it too easy to ignore failures and miss real bugs (such as #31749) detected by the same builder:
https://build.golang.org/log/2c3d2505bf5145815ce4fc57a449c39a46a32cf2

See also #30309, #32834.

CC @ianthehat @stamblerre

@bcmills

This comment has been minimized.

Copy link
Member Author

@bcmills bcmills commented Aug 30, 2019

I'm not sure whether this problem is getting worse, or we're just having a run of bad luck, but we've got five of what appear to be OOM failures in a row now, four of them on https://golang.org/cl/184165.

(https://build.golang.org/?repo=golang.org%2fx%2ftools#short)

CC @matloob @heschik

@bcmills bcmills added the Soon label Sep 3, 2019
@bcmills

This comment has been minimized.

Copy link
Member Author

@bcmills bcmills commented Sep 3, 2019

Definitely not just a run of bad luck. x/tools went from mostly passing on windows-amd64-race to mostly failing at some point around CL 192277 or CL 184165. I'm guessing that one of those changes increased the memory footprint of the test above some critical threshold.

@bcmills

This comment has been minimized.

Copy link
Member Author

@bcmills bcmills commented Nov 6, 2019

These also occur regularly on the windows-amd64-longtest builder: https://build.golang.org/log/4837bb3076d9cee69a232dba45f4e4a9a939cf03

@matloob

This comment has been minimized.

Copy link
Contributor

@matloob matloob commented Nov 6, 2019

Okay. Can the builders' sizes be increased?

@bcmills

This comment has been minimized.

Copy link
Member Author

@bcmills bcmills commented Nov 6, 2019

Looks like they are currently n1-highcpu-4 ,¹ ² which is only 3.6 GB per instance, so probably they should be made substantially larger.

@dmitshur upsized the corresponding linux builders to n1-highcpu-16 in CL 192679.

(CC @bradfitz @toothrot @cagedmantis)

¹https://github.com/golang/build/blob/0babffaf2095de16932bb07c70d58a2730a161fd/dashboard/builders.go#L2029-L2058
²https://github.com/golang/build/blob/0babffaf2095de16932bb07c70d58a2730a161fd/dashboard/builders.go#L421-L441

@dmitshur

This comment has been minimized.

Copy link
Member

@dmitshur dmitshur commented Nov 7, 2019

I agree we should increase the memory for the -race builder at the very least. In order to do that, we need to make a decision about the machine type to use next.

Right now all the windows/amd64 hosts use n1-highcpu-4 machine type, since they were added in 2017 in CL 41393. This includes the normal builders, the -race builder, and the -longtest builders.

Highcpu is described as:

High-CPU machine types are ideal for tasks that require a moderate increase of vCPUs relative to memory. High-CPU machine types have 0.90 GB of memory per vCPU.

In contrast to standard:

Standard machine types have 4 GB of memory per vCPU.

So, we can consider changing the windows -race builder to use either n1-highcpu-8, or n1-standard-4. Here's a comparison (based on https://cloud.google.com/compute/docs/machine-types and https://cloud.google.com/compute/vm-instance-pricing):

Machine vCPUs RAM Price
n1-highcpu-4 (Current) 4 3.6 GB $0.14
n1-highcpu-8 (Option 1) 8 7.2 GB $0.28
n1-standard-4 (Option 2) 4 16 GB $0.19

I would prefer for us to make smaller changes, and iterate based on the results. My current impression is that the standard machine type would be a better fit for our needs, so I propose we change all the windows hosts to use n1-standard-4 (Option 2) instead of the current n1-highcpu-4. Both options seem quite close so I'm on the fence. Feedback welcome. Otherwise we can proceed with that plan as the next step.

@bcmills

This comment has been minimized.

Copy link
Member Author

@bcmills bcmills commented Nov 7, 2019

I would like to have lower-latency longtest slowbots, so (budget permitting) I have a sight preference for the highcpu configuration.

(But really, anything is better than OOMing at head, so I'm not going to hold things up if folks would prefer a different option.)

@dmitshur

This comment has been minimized.

Copy link
Member

@dmitshur dmitshur commented Nov 7, 2019

Bryan, how does n1-highcpu-8 for -longtest and n1-standard-4 for the rest (normal and -race) sound?

@bcmills

This comment has been minimized.

Copy link
Member Author

@bcmills bcmills commented Nov 7, 2019

I would be inclined to do n1-highcpu-8 for -longtest and -race and keep n1-highcpu-4 for the rest, although it's possible that -race will need more than 7.2GB to run x/tools.

(If the “short” builders are doing fine today with only 3.6 GB, I'm doubtful that throwing more RAM at them will do much, although I suppose it could improve filesystem buffering performance.)

@bcmills

This comment has been minimized.

Copy link
Member Author

@bcmills bcmills commented Nov 7, 2019

FWIW, it looks like freebsd-amd64-race is doing fine on n1-highcpu-4, so it may be worth some time for the x/tools folks to figure out why Windows isn't managing. (Are some tests being unexpectedly skipped on FreeBSD? Is the platform just that much more efficient?)

@bcmills

This comment has been minimized.

Copy link
Member Author

@bcmills bcmills commented Nov 7, 2019

I'm having a hard time following the configuration for linux-amd64-race, but I believe it may be on an n1-standard-4 (due to running in a container).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.