Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build/env/linux-arm64/packet: restrict CPUs to a reasonable number of cores #36170

Open
toothrot opened this issue Dec 16, 2019 · 21 comments
Open

Comments

@toothrot
Copy link
Contributor

@toothrot toothrot commented Dec 16, 2019

During the Go 1.14 beta1 release, we had a number of failures due to issues on our arm64 builders #36166.

Inspecting the build box, we discovered that our 96 core machine was very overloaded: a 1 minute load of over 300. This appeared to be related to the failures we encountered.

An initial fix of a maximum of 2 containers with 48 cores each seemed to resolve the issue, but there's likely a more efficient configuration.

@toothrot toothrot changed the title env/linux-arm64/packet: restrict CPUs to a reasonable number of cores x/build/env/linux-arm64/packet: restrict CPUs to a reasonable number of cores Dec 16, 2019
@gopherbot gopherbot added this to the Unreleased milestone Dec 16, 2019
@toothrot toothrot self-assigned this Dec 16, 2019
@gopherbot
Copy link

@gopherbot gopherbot commented Dec 16, 2019

Change https://golang.org/cl/211579 mentions this issue: dashboard: restrict CPU usage on packet builders

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 17, 2019

I like that the packet builder is so effective in sussing out fragile tests.

Users sometimes overload their own machines, and sometimes run on machines that are much bigger than most of our builders — if we scale down the packet builders too much, we will fail to detect both real, timing-sensitive bugs, and fragile tests that manifest as flakiness for our users.

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 17, 2019

The specific failures observed in #36166 may be due to memory constraints, and in particular the CPU∶RAM ratio.

If so, then perhaps we need to re-tune the go and/or dist commands to pay more attention to RAM limits instead of (just) CPU limits.

@toothrot
Copy link
Contributor Author

@toothrot toothrot commented Dec 17, 2019

I agree with that sentiment and benefit. We do still need to figure out a way to be able to build a release for this platform without wasting a day in re-running tests (which is what happened for the latest release, see #36166).

I'm not sure whether we should be adding more complexity to the scheduler for this, if we have releasebot change the configuration during a release, or if we should instead use a dedicated builder to suss out timing or memory sensitive test failures.

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 17, 2019

I agree that we should get to a state where we can build a release for linux/arm64 as quickly and reliably as for other platforms.

Dedicated builders for unusually high and unusually low CPU∶RAM ratios seem like a nice solution if we can do that. I'm not sure that the scheduler needs to be particularly more complex — if we have some reasonable idea of what a build costs, CPU and RAM limits are pretty easy to handle with semaphores. (I think the hard part is getting a reliable indicator of how much CPU and RAM the machine or container actually has available.)

@toothrot
Copy link
Contributor Author

@toothrot toothrot commented Dec 17, 2019

Sure, I understand conceptually how to keep track of such things in a scheduler, but I don't think that the act of accurately accounting for all of the CPU and RAM across our builders is low complexity. It's not impossible, but I don't know if it's at the top of my list.

I appreciate your concern here, and I don't want to lose any coverage by making this change. That being said, I don't know if relying on punishing a single platform is the best way to do that. It is convenient that the packet (and ARM servers in general) have a large number of cores.

Maybe there's another way to move forward with this that doesn't involve breaking the releases for one of our platforms, but still achieves part of your goal? A dedicated highcore builder might suffice to cover what we currently have.

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 17, 2019

The scheduler doesn't have to be “accurate”. (Like most compiler-y problems, “conservative” often suffices. 😉)

At any rate: I think it's very important that our hardware be representative of what we expect real users to run. Do real users deploy to many-core linux/arm64 servers? If so, what kind of resource ratios are they likely to have? I'm happy to turn it down to “realistic”, but preferably not below that point.

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 18, 2019

Thinking some more about this, there are actually two separate bits of value that we're currently getting from the packet builders:

  1. Due to the very high core count, the packet builder tends to be more sensitive to concurrency bugs, especially in the runtime.
  2. Due to the CPU overload, the packet builder tends to be more sensitive to timing-sensitivity bugs in tests.

So if we replace our large, overloaded packet builders with dedicated builders for unusual conditions, that suggests two new builders:

  1. Probably just one “very high CPU count” builder to suss out concurrency bugs, analogous to a longtest builder, or perhaps actually a longtest builder because the more CPU-intensive tests will tend to be long tests anyway). An ARM64 builder seems ideal for this, because we can get lots of independent cores for cheap.
  2. One or more “very low CPU per builder” builders to suss out timing bugs, but not necessarily an ARM64 builder (generally any arch will do).

@cherrymui
Copy link
Contributor

@cherrymui cherrymui commented Dec 18, 2019

We almost never see such a OOM failure on the builder dashboard, but we do see this when building the release. I also sometimes saw the failure when running all.bash manually on gomote. Do you know what the difference is?

As Bryan said, it is helpful to have this builder with very high core count. I think it would be much better if we could limit the number of parallel processes in building the release or running all.bash, but not limiting the number of accessible cores (for each individual process). But I'm not sure how to do this.

@bcmills
Copy link
Member

@bcmills bcmills commented Dec 18, 2019

Note that https://farmer.golang.org/ is currently showing 18 of 20 packet builders down. Is that related to the release errors?

@toothrot
Copy link
Contributor Author

@toothrot toothrot commented Dec 19, 2019

@bcmills The 18 of 20 being down is related to the pending CL. We manually configured it to 2 builders in order to get the release out. I was expecting to be able to move forward with that CL to update the dashboard configuration.

@toothrot
Copy link
Contributor Author

@toothrot toothrot commented Dec 19, 2019

@cherrymui We could limit the number of parallel builds without restricting the number of accessible cores. With an unrestricted number of cores, I found that if there were more than two trybot runs happening in addition to a release, the release generally failed.

I believe packet has a newer hardware configuration that we could use, as @bcmills mentioned, that has a similar number of cores but significantly greater RAM.

@toothrot
Copy link
Contributor Author

@toothrot toothrot commented Dec 19, 2019

As long as we all agree on the following thoughts, I think we can move forward:

  • The linux-arm64 builder should be able to reliably create releases. We'll restrict CPU, RAM, and parallel builds on the current builder in order to achieve this.
  • A second linux-arm64 machine can be provisioned as a highcpu-longtest builder to capture our loss of a builder that covers this scenario. Ideally, it also runs releases at some cadence in order to capture the OOM failure @cherrymui described.
  • A yet-to-be-determined low-CPU (single CPU?) builder should be provisioned somewhere

gopherbot pushed a commit to golang/build that referenced this issue Dec 19, 2019
The packet arm64 build box has 96 cores, and easily gets overloaded with
multiple builds running simultaneously. Restrict each container to fewer
cores, and decrease the number of workers.

In order to not lose some of the coverage we got from packet as a
high-CPU builder, we will need to create new builder types going forward
to ensure we don't lose coverage. See the issue for details.

Updates golang/go#36170

Change-Id: I16478fa05258bc14ac59a1e708133ffcc4c5a664
Reviewed-on: https://go-review.googlesource.com/c/build/+/211579
Run-TryBot: Alexander Rakoczy <alex@golang.org>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
@vielmetti
Copy link

@vielmetti vielmetti commented Apr 29, 2020

@toothrot - Packet has also available a 2d generation Ampere eMag server, a c2.large.arm - happy to allocate one more of those systems to the build pool there. It's 32 cores with appreciably faster per-core performance than the older ThunderX 96-core system.

@toothrot
Copy link
Contributor Author

@toothrot toothrot commented Apr 30, 2020

/cc @cagedmantis, who is currently working on our ARM builders.

@toothrot toothrot removed their assignment Apr 30, 2020
@cagedmantis cagedmantis self-assigned this Apr 30, 2020
@vielmetti
Copy link

@vielmetti vielmetti commented Apr 30, 2020

Good to be acquainted @cagedmantis - looking forward to adding a system to the collection.

@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented May 27, 2020

@vielmetti Would it be possible to continue this conversation over email? You can reach me at carlos@golang.org.

@vielmetti
Copy link

@vielmetti vielmetti commented Jun 1, 2020

@cagedmantis email sent!

@gopherbot
Copy link

@gopherbot gopherbot commented Mar 19, 2021

Change https://golang.org/cl/303232 mentions this issue: dashboard: add linux-arm64-aws to set of trybots

@vielmetti
Copy link

@vielmetti vielmetti commented Mar 19, 2021

@cagedmantis Do you still need the linux-arm64-packet trybot for anything? We've been planning to transition that system to much faster Ampere Altra hardware.

@cagedmantis
Copy link
Contributor

@cagedmantis cagedmantis commented Mar 24, 2021

@vielmetti Yes, we still think it's useful to have the linux-arm64-packet builder. As @bcmills noted above, one reason it is valuable is in sussing out fragile tests. Thank you for the contribution.

gopherbot pushed a commit to golang/build that referenced this issue Apr 2, 2021
This change adds linux-arm64-aws to the set of trybots. It also
removes linux-arm and linux-arm64 from the list of builds attempted in
the misc-compile-linuxarm builder.

Updates golang/go#36170
Fixes golang/go#45065

Change-Id: If25fed08e35b2a91c9c9dbbf31701ff5464dc913
Reviewed-on: https://go-review.googlesource.com/c/build/+/303232
Trust: Carlos Amedee <carlos@golang.org>
Run-TryBot: Carlos Amedee <carlos@golang.org>
TryBot-Result: Go Bot <gobot@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
7 participants