-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x/build: openbsd-amd64-64 trybots are too slow #29223
Comments
SELECT Builder, AVG(Seconds) as Sec FROM builds.Builds WHERE IsTry=True AND StartTime > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 100 HOUR) and Repo = "go" AND FailureURL = "" GROUP BY 1 ORDER BY Sec DESC;
|
SELECT Builder, Event, AVG(Seconds) as Sec FROM builds.Spans WHERE Builder LIKE 'openbsd-amd64%' AND Error='' And IsTry=True AND StartTime > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 100 HOUR) and Repo = "go" GROUP BY 1, 2 ORDER BY Sec DESC;
|
Wow, just running make.bash (which isn't sharded out over N buildlets) is more than twice as slow as other platforms: SELECT Builder, Event, AVG(Seconds) as Sec FROM builds.Spans WHERE Event = 'make' AND Error='' And IsTry=True AND StartTime > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 100 HOUR) and Repo = "go" GROUP BY 1, 2 ORDER BY Sec DESC;
|
Likely suspect: #18314 (use a tmpfs on OpenBSD) |
I tried doing the memory filesystem on /tmp/ in an OpenBSD 6.4 amd64 instance (via gomote ssh) and it works, but it's still not any faster. Still 5 minutes ....
It sees 4 cores:
The kernel we're running is:
Is this Spectre/Meldown mitigations shutting down SMT? Can we disable that for the builders? /cc @mdempsky |
@bradfitz I think you can try setting "sysctl hw.smt=1" to re-enable hyper threading. |
It's already enabled:
So, that's not it. It's crazy that OpenBSD is 2x slower. If it were 10% slower I'd assume, "Oh, OpenBSD prioritizes security over performance" and be fine with that. But 2x makes me think we have a configuration problem somewhere. |
Have you tried increasing login.conf limits (as I suggested on twitter)? |
Which would you increase? We have:
|
The default settings are low. You could try setting datasize-max/cur and stacksize-cur to "unlimited" |
@stmuk Wouldn't the resource limits being too low just cause the build to fail rather than to proceed slowly? |
Yeah. The issue is speed, not failure to build. |
This is all very tedious & slow to work on, so I don't eagerly pursue avenues that don't at least make sense. Maybe if I were really desperate. But given limited time, I'd rather spend it on trying to collect system-wide profiling information or otherwise getting more visibility into the problem, rather than just changing random things.
We push a pre-built Go 1.4 to it and use that. |
@bradfitz Maybe a first step would be to use cmd/dist's GOBUILDTIMELOGFILE to see if any particular steps are slower, or the whole thing is proportionally slower?
|
@bradfitz Too many negatives in that for me to parse that or motivate me to try and help further. I just regret wasting my time trying to help. |
@stmuk, sorry, I didn't mean to waste your time. But with me and @mdempsky both thinking that such a tweak wouldn't do anything, it's not a high priority of mine to try. I appreciate you throwing it out there, even if it's not the answer. I at least went and read the OpenBSD man pages for those knobs. |
@bradfitz You were right the login cap limit relaxation made no difference whatever. @mdempsky Running on i5 Gen 5 Vbox host with OEL7.6 and OpenBSD 6.4 guests under vagrant I get the unexpected result of a slightly faster OpenBSD build! There are different compilers in use to build the 1.4 I bootstrapped tip off. OpenBSD has their patched clang 6 whereas Linux has gcc 4.8.5. OBSD has a noatime mount but otherwise no changes were made. I'm wondering if we are just seeing differences due to the underlying virtualisation. I may experiment with QEMU and more similar C compilers if I get a chance. go version devel +5538a9a Fri Jan 18 22:41:47 2019 +0000 linux/amd64 go version devel +5538a9a Fri Jan 18 22:41:47 2019 +0000 openbsd/amd64 |
The problem is not the compiler or the VM software or the FS. I'm the maintainer of BaCon which also runs a big bash script and it's slow as hell. Something happens between bash and the OpenBSD base which makes the bash scripts slow. Maybe something related to the memory protections. |
@juanfra684, our bash script is a few lines that just calls into a Go program that does the rest. Our performance issue is not bash related. |
You're right, sorry for the misunderstanding. I've built the go (only make, no tests) port on 6.4 and -current, and there is a -14% of difference:
OpenBSD doesn't have magic knobs to speedup things but you could tune a few thing to help the bootstrap. Firstly, if the VM host is using flash drives for storage, forget mfs. It's not an equivalent in speed to Linux tmpfs and you can usually run the FS operations faster in a simple consumer grade SSD. About the mount options, use You could add also a few entries to
|
Sounds good. Yes, we're using GCE's SSDs, which are fast.
We already had
We had 20.
We had |
One of your commands shows:
If you have a few GB of RAM, try with
A better way to check the correct value for
Anyway, those things will only speedup slightly the process. There is an underlying problem on OpenBSD slowing down the build and tests. |
We remount it when the build daemon starts up: That runs before any build. |
|
For debugging builder make.bash speed. Updates golang/go#29223 Change-Id: I030c61ec3fdd7af45c6a96ea5cede0bbb54f97bc Reviewed-on: https://go-review.googlesource.com/c/160317 Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
No, it's being measured from a program (on Linux) that drives the buildlet over the network.
Doesn't look great:
Getting off by 21.5 seconds in 60 seconds looks like the TSC calibration definitely didn't work, which explains why it wasn't selected in the first place. But then I ran it again (new VM boot, different CPU type?) and:
There the adjustments were much more reasonable, but from the client's point of view it took from I wonder if we're not specifying which CPU type we want so we're getting something random. (https://cloud.google.com/compute/docs/cpu-platforms) I'd just run -CURRENT with pvclock instead of worrying about TSC, but even though -CURRENT is effectively 6.4 right now, I worry that in a month or two, CURRENT might introduce some breaking change that Go isn't ready for and we'll be unable to rebuild our builder's VM image if we need something new in it. So I'd like to do one of:
Recommendations? |
Nevermind, it's not random. The docs say:
We run in
So we're on Ivy Bridge for now, and will switch to Haswell by default in a couple months here. We could request a different CPU type if the TSC/OpenBSD is better in one of those. |
On third run, the whole rdate+sleep 60+rdate took exactly 60 seconds from client's perspective, with 9.4 second drift:
|
Hi, I agree with the analysis by @mdempsky that PMTimer is too slow on GCE which prevents TSC calibration.
Probably the easiest until 6.5 arrives in May.
You cannot adjust TSC frequency via a sysctl interface. It would be useless for our users in the long run anyway.
Backporting is straightforward, but would you be able to compile and store the kernel yourself? OpenBSD doesn't backport drivers to -stable due to a short development cycle.
Mirrors don't retain snapshots. If you're able to store and fetch the snapshot yourself that might be a way to go. Then you can schedule updates on your own. FYI, right now LLVM 7 is being integrated into OpenBSD-current.
|
@mbelop, I take that to mean that OpenBSD does do better on Skylake? At least my test seems to confirm. On Skylake it boots and says:
So looks like it had already selected TSC on its own with Skylake and the sysctl was a no-op. But the rdate result wasn't great:
8.5 seconds off in 60 seconds. But fastest build yet: 2m11s. So, yay Skylake? Maybe the wall clock drift or rdate result doesn't matter. |
As long as your hypervisor sets the CPU model of the VCPU to whatever value Skylake uses,
Supposedly because KVM doesn't compensate for VM switches.
I don't think it matters a whole lot for building and running regress tests especially since you |
Yes. |
Should the FreeBSD builders be made to stop using TSC as well? Can Broadwell be set as the minimum CPU type (I think it's available in all the regions)? |
So Skylake alone doesn't seem to do the trick for OpenBSD. I tried booting our original OpenBSD 6.4 image (without the sysctl to set kern.timecounter.hardware) and it comes up and says...
(The 2.0 GHz to me implies Skylake based on https://cloud.google.com/compute/docs/cpu-platforms at least) But then when I run sysctl kern.timecounter.hardware:
And it's slow. So it seems I still need to force it to TSC? Or is GCE not giving me Skylake, despite (seemingly?) asking for it? |
@paulzhol, I don't know. You want to investigate and open a separate bug? But they're not slow (same speed as Linux, which does use the kvm clock), so I'm not worried. Whatever FreeBSD is doing by default seems to work. |
As you can see from
|
Change https://golang.org/cl/160457 mentions this issue: |
Linux fix is torvalds/linux@b511203 https://lwn.net/Articles/752159/ says:
So Linux redoes calibration on this CPU. OpenBSD gives up on it. I'm starting to think mirroring ~today's CURRENT snapshot to our own storage for reproducible VM image builds might be the best path forward, rather than backporting pvclock to 6.4 or worrying about the TSC / CPU issues. |
@bradfitz FreeBSD is not doing the right thing by default, I've forced it to use TSC in golang/build@2e001a4#diff-ca6ebc387c9f22d5808dfac3060344bd based on https://lists.freebsd.org/pipermail/freebsd-cloud/2017-January/000080.html (so the 10.4, 11.2 and 12.0 images are affected). That's why I'm asking if we should revert (or maybe you can request Broadwell/Skylake machines for FreeBSD builders as well, no kvmclock is being used). |
This isn't correct. We don't give up immediately, we attempt to recalibrate with a PM timer or an HPET timer, except that it doesn't work for you. Then we give up.
I can provide a backport patch for 6.4, but I don't know how would you go about integrating it in your test system. I totally support your decision to move to -current however. |
A relevant discussion for about TSC calibration (on FreeBSD): http://freebsd.1045724.x6.nabble.com/TSC-calibration-in-virtual-machines-td6266249.html#a6266308
|
Pardon me if this is just taken out of context, but this question sounds silly, since if you're running under a hypervisor that has total control of the guest memory and execution flow you're trusting it by the very definition. There's no question. |
Ah, okay. Thanks. It'd be nice of course if it worked, or at least whitelisted GCE as having a correctly advertised value like it sounds like FreeBSD is considering. I'll probably go the -current route. |
Correctly advertised value of what?
|
It was my impression from that FreeBSD thread that some hypervisors advertise the TSC frequency. For instance, VMware does: https://svnweb.freebsd.org/base/head/sys/x86/x86/tsc.c?r1=221178&r2=221214&pathrev=221214 That thread made it sound like GCE does something similar. |
Yeah, I don't think that's the case. But if you do, please let us know. |
Doesn't GCE (kvm) has this commit qemu/qemu@9954a15 ? |
GCE uses KVM but with its own implementation; it doesn't use QEMU. So it might do something like that, but that QEMU code wouldn't tell us whether it does. |
So if I'm reading https://www.kernel.org/doc/Documentation/virtual/kvm/timekeeping.txt correctly
My understanding is that if the hypervisor reports a TSC to the guest, then I assume the hypervisor does all the validations required (trapping RDTSC instructions if necessary, keeping the vcpus TSC's in sync etc.) and no calibration should be required (or possible) in the guest. So we are safe to keep forcing TSC on Broadwell/Skylake hosts until something like https://svnweb.freebsd.org/base/head/sys/x86/x86/tsc.c?r1=221178&r2=221214&pathrev=221214 is applied upstream for GCE/KVM? |
…penBSD I thought this would be enough for OpenBSD to select the TSC on its own without being forced to (as in CL 160319), but apparently it is not: golang/go#29223 (comment) So it seems like we want both this CL and CL 160319. Updates golang/go#29223 Change-Id: I0a092d62881d8dcce0ef1129d8d32d8f4025b6ac Reviewed-on: https://go-review.googlesource.com/c/160457 Reviewed-by: Andrew Bonventre <andybons@golang.org>
openbsd-amd64-64 trybots are taking 11+ minutes (which causes TryBots as a whole to take 11+ minutes rather than ~5)
We need to figure out what's slow on them, and/or just shard it out more.
/cc @dmitshur @bcmills @andybons
The text was updated successfully, but these errors were encountered: