Naive question. The runtime has a bunch of top-level vars, some of which are fairly hot, e.g. the writeBarrier struct (checked before every write barrier call), the debug struct (checked during every malloc for e.g. allocfreetrace), and the trace struct (to know whether tracing is enabled). Some are written a lot (writeBarrier), whereas others are read-mostly (debug, trace).
They are organized for readability and thus end up potentially scattered around the final binary. However, I wonder whether it would be better to ensure that all the hottest read-mostly variables are in a single cache line and ensure that the hottest read-write variables don't trigger false sharing.
Many of these aren't easy to move around and experiment with, because of compiler integration. So first: Any instincts about whether this is likely to matter in practice?
I don't recall seeing serious contention on any globals when I ran https://godoc.org/github.com/aclements/go-perf/cmd/memlat, but that was quite a while ago and I wasn't necessarily looking. It would be easy enough to run that again. Particularly if it's run on a multi-node system, any globals with poor cacheability or false sharing should stick out as expensive remote DRAM events.
It would also be easy enough to crank up the PEBS recording rate and just write a simple tool to look for hot globals. Sort of like https://godoc.org/github.com/aclements/go-perf/cmd/memanim, but obviously looking for different things in the memory trace. With memanim, I found the hardware could easily record every single load over 50 cycles.
If we do find any, the cheap solution is to add padding variables around them. We already do this in a few places (grep for CacheLineSize), but I think those are all based on assumptions about hot cache lines and aren't backed up by measurements.
Frequent write sharing can be very expensive and prevent scaling on higher core counts. We need to get rid of each and every case.
But note that processors don't have circuitry to distinguish between false and true sharing. They penalize both equally. So it is not about adding padding and shuffling variables, it is about elimination of frequently written to variables. You can see the following changes for examples: d6ed1b7 d839a80 66d5c9b 909f318 013ad89 c9152a8 86e7323
And scheduler (distributed run queues), memory allocator (MCache) and parallel GC (Workbuf, parfor) were designed around the idea of not creating heavy write sharing in the first place.
If new instances of frequently written to variables were added since then, we need to get rid of them as well.