Your observation is correct. Currently the runtime never frees the g objects created for goroutines, though it does reuse them. The main reason for this is that the scheduler often manipulates g pointers without write barriers (a lot of scheduler code runs without a P, and hence cannot have write barriers), and this makes it very hard to determine when a g can be garbage collected.
One possible solution is to use an RCU-like reclamation scheme over the Ms that understands when each M's scheduler passes through a quiescent state. Then we could schedule unused gs to be reclaimed after a grace period, when all of the Ms have been in a quiescent state. Unfortunately, we can't simply use STWs to detect this grace period because those stop all Ps, so, just like the write barriers, those won't protect against scheduler instances manipulating gs without a P.
@changkun, I'm not sure what your benchmark is measureing. Calling runtime.GC from within a RunParallel doesn't make sense. The garbage collector is already concurrent, and calling runtime.GC doesn't start another garbage collection until the first one is done. Furthermore, if there are several pending runtime.GC calls, they'll all be coalesced into a single GC. If the intent is to just measure how long a GC takes, just call runtime.GC without the RunParallel.
Calling runtime.GC within a RunParallel does not measure contention on allglock. The GCs are serialized by runtime.GC itself, so they're not fighting over allglock, and they're coalesced by runtime.GC, so calling runtime.GC N times concurrently can result in anywhere from 1 to N GCs depending on vagaries of scheduling.
Benchmark aside, though, I think we're all clear on the issue that allgs is never collected and that impacts GC time and heap size.
Since gs are just heap allocated, it would make the most sense to collect them during GC like other heap allocations. The question is when it's safe to unlink them from allgs and allow them to be collected, given that the normal GC reachability invariants don't apply to gs. (At the same time, we don't want to be over-aggressive about unlinking them from allgs either, since we want the allocation pooling behavior to reduce the cost of starting a goroutine.) This is certainly doable, though it would require a fair amount of care.