Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: avoid clearing the full Pool on every GC #22950

Closed
dsnet opened this issue Dec 1, 2017 · 34 comments
Closed

sync: avoid clearing the full Pool on every GC #22950

dsnet opened this issue Dec 1, 2017 · 34 comments

Comments

@dsnet
Copy link
Member

@dsnet dsnet commented Dec 1, 2017

Consider the following benchmark:

func Benchmark(b *testing.B) {
	b.ReportAllocs()
	var a, z [1000]*flate.Writer
	p := sync.Pool{New: func() interface{} { return &flate.Writer{} }}
	for i := 0; i < b.N; i++ {
		for j := 0; j < len(a); j++ {
			a[j] = p.Get().(*flate.Writer)
		}
		for j := 0; j < len(a); j++ {
			p.Put(a[j])
		}
		a = z
		runtime.GC()
	}
}

This currently reports: Benchmark-8 10 263962095 ns/op 663587733 B/op 1017 allocs/op
where I find it surprising that there is around 1000 allocations again every cycle. This seems to indicate that the Pool is clearing its entire contents upon every GC. A peek at the implementation seems to indicate that this is so.

If the workload is such that it is cycling through a high number of items, then it makes sense to actually not clear everything. A simple heuristic to accomplish this is to keep track of the number of calls to Pool.Get since the last attempt clearing the pool and ensure that we retain up to that number of items.

\cc @LMMilewski @aclements

@dsnet dsnet added the Performance label Dec 1, 2017
@dsnet dsnet changed the title sync: avoid clearing sync: avoid clearing the full Pool on every GC Dec 1, 2017
@dsnet dsnet added the Suggested label Dec 1, 2017
@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 1, 2017

When proposing any changes here please consider the discussion at https://groups.google.com/forum/#!msg/golang-dev/kJ_R6vYVYHU/LjoGriFTYxMJ .

@OneOfOne

This comment has been minimized.

Copy link
Contributor

@OneOfOne OneOfOne commented Dec 1, 2017

I'd like to see that too, a lot of times I use a ghetto chan xxx as a memory pool just to get around that.

@cespare

This comment has been minimized.

Copy link
Contributor

@cespare cespare commented Dec 1, 2017

I've also found the sync.Pool clearing behavior to be less than satisfactory. Here's a concrete example taken from a real server.

We have a performance-critical HTTP server. Reducing allocations has been a big part of getting throughput to be consistently high and latency to be consistently low. There are a handful of objects that we cache and reuse between requests as part of the allocation strategy. This server tends to have a ~5 GB heap and runs a GC every ~4 seconds.

One kind of reused object is a certain buffer that we use at the rate of ~10000 / sec. For this use case, a sync.Pool works really well; we get a 99.9% cache hit rate since we reuse buffers tens of thousands of times between GCs.

Another kind of reused object is a gzip.Writer that we use at the rate of ~500 / sec. For this use case, a sync.Pool did not work well for us: the combination of GC frequency and relatively lower usage rate meant that the hit rate when we used a sync.Pool was as low as 50%. Fresh gzip.Writers are extremely expensive to create, so we needed to do better. Now our code uses a buffered channel of *gzip.Writers that it can reuse, but it still uses a sync.Pool as a backup once the channel is drained so that if we suddenly need more writers than the channel size we don't have to keep allocating all of the extras.

Ideally a sync.Pool would be suitable for both of these scenarios.

Finally, the tight coupling between GC frequency and pool efficiency also has weird implications for reasoning about server performance. For instance, if totally unrelated code in our server changes the GC frequency (say, by introducing more allocations or, more perversely, decreasing the overall heap size), the pool hit rates get worse and thus the cost of allocations for these objects increases, even though nothing about the usage or cache-ability of those objects changed.

@dsnet

This comment has been minimized.

Copy link
Member Author

@dsnet dsnet commented Dec 1, 2017

@ianlancetaylor. Much of the discussion seems to be around the question of "when to drain the pool?" and not much around "how much of the pool to drain?". In the thread, Ingo Oeser asked "Will it be fully or partially drained?", but that question does not seem to have been addressed.

Using the GC seems like a great time to trigger drainages, but the scenario that @cespare describes was the failure mode I imagined could happen. I think there's a worthwhile discussion to be had about "how much to drain".

@rasky

This comment has been minimized.

Copy link
Member

@rasky rasky commented Dec 1, 2017

I'll note my real world experience as well. We use go on embedded systems and we often have soft real-time constraints (aka, low latency processing). In a specific software we were working on, we tried using sync.Pool to reuse allocations for handling packets coming from a high freq UDP server. Unfortunately, whenever the GC was triggered, the full pool was exhausted and this caused a peak of high latency (due to the allocations) that went beyond the soft real-time constraints we had. We had to discard sync.Pool and use a hand-rolled memory pool implementation.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Dec 1, 2017

@dsnet It sounds like your proposal is:

  1. Add a new field count to sync.Pool
  2. At each call to the Get method, atomically increment the count field.
  3. In poolCleanup
    • keep count elements distributed across the p's
    • if some elements were kept then keep the pool in allPools
    • set count to 0

Does that sound right? That seems reasonable to me. It basically delays the release of pool elements for one GC cycle.

It won't help @cespare 's case. It may help @rasky 's case. Of course, sync.Pool can't be a general mechanism that addresses all memory cache needs.

@dsnet

This comment has been minimized.

Copy link
Member Author

@dsnet dsnet commented Dec 1, 2017

The algorithm to determine how much to drain has some parallels with cache eviction algorithms,
for which most rely on the principal that the past is a good predictor of the future.
In the same way, there are several pieces of information that can be used as a heuristic to develop
a policy for how much of the pool to drain. Ultimately, it comes down to a trade-off between CPU time vs. memory space.

There are two extremes:

  • We could drain nothing, in which case the CPU time may be great never needing to allocate, but is obviously costly in terms of memory.
    A pool of this type can trivially be implemented as a slice acting as a free-list.
  • We could drain everything, in which case we prioritize saving memory space, but may be costly in CPU time since the items cleared may need to be allocated again shortly. This is the current policy of sync.Pool.

My assertion is that neither of these extremes are ideal and well-suited for most use-cases.

The algorithm I proposed (which you correctly summarize) is a simple algorithm that tries to find a better middle-ground between those two extremes. It certainly not a perfect algorithm as it completely ignores Pool.Get statistics from farther than one GC ago. Thus, it suffers when the workload on the server is highly non-uniform. An extension to the algorithm may perform a weighted average over N GC cycles. As long it can be proven that the pool is eventually drained given no Pool.Get traffic, then we know it doesn't leak memory.

More complicated policies can take into account information from the GC like what proportion of the heap the pool is occupying.
However, those algorithms can get very complicated quickly and are much harder to prove that they are correct.

In terms of whether it helps @cespare's or @rasky's use-cases, I believe my simple algorithm helps both.
The degree to how much it helps really depends on the exact work-load.

For @cespare's example with the gzip.Writer Pool, there were ~2000 calls to Pool.Get between each GC (assuming ~4s between GC and ~500/s Pool.Get calls), this means that when the GC occurs, it will preserve at least up to 2000 items in the Pool when drain is called.
Of course, there are probably far less than 2000 items in the pool since some are in active use and the computation of 2000 Pool.Get calls doesn't track how many intervening Pool.Put calls were made. The important property is that it is a useful heuristic of the demand for allocations from that Pool and should (hopefully) increase the hit-rate from 50% to something much higher.

@jimmyfrasche

This comment has been minimized.

Copy link
Member

@jimmyfrasche jimmyfrasche commented Dec 2, 2017

A simpler heuristic would be to only drain some fraction of the pool at each GC.

Doing both would be good, too. It would be less likely to take too much or leave too little.

@cespare

This comment has been minimized.

Copy link
Contributor

@cespare cespare commented Dec 2, 2017

I believe that @dsnet's suggestion would work for my use case. The gzip.Writer pool that I described never gets very large (I don't have an exact number, but it's always <100 elements) so it seems like @dsnet's proposal would result in the pool never being cleared.

I don't quite see why using the number of Gets as the limit (without regard to Puts) is a good heuristic, though. For instance, in the scenario I described, if the pool grew abnormally large during one GC cycle, it would be trimmed down to 2000 elements and then stay at 2000, even though there are normally only <100 live elements.

ISTM that a better heuristic would be to use the max live set size during the cycle as the limit. This is somewhat more bookkeeping, though. (The straightforward way I can think of doing it would be atomic increment/decrement to track the size plus a CAS loop to track the max size.)

@dsnet

This comment has been minimized.

Copy link
Member Author

@dsnet dsnet commented Dec 4, 2017

I was hoping using Gets alone would be a good approximation of "max live set", but your example provides an case where that approximation falls short.

ISTM that a better heuristic would be to use the max live set size during the cycle as the limit. This is somewhat more bookkeeping, though. (The straightforward way I can think of doing it would be atomic increment/decrement to track the size plus a CAS loop to track the max size.)

This could probably be simplified to just a atomic store. It's okay if it is slightly racy.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Dec 20, 2017

I think only counting the number of Gets can lead to bad behavior. For example, a sync.Pool could have large number of objects in it, but the only access right now is alternating between Get and Put rapidly. In this case, the Pool only needs to retain one object, but the count of Get calls is a function of the call rate and GC frequency, neither of which seem like they should matter.

Along the lines of what @cespare was suggesting, I think what we want to track is actually a low watermark on the occupancy of the Pool between GCs. Then we would free that many objects from the Pool, since that's how many we didn't need. This is more like the "min dead set" than the "max live set". :) I think tracking how many to free rather than how many to retain is simpler because some of the "live" objects that logically belong to the Pool may not actually be in the Pool when we clean it up.

I'm not sure if we want to actually track this count and free that many objects or create a two-tiered victim cache where:

  1. Get draws from tier 1 unless it's empty, in which case it draws from tier 2.
  2. Put always puts to tier 1.
  3. Cleanup drops tier 2 and demotes tier 1 to tier 2.

The victim cache approach keeps cleanup O(1), whereas I think the counting approach would have to be O(n). OTOH, the counting approach is more flexible. For example, we could smooth the count across multiple GC cycles to avoid transient problems. I'm not sure this flexibility is really necessary, though.

Either way, the hot path currently doesn't use any atomic operations or cause any cache contention, and it would be nice to keep it that way. That suggests we should shard whatever the solution is just like the rest of the Pool is sharded.

@dsnet

This comment has been minimized.

Copy link
Member Author

@dsnet dsnet commented Dec 20, 2017

See #23199 of an example where this optimization can cause indefinite pinning of large buffers. I argue the usage pattern described in #23199 is actually incorrect. In some recent work, I used an array of sync.Pools to bucketize each item by capacity so that each Pool contained elements of the approximate same memory cost.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Dec 22, 2017

FWIW, I took a stab at implementing the victim cache approach and got completely tangled up in the carefully constructed races surrounding poolCleanup. I'll give it another go after the holidays, and will probably try to fix the bad (O(n)) STW behavior of poolCleanup while I'm at it.

@rasky

This comment has been minimized.

Copy link
Member

@rasky rasky commented Feb 7, 2018

@aclements are you planning to tackle this in 1.11? I'm hitting this as well, and it's quite hard to write a correct lockless pool implementation.

@aclements

This comment has been minimized.

Copy link
Member

@aclements aclements commented Feb 8, 2018

I'm hoping to tackle this for 1.11, but we'll see what happens. As you point out, it's not an easy problem. :) I think I have to largely rewrite sync.Pool.

@dsnet

This comment has been minimized.

Copy link
Member Author

@dsnet dsnet commented Feb 8, 2018

Should we remove the NeedsDecision label? It seems that we're in agreement that the default behavior is not desired.

@ianlancetaylor

This comment has been minimized.

Copy link
Contributor

@ianlancetaylor ianlancetaylor commented Feb 9, 2018

As far as I know we don't actually know what we want to do, so I think this does still need a decision.

@jimmyfrasche

This comment has been minimized.

Copy link
Member

@jimmyfrasche jimmyfrasche commented Feb 9, 2018

This seems like a place where some simulations would provide a good idea of what sort of improvements are worth pursuing.

@cespare

This comment has been minimized.

Copy link
Contributor

@cespare cespare commented Mar 8, 2018

@jimmyfrasche yes, good idea. I did just that: https://github.com/cespare/misc/blob/master/drainalgo/drainalgo.go.

Doing this exercise made me realize that in this discussion we (or at least I) have been discounting an important, observable characteristic of the current implementation: the per-P private item.

By "observable" I mean what a user can see in terms of the pool hit/miss rate.

The sync.Pool implementation is quite sophisticated, but in terms of this user-observable behavior, it can be thought of as consisting of:

  1. A single cached item (private) per P.
  2. A shared, unbounded cache of items that may be drawn from if there is no private item available.

All of these are cleared on each GC.

(Note: The per-P item is an important fast-path optimization in the high-contention case; my analysis below is concerned with the cases where the hit rate is low -- generally low-contention situations.)

In my test program, I implemented several different pool strategies:

  • sync.Pool: sync.Pool.
  • simple: a simplified simulation of sync.Pool which has a single shared pool that's cleared on each GC, but no per-P private item.
  • maxlive: the "max live set" strategy I described in #22950 (comment).
  • mindead: the "min dead set" strategy Austin described in #22950 (comment).

We'll consider a really bad scenario for the current implementation:

  • Frequent GCs (every 100ms)
  • Relatively infrequent Gets (approx. every 10ms) -> small live set

On my 4-core laptop, this does poorly:

4-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype sync.Pool
...
  60s: 5706 gets (95.1/sec) / 66.70% hit rate / 3.01 avg pool size / 599 GCs (9.983/sec)

But on a 72-core server, it is much worse, because of the per-P caches:

72-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype sync.Pool
...
  60s: 5863 gets (97.7/sec) / 12.81% hit rate / 8.39 avg pool size / 599 GCs (9.983/sec)

The simplified simulation without the per-P items does better in this case, and (as expected) there's no difference between 4 cores and 72 cores:

4-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype simple
...
  60s: 5797 gets (96.6/sec) / 85.01% hit rate / 1.45 avg pool size / 599 GCs (9.983/sec)
72-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype simple
...
  60s: 5862 gets (97.7/sec) / 84.68% hit rate / 1.49 avg pool size / 599 GCs (9.983/sec)

The maxlive and mindead algorithms perform much better yet, yielding 97% hit rates even in this extreme scenario while only keeping a tiny number of items in the pool:

4-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype maxlive
...
  60s: 5803 gets (96.7/sec) / 97.29% hit rate / 1.76 avg pool size / 599 GCs (9.983/sec)
4-cores$ ./drainalgo -getinterval 10ms -gcinterval 100ms -pooltype mindead
...
  60s: 5796 gets (96.6/sec) / 97.19% hit rate / 1.72 avg pool size / 599 GCs (9.983/sec)

In all the scenarios I tested, maxlive and mindead gave similar hit rates and kept a similar number of items in the pool. I think this makes sense because, intuitively, they're equivalent strategies.


My conclusion is that the low hit rate in certain scenarios with the current sync.Pool implementation is partially explained by the clear-the-pool-each-GC strategy that it employs and is also partially explained by the existence of the per-P private single-item cache, and that the latter cause has a more pronounced effect with more CPU cores.

Therefore, if we implement a better high-level strategy, such as mindead, but we keep the per-P items, it may only amount to a partial fix for these scenarios.

That could still be an overall improvement so it may be worth doing (and, separately, the O(n) cleanup would be great to fix).

This might be too complex, but perhaps it would be possible to have two modes of operation (a la sync.Mutex). It could start with a simpler implementation suitable for low-contention cases and switch to an optimized implementation with per-P items when it detects high contention.

@bcmills

This comment has been minimized.

Copy link
Member

@bcmills bcmills commented Mar 8, 2018

It could start with a simpler implementation suitable for low-contention cases and switch to an optimized implementation with per-P items when it detects high contention.

If the per-P cache is significant but contention is low, I think it's worth reconsidering whether that use-case is one that sync.Pool should address at all.

I generally assume that sync.Pool is only for cases that are known to be high-contention, since the remaining cases are already well handled by other means (e.g. a mutex-guarded stack) for which it is much easier to reason about the interactions.

The other thing you potentially lose if you drop per-P caches is L2 cache locality, if the pool is accessed frequently enough to reside in L2 cache. (Moving an object from one P to another potentially invalidates all of the cache lines containing that object, at a cost of ~40ns per line depending on the CPU and access pattern.)

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Mar 11, 2018

Change https://golang.org/cl/100036 mentions this issue: sync: Use a single global pool instead of per-P shared ones

@erikdubbelboer

This comment has been minimized.

Copy link
Contributor

@erikdubbelboer erikdubbelboer commented Aug 31, 2018

Is there any update on this? It looked very promising.

CAFxX added a commit to CAFxX/go that referenced this issue Oct 19, 2018
Rework how sync.Pool is structured internally: instead of per-P private and shared
pools, have a per-P private pool and a global shared pool. This change makes Get()
never more than O(1), whereas previously it was O(n), with n the number of Ps.

The per-P private pool is now composed of one fixed and one additional shard (each
shard contains 32 objects). When the additional shard is full, it gets moved to
the global pool. When the additional shard is empty, a full shard is moved from
the global pool to the private pool. Moving shards is extremely cheap because it
is effectively a single pointer-sized write.

To control access to the shared pool a non-blocking mutex is used (trylock). Because
the private pools are much bigger than previously, accesses to the global pools are
less frequent (at least in non-pathological cases, but even in pathological cases
like the one in the PoolUnderflowUnbalanced benchmark this approach is faster).

The non-blocking mutex approach is faster but can lead to increased memory usage
since contention on the global pools can cause some Get() calls to return nil even
if objects exist in the global pools. Similarly, it can cause some Put() calls to
fail to add the object to the pool.

Another source of increased memory consumption is the increase in size of the per-P
private pool, that goes from 1 to 32 elements. The shardSize (32) is somewhat
arbitrary and can be further tweaked to trade off memory usage for CPU usage.

To minimize shard allocations a second global pool (globalFree) is used as a
freelist for empty shards. Both global pools are implemented as a simple linked
list that is used as a stack.

A number of potential followup tweaks can be done:
- Tweak the shardSize, as discussed above.
- Decouple the size of the per-P fixed shard from the shardSize. This can help
  reduce memory usage (since the contents of the fixed shard are never shared with
  the other Ps) while maintaining most of the benefits.
- Make the shardSize dynamic: during GC increase it if contention is detected on
  globalLock, decrease it otherwise. This should help keep contention under control
  on machines with many cores.
- Investigate whether it makes sense to handle NUMA topologies (e.g. have per-NUMA
  node global pools, or a per-NUMA node shared pool before the global one).
- Get rid of globalLock, and use atomic writes/CAS when operating on the global and
  globalFree linked lists. I attempted this but it seemed to complicate the code
  excessively.
- Move the global pool part of Get/Put into getGlobal/putGlobal to improve
  readability.

On my MacBook Pro this CL has the following effects:

name                       old time/op  new time/op  delta
Pool-4                     14.6ns ± 4%   9.0ns ± 2%  -38.06%  (p=0.000 n=18+19)
PoolOverflow-4             1.87µs ± 2%  1.06µs ± 1%  -43.56%  (p=0.000 n=18+20)
PoolUnderflowUnbalanced-4   151ns ± 6%    39ns ± 1%  -73.99%  (p=0.000 n=20+20)
PoolOverflowUnbalanced-4   64.9ns ± 8%  17.5ns ±10%  -73.12%  (p=0.000 n=20+19)

Benchmark results on machines with higher core counts would be very welcome, as
would be benchmarks on real-life workloads instead of the microbenchmarks above -
especially to measure the impact on memory usage.

This CL supersedes CL 95579 and is a (likely better) alternative to CL 49110.

Potentially related to issue golang#22950.

Change-Id: Ic16e2954a4302385fbc6f165d468600adc608069
@djadala

This comment has been minimized.

Copy link
Contributor

@djadala djadala commented Dec 20, 2018

I wonder if sync.Pool can be leaved as is, and separate type be defined that is not cleared on every GC ?

@Gnukos

This comment has been minimized.

Copy link

@Gnukos Gnukos commented Feb 12, 2019

Maybe we should add a public "Size" field in sync.Pool. During the GC we keep "Size" elements in it. I do not think this can break any actual implementation and it provides a way to reduce allocations.

CAFxX added a commit to CAFxX/go that referenced this issue Feb 14, 2019
Specifically, we clear only half of the poolLocals. To do this we
also have to switch from a global array of Pools to a linked list,
so that we can retain Pools in use and drop Pools that are empty
without allocating.

This means that, for a Pool that suddenly stops being used and gets
dropped:
- during the first GC: half of the poolLocals are cleared
- during the second GC: the second half of the poolLocals are cleared
- during the third GC: the Pool itself is dropped from allPools

Fixes golang#22950

Change-Id: I993666496ca529d301ab6faa94898a45615bb92a
CAFxX added a commit to CAFxX/go that referenced this issue Feb 14, 2019
Specifically, we clear only half of the poolLocals. To do this we
also have to switch from a global array of Pools to a linked list,
so that we can retain Pools in use and drop Pools that are empty
without allocating.

This means that, for a Pool that suddenly stops being used and gets
dropped:
- during the first GC: half of the poolLocals are cleared
- during the second GC: the second half of the poolLocals are cleared
- during the third GC: the Pool itself is dropped from allPools

This simplified approach is chosen as this allows to not have to worry
about resizing the shared arrays during clearPools and it does not add
any synchronization (or atomic operations) during Put/Get.

Fixes golang#22950
CAFxX added a commit to CAFxX/go that referenced this issue Feb 14, 2019
Specifically, we clear only half of the poolLocals. To do this we
also have to switch from a global array of Pools to a linked list,
so that we can retain Pools in use and drop Pools that are empty
without allocating.

This means that, for a Pool that suddenly stops being used and gets
dropped:
- during the first GC: half of the poolLocals are cleared
- during the second GC: the second half of the poolLocals are cleared
- during the third GC: the Pool itself is dropped from allPools

This simplified approach is chosen as this allows to not have to worry
about resizing the shared arrays during clearPools and it does not add
any synchronization (or atomic operations) during Put/Get.

Fixes golang#22950
CAFxX added a commit to CAFxX/go that referenced this issue Feb 14, 2019
Specifically, we clear only half of the poolLocals. To do this we
also have to switch from a global array of Pools to a linked list,
so that we can retain Pools in use and drop Pools that are empty
without allocating.

This means that, for a Pool that suddenly stops being used and gets
dropped:
- during the first GC: half of the poolLocals are cleared
- during the second GC: the second half of the poolLocals are cleared
- during the third GC: the Pool itself is dropped from allPools

This simplified approach is chosen as this allows to not have to worry
about resizing the shared arrays during clearPools and it does not add
any synchronization (or atomic operations) during Put/Get.

Fixes golang#22950
@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Feb 15, 2019

Change https://golang.org/cl/162919 mentions this issue: sync: do not clear sync.Pools completely every GC

@djadala

This comment has been minimized.

Copy link
Contributor

@djadala djadala commented Feb 15, 2019

I dont like idea that now i have correct go program, and after next release it become incorrect, because i put in pool things with different memory size

@dsnet

This comment has been minimized.

Copy link
Member Author

@dsnet dsnet commented Feb 15, 2019

I dont like idea that now i have correct go program, and after next release it become incorrect, because i put in pool things with different memory size

That has always been true. It's not well documented today, but having unbounded sized items in a sync.Pool is problematic. See #23199.

@CAFxX

This comment has been minimized.

Copy link
Contributor

@CAFxX CAFxX commented Feb 16, 2019

I dont like idea that now i have correct go program, and after next release it become incorrect, because i put in pool things with different memory size

I'm not sure I follow. If you are already using the Pool like that, how should my CL make it worse? Can you provide a benchmark/test case that shows the problem?

@djadala

This comment has been minimized.

Copy link
Contributor

@djadala djadala commented Feb 22, 2019

I dont have anything against your CL nor any benchmark, just dont like idea that pool is not cleaned completely at GC.
But as dsnet commented and pointed to #23199, it is already bad idea to put things with different memory sizes in pool.
I still think that leaving current sync.Pool as is, and adding new pool type that is not cleaned completely, is good idea.

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Mar 11, 2019

Change https://golang.org/cl/166957 mentions this issue: sync: internal fixed size lock-free queue for sync.Pool

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Mar 11, 2019

Change https://golang.org/cl/166958 mentions this issue: sync: internal dynamically sized lock-free queue for sync.Pool

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Mar 11, 2019

Change https://golang.org/cl/166961 mentions this issue: sync: smooth out Pool behavior over GC with a victim cache

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Mar 11, 2019

Change https://golang.org/cl/166959 mentions this issue: sync: add Pool benchmarks to stress STW and reuse

@gopherbot

This comment has been minimized.

Copy link

@gopherbot gopherbot commented Mar 11, 2019

Change https://golang.org/cl/166960 mentions this issue: sync: use lock-free structure for Pool stealing

gopherbot pushed a commit that referenced this issue Apr 5, 2019
This is the first step toward fixing multiple issues with sync.Pool.
This adds a fixed size, lock-free, single-producer, multi-consumer
queue that will be used in the new Pool stealing implementation.

For #22950, #22331.

Change-Id: I50e85e3cb83a2ee71f611ada88e7f55996504bb5
Reviewed-on: https://go-review.googlesource.com/c/go/+/166957
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
gopherbot pushed a commit that referenced this issue Apr 5, 2019
This adds a dynamically sized, lock-free, single-producer,
multi-consumer queue that will be used in the new Pool stealing
implementation. It's built on top of the fixed-size queue added in the
previous CL.

For #22950, #22331.

Change-Id: Ifc0ca3895bec7e7f9289ba9fb7dd0332bf96ba5a
Reviewed-on: https://go-review.googlesource.com/c/go/+/166958
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
gopherbot pushed a commit that referenced this issue Apr 5, 2019
This adds two benchmarks that will highlight two problems in Pool that
we're about to address.

The first benchmark measures the impact of large Pools on GC STW time.
Currently, STW time is O(# of items in Pools), and this benchmark
demonstrates 70µs STW times.

The second benchmark measures the impact of fully clearing all Pools
on each GC. Typically this is a problem in heavily-loaded systems
because it causes a spike in allocation. This benchmark stresses this
by simulating an expensive "New" function, so the cost of creating new
objects is reflected in the ns/op of the benchmark.

For #22950, #22331.

Change-Id: I0c8853190d23144026fa11837b6bf42adc461722
Reviewed-on: https://go-review.googlesource.com/c/go/+/166959
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
gopherbot pushed a commit that referenced this issue Apr 5, 2019
Currently, Pool stores each per-P shard's overflow in a slice
protected by a Mutex. In order to store to the overflow or steal from
another shard, a P must lock that shard's Mutex. This allows for
simple synchronization between Put and Get, but has unfortunate
consequences for clearing pools.

Pools are cleared during STW sweep termination, and hence rely on
pinning a goroutine to its P to synchronize between Get/Put and
clearing. This makes the Get/Put fast path extremely fast because it
can rely on quiescence-style coordination, which doesn't even require
atomic writes, much less locking.

The catch is that a goroutine cannot acquire a Mutex while pinned to
its P (as this could deadlock). Hence, it must drop the pin on the
slow path. But this means the slow path is not synchronized with
clearing. As a result,

1) It's difficult to reason about races between clearing and the slow
path. Furthermore, this reasoning often depends on unspecified nuances
of where preemption points can occur.

2) Clearing must zero out the pointer to every object in every Pool to
prevent a concurrent slow path from causing all objects to be
retained. Since this happens during STW, this has an O(# objects in
Pools) effect on STW time.

3) We can't implement a victim cache without making clearing even
slower.

This CL solves these problems by replacing the locked overflow slice
with a lock-free structure. This allows Gets and Puts to be pinned the
whole time they're manipulating the shards slice (Pool.local), which
eliminates the races between Get/Put and clearing. This, in turn,
eliminates the need to zero all object pointers, reducing clearing to
O(# of Pools) during STW.

In addition to significantly reducing STW impact, this also happens to
speed up the Get/Put fast-path and the slow path. It somewhat
increases the cost of PoolExpensiveNew, but we'll fix that in the next
CL.

name                 old time/op     new time/op     delta
Pool-12                 3.00ns ± 0%     2.21ns ±36%  -26.32%  (p=0.000 n=18+19)
PoolOverflow-12          600ns ± 1%      587ns ± 1%   -2.21%  (p=0.000 n=16+18)
PoolSTW-12              71.0µs ± 2%      5.6µs ± 3%  -92.15%  (p=0.000 n=20+20)
PoolExpensiveNew-12     3.14ms ± 5%     3.69ms ± 7%  +17.67%  (p=0.000 n=19+20)

name                 old p50-ns/STW  new p50-ns/STW  delta
PoolSTW-12               70.7k ± 1%       5.5k ± 2%  -92.25%  (p=0.000 n=20+20)

name                 old p95-ns/STW  new p95-ns/STW  delta
PoolSTW-12               73.1k ± 2%       6.7k ± 4%  -90.86%  (p=0.000 n=18+19)

name                 old GCs/op      new GCs/op      delta
PoolExpensiveNew-12       0.38 ± 1%       0.39 ± 1%   +2.07%  (p=0.000 n=20+18)

name                 old New/op      new New/op      delta
PoolExpensiveNew-12       33.9 ± 6%       40.0 ± 6%  +17.97%  (p=0.000 n=19+20)

(https://perf.golang.org/search?q=upload:20190311.1)

Fixes #22331.
For #22950.

Change-Id: Ic5cd826e25e218f3f8256dbc4d22835c1fecb391
Reviewed-on: https://go-review.googlesource.com/c/go/+/166960
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: David Chase <drchase@google.com>
@gopherbot gopherbot closed this in 2dcbf8b Apr 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.