Drop getClockSequence mutex in favour of atomics #83

alcore · 2020-04-03T18:52:07Z

This PR drops the mutex used by getClockSequence in favour of an atomic CAS and atomic sequence increment, primarily motivated by performance.

It introduces a (documented) edge case wherein under extreme contention a time progression can result in the first UUID in the new time interval not reusing the old clock sequence but instead ending up with an incremented sequence (as a safety to avoid duplicates). This changes the internal behaviour and may be not compliant with the spec, depending on the interpretation. The safety net could be simplified if the generator was allowed to increment the clockSequence along with each time progression (which I refrained from), not just within the current (unchanged) time interval.

The gains on go 1.14.1, Windows 10, i7 4770k @ 4.4GHz (windows/amd64) are as follows:

Sequential:

benchmark                              old ns/op     new ns/op     delta
BenchmarkGenerator/NewV1               36.4          28.0          -23.08%
BenchmarkGenerator/NewV2               45.1          34.8          -22.84%

Concurrent (8 threads):

benchmark                              old ns/op     new ns/op     delta
BenchmarkGenerator/NewV1Parallel-8     138           24.0          -82.61%
BenchmarkGenerator/NewV2Parallel-8     142           24.2          -82.96%

Since there is no concurrent benchmark in the test suite, I used the trivial...

b.Run("NewV1Parallel", func(b *testing.B) {
	b.RunParallel(func(pb *testing.PB) {
		for pb.Next() {
			NewV1()
		}
	})
})

The PR includes a new test case for the V1 generator (and V2 by extension) which simulates a concurrent race of G=1024 goroutines attempting to generate N=256 UUIDs each, to tickle the race detector. Afterwards it checks for duplicates (i.e. primarily to cover the documented edge case). The test is deliberately slow (about 3s on my platform) to give the CPU and Go's scheduler some chances to shuffle things around in the execution flow.

Sidenote: This is only actual test that utilizes the race detector, given that no other tests in the suite run concurrently to actually make use of it (while it's set for Travis builds). If there's interest, I may rectify this in a subsequent PR and race through all generators in a similar manner.

niaow · 2020-04-03T19:27:47Z

generator.go

-	// Should increase clock sequence.
-	if timeNow <= g.lastTime {
-		g.clockSequence++
+	timeHi := atomic.LoadUint64(&g.lastTime)


From the sync/atomic docs:

On ARM, x86-32, and 32-bit MIPS, it is the caller's responsibility to arrange for 64-bit alignment of 64-bit words accessed atomically. The first word in a variable or in an allocated struct, array, or slice can be relied upon to be 64-bit aligned.

It should be. sync.Once structs are 96 bits each, io.Reader as an interface is 2*word size, epochFunc and hwAddrFunc are word-sized func pointers, which puts lastTime at bit 320. Am I missing something?

Regardless of the above I agree that the PR should have this at the very least mentioned (in the struct declaration?) if it were to get merged. I'll wait for additonal feedback before I commit further.

@alcore would it be fair to say one risk would be if the standard library changed the sync.Once struct? Would we be able to mitigate that risk by moving the lastTime field to the beginning of the struct definition?

The structure of sync.Once should be irrelevant to the alignment unless it suddenly starts containing a 128-bit value. But this concern then applies to just about any struct member. If Go ever ran on a platform with 128-bit memory addressing, the function pointers would become 128-bit sized and the compiler would be using 128-bit alignment for the struct - assuming current rules.

I think my previous comment might've been a bit misleading here. Alignment is inferred from the largest sized field of a struct (as per https://golang.org/ref/spec#Size_and_alignment_guarantees). In our case that's precisely the timestamp. From the top of my head - if it were stored as a pointer directly (instead of being referenced for the CAS in getClockSequence), on 32-bit platforms it would be a 32-bit value and therefore drop the entire struct's alignment to 32-bits since no other field would then be larger than a 32-bit word.

My previous statement made the mistake of counting the memory offset and potential padding (i.e. there is none) instead of alignment.

tl;dr: lastTime can be moved to the beginning without any adverse effect and fall into the The first word in (...) an allocated struct (...) can be relied upon to be 64-bit aligned. guarantee, but as I understand https://golang.org/ref/spec#Size_and_alignment_guarantees - it is not necessary. It could even be moved to the very end - that would increase the size of the generator (padding after hardwareAddr), but as per Go's spec, the alignment would still be that of the largest member. I believe the comment in the sync/atomic docs is merely a reminder for people who juggle around with unsafe, raw memory pointers, since the 64-bit atomics are implemented as 32-bit words on 32-bit ARM IIRC.

but as per Go's spec, the alignment would still be that of the largest member

This is not correct. The alignment of the beginning of the struct is equal to the alignment of the field with the most alignment (which is often not the largest field). There are no types on 32-bit platforms with 64-bit alignment. This creates a requirement that the sum of the sizes of the fields before it is a multiple of 8 bytes and that this struct is in its own allocation.

So we really do need to move lastTime to the beginning of the struct.

An issue with my phrasing, where the suggestion that it can be moved to the end was segueing on a count of the preceding bits - it'd end up on bit 384 on 32-bit platforms (hardwareAddr being padded), which puts it on a 64-bit boundary again.

But yes, the fact this conversation is even taking place is proof enough that it should be moved to the 0-offset, along with the quote from the sync/atomic docs as a reminder why it's there :-) Makes it oh so much simpler.

Sidenote: Storing it as a pointer would also alleviate the requirements altogether, wouldn't it? It will get its own alloc once it escapes to the heap during struct construction, and by extension Go will allocate the value on a 64-bit boundary, while the pointer in the struct will take the alignment of the struct. I generally do just that, but didn't want to introduce a pointer to chase in this case.

niaow · 2020-04-03T19:33:05Z

I am not personally in favor of merging this PR, as it makes the code a lot more complicated, and doesn't seem to have a huge impact. The lock is unlikely to be problematically contended unless the user code does nothing but generate UUIDs.

theckman · 2020-05-05T05:43:26Z

@alcore thank you for this contribution. I'm interested in merging this in, and want to understand the alignment risk before merging. I'm also aligned with Jaden on the readability concern.

Edit: I do wonder if you could speak more about the system or environment that saw benefit from this change.

alcore · 2020-05-05T07:30:47Z

@theckman I hope my comment in #83 (comment) clears up the memory alignment concern.

As for readability - so am I :-) I could try to reduce the verbiage of the comments a bit, but as for the code itself, I can't currently think of anything that would simplify it. The goto can be dropped for a for statement instead, the only difference then being style (I refrained since it introduces an additional level of indentation, but then again it might be easier to read).

The setup where this mattered is (was) precisely one which @jaddr2line mentioned as unlikely. It's 8 cores running various log and events aggregation within a data center (several milion / sec). The ID generation was deferred from individual services to the aggregator so a single ID generator could by default ensure no duplicates get generated without having to check for collisions further down the line (some objects would hit databases). In this case using one generator per thread of course alleviates the problem of contention, but then makes the generator's mutex obsolete and doesn't solve the issue of potential duplicates.

In the end I simply rolled my own, but I thought this library could benefit from this change, since the ~20% speedup is - IMHO - relevant not just to cases with such contention.

Edit: The code can be simplified, as I mentioned in the PR comment. Essentially if increasing the clock sequence is not gated behind the...

if timeNow <= timeHi {...

... condition but applies to every ID, then getting and setting the clock sequence becomes a single atomic.AddUint32(&g.clockSequence, 1), both clockSequence loads can get dropped and the CAS success check becomes obsolete as well, which in turn means the loop can get dropped. Since the UUID spec does not really handle clock drifts well, the CAS might then just as well be just a store.

But this then turns into a case of undefined spec, which causes issues like uuid-rs/uuid#106 (comment). The author of the issue interprets the spec as stating that the sequence only gets increased under the conditions (time drift, partition change etc.) the spec mentions. Meanwhile I read the words...

the clock sequence is used to help avoid duplicates that could arise when (...) If (...) then the clock sequence has to be changed.

... as specifying when it has to be changed, but not that it can't be increased all the time. It doesn't really affect the safety, as the sequence will just wrap around anyways. But it would be a breaking change if anyone relies on the assumption that it only increases for IDs in the same time frame (I can't think of a use case for this assumption).

j-mie · 2020-11-04T11:17:42Z

@alcore Out of curiosity is there any reason why the logic has changed such that you no longer set the last generation time when the time has gone backwards? https://github.com/gofrs/uuid/pull/83/files#diff-0d0495ad16c6f603876bbf484c43b549f53d0b33d4cd74c908b0ee95a94369eaR238-R240 vs https://github.com/gofrs/uuid/blob/master/generator.go#L237-L242

cameracker · 2021-09-07T21:05:15Z

Thanks for the contribution here. For the most part, this review has sat idle for a while due to some strong objections about increase of code complexity and potential byte alignment concerns. Going to close this. If anybody has any objections, we can continue to talk about the changes here, and the PR can be re-opened.

Drop getClockSequence mutex in favour of atomics

d797a6f

niaow reviewed Apr 3, 2020

View reviewed changes

cameracker closed this Sep 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop getClockSequence mutex in favour of atomics #83

Drop getClockSequence mutex in favour of atomics #83

alcore commented Apr 3, 2020

niaow Apr 3, 2020 •

edited

Loading

alcore Apr 3, 2020 •

edited

Loading

theckman May 5, 2020

alcore May 5, 2020 •

edited

Loading

niaow May 5, 2020

alcore May 5, 2020

niaow commented Apr 3, 2020

theckman commented May 5, 2020 •

edited

Loading

alcore commented May 5, 2020 •

edited

Loading

j-mie commented Nov 4, 2020

cameracker commented Sep 7, 2021

Drop getClockSequence mutex in favour of atomics #83

Drop getClockSequence mutex in favour of atomics #83

Conversation

alcore commented Apr 3, 2020

niaow Apr 3, 2020 • edited Loading

Choose a reason for hiding this comment

alcore Apr 3, 2020 • edited Loading

Choose a reason for hiding this comment

theckman May 5, 2020

Choose a reason for hiding this comment

alcore May 5, 2020 • edited Loading

Choose a reason for hiding this comment

niaow May 5, 2020

Choose a reason for hiding this comment

alcore May 5, 2020

Choose a reason for hiding this comment

niaow commented Apr 3, 2020

theckman commented May 5, 2020 • edited Loading

alcore commented May 5, 2020 • edited Loading

j-mie commented Nov 4, 2020

cameracker commented Sep 7, 2021

niaow Apr 3, 2020 •

edited

Loading

alcore Apr 3, 2020 •

edited

Loading

alcore May 5, 2020 •

edited

Loading

theckman commented May 5, 2020 •

edited

Loading

alcore commented May 5, 2020 •

edited

Loading