Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add PCG64DXSM BitGenerator #18906

Merged
merged 12 commits into from May 5, 2021
Merged

ENH: Add PCG64DXSM BitGenerator #18906

merged 12 commits into from May 5, 2021

Conversation

rkern
Copy link
Member

@rkern rkern commented May 4, 2021

Closes #16313

This PR adds the PCG64DXSM BitGenerator.

I have a documentation page explaining the issue and when you should worry about it. It is my usual overcomplicated prose (with my penchant for parentheticals), but I did try to give clear practical advice upfront and shunt all of the technical explanation down to the bottom. I imagine it could use a few more editing passes, but probably nothing that would prevent a release. Inserting references to it in all of the relevant places was even less elegant, but it gets the job done. As in NEP 19, I name myself as the maker of mistakes, which I feel is appropriate, but I'm not wedded to it if it doesn't match the tone we want in the documentation.

I updated the performance benchmarking script that feeds the tables in the documentation, but I did not replace the tables as I don't have ready access to all of the platforms. @bashtage I think you ran those, am I right?

@charris
Copy link
Member

charris commented May 4, 2021

@rkern I fixed the linting. Could you add a release note?

`PCG64DXSM`: collisions are possible, but both streams have to simultaneously
be both "close" in the 128 bit state space *and* "close" in the 127-bit
increment space, so the negligible chance of colliding in the 128-bit internal
`SeedSequence` pool is more likely. The DXSM output function is more
Copy link
Member

@charris charris May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pool is less likely"?

Copy link
Member Author

@rkern rkern May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. It is more likely that you will observe a collision in the SeedSequence pool (creating completely identical seeds) than to see one of the poorly-correlated pairs of distinct streams.

Copy link
Member Author

@rkern rkern May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the verb to "would be" would help (and be more consistent with the other verbiage we have on the subject).

Copy link
Member

@charris charris May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the SeedSequence affect both increment and state, or just the state?

Copy link
Member Author

@rkern rkern May 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both increment and state. During the mega-thread, we decided that was the best design, especially once we got strong seeding with SeedSequence.

Copy link
Member

@charris charris May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both increment and state

Does this make the seed sequences differ between bit-generators or is the increment derived from the state?

Copy link
Member Author

@rkern rkern May 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me back up a bit to where I suspect the disconnect is. We'll see if I guessed right.

SeedSequence takes an arbitrary sequence of (arbitrary-sized) integers and hashes them down into an internal pool of bits. The default pool size is 128 bits. The BitGenerator then asks the SeedSequence for however many bits it needs to initialize itself. For MT19937, that's a plump 624 uint32s. For the PCG64s, that's the relatively-svelte 128-bit state and 127-bit increment (added together and rounded up to an even 256 bits total). The SeedSequence uses what amounts to a slow, carefully constructed PRNG to generate however many bits are requested. It's not just copying its internal pool of 128 bits into the BitGenerator state.

The claim I am making here is that the odds that two distinct SeedSequences generate a pair of distinct, bad PCG64DXSM instances is less than the odds that two distinct user inputs hashed down to the same 128-bit internal pool. And the odds of that 128-bit collision are ignorably tiny.

@charris
Copy link
Member

charris commented May 4, 2021

I like the technical explanation. It might be useful to others considering using a PCG64 generator, for instance the Go 2 developers.

charris
charris approved these changes May 5, 2021
@bashtage
Copy link
Contributor

bashtage commented May 5, 2021

Is this PCG64CMDXSM or PCG64DXSM? I always thought these were distinct variants of PCG64. If it is CM then I think the docstring for the class should mention the difference since both the step and the output function are changing.

@rkern
Copy link
Member Author

rkern commented May 5, 2021

@bashtage I added some more words to the appropriate part of the docstring. I'm not going to expand the name of the class, though. It's already a carcrash of random characters, and I don't want to make it any longer. :-)

@charris
Copy link
Member

charris commented May 5, 2021

I just ping @imneme to keep her updated.

@bashtage
Copy link
Contributor

bashtage commented May 5, 2021

LGTM. I'll redo timings after this is merged.

@charris charris merged commit 73a0892 into numpy:main May 5, 2021
36 checks passed
@charris
Copy link
Member

charris commented May 5, 2021

Thanks Robert.

@flyingmutant
Copy link

flyingmutant commented Jul 20, 2021

@rkern can you please explain how the CSV files with the test data were generated, are they for PCG64 CM DXSM or another variant? E.g. for 0 seed, shouldn't we start with:

0x0,
0x5238ea76d1f0df4a,
0x1a3c4747022e48a4,
0x340b0228e6afc056,
0x81bb52f8baaa203a,

This is what I get when using pcg_cm_ functions from pcg64.c, and my own implementation as well.

@bashtage
Copy link
Contributor

bashtage commented Jul 20, 2021

@flyingmutant Are you transforming the seed by SeedSequence or are you passing 0x0 to the pcg initialization routine?

@flyingmutant
Copy link

flyingmutant commented Jul 20, 2021

I am using the seed directly (like pcg64-test-data-gen.c does). I thought the CSV files were generated with a version of pcg64-test-data-gen.c as well.

@rkern
Copy link
Member Author

rkern commented Jul 20, 2021

Not anymore. We should probably delete that.

@bashtage
Copy link
Contributor

bashtage commented Jul 21, 2021

It comes from an initial version of randomstate that directly passed through the seed to PCG's own initialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The PCG implementation provided by Numpy has significant, dangerous self-correlation
4 participants