Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Ensure SeedSequence 0-padding does not collide with spawn keys #16551

Merged
merged 2 commits into from Jun 10, 2020

Conversation

rkern
Copy link
Member

@rkern rkern commented Jun 9, 2020

Fixes #16539

The implicit 0-padding that is done to small entropy inputs to make them the
size of the internal pool conflicts with the spawn keys, which start with an
appended 0.

In order to maintain stream compatibility with unspawned SeedSequences, we
explicitly 0-pad short inputs out to the pool size only if the spawn key is
provided, and thus would trigger the bug. This should minimize the impact on
users that were not encountering the bug.

Fixes #16539

The implicit 0-padding that is done to small entropy inputs to make them the
size of the internal pool conflicts with the spawn keys, which start with an
appended 0.

In order to maintain stream compatibility with unspawned `SeedSequence`s, we
explicitly 0-pad short inputs out to the pool size only if the spawn key is
provided, and thus would trigger the bug. This should minimize the impact on
users that were not encountering the bug.
Copy link
Member

@seberg seberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. Do we have to limit the number of spawned children to something like 2**32 or maybe 2**64:

In [6]: SeedSequence(42, spawn_key=(2**32,)).generate_state(4)                                                          
Out[6]: array([3908812709, 3341582407, 3454793571,  679120907], dtype=uint32)

In [7]: SeedSequence(42, spawn_key=(0,1)).generate_state(4)                                                             
Out[7]: array([3908812709, 3341582407, 3454793571,  679120907], dtype=uint32)

This may be a bit hypothetical, but just noticed that it is possible...

# Ensure that large integers are inserted in little-endian fashion to avoid
# trailing 0s.
ss0 = SeedSequence(42)
ss1 = SeedSequence(42 << 32)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this test (also) spawn a single child? Otherwise, does it actually test the trailing 0s, since the length would differ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one isn't related to #16539, just explicitly testing that we did design the conversion of large integers to uint32 words correctly. There were two obvious ways to do it, and the other (perhaps more-obvious) way would have also caused a problem.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we convert 42 << 32 into a uint32 array for consumption by SeedSequence converts it to [0, 42]. The other way to do it ("big-endian") would have converted it to [42, 0], which would mean that 42, 42 << 32, 42 << 64 and 42 << 96 all would generate the same states.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I relalize, I figured that in one case (if this was the "bad" version) this would be [42,] and in the other [42, 0] and I assume that this would still give a different result due to the hashing procedure being sensitive to trailing 0s as well.
But with the 0 appending from the spawn key, it becomes [42, ..., 0] in both cases, where ... is filled with 0s.
In any case, I don't mind, it is still a good test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that the concatenated words of entropy (entropy and spawn_key, as returned by get_assembled_entropy()) will still get 0-padded to pool_size if it's shorter. It just happens implicitly.

# Our pool size is bigger than our entropy, so just keep
# running the hash out.
mixer[i] = hashmix(0, hash_const)

The hashing algorithm is only sensitive to different number of trailing 0s for trailing 0s beyond the pool_size. This is testing for a potential design flaw (which I avoided) even without the spawn_key.

@seberg
Copy link
Member

seberg commented Jun 9, 2020

One thing is whether we should add release notes for this. Which I guess means in this case we have to edit the 1.19 release notes directly, not sure how to best do t hat.

@charris
Copy link
Member

charris commented Jun 9, 2020

If it has a release note it will be easy to pick up from doc/release/upcoming_changes/ in the 1.19.x branch.

@rkern
Copy link
Member Author

rkern commented Jun 9, 2020

Do we have to limit the number of spawned children to something like 2**32

Notionally, yes, we should prevent spawning if it would go past 2**32.

@seberg
Copy link
Member

seberg commented Jun 9, 2020

Yeah, its a different (and very corner case) issue in any case, thanks. I think we can just merge this, just would like to make you/Chuck a call on whether to put it into 1.19 and how the release note should be.

@charris
Copy link
Member

charris commented Jun 10, 2020

Still needs a release note. It can be removed from master after backporting. There will also be a release note for the default rng change, so this isn't unique. The PR could also be made against the 1.19 branch and forward ported.

@charris
Copy link
Member

charris commented Jun 10, 2020

That release note file will be overwritten, the real one is in the maintenance/1.19.x branch. I can work with this, but it will be easier to deal with if next time it is a file in doc/release/upcoming_changes, that way it will show up when backported instead of causing conflicts with the working file.

@charris
Copy link
Member

charris commented Jun 10, 2020

Thinking it over, the best bet is probably to make the default rng change against maintenance/1.19.x and then we can forward port it to master.

@rkern
Copy link
Member Author

rkern commented Jun 10, 2020

Are you referring to the changes in this PR?

@charris charris merged commit 073316d into master Jun 10, 2020
@charris
Copy link
Member

charris commented Jun 10, 2020

Thanks Robert.

@charris
Copy link
Member

charris commented Jun 10, 2020

@rkern I was thinking of the upcoming change of the default generator. But this was easy enough to backport, I just cherry-picked the first commit and cut and pasted the notes insert, so that works.

@rkern
Copy link
Member Author

rkern commented Jun 10, 2020

Thanks! Sorry for the faffing around.

@seberg seberg deleted the fix/seed-sequence-zeros branch June 10, 2020 19:31
@seberg seberg restored the fix/seed-sequence-zeros branch June 10, 2020 19:31
@seberg seberg deleted the fix/seed-sequence-zeros branch June 16, 2020 14:07
@charris charris removed the 09 - Backport-Candidate PRs tagged should be backported label Jun 20, 2020
cloudbopper added a commit to cloudbopper/anamod that referenced this pull request Oct 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected behaviour of spawned numpy.random.SeedSequence
3 participants