Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: randomgen #13163

Merged
merged 139 commits into from May 28, 2019

Conversation

@mattip
Copy link
Member

commented Mar 20, 2019

A start at mMerging bashtage/randomgen into numpy, as part of NEP 19.

The original repo was cloned, moved to a subdirectory, and then merged into numpy, as documented in _randomgen/README-git.md.
Then I moved the code into numpy/random and the docs into doc/source/random and doc/source/papers.

Still very much a work in progress.

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 20, 2019

So far I have begun to extend the setup.py to build the modules, but our cythonize script is very primitive.

@mattip mattip referenced this pull request Mar 20, 2019

Open

ENH: tracking issue for merging randomgen into numpy #13164

11 of 16 tasks complete
@rgommers

This comment has been minimized.

Copy link
Member

commented Mar 20, 2019

License related question: if we list this under "bundled libraries" with its own license, does it mean that randomgen will continue to be developed standalone? I thought it was meant solely for inclusion in NumPy.

@rkern

This comment has been minimized.

Copy link
Member

commented Mar 20, 2019

It's intended for incorporation. randomgen as a separate package will be frozen after this gets merged.

@rgommers

This comment has been minimized.

Copy link
Member

commented Mar 20, 2019

It's intended for incorporation. randomgen as a separate package will be frozen after this gets merged.

Nice. If that NCSA license is absolutely necessary then I guess it's no problem, however for something developed for inclusion in NumPy and only being a part of NumPy going forward, just having it under the NumPy license would be preferred.

@mattip mattip force-pushed the mattip:randomgen branch from 1368fc9 to 0adab55 Mar 21, 2019

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 21, 2019

I had to move pcg32.pyx, pcg64.pyx, and the examples temporarily to randomgen_ignore.

The pcg cython files use compiler directives, which we will need to refactor since the source files generated will differ by platform, preventing distributing a single source-tarball.

The examples use from randomgen import ... or so, and our cythonize script runs from the directory containing the pyx file, so it cannot import that way.

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 21, 2019

@bashtage, @rkern: I have imported randomgen in a way that preserves all the commits from that repo, assuming we wish to keep the commit history. Is that OK? Are there things we should git rebase to collapse into a single commit?

@bashtage

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2019

@mattip I would think that it should be squashed at least some eventually. When picking, I think it would be nice to keep at least one commit from randomgen contributors in, is possible.

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 21, 2019

The current failures are due to cython using its own numpy.pxd rather than the one we ship along-side mtrand.pyx(since that is in a different subdir) which has the line cdef extern from "numpy/npy_no_deprecated_api.h": pass. Choices are:

  • work on getting #12284 merged, then update the PR to use it
  • copy the numpy.pxd from random/mtrand into random/randomgen
  • do the relevant parts of #12284 for randomgen to get this working without actually merging #12284
@bashtage

This comment has been minimized.

Copy link
Contributor

commented Mar 21, 2019

I think a clear official #12284 would be the best path since it would resolve this ambiguity once and for all.

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 21, 2019

in d079759 I took the third path, which it turns out is necessary anyway to merge this work with #12284. It was enough on my machine to quiet all the warnings and build the c-extension modules via cython.

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 23, 2019

Can we add a FIXME here ...

@rkern please feel free to push to my branch, I would be happy for collaboration.

@charris

This comment has been minimized.

Copy link
Member

commented Mar 23, 2019

When picking, I think it would be nice to keep at least one commit from randomgen contributors in, is possible.

Could maybe add a Thanks file or similar listing all the contributors. Should be easy to generate with git.

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 23, 2019

The code is sourced from places other than git. For instance, this file has contributors that do not appear in the git history, and has a different license than the rest of randomgen

@mattip mattip force-pushed the mattip:randomgen branch from d092529 to 407b96a Mar 23, 2019

@bashtage

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2019

Many (all) of the C source files for the BRNGs have no git history since they were made available from the authors directly.

@bashtage

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2019

@rkern 's idea is clearly the right way to do compat in PCG64. I once tried to do it but gave up since the upside for me was small in randomgen. I found it a bit tricky since Cython understands uint128 and so using a fake type wasn't so simple. At one point I thought some small abstraction beyond a simple #ifdef was needed before I stopped trying.

@mattip mattip force-pushed the mattip:randomgen branch from 407b96a to a3303a2 Mar 23, 2019

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 23, 2019

which platforms do not support __unit128?

Does the strategy of breaking the single 128bit int into two 64 bit integers with high and low support both big- and little- endian systems transparently?

@bashtage

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2019

Microsoft compilers (of course).

@bashtage

This comment has been minimized.

Copy link
Contributor

commented Mar 23, 2019

I looked through it and I believe that it is endian independent (and it passed on ARM, but I'm not sure whether this is BE or LE, and it might depend). @rkern wrote it, so he would know better.

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 24, 2019

Tests that come with randomgen now pass.
I think next I should try to move randongen into the numpy.random namespace and drop mtrand. It seems

numpy.random.RandomState = randomgen.legacy.LegacyGenerator
numpy.random.mtrand = randomgen.legacy.LegacyGenerator()
for x in dir(mtrand):
    if x[0] != '_':
        locals()[x] = getattr(mtrand, x)

answers the demands of the NEP (selectively quoted below) except for the missing set_state and get_state. LegacyGenerator().__getstate__ returns a dictionary, where numpy.random.get_state returns a list. I could make that work by coercing the dict to a list and visa-versa for set_state.

Am I on the right track?

First, we will maintain API source compatibility just as we do with the rest of numpy. If we must make a breaking change, we will only do so with an appropriate deprecation period and warnings.
Second, breaking stream-compatibility in order to introduce new features or improve performance will be allowed with caution. ...
[The] legacy distributions class MUST be accessible under the name numpy.random.RandomState for backwards compatibility. All current ways of instantiating numpy.random.RandomState with a given state should instantiate the Mersenne Twister basic RNG with the same state. ... Instances of the legacy distributions class MUST respond True to isinstance(rg, numpy.random.RandomState) because there is current utility code that relies on that check. Similarly, old pickles of numpy.random.RandomState instances MUST unpickle correctly.
Specifically, the initial release of the new PRNG subsystem SHALL leave these (numpy.random.*) convenience functions as aliases to the methods on a global RandomState that is initialized with a Mersenne Twister basic RNG object. A call to numpy.random.seed() will be forwarded to that basic RNG object. In addition, the global RandomState instance MUST be accessible in this initial release by the name numpy.random.mtrand.

@bashtage

This comment has been minimized.

Copy link
Contributor

commented Mar 24, 2019

I used a property for state which seems more natural than a java-ish set/get paradigm. I also with with a dictionary which is much easier to handle in case of breaks since nothing in the dictionary is really a committment as long as future versions use different keys.

One or both of these should possibly be overruled.

Does this the method you have above produce an identical dir as current random? I can't remember what I did with ancient functions names like ranf that are only aliases

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 24, 2019

The aliases are set in __init__.py. The only missing attributes from dir(randomgen.legacy.LegacyGenerator()) are set_state and get_state.

@bashtage

This comment has been minimized.

Copy link
Contributor

commented Mar 24, 2019

One further issue about using the same BRNG between legacy and a modern generator -- the legacy state has an additional value containing the next gaussian, since it uses a polar transformation. This might have some implications for setting the state if the basic RNG is shared.

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 25, 2019

I am getting an non-terminating loop in random_zipf from distributions.c for certain conditions in tests

I modified the code to print the values in each loop, details are below. I am not sure of how to provide a better stopping condition when a == 2.0, X == -sys.maxint - 1, V is very large.

code ``` int64_t random_zipf(brng_t *brng_state, double a) { double T, U, V; int64_t X; double am1, b;

am1 = a - 1.0;
b = pow(2.0, am1);
do {
U = 1.0 - next_double(brng_state);
V = next_double(brng_state);
X = (int64_t)floor(pow(U, -1.0 / am1));
/* The real result may be above what can be represented in a int64.
* It will get casted to -sys.maxint-1. Since this is
* a straightforward rejection algorithm, we can just reject this value
* in the rejection condition below. This function then models a Zipf
* distribution truncated to sys.maxint.
*/
T = pow(1.0 + 1.0 / X, am1);
fprintf(stdout, "am1 %10.3g T %10.3g U %10.3g V %10.3g X %ld\n", T, U, V, X);
} while (((V * X * (T - 1.0) / (b - 1.0)) > (T / b)) || X < 1);
return X;
}

which prints

am1 1 T 0.188 U 0.518 V 1.8e+308 X -9223372036854775808
am1 1 T 0.524 U 0.319 V 1.8e+308 X -9223372036854775808
am1 1 T 0.638 U 0.816 V 1.8e+308 X -9223372036854775808
am1 1 T 0.867 U 0.56 V 1.8e+308 X -9223372036854775808
am1 1 T 0.841 U 0.822 V 1.8e+308 X -9223372036854775808
am1 1 T 0.525 U 0.189 V 1.8e+308 X -9223372036854775808
am1 1 T 0.735 U 0.15 V 1.8e+308 X -9223372036854775808
am1 1 T 0.839 U 0.546 V 1.8e+308 X -9223372036854775808
am1 1 T 0.582 U 0.753 V 1.8e+308 X -9223372036854775808
am1 1 T 0.0421 U 0.854 V 1.8e+308 X -9223372036854775808
am1 1 T 0.0792 U 0.67 V 1.8e+308 X -9223372036854775808
am1 1 T 0.31 U 0.436 V 1.8e+308 X -9223372036854775808
am1 1 T 0.951 U 0.889 V 1.8e+308 X -9223372036854775808
am1 1 T 0.862 U 0.641 V 1.8e+308 X -9223372036854775808
am1 1 T 0.967 U 0.285 V 1.8e+308 X -9223372036854775808
am1 1 T 0.0981 U 0.865 V 1.8e+308 X -9223372036854775808
am1 1 T 0.444 U 0.884 V 1.8e+308 X -9223372036854775808
am1 1 T 0.73 U 0.122 V 1.8e+308 X -9223372036854775808
...

@mattip

This comment has been minimized.

Copy link
Member Author

commented Mar 25, 2019

It helps to debug the debugging code. The printf statement was off-by-one, fixing it shows a==NAN.

@rkern

This comment has been minimized.

Copy link
Member

commented May 27, 2019

Thank you, @imneme for your comments! As this PR is getting big enough to make Github groan, we're going to let it merge without making the final decision about the default. I hate to forum-shift newcomers, especially after they've contributed a lot to the conversation, but I've made a new issue (#13635) for the discussion about that decision. Hopefully, it will be easier on our browsers over there if there is much more to discuss.

One thing that was less cool to me was naming, such as _pcg_mult64. I don't know about NumPy coding standards, but a general rule in C, C++ and POSIX that names with leading underscores are reserved by the system and are best avoided.

Ah, noted. Using leading underscores for "private" functions is a Python convention, and I am less familiar with the C/C++/POSIX conventions (at least, at that level of detail). We should fix that.

@imneme

This comment has been minimized.

Copy link

commented May 27, 2019

Thanks @rkern. Seems like a wise course.

The thing with the underscores is both funny and sad, and sadly common. Just so you're 100% clear, code that uses leading underscores is breaking both C and POSIX rules (but getting way with it today, but the code could legally be summarily rejected by any conforming compiler tomorrow). Using leading underscores for something “private” (perhaps in the belief that these names are less likely to be used by other people) is a bit like the person who smuggles a bomb onto every plane they travel on because the chance of there being two bombs on a plane is really low.

@seberg

This comment has been minimized.

Copy link
Member

commented May 27, 2019

@mattip there seems to be some real error in the tests while compiling.

I cannot say I reviewed anything here really, but I think Matti, Robert, and Kevin did look through most carefully. I will plan to merge it maybe late tonight or tomorrow morning, assuming the tests are passing and no-one speaks up.

bashtage added some commits May 23, 2019

DOC: Add __all__ and document lock
Add docstring for lock
Use __all__ to discuorage unless attributes from appearing
MAINT: Remove remnants of bit generators
Remove traces of the three removed bit generators
Add lock to Cython examples
BLD: Improve setup
Attempt to avoid defining variables that are incorrect for some platforms
STY: Clean up code
Pep8 fixes
Remove unused imports
Fix name error
PERF: Reorder header for philox (#34)
* PERF: Reorder header for philox

Reorder header so that support uint128 is always used if avilable, irrespective of platform

@mattip mattip force-pushed the mattip:randomgen branch from dbc1374 to 9c261e6 May 27, 2019

@mattip

This comment has been minimized.

Copy link
Member Author

commented May 27, 2019

hmm, getting a FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. It seems there is bitrot when merging with master

@mattip mattip force-pushed the mattip:randomgen branch from 6152173 to 70d6293 May 27, 2019

@mattip

This comment has been minimized.

Copy link
Member Author

commented May 28, 2019

tests are passing

@seberg

This comment has been minimized.

Copy link
Member

commented May 28, 2019

I would like to thank everyone for making this milestone possible. Especially Robert and Kevin who made all of randomgen possible. I am also very happy and impressed to see so many specialists weighing in!
Probably, there will be a few follow-up issues, but I trust that this is very high quality work and we have discussed the exact API in depth so that I believe it is hashed out very well.

I am looking forward to using the new API, thanks all!

@seberg seberg merged commit 22239d1 into numpy:master May 28, 2019

16 checks passed

LGTM analysis: C/C++ 1 new alert and 2 fixed alerts
Details
LGTM analysis: JavaScript No code changes detected
Details
LGTM analysis: Python 1 fixed alert
Details
Shippable Run 4212 status is SUCCESS.
Details
azure-pipeline numpy.numpy Build #20190527.20 succeeded
Details
azure-pipeline numpy.numpy (Linux_PyPy3) Linux_PyPy3 succeeded
Details
azure-pipeline numpy.numpy (Linux_Python_36_32bit_full_with_asserts) Linux_Python_36_32bit_full_with_asserts succeeded
Details
azure-pipeline numpy.numpy (Windows Python35-64bit-full) Windows Python35-64bit-full succeeded
Details
azure-pipeline numpy.numpy (Windows Python36-32bit-fast) Windows Python36-32bit-fast succeeded
Details
azure-pipeline numpy.numpy (Windows Python36-64bit-full) Windows Python36-64bit-full succeeded
Details
azure-pipeline numpy.numpy (Windows Python37-32bit-fast) Windows Python37-32bit-fast succeeded
Details
azure-pipeline numpy.numpy (Windows Python37-64bit-full) Windows Python37-64bit-full succeeded
Details
azure-pipeline numpy.numpy (macOS) macOS succeeded
Details
ci/circleci Your tests passed on CircleCI!
Details
codecov/project 85.63% (target 1%)
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@bashtage

This comment has been minimized.

Copy link
Contributor

commented May 28, 2019

Great. How long does it take for this to make it out to projects that test using prerelease builds of NumPy? As fun as the API discussion is about Generator, my main concern is verifying that I didn't accidentally break something in RandomState.

@seberg

This comment has been minimized.

Copy link
Member

commented May 28, 2019

Indeed, I do think we have a few projects testing against master so that should be fairly quick. The big test will be the rc releases probably.

@WarrenWeckesser

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

Would anyone object to a pull request that renames distributions-boxmuller.c in numpy/random/src/legacy/ to distributions-legacy.c? Or would that be needless churn? The file contains all the legacy distributions, not just the normal distribution.

When I first saw the file name, I didn't realize that it was the file I was looking for. It only took a quick look to figure it out, so this isn't a high priority, but I think a more accurate name would be helpful.

@bashtage

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

I switched in randomgen after @mattip started the process with legacy-distributions.c and .h

https://github.com/bashtage/randomgen/tree/master/randomgen/src/legacy

I also moved them to src/legacy which I'm not sure is totally necessary, so I'm +1 on the clarifications. IMO a git mv is minimal churn.

Perhaps a comment at the top would be helpful to explain that this is the file where hanged fucntions from distributions.c go to maintain the RandomState promise would be helpful too.

@WarrenWeckesser

This comment has been minimized.

Copy link
Contributor

commented Jun 9, 2019

OK, pull request is #13743

@mattip

This comment has been minimized.

Copy link
Member Author

commented Jun 10, 2019

hanged fucntions from distributions.c

@rkern could you rephrase that request in the PR? Do you mean frozen versions of the functions?

@WarrenWeckesser

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

@mattip, it was @bashtage who made that comment.

@bashtage

This comment has been minimized.

Copy link
Contributor

commented Jun 10, 2019

Do you mean frozen versions of the functions?

Yes. I mean the canonical version that must be used in RandomState. legacy-distributions.c is the file where the canonical version should be moved to when the generator for a specific distribution in distributions.c is improved.

Does their energy intake deviate systematically from the recommended
value of 7725 kJ?

We ha

This comment has been minimized.

Copy link
@WarrenWeckesser

WarrenWeckesser Jun 12, 2019

Contributor

Why is the actual output not shown here? Is it because the indeterminate value causes problems with doctests?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.