Use np.random's RandomState when seed is None #13161

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
6 participants
Contributor

ariddell commented May 12, 2016 edited

  • closes #13143
  • tests added / passed
  • passes git diff upstream/master | flake8 --diff
  • whatsnew entry

The handle for numpy's current random state is
np.random.mtrand._rand.

Compare https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L573

@shoyer shoyer and 1 other commented on an outdated diff May 12, 2016

pandas/core/common.py
@@ -2072,7 +2072,7 @@ def _random_state(state=None):
elif isinstance(state, np.random.RandomState):
return state
elif state is None:
- return np.random.RandomState()
+ return np.random.mtrand._rand
@shoyer

shoyer May 12, 2016 edited

Member

Can we use something like this instead?

seed = np.random.randint(2 ** 32)
return np.random.RandomState(seed)

That way we don't need to use NumPy's private API.

@jorisvandenbossche

jorisvandenbossche May 12, 2016

Owner

I don't think that will do the same?

As this is to catch a previously called np.random.seed(), so feeding it another seed will not give the desired result I think?

@shoyer

shoyer May 12, 2016

Member

We want results to be reproducible after calling np.random.seed(). But we don't need to reuse exactly the same seed -- it's OK to also use a seed derived from NumPy's seed.

Using the exact same seed would only be necessary if we want to promise the sample makes the exact same choice as np.random.choice. But that's really an implementation detail.

(Clearly a comment in the code would also be in order if we use my suggested approach.)

@jorisvandenbossche

jorisvandenbossche May 12, 2016

Owner

Aha, I see.
For me it does not really matter which approach we take (sklearn also uses _rand, so although it is private it seems ok)

@shoyer shoyer and 1 other commented on an outdated diff May 12, 2016

pandas/tests/test_generic.py
@@ -415,6 +415,10 @@ def test_sample(self):
o.sample(frac=0.7, random_state=np.random.RandomState(test)),
o.sample(frac=0.7, random_state=np.random.RandomState(test)))
+ np.random.seed(test)
+ self._compare(o.sample(n=4), o.sample(n=4))
@shoyer

shoyer May 12, 2016

Member

This test doesn't look right to me. You need to call np.random.seed(test) before every call to sample, not just the first one.

@ariddell

ariddell May 12, 2016

Contributor

You're right. I fixed the test.

On 05/12, Stephan Hoyer wrote:

@@ -415,6 +415,10 @@ def test_sample(self):
o.sample(frac=0.7, random_state=np.random.RandomState(test)),
o.sample(frac=0.7, random_state=np.random.RandomState(test)))

  •        np.random.seed(test)
    
  •        self._compare(o.sample(n=4), o.sample(n=4))
    

This test doesn't look right to me. You need to call np.random.seed(test) before every call to sample, not just the first one.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub:
https://github.com/pydata/pandas/pull/13161/files/08b2cd1ae1e581e720d169cbf8d4888adcb07e67#r63099499

Contributor

jreback commented May 13, 2016

@ariddell pls also add a whatsnew entry. I would do it in API changes section.

Contributor

ariddell commented May 16, 2016

@jreback added. Thanks!

Member

shoyer commented May 16, 2016

@ariddell could you please fix the use of NumPy's private state as noted above? This is something simple that could avoid significant pain down the road.

Contributor

ariddell commented May 16, 2016

Fair enough. What should the seed be? Did you mean 2 ** 32 above?

Member

shoyer commented May 16, 2016

@ariddell Yes, I meant 2 ** 32 :).

Contributor

rkern commented May 17, 2016

Using np.random.mtrand._rand is the right approach. It's not going to disappear. This is an authorized use.

Member

shoyer commented May 17, 2016

Let's just return np.random from _random_state instead of np.random.mtrand._rand, using @rkern's suggestion of duck typing for RandomState objects from the related thread on numpy-discussion: https://mail.scipy.org/pipermail/numpy-discussion/2016-May/075489.html

Contributor

ariddell commented May 17, 2016

@shoyer duck typing it is.

@shoyer shoyer commented on the diff May 17, 2016

pandas/core/common.py
@@ -2072,7 +2072,7 @@ def _random_state(state=None):
elif isinstance(state, np.random.RandomState):
return state
elif state is None:
- return np.random.RandomState()
+ return np.random
@shoyer

shoyer May 17, 2016

Member

can you please update the docstring, too?

@ariddell

ariddell May 18, 2016

Contributor

Yes. Fixed.

On 05/17, Stephan Hoyer wrote:

@@ -2072,7 +2072,7 @@ def _random_state(state=None):
elif isinstance(state, np.random.RandomState):
return state
elif state is None:

  •    return np.random.RandomState()
    
  •    return np.random
    

can you please update the docstring, too?


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/pydata/pandas/pull/13161/files/c8a38f99637ad44dad0db9118b17fd4e3c8643f3#r63603099

jreback added this to the 0.18.2 milestone May 18, 2016

Contributor

jreback commented May 18, 2016

you have a failing test

Contributor

jreback commented May 20, 2016

@ariddell can you rebase / update

Allen Riddell Use np.random's RandomState when seed is None
The handle for numpy's current random state is
``np.random.mtrand._rand``. Rather than use the private API, return
np.random, as the module makes available the same functions as an
instance of RandomState.

Compare https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/validation.py#L573
595b0bc
Contributor

ariddell commented May 20, 2016

OK. I think I've got it.

jreback closed this in 6f90340 May 21, 2016

Contributor

jreback commented May 21, 2016

thanks @ariddell

@nps nps added a commit to nps/pandas that referenced this pull request May 30, 2016

@nps Allen Riddell + nps API: Use np.random's RandomState when seed is None in .sample
closes #13143

Author: Allen Riddell <abr@ariddell.org>

Closes #13161 from ariddell/feature/sample-numpy-random-seed and squashes the following commits:

595b0bc [Allen Riddell] Use np.random's RandomState when seed is None
72d6963

pydolan commented Feb 9, 2017 edited

I'm not getting the behavior I'd expect from this fix in Pandas 0.19.2. Example:

>>> import pandas as pd
>>> import numpy as np
>>> np.random.seed(12345678)
>>> df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
>>> df.sample(n=2)
          a         b         c         d         e
8  0.365786 -0.855795 -0.511062 -0.116734  1.133541
0  1.281455 -0.423852  0.011236 -0.945194 -0.088124
>>> df.sample(n=2)
          a         b         c         d         e
2  0.341751 -1.158737  0.313814  1.827552 -0.351045
9  1.355003  0.532651  0.250463 -0.281751 -0.342741
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8  0.365786 -0.855795 -0.511062 -0.116734  1.133541
6 -0.401077 -0.047115 -0.410951 -1.608354  0.290594
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8  0.365786 -0.855795 -0.511062 -0.116734  1.133541
6 -0.401077 -0.047115 -0.410951 -1.608354  0.290594
>>> pd.__version__
u'0.19.2'
>>> np.__version__
'1.12.0'

Any thoughts on why df.sample(n=2) doesn't match df.sample(n=2, random_state=12345678)?

Member

shoyer commented Feb 9, 2017

@pydolan is there a missing np.random.seed() call somewhere in your example?

pydolan commented Feb 9, 2017

@shoyer -- sorry about that; will update example with my actual code (i.e., with the one missing line)

Member

shoyer commented Feb 9, 2017

@pydolan It looks like you are calling np.random.randn() after setting the seed before df.sample()

pydolan commented Feb 9, 2017

If I move the seed() call after instantiating the dataframe, I still get inconsistent behavior with calls to sample(), except when I provide the random_state arg. Example:

>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame(np.random.randn(10, 5), columns=['a', 'b', 'c', 'd', 'e'])
>>> np.random.seed(12345678)
>>> df.sample(n=2)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679
>>> df.sample(n=2)
          a         b         c         d         e
4  0.895497  0.609853 -1.548664 -1.238415 -1.058904
5  0.196420  0.472877 -0.918205  1.019862 -0.631993
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679
>>> df.sample(n=2, random_state=12345678)
          a         b         c         d         e
8 -1.250204  0.551508  1.408080  0.397452  0.424326
6 -0.028298  0.203270  0.939094 -1.802227 -0.088679

Note that this time, the first call to sample() uses the seed, but the second call does not use the seed. Is it expected that seed() needs to be called before every sample call? I thought it is supposed to be, "set once", and all future randomization-related calls should use (including my original example, where randn() is called after seed() and before sample()).

(Also note that I did verify that calling seed() before every call to sample() does indeed produce the same sampled rows.)

Member

shoyer commented Feb 9, 2017

Is it expected that seed() needs to be called before every sample call? I thought it is supposed to be, "set once", and all future randomization-related calls should use (including my original example, where randn() is called after seed() and before sample()).

This is working as intended. Try just sampling random numbers from numpy -- it does the exact same thing.

pydolan commented Feb 9, 2017

Good point; I completely spaced on the nature of random sampling. Sorry for the waste of time!

ariddell deleted the unknown repository branch Feb 18, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment