Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

New R-like sample function #151

Closed
wants to merge 3 commits into from
+6,082 −4,258

2 participants

@chrisjordansquire

This is a convenience function that wraps together several different functionalities. It allows the user to pass in a 1-D array-like and generate a possibly weighted, possibly without replacement random sample from that array.

It is essentially an implementation of the sample function in R's base package, which is a basic function for quickly generating random samples from a given vector.

With the exception of the weighted sample without replacement, all the functionality of sample can be obtained using various functions and tricks, but it would be nice to have them encapsulated in one function. It also eases the transitions for current and former R/S/S-plus users.

It's biggest use case is for quickly and easily generating random samples from a vector of strings or non-contiguous, irregularly spaced integers (as occurs with codes in survey data). This is very useful for generating simulated categorical data, such as

numpy.random.sample( ['male', 'female'], 20, p=[0.55, 0.45])

Such a sampling can currently be accomplished by creating an rv_discrete instance in scipy.stats with P(X=0)=0.55 and P(X=1)=0.45, getting the appropriate random sample, and then using the output as indices in an array ['male', 'female']. In addition to the number of commands being somewhat cumbersome for quickly generating data, one can't use the existing stats.rv_discrete function to directly create a distribution on 'male' and 'female'. scipy.stats random variables assume the sample space is the real line.

I am open to other names for this function, as long as they are short. While creating this function I realized that random.random_sample was given a number of different aliases in the random.__init__.py file which weren't in the reference docs. I removed those aliases, as well as the limited number of times they were used in the numpy code base. (All in test files.) This included random.sample as an alias for random.random_sample. While this new sample function changes the previous random.sample function, it does not break the API as the previous function was not in the docs. It will also fail on all calls to previous function as the function signature is different.

@chrisjordansquire

Made the changes discussed on the mailing list. Reverted ranf, random, and sample to their previous aliases, as well as the tests dependent on them, and changed the name of the new function from sample to choice.

Also added the aliases to the reference docs.

@charris
Owner

Could you add this to doc/release/2.0.0-notes.rst? And also add

.. versionadded:: 1.7.0

to the function documentation.

'choice' doesn't sound quite right to me, maybe sample_from or some such?

@charris
Owner

What is the status of this?

@chrisjordansquire

Getting back to work on this now that I'm done with fall coursework. I'll make the few changes requested.

numpy/random/mtrand/mtrand.pyx
((84 lines not shown))
+ raise ValueError("a must be greater than 0")
+ else:
+ a = np.asarray(a)
+ if len(a.shape) != 1:
+ raise ValueError("a must be 1-dimensional")
+ pop_size = a.size
+ if pop_size is 0:
+ raise ValueError("a must be non-empty")
+
+ if None != p:
+ p = np.asarray(p)
+ if len(p.shape) != 1:
+ raise ValueError("p must be 1-dimensional")
+ if p.size != pop_size:
+ raise ValueError("a and p must have same size")
+ if any(p<0):
@charris Owner
charris added a note

Spaces around '<'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
numpy/random/mtrand/mtrand.pyx
((86 lines not shown))
+ a = np.asarray(a)
+ if len(a.shape) != 1:
+ raise ValueError("a must be 1-dimensional")
+ pop_size = a.size
+ if pop_size is 0:
+ raise ValueError("a must be non-empty")
+
+ if None != p:
+ p = np.asarray(p)
+ if len(p.shape) != 1:
+ raise ValueError("p must be 1-dimensional")
+ if p.size != pop_size:
+ raise ValueError("a and p must have same size")
+ if any(p<0):
+ raise ValueError("probabilities are not non-negative")
+ if not np.allclose(p.sum(), 1):
@charris Owner
charris added a note

Any reason not to just normalize the sum to 1?

Mostly as an error check. If the sum isn't close to 1, then the user likely made an error. Didn't want to silently propagate, especially since coding errors in the random number generator portion of some code can be especially hard to check.

@charris Owner
charris added a note

OK, I was mostly thinking of the p as weights, but requiring the user to use probabilities might introduce a bit more discipline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
numpy/random/mtrand/mtrand.pyx
((93 lines not shown))
+ if None != p:
+ p = np.asarray(p)
+ if len(p.shape) != 1:
+ raise ValueError("p must be 1-dimensional")
+ if p.size != pop_size:
+ raise ValueError("a and p must have same size")
+ if any(p<0):
+ raise ValueError("probabilities are not non-negative")
+ if not np.allclose(p.sum(), 1):
+ raise ValueError("probabilities do not sum to 1")
+
+ # Actual sampling
+ if replace:
+ if None != p:
+ x = np.arange(pop_size)
+ number_each = self.multinomial(size, p)
@charris Owner
charris added a note

Did you try the inverse cdf method Josef mentioned in the thread following your post at http://thread.gmane.org/gmane.comp.python.numeric.general/45916

At the time I'd thought it would be more efficient. But then I just did a quick speed test, and the methods seem equivalent. I can change it, but I'm not sure if there's a compelling reason to.

import numpy as np

def msa(pop_size, size, p):
    x = np.arange(pop_size)
    number_each = np.random.multinomial(size, p)
    idx = np.repeat(x, number_each)
    np.random.shuffle(idx)
    return idx

def cdfsa(pop_size, size, p):
    cdf = np.cumsum(p)
    return np.searchsorted(cdf, np.random.random(size))

p = np.random.random(20)
p = p/np.sum(p)

def w1():
    return msa(20, 10000, p)

def w2():
    return cdfsa(20, 10000, p)
@charris Owner
charris added a note

On my machine with len(p) == 1000:

In [3]: timeit test_choice.w1()
100 loops, best of 3: 2.69 ms per loop

In [4]: timeit test_choice.w2()
1000 loops, best of 3: 1.14 ms per loop

Using cdf.searchsorted(x, side='right') might be a bit cleaner than using the function and also avoids a copy. Searching with the side='right' also avoids the possibility of returning a result that has zero probability.

Ah. I'd worried that side='right' would give me what I wanted, but now I see it does. Ok, I'll make the change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@charris charris commented on the diff
numpy/random/mtrand/mtrand.pyx
((109 lines not shown))
+ idx = np.repeat(x, number_each)
+ self.shuffle(idx)
+ else:
+ idx = self.randint(0, pop_size, size=size)
+ else:
+ if size > pop_size:
+ raise ValueError(''.join(["Cannot take a larger sample than ",
+ "population when 'replace=False'"]))
+
+ if None != p:
+ if np.sum(p>0) < size:
+ raise ValueError("Fewer non-zero entries in p than size")
+ n_uniq = 0
+ p = p.copy()
+ found = np.zeros(size, dtype=np.int)
+ while n_uniq < size:
@charris Owner
charris added a note

Heh. OK, it has O(n^2) running time. I wonder if there is something clever that could be done. Maybe a crossover between just doing repeated samples, discarding those already taken until the unavailable slots exceeded a certain probability threshold, then redoing the inverse cdf.

The case with zero probability for some of the elements can be subtle and might best be handled by removing them in the beginning.

@charris Owner
charris added a note

OK, I see you solved the O(n^2) thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
numpy/random/mtrand/mtrand.pyx
((115 lines not shown))
+ raise ValueError(''.join(["Cannot take a larger sample than ",
+ "population when 'replace=False'"]))
+
+ if None != p:
+ if np.sum(p>0) < size:
+ raise ValueError("Fewer non-zero entries in p than size")
+ n_uniq = 0
+ p = p.copy()
+ found = np.zeros(size, dtype=np.int)
+ while n_uniq < size:
+ x = self.rand(size-n_uniq)
+ if n_uniq > 0:
+ p[found[0:n_uniq]] = 0
+ p = p/p.sum()
+ cdf = np.cumsum(p)
+ new = np.searchsorted(cdf, x)
@charris Owner
charris added a note

What if x is zero and the first probability is zero? I think cdf.searchsorted(x, side='right') might work better since self.rand never returns 1. Runs of zeros needs some testing. Admittedly, the chance of getting x == 0 is pretty slim, but it isn't quite zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@charris
Owner

How about choose, or does that get confused with R? Maybe draw, drawfrom, or pick instead of choice?

@chrisjordansquire

We'd discussed this on the list ages ago. The consensus at the time seemed to be that choice was a good match compared to the functions in the python standard library's random module. Especially when the default size is 1. (I preferred the name sample, but that's already taken, and more people preferred to not make so drastic a change to an existing function.)

@charris
Owner

OK.

@chrisjordansquire

Ok. I think that's all the changes. (Unless I forgot something.) Seems to pass the tests still, too.

@charris

Indentation looks off, did some tabs slip in there?

@charris charris commented on the diff
numpy/random/mtrand/mtrand.pyx
((94 lines not shown))
+
+ if None != p:
+ p = np.asarray(p)
+ if len(p.shape) != 1:
+ raise ValueError("p must be 1-dimensional")
+ if p.size != pop_size:
+ raise ValueError("a and p must have same size")
+ if any(p < 0):
+ raise ValueError("probabilities are not non-negative")
+ if not np.allclose(p.sum(), 1):
+ raise ValueError("probabilities do not sum to 1")
+
+ # Actual sampling
+ if replace:
+ if None != p:
+ cdf = p.cumsum()
@charris Owner
charris added a note

Indentation looks off. Did a tab slip in there?

Yep. Good catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
@charris
Owner

Looks like mtrand.c needs to be regenerated also.

@chrisjordansquire

Using the generate_mtrand_c.py script? I had thought that would be done by re-building and re-installing all of numpy.

@charris
Owner

I don't think it is regenerated during install, I just tried it and the file wasn't updated. Also, that would make cython one of the dependencies and I don't think it is. You must have regenerated the file a some point because it is among the commits.

@charris
Owner

And after regeneration there is a test error:

Traceback (most recent call last):
  File "/home/charris/.local/lib/python2.7/site-packages/numpy/random/tests/test_random.py", line 129, in test_choice_nonuniform_replace
@charris
Owner

I think the test failure is bogus and only due to the random number being used differently. After looking, I'm not sure those tests really help much since it is the statistics that need testing. But at least we will know if something changes in the future. I think that is why they are there, so we don't break backward compatibility in the future.

@chrisjordansquire

I'm not sure if the tests are totally helpful. But there is, of course, no really great way to test a random number generator. Unless you do a large amount of sampling and compare the proportions generated...but then the usual tests are just to compare against the reference distribution. And you'll reject such a test with some non-zero probability, which seems unacceptable for a test.

That's why I settled on using the tests that are there. I'm open to other ideas, or just eliminating the existing tests.

@charris
Owner

Now that my memory has been jogged, I think the tests were added after a change in the normal distribution algorithm changed the output stream for a given seed. That was a problem for some folks, so the tests were added just to make sure the that behavior was preserved for backward compatibility. So your tests are quite OK, especially since this will be the first version and only later revisions will need to toe the line. I'll just change the check and push it.

@charris
Owner

Pushed in 3d0b348..2988b0e.

@charris charris closed this
@charris
Owner

I made a couple of fixes in pull request 177. Could you check them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.