Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

New R-like sample function #151

Closed
wants to merge 3 commits into
from

Conversation

Projects
None yet
2 participants
Contributor

chrisjordansquire commented Sep 1, 2011

This is a convenience function that wraps together several different functionalities. It allows the user to pass in a 1-D array-like and generate a possibly weighted, possibly without replacement random sample from that array.

It is essentially an implementation of the sample function in R's base package, which is a basic function for quickly generating random samples from a given vector.

With the exception of the weighted sample without replacement, all the functionality of sample can be obtained using various functions and tricks, but it would be nice to have them encapsulated in one function. It also eases the transitions for current and former R/S/S-plus users.

It's biggest use case is for quickly and easily generating random samples from a vector of strings or non-contiguous, irregularly spaced integers (as occurs with codes in survey data). This is very useful for generating simulated categorical data, such as

numpy.random.sample( ['male', 'female'], 20, p=[0.55, 0.45])

Such a sampling can currently be accomplished by creating an rv_discrete instance in scipy.stats with P(X=0)=0.55 and P(X=1)=0.45, getting the appropriate random sample, and then using the output as indices in an array ['male', 'female']. In addition to the number of commands being somewhat cumbersome for quickly generating data, one can't use the existing stats.rv_discrete function to directly create a distribution on 'male' and 'female'. scipy.stats random variables assume the sample space is the real line.

I am open to other names for this function, as long as they are short. While creating this function I realized that random.random_sample was given a number of different aliases in the random.init.py file which weren't in the reference docs. I removed those aliases, as well as the limited number of times they were used in the numpy code base. (All in test files.) This included random.sample as an alias for random.random_sample. While this new sample function changes the previous random.sample function, it does not break the API as the previous function was not in the docs. It will also fail on all calls to previous function as the function signature is different.

Contributor

chrisjordansquire commented Sep 2, 2011

Made the changes discussed on the mailing list. Reverted ranf, random, and sample to their previous aliases, as well as the tests dependent on them, and changed the name of the new function from sample to choice.

Also added the aliases to the reference docs.

Owner

charris commented Oct 1, 2011

Could you add this to doc/release/2.0.0-notes.rst? And also add

.. versionadded:: 1.7.0

to the function documentation.

'choice' doesn't sound quite right to me, maybe sample_from or some such?

Owner

charris commented Oct 22, 2011

What is the status of this?

Contributor

chrisjordansquire commented Dec 14, 2011

Getting back to work on this now that I'm done with fall coursework. I'll make the few changes requested.

@charris charris commented on an outdated diff Dec 15, 2011

numpy/random/mtrand/mtrand.pyx
+ raise ValueError("a must be greater than 0")
+ else:
+ a = np.asarray(a)
+ if len(a.shape) != 1:
+ raise ValueError("a must be 1-dimensional")
+ pop_size = a.size
+ if pop_size is 0:
+ raise ValueError("a must be non-empty")
+
+ if None != p:
+ p = np.asarray(p)
+ if len(p.shape) != 1:
+ raise ValueError("p must be 1-dimensional")
+ if p.size != pop_size:
+ raise ValueError("a and p must have same size")
+ if any(p<0):
@charris

charris Dec 15, 2011

Owner

Spaces around '<'.

@charris charris and 1 other commented on an outdated diff Dec 15, 2011

numpy/random/mtrand/mtrand.pyx
+ a = np.asarray(a)
+ if len(a.shape) != 1:
+ raise ValueError("a must be 1-dimensional")
+ pop_size = a.size
+ if pop_size is 0:
+ raise ValueError("a must be non-empty")
+
+ if None != p:
+ p = np.asarray(p)
+ if len(p.shape) != 1:
+ raise ValueError("p must be 1-dimensional")
+ if p.size != pop_size:
+ raise ValueError("a and p must have same size")
+ if any(p<0):
+ raise ValueError("probabilities are not non-negative")
+ if not np.allclose(p.sum(), 1):
@charris

charris Dec 15, 2011

Owner

Any reason not to just normalize the sum to 1?

@chrisjordansquire

chrisjordansquire Dec 15, 2011

Contributor

Mostly as an error check. If the sum isn't close to 1, then the user likely made an error. Didn't want to silently propagate, especially since coding errors in the random number generator portion of some code can be especially hard to check.

@charris

charris Dec 15, 2011

Owner

OK, I was mostly thinking of the p as weights, but requiring the user to use probabilities might introduce a bit more discipline.

@charris charris and 1 other commented on an outdated diff Dec 15, 2011

numpy/random/mtrand/mtrand.pyx
+ if None != p:
+ p = np.asarray(p)
+ if len(p.shape) != 1:
+ raise ValueError("p must be 1-dimensional")
+ if p.size != pop_size:
+ raise ValueError("a and p must have same size")
+ if any(p<0):
+ raise ValueError("probabilities are not non-negative")
+ if not np.allclose(p.sum(), 1):
+ raise ValueError("probabilities do not sum to 1")
+
+ # Actual sampling
+ if replace:
+ if None != p:
+ x = np.arange(pop_size)
+ number_each = self.multinomial(size, p)
@charris

charris Dec 15, 2011

Owner

Did you try the inverse cdf method Josef mentioned in the thread following your post at http://thread.gmane.org/gmane.comp.python.numeric.general/45916

@chrisjordansquire

chrisjordansquire Dec 15, 2011

Contributor

At the time I'd thought it would be more efficient. But then I just did a quick speed test, and the methods seem equivalent. I can change it, but I'm not sure if there's a compelling reason to.

import numpy as np

def msa(pop_size, size, p):
    x = np.arange(pop_size)
    number_each = np.random.multinomial(size, p)
    idx = np.repeat(x, number_each)
    np.random.shuffle(idx)
    return idx

def cdfsa(pop_size, size, p):
    cdf = np.cumsum(p)
    return np.searchsorted(cdf, np.random.random(size))

p = np.random.random(20)
p = p/np.sum(p)

def w1():
    return msa(20, 10000, p)

def w2():
    return cdfsa(20, 10000, p)
@charris

charris Dec 15, 2011

Owner

On my machine with len(p) == 1000:

In [3]: timeit test_choice.w1()
100 loops, best of 3: 2.69 ms per loop

In [4]: timeit test_choice.w2()
1000 loops, best of 3: 1.14 ms per loop

Using cdf.searchsorted(x, side='right') might be a bit cleaner than using the function and also avoids a copy. Searching with the side='right' also avoids the possibility of returning a result that has zero probability.

@chrisjordansquire

chrisjordansquire Dec 16, 2011

Contributor

Ah. I'd worried that side='right' would give me what I wanted, but now I see it does. Ok, I'll make the change.

@charris charris commented on the diff Dec 15, 2011

numpy/random/mtrand/mtrand.pyx
+ idx = np.repeat(x, number_each)
+ self.shuffle(idx)
+ else:
+ idx = self.randint(0, pop_size, size=size)
+ else:
+ if size > pop_size:
+ raise ValueError(''.join(["Cannot take a larger sample than ",
+ "population when 'replace=False'"]))
+
+ if None != p:
+ if np.sum(p>0) < size:
+ raise ValueError("Fewer non-zero entries in p than size")
+ n_uniq = 0
+ p = p.copy()
+ found = np.zeros(size, dtype=np.int)
+ while n_uniq < size:
@charris

charris Dec 15, 2011

Owner

Heh. OK, it has O(n^2) running time. I wonder if there is something clever that could be done. Maybe a crossover between just doing repeated samples, discarding those already taken until the unavailable slots exceeded a certain probability threshold, then redoing the inverse cdf.

The case with zero probability for some of the elements can be subtle and might best be handled by removing them in the beginning.

@charris

charris Dec 15, 2011

Owner

OK, I see you solved the O(n^2) thing.

@charris charris commented on an outdated diff Dec 15, 2011

numpy/random/mtrand/mtrand.pyx
+ raise ValueError(''.join(["Cannot take a larger sample than ",
+ "population when 'replace=False'"]))
+
+ if None != p:
+ if np.sum(p>0) < size:
+ raise ValueError("Fewer non-zero entries in p than size")
+ n_uniq = 0
+ p = p.copy()
+ found = np.zeros(size, dtype=np.int)
+ while n_uniq < size:
+ x = self.rand(size-n_uniq)
+ if n_uniq > 0:
+ p[found[0:n_uniq]] = 0
+ p = p/p.sum()
+ cdf = np.cumsum(p)
+ new = np.searchsorted(cdf, x)
@charris

charris Dec 15, 2011

Owner

What if x is zero and the first probability is zero? I think cdf.searchsorted(x, side='right') might work better since self.rand never returns 1. Runs of zeros needs some testing. Admittedly, the chance of getting x == 0 is pretty slim, but it isn't quite zero.

Owner

charris commented Dec 15, 2011

How about choose, or does that get confused with R? Maybe draw, drawfrom, or pick instead of choice?

Contributor

chrisjordansquire commented Dec 15, 2011

We'd discussed this on the list ages ago. The consensus at the time seemed to be that choice was a good match compared to the functions in the python standard library's random module. Especially when the default size is 1. (I preferred the name sample, but that's already taken, and more people preferred to not make so drastic a change to an existing function.)

Owner

charris commented Dec 15, 2011

OK.

Contributor

chrisjordansquire commented Dec 17, 2011

Ok. I think that's all the changes. (Unless I forgot something.) Seems to pass the tests still, too.

Indentation looks off, did some tabs slip in there?

@charris charris commented on the diff Dec 17, 2011

numpy/random/mtrand/mtrand.pyx
+
+ if None != p:
+ p = np.asarray(p)
+ if len(p.shape) != 1:
+ raise ValueError("p must be 1-dimensional")
+ if p.size != pop_size:
+ raise ValueError("a and p must have same size")
+ if any(p < 0):
+ raise ValueError("probabilities are not non-negative")
+ if not np.allclose(p.sum(), 1):
+ raise ValueError("probabilities do not sum to 1")
+
+ # Actual sampling
+ if replace:
+ if None != p:
+ cdf = p.cumsum()
@charris

charris Dec 17, 2011

Owner

Indentation looks off. Did a tab slip in there?

@chrisjordansquire

chrisjordansquire Dec 17, 2011

Contributor

Yep. Good catch.

Owner

charris commented Dec 17, 2011

Looks like mtrand.c needs to be regenerated also.

Contributor

chrisjordansquire commented Dec 17, 2011

Using the generate_mtrand_c.py script? I had thought that would be done by re-building and re-installing all of numpy.

Owner

charris commented Dec 17, 2011

I don't think it is regenerated during install, I just tried it and the file wasn't updated. Also, that would make cython one of the dependencies and I don't think it is. You must have regenerated the file a some point because it is among the commits.

Owner

charris commented Dec 17, 2011

And after regeneration there is a test error:

Traceback (most recent call last):
  File "/home/charris/.local/lib/python2.7/site-packages/numpy/random/tests/test_random.py", line 129, in test_choice_nonuniform_replace
Owner

charris commented Dec 17, 2011

I think the test failure is bogus and only due to the random number being used differently. After looking, I'm not sure those tests really help much since it is the statistics that need testing. But at least we will know if something changes in the future. I think that is why they are there, so we don't break backward compatibility in the future.

Contributor

chrisjordansquire commented Dec 17, 2011

I'm not sure if the tests are totally helpful. But there is, of course, no really great way to test a random number generator. Unless you do a large amount of sampling and compare the proportions generated...but then the usual tests are just to compare against the reference distribution. And you'll reject such a test with some non-zero probability, which seems unacceptable for a test.

That's why I settled on using the tests that are there. I'm open to other ideas, or just eliminating the existing tests.

Owner

charris commented Dec 17, 2011

Now that my memory has been jogged, I think the tests were added after a change in the normal distribution algorithm changed the output stream for a given seed. That was a problem for some folks, so the tests were added just to make sure the that behavior was preserved for backward compatibility. So your tests are quite OK, especially since this will be the first version and only later revisions will need to toe the line. I'll just change the check and push it.

Owner

charris commented Dec 17, 2011

Pushed in 3d0b348..2988b0e.

@charris charris closed this Dec 17, 2011

Owner

charris commented Dec 17, 2011

I made a couple of fixes in pull request 177. Could you check them?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment