Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series/DataFrame sample method with/without replacement #2419

Closed
wesm opened this issue Dec 3, 2012 · 35 comments
Closed

Series/DataFrame sample method with/without replacement #2419

wesm opened this issue Dec 3, 2012 · 35 comments
Milestone

Comments

@wesm
Copy link
Member

wesm commented Dec 3, 2012

Should use a more intelligent algorithm than using np.random.permutation

@changhiskhan
Copy link
Contributor

Something like Series/DataFrame.sample(ntrials, shape=None, axis=0, replace=True, iterator=False)?

@wesm
Copy link
Member Author

wesm commented Dec 7, 2012

Or even just .sample(size, replace=True/False) would be fine. @rkern had a reservoir sampling impl floating around (for efficient sampling w/o replacement), maybe only on the mailing list

@wesm
Copy link
Member Author

wesm commented Dec 7, 2012

This doesn't need to get done for 0.10

@jreback jreback modified the milestones: 0.14.1, Someday May 29, 2014
hayd added a commit to hayd/pandas that referenced this issue May 30, 2014
@jreback jreback modified the milestones: 0.15.0, 0.14.1 Jun 26, 2014
@shoyer
Copy link
Member

shoyer commented Jan 21, 2015

I would like to propose that we should copy the API from dplyr for this method: namely, we should have two methods, sample_n and sample_frac. These methods are especially nice when coupled with groupby.

CC @hayd

@TomAugspurger
Copy link
Contributor

Steal all the dplyr!

To keep the number of new methods low, would you favor a single method df.sample(sample_size) where the behavior is like sample_frac if sample_size is between (0, 1), and like sample_n if it's a positive integer? There's precendce for this in scikit-learn's train_test_split:

test_size: If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples...

And we can have a with_replacement keyword argument as well. np.random.choice has a default of replace=True.

@shoyer
Copy link
Member

shoyer commented Jan 21, 2015

@TomAugspurger Hmm. I've used train_test_split, but don't like the degeneracy of size = 1. I think Hadley Wickham has the right idea in dplyr with "each function does only one thing, but does it well." So I would prefer for two methods with the same prefix. In my opinion, similarly named methods do not cause much more cognitive load than a single method.

@TomAugspurger
Copy link
Contributor

Good enough for me.

On Wed, Jan 21, 2015 at 1:20 PM, Stephan Hoyer notifications@github.com
wrote:

@TomAugspurger https://github.com/TomAugspurger Hmm. I've used
train_test_split, but don't like the degeneracy of size = 1. I think
Hadley Wickham has the right idea in dplyr with "each function does only
one thing, but does it well." So I would prefer for two methods with the
same prefix. In my opinion, similarly named methods do not cause much more
cognitive load than a single method.


Reply to this email directly or view it on GitHub
#2419 (comment).

@stared
Copy link

stared commented Jan 26, 2015

+1

@nickeubank
Copy link
Contributor

I'd be happy to take a look at this in a about a week (after a presentation).

How would people feel about an implementation built around a numpy sampling of the index, followed by a .loc[] call, similar (though with the suggested df.sample_size() and .sample_frac() formatting suggested above)?

def rand_rows(df, num_rows = 5):
    from numpy import random as rm
    subset = rm.choice(df.index.values, size = num_rows)    
    return df.loc[subset]

a_data_frame = pd.DataFrame({'col1':range(10,20), 'col2':range(20,30)})
rand_rows(a_data_frame)
rand_rows(a_data_frame, 6)

@TomAugspurger
Copy link
Contributor

That sounds fine. You'll also want to accept a seed parameter.

The only wrinkle is how to handle duplicates in the index. If you use .loc you could potentially more that num_rows if a duplicated index_label is selected. I think you should use .iloc and make everything position based.

@nickeubank
Copy link
Contributor

Sounds great -- I'll get to it next week!

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

@nickeubank glad you're excited about this! It would be great if you could get this finished :).

Here are the rough versions (mostly untested) that I wrote a few weeks ago:

def sample_n(df, n, replace=False, weight=None, seed=None):
    """Sample n rows from a DataFrame at random
    """
    rs = np.random.RandomState(seed)
    locs = rs.choice(df.shape[0], size=n, replace=replace, p=weight)
    return df.take(locs, axis=0)

def sample_frac(df, frac, replace=False, weight=None, seed=None):
    """Sample some fraction of a DataFrame at random
    """
    n = int(round(frac * df.shape[0]))
    return sample_n(df, n, replace=replace, weight=weight, seed=seed)

I think these get a couple of things right:

  1. Accepts a random numbers seed, which is essential for reproducibility.
  2. Samples integers and does position based indexing. This lets us side-step the complexity of .loc and location based indexing.
  3. Uses .take, which is actually usually considerably faster than indexing with .iloc.
  4. API borrowed from dplyr

What this needs:

  1. Tests!
  2. Documentation!
  3. Probably should accept a string for the weight argument, which would map to a DataFrame column.

Also, it would be really nice for these methods to work with grouped operations, so you could write something like df.groupby('category').sample_n(100) -> get 100 samples from each category.

@nickeubank
Copy link
Contributor

@shoyer Great! looks like this is in great shape. I'll start by building some tests and look into a weight implementation and get back to you, then we can pivot to the groupby once that's done.

Do you have an existing fork I should work on?

@shoyer
Copy link
Member

shoyer commented Mar 1, 2015

@nickeubank Nope, feel free to start from scratch. I needed sample_n for a notebook, but didn't have time to clean it up for a PR.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@nickeubank
Copy link
Contributor

Quick poll: I'm inclined to call the function "rand()" and accept both "size" and "size_type = {number, frac}" to accommodate both request for an exact number of rows and a fraction of rows.

My personal interest in this is mostly for being able to quickly query a random set of rows to examine my data frame, so having "df.rand()" return 5 random rows in a manner analogous to "df.head()" feels more appealing than longer function names like sample_n() or sample_frac().

But I'm open to input -- would people prefer sample_n() and sample_frac()? or is rand() seem ok?

@shoyer
Copy link
Member

shoyer commented Mar 13, 2015

I am not a fan of df.rand() because it's not clear what rand means in the context of a DataFrame. Sure, it means something random is happening, but rand makes me think of generating random numbers (e.g., with np.random.rand()), not sampling at random.

For me, adding a few characters to the length of the function is not such a big concern, because I'm almost always using auto-complete in IPython, anyways.

I'm afraid I'm also not a fan of returning 5 random rows as the default. That feels like a very arbitrary number to me -- and again, something that would be hard to guess.

@TomAugspurger
Copy link
Contributor

I'm also in favor of sample_n and sample_frac. Long method names don't bother me (up to point). The only trouble is that tab completion doesn't work through method chains.

@jorisvandenbossche
Copy link
Member

@nickeubank be sure to also check #7274, a closed PR trying to implement this for some inspiration (comments, tests)

I also like sample more than rand, if it should be two functions or one function with two kwargs, I have no real opinion about. Slightly leaning to one method, but if the others prefer two, that is OK with me.

@nickeubank
Copy link
Contributor

OK, sounds like a concensus in favor of .sample() over .rand().

like @jorisvandenbossche, I'm inclined to one method with a n_or_frac option, but am open to following @TomAugspurger's suggestion if that's what people prefer.

@shoyer Regarding the default return of five rows, it's a little arbitrary, but is analogous to what head() and tail() provide. And while I realize not everyone will use this for quick data interrogations, I don't see a lot of harm in a default for those who are -- I have trouble imagining a situation in which having a default N would cause problems in analysis.

@shoyer
Copy link
Member

shoyer commented Mar 13, 2015

Like I said before, my main issue with plain sample is that size=1 is degenerate. And unfortunately, getting one sample at random and getting a number of samples equal to the length of the frame (e.g., for bootstrapping) are both common use cases. What's your proposal for this edge case?

@nickeubank
Copy link
Contributor

Ah, I see -- you were thinking that if a size value is between 0 and 1, the function infers the user wants a share of rows; if size is an integer greater than 1, the function assumes they want N rows?

I was just going to make it a function option. That gets rid of the corner case. Basically:

def sample(self, size = 5, n_or_frac = 'number', replacement = False, weights = None, seed = None):
    """
    Returns a sample of rows from object. 

    Parameters
    ----------
        size: Number of rows (if n_or_frac = 'n') or 
              share of rows (if n_or_frac = 'frac'). Default 5.
        n_or_frac {'n', 'frac'}: 
              If 'n': return a sample with 'size' number of rows. 
              If 'frac', return 'size' fraction of rows. 
              Default is 'n'. 
        replacement {True, False}: Sample with or without replacement.
        weights: Series or ndarray of weights. Must be same length as index.  
                 Default 'None' results in equal probability weighting.
        seed: seed to be fed to numpy random.RandomState() Function. Default None. 
    """

@jorisvandenbossche
Copy link
Member

If we would make it one sample function, I think it should have two separate keywords like sample(n=None, frac=None) instead of one keyword controlling what the other does.

But also ok to make two functions of it

@jorisvandenbossche
Copy link
Member

Also, I would use replace instead of replacement to be consistent with numpy

@shoyer
Copy link
Member

shoyer commented Mar 13, 2015

sample(n=None, frac=None) looks pretty nice to me, actually. I suppose if it's called like df.sample() then we could even default to sampling five rows (not entirely sure that's a good idea, though).

@jorisvandenbossche
Copy link
Member

and actually df.sample_frac(0.5) is not shorter as df.sample(frac=0.5), and the latter looks a bit nicer to me.

@nickeubank
Copy link
Contributor

Ha! Do you think this is the exact conversation that the dplyr developers had?

Sounds like there's a pretty good consensus around 2 functions -- i'll code that up!

@shoyer
Copy link
Member

shoyer commented Mar 13, 2015

Actually, I think @jorisvandenbossche and I are now voting for one function, two arguments :).

@nickeubank
Copy link
Contributor

Oh! Misread post on length. :)

OK, so something like the following, with an error thrown if both n and frac values are provided:

   def sample(self, n = 5, frac = None , replace = False, weights = None, seed = None):
        """
        Returns a sample of rows from object. 

        Parameters
        ----------
            n: Number of rows to return. Cannot be used with frac.
               Default = 5 if frac = None. 
            frac: share of rows to return. Cannot be used with n. 
            replace {True, False}: Sample with or without replacement.
            weights: Series or ndarray of weights. Must be same length as index.  
                     Default 'None' results in equal probability weighting.
            seed: seed to be fed to numpy random.RandomState() Function. Default None. 
        """

@shoyer
Copy link
Member

shoyer commented Mar 13, 2015

Yes, that looks very close. One thing to note is that you'll need to make n=None in the function signature -- otherwise we can't tell cleanly if n=5 was intentional or merely the default value. This matters because of the alternative frac option.

Also, weights (on DataFrame) should accept a string, which tries to look up the weights from that column of the data frame.

@nickeubank
Copy link
Contributor

On first point: Great.

On weights: I was coding this into "code/generic.py" so it would also work with Series, and in a series the string wouldn't mean anything. With that in mind, I thought I'd just ask for a Series in the weight field, and the user could pass df.weightColumn if they had one.

Or do you think we need an if type(self) = pd.core.frame.DataFrame: clause to allow strings if DataFrame?

@nickeubank
Copy link
Contributor

Nevermind -- ill just add "if dataframe" clause. :)

@cpcloud
Copy link
Member

cpcloud commented Mar 16, 2015

Little late to the party here, but I am -1 on passing in a string to weights to mean a column. Why not just accept a single thing--a Series--and it works with both series and frame without having to know what the type of self is. It's also more clear what the meaning is IMO.

@shoyer
Copy link
Member

shoyer commented Mar 16, 2015

Little late to the party here, but I am -1 on passing in a string to weights to mean a column.

I agree this functionality is not essential, but we already use this sort of syntax as a shortcut (e.g., with groupby), so I doubt it will be confusing. The main advantage, from my perspective, is enhanced chain-ability (similar to assign), because you don't need to write the variable for the containing frame again.

@nickeubank
Copy link
Contributor

Submitted as pull request #9666. Input welcome!

@jreback jreback modified the milestones: 0.16.1, Next Major Release Mar 17, 2015
@jreback
Copy link
Contributor

jreback commented May 1, 2015

closed by #9666

@jreback jreback closed this as completed May 1, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
9 participants