Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Implement groupby.sample #34069

Merged
merged 51 commits into from
Jun 14, 2020
Merged

ENH: Implement groupby.sample #34069

merged 51 commits into from
Jun 14, 2020

Conversation

dsaxton
Copy link
Member

@dsaxton dsaxton commented May 8, 2020

@dsaxton dsaxton requested a review from mroeschke May 8, 2020 14:49
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implementation looks good. Just some doc comments.

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Show resolved Hide resolved
pandas/core/groupby/groupby.py Show resolved Hide resolved
pandas/core/groupby/groupby.py Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @TomAugspurger @jorisvandenbossche if you'd have a look

else:
ws = [None] * self.ngroups

if random_state:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think this is enough, you need to always have a random_state here that is consistent across the entire groupby.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either is fine. Either we get a random state from NumPy's global random state initially and re-use it, or we have each group draw from the global random state pool. It's similar to these two calls

  1. .sample(random_state=0) # each call uses the seed 0
  2. .sample(random_state=np.random.RandomState(0)) # each call makes an independent draw

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually meant to make this random_state is not None (didn't consider other "falsey" values)

@@ -275,6 +275,7 @@ Other enhancements
such as ``dict`` and ``list``, mirroring the behavior of :meth:`DataFrame.update` (:issue:`33215`)
- :meth:`~pandas.core.groupby.GroupBy.transform` and :meth:`~pandas.core.groupby.GroupBy.aggregate` has gained ``engine`` and ``engine_kwargs`` arguments that supports executing functions with ``Numba`` (:issue:`32854`, :issue:`33388`)
- :meth:`~pandas.core.resample.Resampler.interpolate` now supports SciPy interpolation method :class:`scipy.interpolate.CubicSpline` as method ``cubicspline`` (:issue:`33670`)
- :class:`DataFrameGroupBy` and :class:`SeriesGroupBy` now implement the ``sample`` method for doing random sampling within groups (:issue:`31775`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need the full path to these classes in the docs.

else:
ws = [None] * self.ngroups

if random_state:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either is fine. Either we get a random state from NumPy's global random state initially and re-use it, or we have each group draw from the global random state pool. It's similar to these two calls

  1. .sample(random_state=0) # each call uses the seed 0
  2. .sample(random_state=np.random.RandomState(0)) # each call makes an independent draw

the underlying object and will be used as sampling probabilities
after normalization within each group.
random_state : int, array-like, BitGenerator, np.random.RandomState, optional
If int, array-like, or BitGenerator (NumPy>=1.17), seed for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It it is a BitGenerator, do you use a Generator to produce the random samples or a RandomState. Best practice is to use a Generator since RandomState is effectively frozen in time. If an int, it is used as a seed for np.random.default_rng() or RandomState if NumPy >= 1.17?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is following a pattern similar to the one used in pandas.core.generic.sample of processing the random_state according to pandas.core.common.random_state:

def random_state(state=None):

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine, can you add a reference in doc/source/reference/groupby.rst

also a mention / small example in user_guide/groupby.rst if appropriate

@jreback jreback added this to the 1.1 milestone Jun 14, 2020
@jreback jreback merged commit b3f483f into pandas-dev:master Jun 14, 2020
@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

thanks @dsaxton very nice!

@dsaxton dsaxton deleted the groupby-sample branch June 14, 2020 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Sample method for Groupby objects
6 participants