-
-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: option for groupby.hist to match bins #22228
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR!
pandas/core/groupby/groupby.py
Outdated
@@ -582,6 +582,15 @@ def wrapper(*args, **kwargs): | |||
kwargs_with_axis['axis'] is None: | |||
kwargs_with_axis['axis'] = self.axis | |||
|
|||
if (name == 'hist' and | |||
kwargs.pop('equal_bins', False)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you get the same behavior without mutating the kwargs dict?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can pass equal_bins
to hist_series
. Do you also mean this for line 588? Just asking, because that might be a bit tricky to pipe through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Less concerned about 588 since you end up replacing that anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I ended up copying kwargs and mutating that.
pandas/core/groupby/groupby.py
Outdated
kwargs.pop('equal_bins', False)): | ||
# GH-22222 | ||
bins = kwargs.pop('bins', None) | ||
if type(bins) == int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be preferable I think to use is_integer
from pandas.core.dtypes.inference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
ax = g.hist(bins=bins, alpha=0.7, equal_bins=True) | ||
both_hists_max = g.apply(lambda x: max( | ||
np.histogram(x, bins=bin_range)[0])).max() | ||
assert ax.iloc[0].get_ylim()[1] >= both_hists_max |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it not possible to make a stronger assertion here about the actual binned values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion! I'm comparing bar widths in the new test.
Codecov Report
@@ Coverage Diff @@
## master #22228 +/- ##
==========================================
- Coverage 92.08% 92.08% -0.01%
==========================================
Files 169 169
Lines 50691 50698 +7
==========================================
+ Hits 46681 46687 +6
- Misses 4010 4011 +1
Continue to review full report at Codecov.
|
pandas/core/groupby/groupby.py
Outdated
def curried_with_axis(x): | ||
return f(x, *args, **kwargs_with_axis) | ||
|
||
def curried(x): | ||
return f(x, *args, **kwargs) | ||
return f(x, *args, **kwargs_wo_axis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 611 will be run b/c 'hist' is in base.plotting_methods
, the result is returned, so this change should only have a local effect.
@@ -2470,8 +2470,11 @@ def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None, | |||
bin edges are calculated and returned. If bins is a sequence, gives | |||
bin edges, including left edge of first bin and right edge of last | |||
bin. In this case, bins is returned unmodified. | |||
bins: integer, default 10 | |||
Number of histogram bins to be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is repeated and seems redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeated where? This should still stay, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bins
is defined on line 2468 and 2473. It looks like a mistake when writing the docstring. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK didn't see that. This is fine then
@@ -2470,8 +2470,11 @@ def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None, | |||
bin edges are calculated and returned. If bins is a sequence, gives | |||
bin edges, including left edge of first bin and right edge of last | |||
bin. In this case, bins is returned unmodified. | |||
bins: integer, default 10 | |||
Number of histogram bins to be used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeated where? This should still stay, no?
pandas/plotting/_core.py
Outdated
@@ -2480,6 +2483,7 @@ def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None, | |||
matplotlib.axes.Axes.hist : Plot a histogram using matplotlib. | |||
|
|||
""" | |||
# TODO: separate docstrings of series and groupby hist functions (GH-22241) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't need this comment here - the open issue should suffice
g = df.groupby('group')['rand'] | ||
ax = g.hist(bins=bins, alpha=0.7, equal_bins=True)[0] | ||
bin_width_group0 = ax.patches[0].get_width() | ||
bin_width_group1 = ax.patches[bins].get_width() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using np.random
can you not define a set of data whereby you can also easily assert that both groups have an equal number of bins across the same X-scale?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can replace np.random
with some predefined values. But I think that comparing the number of bins wouldn't do the job. That's the current state of the code. We're here just changing the range without changing the number of bins.
I've written a new test, that compares the x-axis values of the bins, that should be more rigorous. For some reason the functions see kwargs differently in python 2 and 3, so some CI tests fail in the former. I'll make a commit once I figure out what's going on.
@@ -12,6 +12,23 @@ | |||
from pandas.tests.plotting.common import TestPlotBase | |||
|
|||
|
|||
@td.skip_if_no_mpl | |||
def test_hist_bins_match(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this function into the class that already exists in the module
pandas/core/groupby/groupby.py
Outdated
if 'axis' not in kwargs_with_axis or \ | ||
kwargs_with_axis['axis'] is None: | ||
kwargs_with_axis['axis'] = self.axis | ||
|
||
if name == 'hist' and kwargs_wo_axis.pop('equal_bins', False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think another copy of kwargs is the cleanest solution here - can you not just get
from the kwargs dict instead of popping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case matplotltib.pyplot.hist
throws an error that equal_bins
is not recognized. An alternative solution that I can think of is to pass equal_bins
as a named argument to hist_series
in pandas/plotting/_core.py. But then it will be a dummy variable that's not used within the function. Any suggestions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify a bit more ~~~, in python3 (but for some reason not python2),~~~ any extra argument passed to kwargs
will be read in this line (as kwds
):
pandas/pandas/plotting/_core.py
Line 2501 in 0370740
ax.hist(values, bins=bins, **kwds) |
There is a python2 versus python3 discrepancy that's causing some CI tests to fail. I opened #22285 to deal with it. |
Merge branch 'master' of https://github.com/pandas-dev/pandas into hist_match_bins_tmp1_rebased
One job failed for no reason. Otherwise it's green. |
@@ -2443,7 +2443,7 @@ def hist_frame(data, column=None, by=None, grid=True, xlabelsize=None, | |||
|
|||
def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None, | |||
xrot=None, ylabelsize=None, yrot=None, figsize=None, | |||
bins=10, **kwds): | |||
bins=10, equal_bins=False, **kwds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying the alternative approach here. equal_bins
will be unused within the function. This is just to avoid popping kwds
.
This also takes care of py2/3 compat issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might have missed the compat issue but if you are using .get
it shouldn't matter whether this exists as a keyword or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kwds
can only contain matploltib.pyplot.hist
input args (b/c of line 2504) and throws an error if equal_bins
is a keyword.
@pytest.mark.parametrize( | ||
'bins, equal_bins', | ||
zip([5, None, np.linspace(-3, 3, 10)], [True, False]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trying all "possible" forms of bins
with and without equal_bins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would still prefer if you could be explicit about what is expected here, i.e. add expected
as a third parameter here and assert that we get the correct number of bins across the two groups
pandas/core/groupby/groupby.py
Outdated
@@ -581,6 +582,17 @@ def wrapper(*args, **kwargs): | |||
if 'axis' not in kwargs_with_axis or \ | |||
kwargs_with_axis['axis'] is None: | |||
kwargs_with_axis['axis'] = self.axis | |||
if (name == 'hist' and | |||
kwargs.get('equal_bins', False) is True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need is True
here or can we use the implicit truthiness of objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a bit safer this way, as it guards against some non-boolean statements by the user, such as equal_bins=None
. But your suggestion should be fine if we ignore bad inputs from user.
pandas/core/groupby/groupby.py
Outdated
bins = kwargs.get('bins') | ||
if bins is None: | ||
bins = 10 # use default value used in `hist_series` | ||
if is_integer(bins): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could just use is_scalar
here since we only need to guard against scalar vs sequence
pandas/core/groupby/groupby.py
Outdated
kwargs.get('equal_bins', False) is True): | ||
# GH-22222 | ||
bins = kwargs.get('bins') | ||
if bins is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this? Doesn't the keyword already default to 10?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The keyword defaults to 10 after the function is called (curried_axis
), but we need its value before that (on line 593)
|
||
|
||
@td.skip_if_no_mpl | ||
class TestDataFrameGroupByPlots(TestPlotBase): | ||
@pytest.mark.parametrize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Readability nit but let's add an empty line here
index=[0, 1], columns=['min', 'max']) | ||
group_ranges = g.agg([min, max]) | ||
assert np.isclose(group_ranges, hist_ranges).all() | ||
tm.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what the purpose of this call is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tm.close()
closes any open figure. Otherwise because of pytest.parameterize
we will keep plotting on the same figure and accumulate counts.
axes = g.hist(bins=bins, alpha=0.7, equal_bins=equal_bins) | ||
ax = axes[0] | ||
|
||
if bins is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just generally avoid this type of code in tests - you aren't asserting anything about the functionality of the code being tested but rather just adding separate logic here in the test case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test should be slightly different, depending on the type provided for bins
. Do you recommend dropping parametrize
decorator and writing separate tests for these instead?
@pytest.mark.parametrize( | ||
'bins, equal_bins', | ||
zip([5, None, np.linspace(-3, 3, 10)], [True, False]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would still prefer if you could be explicit about what is expected here, i.e. add expected
as a third parameter here and assert that we get the correct number of bins across the two groups
@@ -640,6 +640,7 @@ Plotting | |||
|
|||
- Bug in :func:`DataFrame.plot.scatter` and :func:`DataFrame.plot.hexbin` caused x-axis label and ticklabels to disappear when colorbar was on in IPython inline backend (:issue:`10611`, :issue:`10678`, and :issue:`20455`) | |||
- Bug in plotting a Series with datetimes using :func:`matplotlib.axes.Axes.scatter` (:issue:`22039`) | |||
- The new argument ``equal_bins`` in :func:`SeriesGroupBy.hist` sets histogram bins to equal values (:issue:`22222`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually hist
method of a SeriesGroupBy
object, although that't not available in the docs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is available in api.rst
if that suffices:
The following methods are available in both ``SeriesGroupBy`` and
``DataFrameGroupBy`` objects, but may differ slightly, usually in that
the ``DataFrameGroupBy`` version usually permits the specification of an
axis argument, and often an argument indicating whether to restrict
application to columns of a specific data type.
.. autosummary::
:toctree: generated/
...
DataFrameGroupBy.hist
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just change the reference to DataFrameGroupBy.hist in the whatsnew - should suffice
|
||
|
||
@td.skip_if_no_mpl | ||
class TestDataFrameGroupByPlots(TestPlotBase): | ||
|
||
@pytest.mark.parametrize('equal_bins, bins, expected1, expected2', ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need two expected
variables for the two groups. They correspond to the leftmost part of each histogram bar on the x-axis.
|
||
tm.assert_almost_equal(points[:num_bins], np.array(expected1)) | ||
tm.assert_almost_equal(points[num_bins:], np.array(expected2)) | ||
tm.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to close the figure (to avoid overwriting it during parametrize
)
@@ -581,6 +581,16 @@ def wrapper(*args, **kwargs): | |||
if 'axis' not in kwargs_with_axis or \ | |||
kwargs_with_axis['axis'] is None: | |||
kwargs_with_axis['axis'] = self.axis | |||
if name == 'hist' and kwargs.get('equal_bins', False): | |||
# GH-22222 | |||
bins = kwargs.get('bins') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not add anything here
this is obscuring already extremely fragile code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean not in the wrapper? Or also beyond that? Is there any other place you'd suggest?
@@ -640,6 +640,7 @@ Plotting | |||
|
|||
- Bug in :func:`DataFrame.plot.scatter` and :func:`DataFrame.plot.hexbin` caused x-axis label and ticklabels to disappear when colorbar was on in IPython inline backend (:issue:`10611`, :issue:`10678`, and :issue:`20455`) | |||
- Bug in plotting a Series with datetimes using :func:`matplotlib.axes.Axes.scatter` (:issue:`22039`) | |||
- The new argument ``equal_bins`` in :func:`SeriesGroupBy.hist` sets histogram bins to equal values (:issue:`22222`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just change the reference to DataFrameGroupBy.hist in the whatsnew - should suffice
@@ -581,6 +581,16 @@ def wrapper(*args, **kwargs): | |||
if 'axis' not in kwargs_with_axis or \ | |||
kwargs_with_axis['axis'] is None: | |||
kwargs_with_axis['axis'] = self.axis | |||
if name == 'hist' and kwargs.get('equal_bins', False): | |||
# GH-22222 | |||
bins = kwargs.get('bins') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm wondering if we even need to use .get here - can we just access by index? I'm assuming that 'equal_bins' should always be there so a KeyError when it's not would be fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same py2/3 issue. In py3 equal_bins
, bins
, etc do not take their default values unless the code inside hist_series
function is being run.
@@ -2443,7 +2443,7 @@ def hist_frame(data, column=None, by=None, grid=True, xlabelsize=None, | |||
|
|||
def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None, | |||
xrot=None, ylabelsize=None, yrot=None, figsize=None, | |||
bins=10, **kwds): | |||
bins=10, equal_bins=False, **kwds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might have missed the compat issue but if you are using .get
it shouldn't matter whether this exists as a keyword or not
@@ -581,6 +581,16 @@ def wrapper(*args, **kwargs): | |||
if 'axis' not in kwargs_with_axis or \ | |||
kwargs_with_axis['axis'] is None: | |||
kwargs_with_axis['axis'] = self.axis | |||
if name == 'hist' and kwargs.get('equal_bins', False): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can just do kwargs.get('equal_bins')
- no need for the False
argument
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if name == 'hist' and kwargs.get('equal_bins', False): | ||
# GH-22222 | ||
bins = kwargs.get('bins') | ||
if bins is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you point me to where bins
doesn't default to 10?
(True, 10, | ||
[-3., -2.4, -1.8, -1.2, -0.6, 0., 0.6, 1.2, 1.8, 2.4], | ||
[-3., -2.4, -1.8, -1.2, -0.6, 0., 0.6, 1.2, 1.8, 2.4]), | ||
(True, None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming the default is always 10 for bins
then None
is not a valid argument here and should in fact raise an error instead of giving valid results
points = np.array([patch.get_bbox().get_points() | ||
for patch in ax.patches])[:, 0, 0] | ||
|
||
tm.assert_almost_equal(points[:num_bins], np.array(expected1)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would prefer to use tm.assert_numpy_array_equal
instead here and on line below
Making a new function that does this is pretty straightforward: diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py
index 3f84fa0..2d324f4 100644
--- a/pandas/core/groupby/groupby.py
+++ b/pandas/core/groupby/groupby.py
@@ -1001,6 +1001,13 @@ class GroupBy(_GroupBy):
len(grouped) : int
Number of groups
"""
+ def new_hist(self, bins=10, equal_bins=False, **kwargs):
+ if equal_bins and is_scalar(bins):
+ # share the same numpy array for all group bins
+ bins = np.linspace(self.obj.min(),
+ self.obj.max(), bins + 1)
+ return self.hist(bins=bins, **kwargs)
+
def _bool_agg(self, val_test, skipna):
"""Shared func to call any / all Cython GroupBy implementations""" but I don't know how to overload |
Closing as stale. Ping if you'd like to pick this back up |
git diff upstream/master -u -- "*.py" | flake8 --diff
Example code and result: