Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: option for groupby.hist to match bins #22228

Closed
wants to merge 14 commits into from

Conversation

javadnoorb
Copy link
Contributor

@javadnoorb javadnoorb commented Aug 7, 2018

Example code and result:

%matplotlib inline
import numpy as np
import pandas as pd

def test_hist_bins_match():
    # GH-22222
    N = 100
    bins = 5

    np.random.seed(0)
    df = pd.DataFrame(np.append(np.random.randn(N), np.random.randn(N) / 10),
                   columns=['rand'])
    df['group'] = [0] * N + [1] * N
    g = df.groupby('group')['rand']
    bin_range = np.linspace(df['rand'].min(), df['rand'].max(), bins + 1)

    ax = g.hist(bins=bins, alpha=0.7, equal_bins=True)
    both_hists_max = g.apply(lambda x: max(
        np.histogram(x, bins=bin_range)[0])).max()
    assert ax.iloc[0].get_ylim()[1] >= both_hists_max

test_hist_bins_match()

image

@javadnoorb javadnoorb changed the title ENH: option to for goupby.hist to match bins ENH: option for goupby.hist to match bins Aug 7, 2018
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR!

@@ -582,6 +582,15 @@ def wrapper(*args, **kwargs):
kwargs_with_axis['axis'] is None:
kwargs_with_axis['axis'] = self.axis

if (name == 'hist' and
kwargs.pop('equal_bins', False)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you get the same behavior without mutating the kwargs dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can pass equal_bins to hist_series. Do you also mean this for line 588? Just asking, because that might be a bit tricky to pipe through.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Less concerned about 588 since you end up replacing that anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I ended up copying kwargs and mutating that.

kwargs.pop('equal_bins', False)):
# GH-22222
bins = kwargs.pop('bins', None)
if type(bins) == int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be preferable I think to use is_integer from pandas.core.dtypes.inference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

ax = g.hist(bins=bins, alpha=0.7, equal_bins=True)
both_hists_max = g.apply(lambda x: max(
np.histogram(x, bins=bin_range)[0])).max()
assert ax.iloc[0].get_ylim()[1] >= both_hists_max
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not possible to make a stronger assertion here about the actual binned values?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion! I'm comparing bar widths in the new test.

@javadnoorb javadnoorb changed the title ENH: option for goupby.hist to match bins ENH: option for groupby.hist to match bins Aug 9, 2018
@codecov
Copy link

codecov bot commented Aug 9, 2018

Codecov Report

Merging #22228 into master will decrease coverage by <.01%.
The diff coverage is 85.71%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22228      +/-   ##
==========================================
- Coverage   92.08%   92.08%   -0.01%     
==========================================
  Files         169      169              
  Lines       50691    50698       +7     
==========================================
+ Hits        46681    46687       +6     
- Misses       4010     4011       +1
Flag Coverage Δ
#multiple 90.49% <85.71%> (-0.01%) ⬇️
#single 42.33% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/plotting/_core.py 83.48% <ø> (ø) ⬆️
pandas/core/groupby/groupby.py 96.04% <85.71%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ab274c2...adccd1a. Read the comment docs.

def curried_with_axis(x):
return f(x, *args, **kwargs_with_axis)

def curried(x):
return f(x, *args, **kwargs)
return f(x, *args, **kwargs_wo_axis)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 611 will be run b/c 'hist' is in base.plotting_methods, the result is returned, so this change should only have a local effect.

@@ -2470,8 +2470,11 @@ def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None,
bin edges are calculated and returned. If bins is a sequence, gives
bin edges, including left edge of first bin and right edge of last
bin. In this case, bins is returned unmodified.
bins: integer, default 10
Number of histogram bins to be used
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is repeated and seems redundant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeated where? This should still stay, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bins is defined on line 2468 and 2473. It looks like a mistake when writing the docstring. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah OK didn't see that. This is fine then

@@ -2470,8 +2470,11 @@ def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None,
bin edges are calculated and returned. If bins is a sequence, gives
bin edges, including left edge of first bin and right edge of last
bin. In this case, bins is returned unmodified.
bins: integer, default 10
Number of histogram bins to be used
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeated where? This should still stay, no?

@@ -2480,6 +2483,7 @@ def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None,
matplotlib.axes.Axes.hist : Plot a histogram using matplotlib.

"""
# TODO: separate docstrings of series and groupby hist functions (GH-22241)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need this comment here - the open issue should suffice

g = df.groupby('group')['rand']
ax = g.hist(bins=bins, alpha=0.7, equal_bins=True)[0]
bin_width_group0 = ax.patches[0].get_width()
bin_width_group1 = ax.patches[bins].get_width()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using np.random can you not define a set of data whereby you can also easily assert that both groups have an equal number of bins across the same X-scale?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can replace np.random with some predefined values. But I think that comparing the number of bins wouldn't do the job. That's the current state of the code. We're here just changing the range without changing the number of bins.

I've written a new test, that compares the x-axis values of the bins, that should be more rigorous. For some reason the functions see kwargs differently in python 2 and 3, so some CI tests fail in the former. I'll make a commit once I figure out what's going on.

@@ -12,6 +12,23 @@
from pandas.tests.plotting.common import TestPlotBase


@td.skip_if_no_mpl
def test_hist_bins_match():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this function into the class that already exists in the module

if 'axis' not in kwargs_with_axis or \
kwargs_with_axis['axis'] is None:
kwargs_with_axis['axis'] = self.axis

if name == 'hist' and kwargs_wo_axis.pop('equal_bins', False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think another copy of kwargs is the cleanest solution here - can you not just get from the kwargs dict instead of popping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case matplotltib.pyplot.hist throws an error that equal_bins is not recognized. An alternative solution that I can think of is to pass equal_bins as a named argument to hist_series in pandas/plotting/_core.py. But then it will be a dummy variable that's not used within the function. Any suggestions?

Copy link
Contributor Author

@javadnoorb javadnoorb Aug 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify a bit more ~~~, in python3 (but for some reason not python2),~~~ any extra argument passed to kwargswill be read in this line (as kwds):

ax.hist(values, bins=bins, **kwds)

@javadnoorb
Copy link
Contributor Author

javadnoorb commented Aug 11, 2018

There is a python2 versus python3 discrepancy that's causing some CI tests to fail. I opened #22285 to deal with it.

@javadnoorb
Copy link
Contributor Author

One job failed for no reason. Otherwise it's green.

@@ -2443,7 +2443,7 @@ def hist_frame(data, column=None, by=None, grid=True, xlabelsize=None,

def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None,
xrot=None, ylabelsize=None, yrot=None, figsize=None,
bins=10, **kwds):
bins=10, equal_bins=False, **kwds):
Copy link
Contributor Author

@javadnoorb javadnoorb Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm trying the alternative approach here. equal_bins will be unused within the function. This is just to avoid popping kwds.

This also takes care of py2/3 compat issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might have missed the compat issue but if you are using .get it shouldn't matter whether this exists as a keyword or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kwds can only contain matploltib.pyplot.hist input args (b/c of line 2504) and throws an error if equal_bins is a keyword.

@pytest.mark.parametrize(
'bins, equal_bins',
zip([5, None, np.linspace(-3, 3, 10)], [True, False])
)
Copy link
Contributor Author

@javadnoorb javadnoorb Aug 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying all "possible" forms of bins with and without equal_bins

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would still prefer if you could be explicit about what is expected here, i.e. add expected as a third parameter here and assert that we get the correct number of bins across the two groups

@@ -581,6 +582,17 @@ def wrapper(*args, **kwargs):
if 'axis' not in kwargs_with_axis or \
kwargs_with_axis['axis'] is None:
kwargs_with_axis['axis'] = self.axis
if (name == 'hist' and
kwargs.get('equal_bins', False) is True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need is True here or can we use the implicit truthiness of objects?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a bit safer this way, as it guards against some non-boolean statements by the user, such as equal_bins=None. But your suggestion should be fine if we ignore bad inputs from user.

bins = kwargs.get('bins')
if bins is None:
bins = 10 # use default value used in `hist_series`
if is_integer(bins):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could just use is_scalar here since we only need to guard against scalar vs sequence

kwargs.get('equal_bins', False) is True):
# GH-22222
bins = kwargs.get('bins')
if bins is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of this? Doesn't the keyword already default to 10?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keyword defaults to 10 after the function is called (curried_axis), but we need its value before that (on line 593)



@td.skip_if_no_mpl
class TestDataFrameGroupByPlots(TestPlotBase):
@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Readability nit but let's add an empty line here

index=[0, 1], columns=['min', 'max'])
group_ranges = g.agg([min, max])
assert np.isclose(group_ranges, hist_ranges).all()
tm.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what the purpose of this call is?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tm.close() closes any open figure. Otherwise because of pytest.parameterize we will keep plotting on the same figure and accumulate counts.

axes = g.hist(bins=bins, alpha=0.7, equal_bins=equal_bins)
ax = axes[0]

if bins is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just generally avoid this type of code in tests - you aren't asserting anything about the functionality of the code being tested but rather just adding separate logic here in the test case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test should be slightly different, depending on the type provided for bins. Do you recommend dropping parametrize decorator and writing separate tests for these instead?

@pytest.mark.parametrize(
'bins, equal_bins',
zip([5, None, np.linspace(-3, 3, 10)], [True, False])
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would still prefer if you could be explicit about what is expected here, i.e. add expected as a third parameter here and assert that we get the correct number of bins across the two groups

@@ -640,6 +640,7 @@ Plotting

- Bug in :func:`DataFrame.plot.scatter` and :func:`DataFrame.plot.hexbin` caused x-axis label and ticklabels to disappear when colorbar was on in IPython inline backend (:issue:`10611`, :issue:`10678`, and :issue:`20455`)
- Bug in plotting a Series with datetimes using :func:`matplotlib.axes.Axes.scatter` (:issue:`22039`)
- The new argument ``equal_bins`` in :func:`SeriesGroupBy.hist` sets histogram bins to equal values (:issue:`22222`)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually hist method of a SeriesGroupBy object, although that't not available in the docs:

https://pandas.pydata.org/pandas-docs/stable/api.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is available in api.rst if that suffices:

The following methods are available in both ``SeriesGroupBy`` and
``DataFrameGroupBy`` objects, but may differ slightly, usually in that
the ``DataFrameGroupBy`` version usually permits the specification of an
axis argument, and often an argument indicating whether to restrict
application to columns of a specific data type.

.. autosummary::
   :toctree: generated/

...
   DataFrameGroupBy.hist

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just change the reference to DataFrameGroupBy.hist in the whatsnew - should suffice



@td.skip_if_no_mpl
class TestDataFrameGroupByPlots(TestPlotBase):

@pytest.mark.parametrize('equal_bins, bins, expected1, expected2', (
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need two expected variables for the two groups. They correspond to the leftmost part of each histogram bar on the x-axis.


tm.assert_almost_equal(points[:num_bins], np.array(expected1))
tm.assert_almost_equal(points[num_bins:], np.array(expected2))
tm.close()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to close the figure (to avoid overwriting it during parametrize)

@@ -581,6 +581,16 @@ def wrapper(*args, **kwargs):
if 'axis' not in kwargs_with_axis or \
kwargs_with_axis['axis'] is None:
kwargs_with_axis['axis'] = self.axis
if name == 'hist' and kwargs.get('equal_bins', False):
# GH-22222
bins = kwargs.get('bins')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not add anything here

this is obscuring already extremely fragile code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean not in the wrapper? Or also beyond that? Is there any other place you'd suggest?

@@ -640,6 +640,7 @@ Plotting

- Bug in :func:`DataFrame.plot.scatter` and :func:`DataFrame.plot.hexbin` caused x-axis label and ticklabels to disappear when colorbar was on in IPython inline backend (:issue:`10611`, :issue:`10678`, and :issue:`20455`)
- Bug in plotting a Series with datetimes using :func:`matplotlib.axes.Axes.scatter` (:issue:`22039`)
- The new argument ``equal_bins`` in :func:`SeriesGroupBy.hist` sets histogram bins to equal values (:issue:`22222`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just change the reference to DataFrameGroupBy.hist in the whatsnew - should suffice

@@ -581,6 +581,16 @@ def wrapper(*args, **kwargs):
if 'axis' not in kwargs_with_axis or \
kwargs_with_axis['axis'] is None:
kwargs_with_axis['axis'] = self.axis
if name == 'hist' and kwargs.get('equal_bins', False):
# GH-22222
bins = kwargs.get('bins')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm wondering if we even need to use .get here - can we just access by index? I'm assuming that 'equal_bins' should always be there so a KeyError when it's not would be fine

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same py2/3 issue. In py3 equal_bins, bins, etc do not take their default values unless the code inside hist_series function is being run.

@@ -2443,7 +2443,7 @@ def hist_frame(data, column=None, by=None, grid=True, xlabelsize=None,

def hist_series(self, by=None, ax=None, grid=True, xlabelsize=None,
xrot=None, ylabelsize=None, yrot=None, figsize=None,
bins=10, **kwds):
bins=10, equal_bins=False, **kwds):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might have missed the compat issue but if you are using .get it shouldn't matter whether this exists as a keyword or not

@@ -581,6 +581,16 @@ def wrapper(*args, **kwargs):
if 'axis' not in kwargs_with_axis or \
kwargs_with_axis['axis'] is None:
kwargs_with_axis['axis'] = self.axis
if name == 'hist' and kwargs.get('equal_bins', False):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just do kwargs.get('equal_bins') - no need for the False argument

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason py3 and py2 kwargs behavior are different. In py3 kwargs defaults are not set here if the parameter is not provided by the user. I opened an issue on this (#22285). I'm not sure where to move the code (per @jreback's comment) to avoid this.

if name == 'hist' and kwargs.get('equal_bins', False):
# GH-22222
bins = kwargs.get('bins')
if bins is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you point me to where bins doesn't default to 10?

(True, 10,
[-3., -2.4, -1.8, -1.2, -0.6, 0., 0.6, 1.2, 1.8, 2.4],
[-3., -2.4, -1.8, -1.2, -0.6, 0., 0.6, 1.2, 1.8, 2.4]),
(True, None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming the default is always 10 for bins then None is not a valid argument here and should in fact raise an error instead of giving valid results

points = np.array([patch.get_bbox().get_points()
for patch in ax.patches])[:, 0, 0]

tm.assert_almost_equal(points[:num_bins], np.array(expected1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would prefer to use tm.assert_numpy_array_equal instead here and on line below

@javadnoorb
Copy link
Contributor Author

Making a new function that does this is pretty straightforward:

diff --git a/pandas/core/groupby/groupby.py b/pandas/core/groupby/groupby.py
index 3f84fa0..2d324f4 100644
--- a/pandas/core/groupby/groupby.py
+++ b/pandas/core/groupby/groupby.py
@@ -1001,6 +1001,13 @@ class GroupBy(_GroupBy):
     len(grouped) : int
         Number of groups
     """
+    def new_hist(self, bins=10, equal_bins=False, **kwargs):
+        if equal_bins and is_scalar(bins):
+            # share the same numpy array for all group bins
+            bins = np.linspace(self.obj.min(),
+                               self.obj.max(), bins + 1)
+        return self.hist(bins=bins, **kwargs)
+
     def _bool_agg(self, val_test, skipna):
         """Shared func to call any / all Cython GroupBy implementations"""

but I don't know how to overload SeriesGroupBy.hist without touching the wrappers, or some other sensitive part of the code. I defer this PR to anyone who knows how to do this.

@WillAyd
Copy link
Member

WillAyd commented Nov 23, 2018

Closing as stale. Ping if you'd like to pick this back up

@WillAyd WillAyd closed this Nov 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: groupby.hist bins don't match
3 participants