ENH: add np.nan funcs to _cython_table #21123

topper-123 · 2018-05-18T18:47:01Z

closes #19629
closes #21134

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

This started as a copy of #19670 by @AaronCritchley, but has solved two bugs that the tests surfaced along the way.

Bug 1:
there is currently a bug in df.aggregate, where the method incorrectly defers to df.apply in a corner case. This only shows up in the result when using numpy < 1.13 and passing np.nan* functions to df.aggregate. This is the reason for the change in base.py line 571. (see #8383 for further details on the bug in numpy<1.13 and how it affects pandas.)

Bug 2:
Passing builtins to df.aggregate is ok when axis=0, but gives wrong result,when axis=1 (#21134). The reason for this difference is that df.aggregate defers to df._aggregate when axis=0, but defers to df.apply, when axis=1, and these give different result when passed funcions and the series/frame contains Nan values. This can be solved by transposing df and defering the transposed frame to its _aggragate method when axis=1.

The added tests have been heavily parametrized (this helped unearth the bugs above). Thet have been placed in series/test_apply.py and frame/test_apply, as a lot of other tests for ser/df.aggregate were already there.

codecov · 2018-05-18T23:21:31Z

Codecov Report

Merging #21123 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #21123      +/-   ##
==========================================
+ Coverage   91.84%   91.84%   +<.01%     
==========================================
  Files         153      153              
  Lines       49505    49512       +7     
==========================================
+ Hits        45466    45473       +7     
  Misses       4039     4039

Flag	Coverage Δ
#multiple	`90.24% <100%> (ø)`	⬆️
#single	`41.88% <21.87%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/frame.py	`97.22% <100%> (-0.01%)`	⬇️
pandas/core/groupby/groupby.py	`92.66% <100%> (ø)`	⬆️
pandas/core/base.py	`96.87% <100%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f6abb61...580edcf. Read the comment docs.

WillAyd · 2018-05-19T02:25:51Z

pandas/tests/test_nanops.py

+    pd.Series([1, 2, 3, 4, 5, 6]),
+    pd.DataFrame([[1, 2, 3], [4, 5, 6]])
+])
+def nan_test_object(request):


Should we add NA data to these?

WillAyd · 2018-05-19T02:27:43Z

pandas/core/base.py

-        if f and not args and not kwargs:
-            return getattr(self, f)(), None
+        if f:
+            return getattr(self, f)(*args, **kwargs), None


Is this equivalent? Wondering if there's any case where providing args / kwargs before would have routed the function to a different place

It's not, but this is better :-). This fixes the bug in that was causing #19629 to fail.

The issue is subtle, but it has to do with a bug in numpy < 1.13 np.nan* functions. Numpy < 1.13 handles e.g. np.nanmin(pd_obj) incorrectly while it handles np.min(pd_obj) correctly. Numpy >= 1.13 handles both correctly. See #19753 for another issue regarding the same numpy problem.

Anyway, if f is a string, you should call getattr(self, f)(*args, **kwargs), so it was perhaps more luck than design that the previous version did work:-)

WillAyd · 2018-05-19T02:29:02Z

pandas/tests/test_nanops.py

+    (np.min, np.nanmin),
+])
+def test_np_nan_functions(standard, nan_method, nan_test_object):
+    tm.assert_almost_equal(nan_test_object.agg(standard),


Might be a silly question but can we not use the frame / series equals methods here? Don't think precision is that much of a factor with the fixtures

I think these are ok as nan_test_object can be either a series or a frame. I think the name should be changed though, to e.g. series_or_frame.

pep8speaks · 2018-05-19T08:50:32Z

Hello @topper-123! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on May 26, 2018 at 07:47 Hours UTC

WillAyd · 2018-05-19T16:43:15Z

doc/source/whatsnew/v0.23.1.txt

+^^^^^^^
+
+- :meth:`~DataFrame.agg` now correctly handles numpy NaN-aware methods like :meth:`numpy.nansum` (:issue:`19629`)
+- :meth:`~DataFrame.agg` now correctly handles built-in methods like ``sum`` when axis=1 (:issue:`19629`)


Don't think you meant to add this in this PR?

Well, without those changes the tests in test_apply don't pass. At the same time, the tests in test_apply are sufficient for testing for this bug, so these two issues are very related...

The fixed bug looks like this, BTW:

>>> df = pd.DataFrame([[np.nan, 2], [3, 4.]]) >>> df.agg(sum, axis=1) 0 NaN # should say 2.0 1 7.0 dtype: float64

WillAyd · 2018-05-19T16:44:29Z

pandas/core/frame.py

-                result, how = self._aggregate(func, axis=0, *args, **kwargs)
-            except TypeError:
-                pass
+        df = self if not axis else self.T


Same thing?

Yeah, a bit. df.agg and df.apply with axis=1 give in many cases identical output, but not in all, as df.apply doesn't make the lookup in _cython_table. So, this is needed for some tests to pass.

AFAIK, transposition is cheap in numpy/pandas, so this is an ok approach?

I would rather you not do this, this is a quite hacky way of handling this, there is a small bug on the lower level i think.

I don't think so, as df._aggregate currently doesn't take an axis parameter. I could add an axis parameter to df._aggregate and do the transposition there instead?

you can try. this should be handled on a much lower level.

I've added the axis parameter to df._aggregate, so this is handled there.

WillAyd · 2018-05-19T16:45:45Z

pandas/tests/frame/test_apply.py

+        pd.DataFrame([[np.nan, 2], [3, 4]]),
+        pd.DataFrame(),
+    ])
+    @pytest.mark.parametrize(


Hmm wonder if this would be better as a shared fixture - thoughts?

yes pls make this a fixture (in conftest.py)

WillAyd · 2018-05-19T16:46:29Z

pandas/tests/frame/test_apply.py

+        # GH21123
+        np_func, str_func = cython_table_items
+
+        if isinstance(test_input, pd.DataFrame):


Don't need this type check since the parameters are all DataFrames

You are right. I was preparing for eventual tests where test_input would be tuple([frame, args, kwargs]), but of course it looks silly now, when I could not find such tests that make sense.

I'll remove that unless someone comes up with a tests that requires args and/or kwargs

Also, the assert should now be tm.assert_frame_equal. I'll change that.

topper-123 · 2018-05-19T17:20:01Z

The travis failures say "Different tests were collected between gw1 and gw0. The difference is:..."

I don't think this has anything to do with my PR, anyone knows?

WillAyd · 2018-05-19T17:22:51Z

I have not seen that before - I'd say let's just take a look after your next push and see if it repeats

jreback · 2018-05-19T17:39:51Z

pandas/core/frame.py

-                result, how = self._aggregate(func, axis=0, *args, **kwargs)
-            except TypeError:
-                pass
+        df = self if not axis else self.T


I would rather you not do this, this is a quite hacky way of handling this, there is a small bug on the lower level i think.

jreback · 2018-05-19T17:43:03Z

pandas/tests/frame/test_apply.py

+        pd.DataFrame([[np.nan, 2], [3, 4]]),
+        pd.DataFrame(),
+    ])
+    @pytest.mark.parametrize(


yes pls make this a fixture (in conftest.py)

jreback · 2018-05-19T17:43:20Z

pandas/tests/series/test_apply.py

+        pd.Series(),
+    ])
+    @pytest.mark.parametrize(
+        "cython_table_items",


topper-123 · 2018-05-20T07:11:27Z

All the failures are of the type

Different tests were collected between gw0 and gw1. The difference is:
--- gw0
+++ gw1

Everything passes locally. Don't think this has anything to do with my PR. I can try push again, but this has been every time so far, so don't think that will help...

topper-123 · 2018-05-21T08:49:01Z

The failure above has been solved, it was a mistake concerning python3.5 dicts in xdist in conftest.py: python3.5 randomizes dict retrieval, and of course xdist doesnt like randomizing work distribution. I`ve made the params parameter in cython_table in ``conftest.py`` deterministic, so it doesnt happen again.

…riesAggregate

jreback · 2018-05-21T10:54:58Z

pandas/core/groupby/groupby.py

@@ -4086,7 +4086,10 @@ def _post_process_cython_aggregate(self, obj):
    def aggregate(self, arg, *args, **kwargs):

        _level = kwargs.pop('_level', None)
-        result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
+        _agg_kwargs = kwargs.copy()


you can just list axis as a kwarg

That breaks a test. The issue is axis can considered to be supplied twice and you may get (from the breaking test):

>>> _level, args, _agg_kwargs = None, (80,), {'axis': 0} >>> self._aggregate(arg, _level=_level, *args, **_agg_kwargs) TypeError: _aggregate() got multiple values for argument 'axis'

Not sure I understand this - what test is breaking? Perhaps the test is configured incorrectly?

Both 80 and 0 may be the value for parameter axis: The function signature is def _aggregate(self, arg, axis=0, *args, **kwargs), so the second unnamed argument (80) will be considered to be axis, but this clashes with the parameter in kwargs ({'axis': 0}), causing the exception.

To avoid this we'd prefer the signature to be def _aggregate(self, arg, *args, axis=0, **kwargs), but this syntax is only supported in Python3...

Where is that test located?

pandas\tests\groupby\test_groupby.py::test_pass_args_kwargs. It's the line agg_result = df_grouped.agg(np.percentile, 80, axis=0)

jreback · 2018-05-21T10:55:18Z

pandas/conftest.py

+                       key=lambda x: x[0].__name__)),
+    ids=lambda x: "({}-{!r})".format(x[0].__name__, x[1]),
+)
+def cython_table_items(request):


add _fixture to the end of the name

jreback · 2018-05-21T10:56:48Z

pandas/tests/frame/test_apply.py

+        # GH21123
+        np_func, str_func = cython_table_items
+
+        tm.assert_almost_equal(df.agg(np_func),


use assert_frame_equal it provides stronger guarantees

Most of these aggregate to series (so assert_series_equal)), but cumprod and cumsum is in _cython_table, so in that case a DataFrame is returned.

I could add a conditional, so the more correct assert is used each time.

jreback · 2018-05-21T10:56:56Z

pandas/tests/series/test_apply.py

+        # GH21123
+        np_func, str_func = cython_table_items
+
+        tm.assert_almost_equal(series.agg(np_func),


use assert_series_equal

jreback · 2018-05-21T10:58:07Z

pandas/tests/frame/test_apply.py

+    def test_agg_function_input(self, df, cython_table_items):
+        # test whether the functions (keys) in
+        # pd.core.base.SelectionMixin._cython_table give the same result
+        # as the related strings (values) when used in df.agg. Examples:


add an example which actually tests (say for sum, nansum) axis=1, IOW contruct the resultant frame

this test doesn't actually tests that axis=1 works, just that it matches with a string (which doesn't have tests itself)

I don't understand, could you expand.

This tests if the result are the same if e.g. same result when np.sum is supplied as when string 'sum' is supplied. So it is correct that this doesn't verify the result. I considered that to the a different test, where you test against the string versions.

topper-123 · 2018-05-25T22:08:05Z

I`ve rewritten the tests to now have expected results also.

WillAyd · 2018-05-25T23:06:36Z

pandas/core/base.py

@@ -316,13 +331,14 @@ def _try_aggregate_string_function(self, arg, *args, **kwargs):

        raise ValueError("{arg} is an unknown string function".format(arg=arg))

-    def _aggregate(self, arg, *args, **kwargs):
+    def _aggregate(self, arg, axis=0, *args, **kwargs):


I still think this should be a separate PR. I know you mentioned there was some interweaving of test dependency, but I feel like we are injecting this keyword in here without any regard to existing test coverage for axis=1.

I can look into making separate PR. That will have to be pulled in before this one, so the tests of this PR won't break.

If timing is a concern you can also xfail the axis=1 tests. Rebase thereafter would be minor

This has now been created as #21224

WillAyd · 2018-05-25T23:12:13Z

pandas/tests/frame/test_apply.py

+        df = inputs[0]
+        expected = inputs[1][str_func]
+
+        if isinstance(expected, type) and issubclass(expected, Exception):


Anything that raises should be done in a separate test, i.e. test_agg_function_input_raises

I agree in principle, but this test iterates over all items in _cython_table, of which some will fail on some inputs.

So I'd have to construct the tests quite a bit differently and probably the fixture in conftest.py couldn't be used (because it returns all combinations and I now have to select the relevant ones for each test method). So something like:

(builtins.sum, 'sum': 0), (np.sum, 'sum', 0), (np.nansum: 'sum', 0), etc...

which will be very inelegant and repetitive IMO. Is it not possible to bend this rule on this one (or give hint on how to do it elegantly)?...

Hmm OK understood. May make sense as an exception then - don't have anything off the top of my head to improve but will think more about it

WillAyd · 2018-05-25T23:13:42Z

pandas/tests/frame/test_apply.py

+                df.agg(np_func, axis=axis)
+                df.agg(str_func, axis=axis)
+        elif str_func in ('cumprod', 'cumsum'):
+            tm.assert_frame_equal(df.agg(np_func, axis=axis), expected)


For readability / consistency with other tests create a variable called result and assign to it before the call to assert_frame_equal

I've uploaded a changed version.

WillAyd · 2018-05-25T23:13:52Z

pandas/tests/series/test_apply.py

+            tm.assert_series_equal(series.agg(np_func), expected)
+            tm.assert_series_equal(series.agg(str_func), expected)
+        else:
+            tm.assert_almost_equal(series.agg(np_func), expected)


Should be assert_series_equal no?

series.agg(np_func) resturns a scalar. I could use assert np.isclose(...)?

WillAyd · 2018-05-26T23:01:53Z

pandas/tests/frame/test_apply.py

+        df = inputs[0]
+        expected = inputs[1][str_func]
+
+        if isinstance(expected, type) and issubclass(expected, Exception):


Hmm OK understood. May make sense as an exception then - don't have anything off the top of my head to improve but will think more about it

WillAyd · 2018-05-26T23:06:31Z

pandas/core/groupby/groupby.py

@@ -4086,7 +4086,10 @@ def _post_process_cython_aggregate(self, obj):
    def aggregate(self, arg, *args, **kwargs):

        _level = kwargs.pop('_level', None)
-        result, how = self._aggregate(arg, _level=_level, *args, **kwargs)
+        _agg_kwargs = kwargs.copy()


Not sure I understand this - what test is breaking? Perhaps the test is configured incorrectly?

WillAyd · 2018-05-26T23:09:41Z

pandas/core/frame.py

+        try:
+            result, how = self._aggregate(func, axis=axis, *args, **kwargs)
+        except TypeError:
+            pass
        if result is None:


Related to the axis change, do we still hit this condition?

The axis change is related to #21134. So if I move that to a separate PR, this will move too.

Yep that is expected just wanted to see if we still needed it (regardless of the PR it appears in)

jreback · 2018-05-29T01:00:53Z

as @WillAyd indicates, can you split this up into a cython table PR and on top of that the agg fixes?

first should be straightforward and we can get it quickly. pls put tests, changes and whatsnew for that one (in this PR is fine), and issue another PR for other changes.

jreback · 2018-06-19T01:42:24Z

pandas/conftest.py

+
+@pytest.fixture(
+    # params: Python 3.5 randomizes dict access and xdist doesn't like that
+    # in fixtures. In order to get predetermined values we need to sort


so would like to have this fixture in your other PR

topper-123 · 2018-07-28T23:08:20Z

Closing in favor of #22109.

topper-123 force-pushed the np-nan-funcs-to-cython-map branch from 0537897 to b0a7a0f Compare May 18, 2018 23:21

WillAyd requested changes May 19, 2018

View reviewed changes

topper-123 force-pushed the np-nan-funcs-to-cython-map branch 8 times, most recently from 2313461 to 075427d Compare May 19, 2018 16:36

WillAyd requested changes May 19, 2018

View reviewed changes

topper-123 changed the title ~~WIP/ENH: add np.nan funcs to _cython_table~~ ENH: add np.nan funcs to _cython_table May 19, 2018

topper-123 force-pushed the np-nan-funcs-to-cython-map branch from 075427d to 73048a1 Compare May 19, 2018 17:30

jreback requested changes May 19, 2018

View reviewed changes

jreback added Reshaping Concat, Merge/Join, Stack/Unstack, Explode Numeric Operations Arithmetic, Comparison, and Logical operations labels May 19, 2018

topper-123 force-pushed the np-nan-funcs-to-cython-map branch 4 times, most recently from 2f0c0bc to 6ece143 Compare May 19, 2018 23:15

tp added 2 commits May 20, 2018 23:44

add np.nan funcs to _cython_table

aa1b457

test changes

39e2e59

topper-123 force-pushed the np-nan-funcs-to-cython-map branch 3 times, most recently from 2e40325 to f353376 Compare May 21, 2018 00:25

add tests for func input to .agg to TestDataFrameAggregate and TestSe…

5ec7e18

…riesAggregate

topper-123 force-pushed the np-nan-funcs-to-cython-map branch from f353376 to 5ec7e18 Compare May 21, 2018 08:59

jreback requested changes May 21, 2018

View reviewed changes

tp added 2 commits May 21, 2018 13:21

comments addressed for adding nan_funcs to _cython_table

f91b716

new tests

396b327

topper-123 force-pushed the np-nan-funcs-to-cython-map branch from 327f1a9 to 396b327 Compare May 25, 2018 20:36

WillAyd requested changes May 25, 2018

View reviewed changes

topper-123 force-pushed the np-nan-funcs-to-cython-map branch from 9a053b8 to 3399bcd Compare May 26, 2018 02:06

changed tests

580edcf

topper-123 force-pushed the np-nan-funcs-to-cython-map branch from 3399bcd to 580edcf Compare May 26, 2018 07:46

WillAyd requested changes May 26, 2018

View reviewed changes

This was referenced May 27, 2018

BUG: df.agg(sum, axis=1) uses different method than when axis=0 #21222

Closed

BUG: df.agg, df.transform and df.apply use different methods when axis=1 than when axis=0 #21224

Merged

jreback reviewed Jun 19, 2018

View reviewed changes

topper-123 closed this Jul 28, 2018

ENH: add np.nan funcs to _cython_table #21123

ENH: add np.nan funcs to _cython_table #21123

Conversation

topper-123 commented May 18, 2018 • edited by jreback Loading

codecov bot commented May 18, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented May 19, 2018 • edited Loading

Comment last updated on May 26, 2018 at 07:47 Hours UTC

Choose a reason for hiding this comment

topper-123 May 19, 2018 • edited Loading

Choose a reason for hiding this comment

topper-123 May 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 May 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 May 19, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented May 19, 2018

WillAyd commented May 19, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 commented May 20, 2018

topper-123 commented May 21, 2018 • edited Loading

Choose a reason for hiding this comment

topper-123 May 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 May 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 May 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 May 21, 2018 • edited Loading

Choose a reason for hiding this comment

topper-123 commented May 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

topper-123 May 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 29, 2018

Choose a reason for hiding this comment

topper-123 commented Jul 28, 2018

topper-123 commented May 18, 2018 •

edited by jreback

Loading

codecov bot commented May 18, 2018 •

edited

Loading

pep8speaks commented May 19, 2018 •

edited

Loading

topper-123 May 19, 2018 •

edited

Loading

topper-123 May 19, 2018 •

edited

Loading

topper-123 May 19, 2018 •

edited

Loading

topper-123 May 19, 2018 •

edited

Loading

topper-123 commented May 21, 2018 •

edited

Loading

topper-123 May 21, 2018 •

edited

Loading

topper-123 May 27, 2018 •

edited

Loading

topper-123 May 27, 2018 •

edited

Loading

topper-123 May 21, 2018 •

edited

Loading

topper-123 May 27, 2018 •

edited

Loading