API: clarify DataFrame.apply reduction on empty frames #6007

wolever · 2014-01-20T03:48:55Z

Add is_reduction argument to DataFrame.apply to avoid undefined behavior when .apply is called on an empty DataFrame when function being applied is only defined for valid inputs.

Currently, if the DataFrame is empty, a guess is made at the correct return value (either a DataFrame or a Series) by calling the function being applied with an empty Series as an argument:

if not all(self.shape):
    # How to determine this better?
    is_reduction = False
    try:
        is_reduction = not isinstance(f(_EMPTY_SERIES), Series)
    except Exception:
        pass

    if is_reduction:
        return Series(NA, index=self._get_agg_axis(axis))
    else:
        return self.copy()

For reduction functions which produce undefined results on unexpected input (ex, a function which doesn't expect an empty argument), this means that the the result of apply is also undefined.

This pull request adds an explicit is_reduction argument so that it's possible to explicitly control this otherwise undefined behavior.

Update: there has been the suggestion that the existing reduce argument should be used. Is this reasonable? The PR would be updated as follows:

Remove the is_reduction argument
Change the default value of reduce from True to None (to preserve the current behavior of checking the return value of the function being applied)
In the case of an empty DataFrame, treat reduce in the same way that I'm currently treating is_reduction
Otherwise treat reduce as normal

Ref:

jreback · 2014-01-20T03:52:51Z

their is already a reduce argument

wolever · 2014-01-20T03:54:46Z

The reduce argument has no effect if the DataFrame is empty.

jreback · 2014-01-20T03:56:08Z

don't u think it would be better to fix that than add another keyword?

wolever · 2014-01-20T04:01:42Z

Would it make sense to change the default value of reduce to None (so as not to break existing calls to .apply() which depend on the current empty-data-frame-reduction-guessing), then replace is_reduction with reduce? (I don't grok the semantics of reduce well enough to say for sure)

jreback · 2014-01-20T04:05:03Z

that's possible I think

can u give a use case for this?

apply on an empty frame should just return the frame - I don't think a reduction makes sense

wolever · 2014-01-20T04:12:27Z

My use case is that I've got a reduction function which behaves poorly (raises an exception) when the current implementation calls the reduction function with an empty Series to try and guess the return value (https://github.com/wolever/pandas/blob/5ca822b35d196928fc9ef1b14d457553ea7f3e68/pandas/core/frame.py#L3437)

This causes a problem because it's surprising that .apply(…) returns a Series except when it's empty (which essentially means I've got to litter otherwise correct algorithms with checks to see if the input DataFrame is empty).

jreback · 2014-01-20T04:15:33Z

can u put a complete example up
as well as pandas version?

wolever · 2014-01-20T04:17:24Z

Specifically, the code I'm working with right now looks like this:

a = df.apply(row_reduction_func_a, axis=1)
b = df.apply(row_reduction_func_b, axis=1)
result = DataFrame({ "a": a, "b": b })

Which works fine, except when the df is empty (because then a will be a DataFrame and the DataFrame constructor raises an exception when it tries to create a column from a DataFrame)

Pandas version 0.12.0 but this issue is present in trunk (see the associated test case).

jreback · 2014-01-20T04:19:11Z

can u put up the function itself along with a sample frame

wolever · 2014-01-20T04:20:00Z

Is there something wrong with the test cases I supplied?

jreback · 2014-01-20T04:23:19Z

no I am trying to see if their is a more general issue

your solution is treating the symptom not the problem

wolever · 2014-01-20T04:31:20Z

Here is an example of a reduction function which will behave poorly (forgive me for being unable to share my specific function):

>>> def row_reduction_function(row):
...    return (row == correct).sum()
>>> correct = pd.Series(["a", "b"], index=["a", "b"])

And when applied to a full DataFrame, it produces the expected result (a Series with one element):

>>> nonempty = pd.DataFrame({ "first": ["a"], "second": ["x"] })
>>> nonempty.apply(row_reduction_function, axis=1)
0    1
dtype: int64

But when applied to an empty DataFrame, it doesn't produce an empty Series; it produces a DataFrame:

>>> empty = pd.DataFrame({ "first": [], "second": [] })
>>> empty.apply(row_reduction_function, axis=1)
Empty DataFrame
Columns: [first second]
Index: []

wolever · 2014-01-20T04:33:48Z

The problem is that, with an empty DataFrame, .apply tries to guess what the result should be, and in sitautions where the reduction function is only defined for a "real" row (as opposed to the empty row .apply uses to try and guess the result), the result is also undefined. There needs to be some way to explicitly define whether the result should be a Series of a DataFrame. If using the reduce argument for this purpose is appropriate, I'd be all for that!

MichaelWS · 2014-01-20T16:01:44Z

I think it would make more sense to have a empty frame in this example. I would rather not get an exception as missing data may throw exceptions in many functions.

What would be the expected behavior expected for this:
http://stackoverflow.com/questions/21225608/pandas-dataframe-applyf-axis-1-on-an-empty-dataframe-ignores-exceptions

wolever · 2014-01-20T16:52:08Z

@MichaelWS the issue is not one of "when should an empty DataFrame be returned versus an empty Series".

The problem is that, when the input DataFrame is empty, the result of .apply is undefined if the function being applied has undefined behavior for unexpected input.

wolever · 2014-01-20T17:08:57Z

@MichaelWS as for the question of expected behavior: my problem isn't with the expected behavior. My problem is that there's no way to override the current behavior and explicitly control the currently undefined behavior when the DataFrame is empty.

MichaelWS · 2014-01-20T17:24:12Z

I get your point of view, but I agree with jreback and think the average user would be a bit confused with the reduce and is_reduction combination. It might be easier to use the reduce keyword to handle the empty frame.

wolever · 2014-01-20T17:35:46Z

Absolutely agree.

In that case, do you agree that it makes sense to update the PR as follows:

Remove the is_reduction argument
Change the default value of reduce from True to None (to preserve the current behavior of checking the return value of the function being applied)
In the case of an empty DataFrame, treat reduce in the same way that I'm currently treating is_reduction
Otherwise treat reduce as normal

MichaelWS · 2014-01-20T17:45:54Z

That would be my preference, but I would wait for some others for a consensus.

wolever · 2014-01-20T17:47:32Z

Okay, cool. I'll update the description and see if we can get some more consensus. I don't grok reduce to know for sure that it will be okay.

jreback · 2014-01-20T21:20:14Z

why don't u give a try by changing reduce
I don't remember exactly why I added it anyhow

wolever · 2014-01-20T22:17:00Z

Okay, cool — there we go. It does feel a bit like we're overloading the reduce argument… but, like I said, I don't totally grok the regular functioning of reduce, so I'll trust you if you say this is the way to do it.

wolever · 2014-01-21T01:47:19Z

In fact, I noticed another issue today too: if the DataFrame is completely empty (no rows or columns), .apply(…) always returns a DataFrame, when in fact it should use the same logic to determine what type should be returned. PR has been updated.

I'll update the unit tests to make sure they are all passing after this gets approval.

jreback · 2014-01-21T03:12:42Z

I have 2 look at this some more
but can u add a release note in the API section (reference this pr number)

also pls put a short example in v0.13.1.txt (illustrating both cases)

jreback · 2014-01-21T19:11:51Z

@wolever ok...this looks fine...pls add the docs as I mentioned above and good 2 go

wolever · 2014-01-24T02:44:22Z

Okay! Docs and tests updated too. Unfortunately I wasn't able to test the docs since my laptop ran out of RAM while trying to build them… but hopefully they should look decent.

jreback · 2014-01-24T11:59:48Z

doc/source/v0.13.1.txt

+        print "Apply function being called with:", col
+        return col.sum()
+
+     import pandas as pd


take out the import pandas line and the pd (pandas is auto imported in docs and Dataframe is too)

jreback · 2014-01-24T14:15:00Z

also, pls rebase and squash down to a single commit, see here: https://github.com/pydata/pandas/wiki/Using-Git

jreback · 2014-01-24T22:25:27Z

@wolever pls rebase this and squash

jreback · 2014-01-26T15:31:13Z

closed via d74bd31

thanks for the nice fix!

wolever · 2014-01-26T16:50:05Z

Urg, sorry for not doing this myself - I've been out of commission the last couple of days.

Thanks for the merge!

On Jan 26, 2014, at 10:31, jreback notifications@github.com wrote:

closed via d74bd31

thanks for the nice fix!

—
Reply to this email directly or view it on GitHub.

Add is_reduction argument to DataFrame.apply

5ca822b

Use 'reduce' argument instead of adding 'is_reduction' argument

1fe8e6b

Fix apply on empty DataFrame returns DataFrame

a342be6

Add changelog notes and examples

55b1d80

jreback reviewed Jan 24, 2014
View reviewed changes

jreback closed this Jan 26, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: clarify DataFrame.apply reduction on empty frames #6007

API: clarify DataFrame.apply reduction on empty frames #6007

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

wolever commented Jan 20, 2014

MichaelWS commented Jan 20, 2014

wolever commented Jan 20, 2014

wolever commented Jan 20, 2014

MichaelWS commented Jan 20, 2014

wolever commented Jan 20, 2014

MichaelWS commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

wolever commented Jan 21, 2014

jreback commented Jan 21, 2014

jreback commented Jan 21, 2014

wolever commented Jan 24, 2014

jreback Jan 24, 2014

jreback commented Jan 24, 2014

jreback commented Jan 24, 2014

jreback commented Jan 26, 2014

wolever commented Jan 26, 2014

API: clarify DataFrame.apply reduction on empty frames #6007

API: clarify DataFrame.apply reduction on empty frames #6007

Conversation

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

wolever commented Jan 20, 2014

MichaelWS commented Jan 20, 2014

wolever commented Jan 20, 2014

wolever commented Jan 20, 2014

MichaelWS commented Jan 20, 2014

wolever commented Jan 20, 2014

MichaelWS commented Jan 20, 2014

wolever commented Jan 20, 2014

jreback commented Jan 20, 2014

wolever commented Jan 20, 2014

wolever commented Jan 21, 2014

jreback commented Jan 21, 2014

jreback commented Jan 21, 2014

wolever commented Jan 24, 2014

jreback Jan 24, 2014

Choose a reason for hiding this comment

jreback commented Jan 24, 2014

jreback commented Jan 24, 2014

jreback commented Jan 26, 2014

wolever commented Jan 26, 2014