Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: clarify DataFrame.apply reduction on empty frames #6007

Closed
wants to merge 4 commits into from

Conversation

@wolever
Copy link
Contributor

commented Jan 20, 2014

Add is_reduction argument to DataFrame.apply to avoid undefined behavior when .apply is called on an empty DataFrame when function being applied is only defined for valid inputs.

Currently, if the DataFrame is empty, a guess is made at the correct return value (either a DataFrame or a Series) by calling the function being applied with an empty Series as an argument:

if not all(self.shape):
    # How to determine this better?
    is_reduction = False
    try:
        is_reduction = not isinstance(f(_EMPTY_SERIES), Series)
    except Exception:
        pass

    if is_reduction:
        return Series(NA, index=self._get_agg_axis(axis))
    else:
        return self.copy()

For reduction functions which produce undefined results on unexpected input (ex, a function which doesn't expect an empty argument), this means that the the result of apply is also undefined.

This pull request adds an explicit is_reduction argument so that it's possible to explicitly control this otherwise undefined behavior.

Update: there has been the suggestion that the existing reduce argument should be used. Is this reasonable? The PR would be updated as follows:

  • Remove the is_reduction argument
  • Change the default value of reduce from True to None (to preserve the current behavior of checking the return value of the function being applied)
  • In the case of an empty DataFrame, treat reduce in the same way that I'm currently treating is_reduction
  • Otherwise treat reduce as normal

Ref:

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2014

their is already a reduce argument

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

The reduce argument has no effect if the DataFrame is empty.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2014

don't u think it would be better to fix that than add another keyword?

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

Would it make sense to change the default value of reduce to None (so as not to break existing calls to .apply() which depend on the current empty-data-frame-reduction-guessing), then replace is_reduction with reduce? (I don't grok the semantics of reduce well enough to say for sure)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2014

that's possible I think

can u give a use case for this?

apply on an empty frame should just return the frame - I don't think a reduction makes sense

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

My use case is that I've got a reduction function which behaves poorly (raises an exception) when the current implementation calls the reduction function with an empty Series to try and guess the return value (https://github.com/wolever/pandas/blob/5ca822b35d196928fc9ef1b14d457553ea7f3e68/pandas/core/frame.py#L3437)

This causes a problem because it's surprising that .apply(…) returns a Series except when it's empty (which essentially means I've got to litter otherwise correct algorithms with checks to see if the input DataFrame is empty).

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2014

can u put a complete example up
as well as pandas version?

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

Specifically, the code I'm working with right now looks like this:

a = df.apply(row_reduction_func_a, axis=1)
b = df.apply(row_reduction_func_b, axis=1)
result = DataFrame({ "a": a, "b": b })

Which works fine, except when the df is empty (because then a will be a DataFrame and the DataFrame constructor raises an exception when it tries to create a column from a DataFrame)

Pandas version 0.12.0 but this issue is present in trunk (see the associated test case).

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2014

can u put up the function itself along with a sample frame

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

Is there something wrong with the test cases I supplied?

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2014

no I am trying to see if their is a more general issue

your solution is treating the symptom not the problem

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

Here is an example of a reduction function which will behave poorly (forgive me for being unable to share my specific function):

>>> def row_reduction_function(row):
...    return (row == correct).sum()
>>> correct = pd.Series(["a", "b"], index=["a", "b"])

And when applied to a full DataFrame, it produces the expected result (a Series with one element):

>>> nonempty = pd.DataFrame({ "first": ["a"], "second": ["x"] })
>>> nonempty.apply(row_reduction_function, axis=1)
0    1
dtype: int64

But when applied to an empty DataFrame, it doesn't produce an empty Series; it produces a DataFrame:

>>> empty = pd.DataFrame({ "first": [], "second": [] })
>>> empty.apply(row_reduction_function, axis=1)
Empty DataFrame
Columns: [first second]
Index: []
@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

The problem is that, with an empty DataFrame, .apply tries to guess what the result should be, and in sitautions where the reduction function is only defined for a "real" row (as opposed to the empty row .apply uses to try and guess the result), the result is also undefined. There needs to be some way to explicitly define whether the result should be a Series of a DataFrame. If using the reduce argument for this purpose is appropriate, I'd be all for that!

@MichaelWS

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2014

I think it would make more sense to have a empty frame in this example. I would rather not get an exception as missing data may throw exceptions in many functions.

What would be the expected behavior expected for this:
http://stackoverflow.com/questions/21225608/pandas-dataframe-applyf-axis-1-on-an-empty-dataframe-ignores-exceptions

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

@MichaelWS the issue is not one of "when should an empty DataFrame be returned versus an empty Series".

The problem is that, when the input DataFrame is empty, the result of .apply is undefined if the function being applied has undefined behavior for unexpected input.

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

@MichaelWS as for the question of expected behavior: my problem isn't with the expected behavior. My problem is that there's no way to override the current behavior and explicitly control the currently undefined behavior when the DataFrame is empty.

@MichaelWS

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2014

I get your point of view, but I agree with jreback and think the average user would be a bit confused with the reduce and is_reduction combination. It might be easier to use the reduce keyword to handle the empty frame.

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

Absolutely agree.

In that case, do you agree that it makes sense to update the PR as follows:

  • Remove the is_reduction argument
  • Change the default value of reduce from True to None (to preserve the current behavior of checking the return value of the function being applied)
  • In the case of an empty DataFrame, treat reduce in the same way that I'm currently treating is_reduction
  • Otherwise treat reduce as normal
@MichaelWS

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2014

That would be my preference, but I would wait for some others for a consensus.

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

Okay, cool. I'll update the description and see if we can get some more consensus. I don't grok reduce to know for sure that it will be okay.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 20, 2014

why don't u give a try by changing reduce
I don't remember exactly why I added it anyhow

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 20, 2014

Okay, cool — there we go. It does feel a bit like we're overloading the reduce argument… but, like I said, I don't totally grok the regular functioning of reduce, so I'll trust you if you say this is the way to do it.

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 21, 2014

In fact, I noticed another issue today too: if the DataFrame is completely empty (no rows or columns), .apply(…) always returns a DataFrame, when in fact it should use the same logic to determine what type should be returned. PR has been updated.

I'll update the unit tests to make sure they are all passing after this gets approval.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 21, 2014

I have 2 look at this some more
but can u add a release note in the API section (reference this pr number)

also pls put a short example in v0.13.1.txt (illustrating both cases)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 21, 2014

@wolever ok...this looks fine...pls add the docs as I mentioned above and good 2 go

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 24, 2014

Okay! Docs and tests updated too. Unfortunately I wasn't able to test the docs since my laptop ran out of RAM while trying to build them… but hopefully they should look decent.

print "Apply function being called with:", col
return col.sum()

import pandas as pd

This comment has been minimized.

Copy link
@jreback

jreback Jan 24, 2014

Contributor

take out the import pandas line and the pd (pandas is auto imported in docs and Dataframe is too)

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 24, 2014

also, pls rebase and squash down to a single commit, see here: https://github.com/pydata/pandas/wiki/Using-Git

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 24, 2014

@wolever pls rebase this and squash

@jreback

This comment has been minimized.

Copy link
Contributor

commented Jan 26, 2014

closed via d74bd31

thanks for the nice fix!

@jreback jreback closed this Jan 26, 2014

@wolever

This comment has been minimized.

Copy link
Contributor Author

commented Jan 26, 2014

Urg, sorry for not doing this myself - I've been out of commission the last couple of days.

Thanks for the merge!

On Jan 26, 2014, at 10:31, jreback notifications@github.com wrote:

closed via d74bd31

thanks for the nice fix!


Reply to this email directly or view it on GitHub.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.