-
-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: clarify DataFrame.apply reduction on empty frames #6007
Conversation
their is already a reduce argument |
The |
don't u think it would be better to fix that than add another keyword? |
Would it make sense to change the default value of |
that's possible I think can u give a use case for this? apply on an empty frame should just return the frame - I don't think a reduction makes sense |
My use case is that I've got a reduction function which behaves poorly (raises an exception) when the current implementation calls the reduction function with an empty Series to try and guess the return value (https://github.com/wolever/pandas/blob/5ca822b35d196928fc9ef1b14d457553ea7f3e68/pandas/core/frame.py#L3437) This causes a problem because it's surprising that |
can u put a complete example up |
Specifically, the code I'm working with right now looks like this: a = df.apply(row_reduction_func_a, axis=1)
b = df.apply(row_reduction_func_b, axis=1)
result = DataFrame({ "a": a, "b": b }) Which works fine, except when the Pandas version 0.12.0 but this issue is present in trunk (see the associated test case). |
can u put up the function itself along with a sample frame |
Is there something wrong with the test cases I supplied? |
no I am trying to see if their is a more general issue your solution is treating the symptom not the problem |
Here is an example of a reduction function which will behave poorly (forgive me for being unable to share my specific function): >>> def row_reduction_function(row):
... return (row == correct).sum()
>>> correct = pd.Series(["a", "b"], index=["a", "b"]) And when applied to a full DataFrame, it produces the expected result (a Series with one element): >>> nonempty = pd.DataFrame({ "first": ["a"], "second": ["x"] })
>>> nonempty.apply(row_reduction_function, axis=1)
0 1
dtype: int64 But when applied to an empty DataFrame, it doesn't produce an empty Series; it produces a DataFrame: >>> empty = pd.DataFrame({ "first": [], "second": [] })
>>> empty.apply(row_reduction_function, axis=1)
Empty DataFrame
Columns: [first second]
Index: [] |
The problem is that, with an empty DataFrame, |
I think it would make more sense to have a empty frame in this example. I would rather not get an exception as missing data may throw exceptions in many functions. What would be the expected behavior expected for this: |
@MichaelWS the issue is not one of "when should an empty DataFrame be returned versus an empty Series". The problem is that, when the input DataFrame is empty, the result of |
@MichaelWS as for the question of expected behavior: my problem isn't with the expected behavior. My problem is that there's no way to override the current behavior and explicitly control the currently undefined behavior when the DataFrame is empty. |
I get your point of view, but I agree with jreback and think the average user would be a bit confused with the reduce and is_reduction combination. It might be easier to use the reduce keyword to handle the empty frame. |
Absolutely agree. In that case, do you agree that it makes sense to update the PR as follows:
|
That would be my preference, but I would wait for some others for a consensus. |
Okay, cool. I'll update the description and see if we can get some more consensus. I don't grok |
why don't u give a try by changing reduce |
Okay, cool — there we go. It does feel a bit like we're overloading the |
In fact, I noticed another issue today too: if the DataFrame is completely empty (no rows or columns), I'll update the unit tests to make sure they are all passing after this gets approval. |
I have 2 look at this some more also pls put a short example in v0.13.1.txt (illustrating both cases) |
@wolever ok...this looks fine...pls add the docs as I mentioned above and good 2 go |
Okay! Docs and tests updated too. Unfortunately I wasn't able to test the docs since my laptop ran out of RAM while trying to build them… but hopefully they should look decent. |
print "Apply function being called with:", col | ||
return col.sum() | ||
|
||
import pandas as pd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take out the import pandas line and the pd (pandas is auto imported in docs and Dataframe is too)
also, pls rebase and squash down to a single commit, see here: https://github.com/pydata/pandas/wiki/Using-Git |
@wolever pls rebase this and squash |
closed via d74bd31 thanks for the nice fix! |
Urg, sorry for not doing this myself - I've been out of commission the last couple of days. Thanks for the merge!
|
Add
is_reduction
argument toDataFrame.apply
to avoid undefined behavior when.apply
is called on an empty DataFrame when function being applied is only defined for valid inputs.Currently, if the DataFrame is empty, a guess is made at the correct return value (either a
DataFrame
or aSeries
) by calling the function being applied with an empty Series as an argument:For reduction functions which produce undefined results on unexpected input (ex, a function which doesn't expect an empty argument), this means that the the result of
apply
is also undefined.This pull request adds an explicit
is_reduction
argument so that it's possible to explicitly control this otherwise undefined behavior.Update: there has been the suggestion that the existing
reduce
argument should be used. Is this reasonable? The PR would be updated as follows:is_reduction
argumentreduce
fromTrue
toNone
(to preserve the current behavior of checking the return value of the function being applied)reduce
in the same way that I'm currently treatingis_reduction
reduce
as normalRef: