Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: sum of Series of all NaN should return 0 or NaN ? #9422

Closed
1 task
shoyer opened this issue Feb 5, 2015 · 116 comments · Fixed by #17630
Closed
1 task

API: sum of Series of all NaN should return 0 or NaN ? #9422

shoyer opened this issue Feb 5, 2015 · 116 comments · Fixed by #17630
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@shoyer
Copy link
Member

shoyer commented Feb 5, 2015

Summary

The question is what the sum of a Series of all NaNs should return (which is equivalent to an empty Series after skipping the NaNs): NaN or 0?

In [1]: s = Series([np.nan])                 

In [2]: s.sum(skipna=True)  # skipping NaNs is the default
Out[2]: nan or 0     <---- DISCUSSION POINT

In [3]: s.sum(skipna=False)
Out[3]: nan

The reason this is a discussion point has the following cause: the internal nansum implementation of pandas returns NaN. But, when bottleneck is installed, pandas will use bottlenecks implementation of nansum, which returns 0 (for the versions >= 1.0).
Bottleneck changed the behaviour from returning NaN to returning 0 to model it after numpy's nansum function.

This has the very annoying consequence that depending on whether bottleneck is installed or not (which is only an optional dependency), you get a different behaviour.

So the decision we need to make, is to either:

  • adapt pandas internal implementation to return 0, so in all cases 0 is returned for all NaN/empty series.
  • workaround bottlenecks behaviour or not use it for nansum, in order to consistently return NaN instead of 0
  • choose one of both above as the default, but have an option to switch behaviour

Original title: nansum in bottleneck 1.0 will return 0 for all NaN arrays instead of NaN

xref pydata/bottleneck#96
xref #9421

This matches a change from numpy 1.8 -> 1.9.

We should address this for pandas 0.16.

Should we work around the new behavior (probably the simplest choice) or change nansum in pandas?

@shoyer shoyer added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Feb 5, 2015
@jreback
Copy link
Contributor

jreback commented Feb 5, 2015

I guess the np.nan* methods 'ignore' nans by default and thus all-nans is really an empty array. But this seems wrong to me. You can never get a NaN result back.

However pandas wants to propogate NaN results, so this would break existing behavior (which looking at it seems slightly wrong).

In [13]: s = Series([np.nan])                 

In [15]: s.sum(skipna=True)
Out[15]: nan

In [16]: s.sum(skipna=False)
Out[16]: nan

so I think that you are proposing that [15] should be 0.0 (this is the default, FYI).

@jreback jreback added this to the 0.16.0 milestone Feb 5, 2015
@shoyer
Copy link
Member Author

shoyer commented Feb 6, 2015

Yes, we might want to change your line [15] to 0.0. There is certainly something to be said for staying in sync with numpy/bottleneck.

I think sum and prod are the only functions that would need to change to match the numpy behavior (nanprod will be new in numpy 1.10: numpy/numpy#5418). All the other NA-skipping aggregation functions (mean, std, var, median, etc) are undefined for empty arrays, so it still makes sense for them to return NaN in the all NaN case.

This change also has implications for windowed and grouped sum/prod. These might also need a corresponding change to match.

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 5, 2015
jreback added a commit that referenced this issue Jun 4, 2015
TST: fix for bottleneck >= 1.0 nansum behavior, xref #9422
cgevans added a commit to cgevans/pandas that referenced this issue Jun 5, 2015
* https://github.com/pydata/pandas: (26 commits)
  disable some deps on 3.2 build
  Fix meantim typo
  DOC: use current ipython in doc build
  PERF: write basic datetimes faster pandas-dev#10271
  TST: fix for bottleneck >= 1.0 nansum behavior, xref pandas-dev#9422
  add numba example to enhancingperf.rst
  BUG: SparseSeries constructor ignores input data name
  BUG: Raise TypeError only if key DataFrame is not empty pandas-dev#10126
  ENH: groupby.apply for Categorical should preserve categories (closes pandas-dev#10138)
  DOC: add in whatsnew/0.17.0.txt
  DOC: move whatsnew from 0.17.0 -> 0.16.2
  BUG:  Holiday(..) with both offset and observance raises NotImplementedError pandas-dev#10217
  BUG: Index.union cannot handle array-likes
  BUG: SparseSeries.abs() resets name
  BUG: Series arithmetic methods incorrectly hold name
  ENH: Don't infer WOM-5MON if we don't support it (pandas-dev#9425)
  BUG: Series.align resets name when fill_value is specified
  BUG: GroupBy.get_group raises ValueError when group key contains NaT
  Close mysql connection in TestXMySQL to prevent tests freezing
  BUG: plot doesnt default to matplotlib axes.grid setting (pandas-dev#9792)
  ...
yarikoptic added a commit to neurodebian/pandas that referenced this issue Jul 2, 2015
* commit 'v0.16.1-97-gbc7d48f': (56 commits)
  disable some deps on 3.2 build
  Fix meantim typo
  DOC: use current ipython in doc build
  PERF: write basic datetimes faster pandas-dev#10271
  TST: fix for bottleneck >= 1.0 nansum behavior, xref pandas-dev#9422
  add numba example to enhancingperf.rst
  BUG: SparseSeries constructor ignores input data name
  BUG: Raise TypeError only if key DataFrame is not empty pandas-dev#10126
  ENH: groupby.apply for Categorical should preserve categories (closes pandas-dev#10138)
  DOC: add in whatsnew/0.17.0.txt
  DOC: move whatsnew from 0.17.0 -> 0.16.2
  BUG:  Holiday(..) with both offset and observance raises NotImplementedError pandas-dev#10217
  BUG: Index.union cannot handle array-likes
  BUG: SparseSeries.abs() resets name
  BUG: Series arithmetic methods incorrectly hold name
  ENH: Don't infer WOM-5MON if we don't support it (pandas-dev#9425)
  BUG: Series.align resets name when fill_value is specified
  BUG: GroupBy.get_group raises ValueError when group key contains NaT
  Close mysql connection in TestXMySQL to prevent tests freezing
  BUG: plot doesnt default to matplotlib axes.grid setting (pandas-dev#9792)
  ...
@jreback
Copy link
Contributor

jreback commented Jul 31, 2015

Yeh an option passed to core/nanops/_maybe_null_out in nansum/nanprod would do the trick here.

@jreback jreback modified the milestones: 0.17.0, Next Major Release Aug 13, 2015
jreback added a commit to jreback/pandas that referenced this issue Aug 21, 2015
… compat with numpy >= 1.8.2 and bottleneck >= 1.0, pandas-dev#9422

     note that passing skipna=False will still return a NaN
@shoyer
Copy link
Member Author

shoyer commented Nov 10, 2017

@sam-s Pandas uses NaN to represent missing values (i.e., NULL in database semantics). This has its downsides, but is a well established part of the current pandas data model (it will be eventually fixed in pandas 2.0).

@shoyer
Copy link
Member Author

shoyer commented Nov 10, 2017

We're now seriously considered three different options:

  1. Empty and all missing sums are both zero (NumPy, bottleneck and R use this): SUM([]) = 0, SUM([NULL]) = 0
  2. Empty sum is zero, all missing sum is missing (pandas <0.20 uses this): SUM([]) = 0, SUM([NULL]) IS NULL
  3. Empty and all missing sums are both missing (databases use this): SUM([]) IS NULL, SUM([NULL]) IS NULL

My question: Is there any precedence (other than old versions of pandas) for choice 2? This may be a compromise solution that keeps both the "result should be zero" and "result should be missing" camps happy, but distinguishing between "empty" and "missing" is something we don't do anywhere else in pandas, so this seems like a huge gotcha to me. I think this status quo was worse than making either choice.

@sam-s
Copy link

sam-s commented Nov 10, 2017

@shoyer so your rationale is that sum([]) is "a missing value". Right?
How is it missing?
A sum of an empty set is, naturally (because associativity!), the "additive unit" (i.e., 0).
Nothing is missing.

@kawochen
Copy link
Contributor

@jreback I believe the information is not lost, but shifted from one case to another. If we decided, instead, to return zero for all NaN Series, and .sum() returned NaN, then it would be likely that inf and -inf were involved. In the current API, we can't be sure. I do however think that I see NaN a lot more than inf.

At the same time, I believe that, regardless of the decision, people would complain, citing mathematics and yelling 'obviously'. If mathematics were a person, she would perhaps be surprised at how often she was mentioned by both camps.

@sam-s
Copy link

sam-s commented Nov 10, 2017

@jorisvandenbossche: you say I don't think this comparison is fully correct, as it is comparing two different kinds of sums (the pandas Series.sum vs + with scalars), where the one skips NaNs and the other propagates them.
Computer operations should stay as close as possible to their mathematical ideals.
Creating gratuitous incompatibilities and breaking associativity is, IMO, counterproductive.

@sam-s
Copy link

sam-s commented Nov 10, 2017

@kawochen: could you please explain how sum([])=NaN makes sense from the mathematical POV?

@shoyer
Copy link
Member Author

shoyer commented Nov 10, 2017

@sam-s I think we all understand the mathematical issues. That's not the only basis on which this decision is being made.

@sam-s
Copy link

sam-s commented Nov 10, 2017

@shoyer: of course math is not the only consideration!
the reason for my question is that @kawochen said "If mathematics were a person, she would perhaps be surprised at how often she was mentioned by both camps", so I am asking how the sum([])=NaN folks invoke math.

@kawochen
Copy link
Contributor

I was only making an observation that both camps cited math, and I was not wrong.

@brazilbean said:

From a mathematics-theory perspective, returning 0 as the sum of "no numbers" is wrong. The correct behavior to "sum no numbers" is to return "not a number". From this perspective, everybody has been doing this wrong (except pandas... :).

@sam-s The sum of a empty set being zero is a convention. It is convenient, but it doesn't follow from any definition of (+, R). When it is used in any math books I've read, it is explicitly defined to be zero. Do I want it to be 0? Yes, I think that'd be prettier. Do I think it needs to be? No.

@sam-s
Copy link

sam-s commented Nov 10, 2017

@kawochen : The sum of a empty set being zero is a convention
This is false.
sum([])==0 is a theorem which follows from associativity of the group (R,+):
sum(list1 + list2) == sum(list1) + sum(list2)
Since list1 == list1 + [], we must have
sum(list1) == sum(list1 + []) == sum(list1) + sum([])
thus sum([])==0.

@kawochen : When it is used in any math books I've read, it is explicitly defined to be zero.
This is merely a didactic point - it is not a part of the definition, just an illustration.

@brazilbean : From a mathematics-theory perspective, returning 0 as the sum of "no numbers" is wrong.
This is false, as proven above.

@shoyer: I think we all understand the mathematical issues
Apparently, at least @kawochen and @brazilbean were confused at some point. ;-)

@shoyer
Copy link
Member Author

shoyer commented Nov 10, 2017

@sam-s Please stop. This is not the place to prove that others are wrong.

@kawochen
Copy link
Contributor

@sam-s I am not confused. (+, R) is a binary operation on R, not a mysterious unary operation on lists.

@sam-s
Copy link

sam-s commented Nov 10, 2017

@kawochen : I have no desire to fight. I apologize if I insulted you or anyone else.
My only wish is to understand how one can justify sum([])==NaN mathematically.
I believe I offered a proof that sum([])==0.
Can you point out an error in my argument?

PS. Again, I understand that math is not the only rationale for making software design decisions (moreover, R with NaN/Inf/&c is not a field). However, a claim was made that sum([])==0 is somehow not necessarily true from a mathematical POV. I am trying to figure out what was meant by that statement.

@wesm
Copy link
Member

wesm commented Nov 10, 2017

@sam-s pandas is not a mathematics library, so these mathematical arguments are not persuasive. The issue is comparing all-null data versus empty data. The change that was made was to make the empty data behavior consistent across all reductions. This has nothing to do with mathematical arithmetic theory. Also please stop using NaN as a straw man. NaN is null in pandas.

@sam-s
Copy link

sam-s commented Nov 10, 2017

@wesm : I am replying to claims that sum([])==0 is not necessarily true mathematically.
As I said at least twice and will repeat again, I understand that math is not the only rationale.
I am not sure what you mean by straw man -- null/NA/NaN are, indeed, interchangeable in this context.

@wesm : The change that was made was to make the empty data behavior consistent across all reductions.
I am not sure what you mean here. My guess is that you are talking about different orders of summation along different dimensions, but I would appreciate a clarification. What is the test case where sum([])==null would be valuable for preserving consistency?
Thank you very much for your help!

@wesm
Copy link
Member

wesm commented Nov 10, 2017

Here you go:

In [1]: import numpy as np

In [2]: np.prod([])
Out[2]: 1.0

In [3]: np.sum([])
Out[3]: 0.0

In [4]: np.mean([])
/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/numpy/core/_methods.py:59: RuntimeWarning: Mean of empty slice.
  warnings.warn("Mean of empty slice.", RuntimeWarning)
/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/numpy/core/_methods.py:70: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Out[4]: nan

In [5]: np.std([])
/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/numpy/core/_methods.py:82: RuntimeWarning: Degrees of freedom <= 0 for slice
  warnings.warn("Degrees of freedom <= 0 for slice", RuntimeWarning)
/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/numpy/core/_methods.py:94: RuntimeWarning: invalid value encountered in true_divide
  arrmean, rcount, out=arrmean, casting='unsafe', subok=False)
/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/numpy/core/_methods.py:116: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
Out[5]: nan

In [6]: np.max([])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims)
   2289         try:
-> 2290             amax = a.max
   2291         except AttributeError:

AttributeError: 'list' object has no attribute 'max'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-6-2c7d6c30ced7> in <module>()
----> 1 np.max([])

/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims)
   2291         except AttributeError:
   2292             return _methods._amax(a, axis=axis,
-> 2293                                 out=out, **kwargs)
   2294         return amax(axis=axis, out=out, **kwargs)
   2295     else:

/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/numpy/core/_methods.py in _amax(a, axis, out, keepdims)
     24 # small reductions
     25 def _amax(a, axis=None, out=None, keepdims=False):
---> 26     return umr_maximum(a, axis, None, out, keepdims)
     27 
     28 def _amin(a, axis=None, out=None, keepdims=False):

ValueError: zero-size array to reduction operation maximum which has no identity

In [7]: np.min([])
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/numpy/core/fromnumeric.py in amin(a, axis, out, keepdims)
   2389         try:
-> 2390             amin = a.min
   2391         except AttributeError:

AttributeError: 'list' object has no attribute 'min'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-7-6b53ced98bfe> in <module>()
----> 1 np.min([])

/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/numpy/core/fromnumeric.py in amin(a, axis, out, keepdims)
   2391         except AttributeError:
   2392             return _methods._amin(a, axis=axis,
-> 2393                                 out=out, **kwargs)
   2394         return amin(axis=axis, out=out, **kwargs)
   2395     else:

/home/wesm/anaconda3/envs/arrow-test/lib/python3.5/site-packages/numpy/core/_methods.py in _amin(a, axis, out, keepdims)
     27 
     28 def _amin(a, axis=None, out=None, keepdims=False):
---> 29     return umr_minimum(a, axis, None, out, keepdims)
     30 
     31 def _sum(a, axis=None, dtype=None, out=None, keepdims=False):

ValueError: zero-size array to reduction operation minimum which has no identity

Do these answers all seem reasonable to you?

In [8]: import pandas as pd

In [9]: pd.__version__
Out[9]: '0.21.0'

In [10]: s = pd.Series([])

In [11]: s.sum()
Out[11]: nan

In [12]: s.mean()
Out[12]: nan

In [13]: s.max()
Out[13]: nan

In [14]: s.min()
Out[14]: nan

In [15]: s.std()
Out[15]: nan

In [16]: s.prod()
Out[16]: nan

This is what I mean by consistency. The question is not whether sum([]) == 0 is a mathematically reasonable answer, but what are the semantics of an all-null Series versus a length-0 Series.

@sam-s
Copy link

sam-s commented Nov 10, 2017

@wesm: I am chagrined by your approach.
The previous behavior was more or less reasonable.
These are different operations with different semantics and different invariants:
Yes, sum==0, prod==1, mean/std --> error all make sense.
max=-inf and min=+inf would make more sense than errors because max(list1+list2)==max(max(list1),max(list2)) &c.

Now, returning nan/null makes no sense whatsoever, except for a superficial similarity between unrelated cases.
Empty lists/sets are not "weird unicorns requiring special treatment", but normal objects whose properties should be governed by common invariants like the one for max above or associativity for sum &c. Yes, mean([]) and std([1]) make no sense and should be errors. But others are well defined.

@wesm : semantics of an all-null Series versus a length-0 Series
I think you are looking at this from the wrong side.
The question is not all-null but some-null:
The point is that the dichotomy is

  • "do we have a null" vs "we have no nulls" rather than
  • "do we have only nulls" vs "do we have some real data".

If you have a list [1,null], then you have two possible approaches:

  1. null means "unknown", so we really have no idea what the (say) sum is, as the second list element may be 1 (in which case the sum is 2) or -1 (then sum is 0) &c, so all operations should return null.
  2. null means "missing", so we ignore its value and work with the rest of the list. IOW, we view [null,3] as if it were [3] and return 3 for sum &c.

(please do not get hung up on "unknown" vs "missing" terminology, use whatever makes sense to you)

Both of the above 2 approaches to treating nulls (1: return null, 2: ignore nulls) are only relevant to situations where we have nulls. But there are no nulls in an empty set, so this is not relevant!

When we have no nulls, returning null means that the operation on the data supplied is undefined. IOW, there is no way to produce a number which would satisfy the basic properties of the operations in question. But this is not the case here!

Let us see: associativity implies that prod([])=1 and sum([])=0.
The "pseudo-associativity" implies that max([])=-inf and min([])=inf (yep, in this case max<min and that's okay - these are separate operations and internal consistency of each of them is more important than the inter-operation relationships ;-)
mean and std have no nice invariants and thus should raise exceptions: there is no way to return anything meaningful. (Or, if you insist, return null.)

Thank you very much for listening to me and replying.
I hope you will give my ideas some thought and find them reasonable on due consideration.

Again, while I do understand that math is not the only consideration, I beg you to remember that your customer is an applied mathematician like yours truly and we have certain expectations from the basic math operations. Please do not surprise us like this! ;-)

Thanks again!

@shoyer
Copy link
Member Author

shoyer commented Nov 10, 2017

The point is that the dichotomy is

  • "do we have a null" vs "we have no nulls" rather than
  • "do we have only nulls" vs "do we have some real data".

This is reasonable from a mathematical perspective, but again is not the choice made by databases. In databases (and pandas) nulls can be introduced into empty results quite easily from joins, and in general there is no careful distinction.

@kawochen
Copy link
Contributor

@sam-s

Can you point out an error in my argument?

@jorisvandenbossche already pointed out the error.

As I mentioned in the other issue: I don't think this comparison is fully correct, as it is comparing two different kinds of sums (the pandas Series.sum vs + with scalars), where the one skips NaNs and the other propagates them.
If you would write it as pd.Series([s1.sum(), s2.sum()]).sum(), then the comparison would still hold with 0.21.0

In other words, you are mixing (the sometimes overloaded) + and .sum() in your argument.
pd.Series([pandas_series_1.sum(), pandas_series_2.sum(), ..., pandas_series_n.sum()]).sum() still equals pd.concat([pandas_series_1, ..., pandas_series_n]).sum(), if we ignore inf and -inf. (We ignore inf and -inf because it was that way before and is unrelated to the recent change.) If you'd like to show that the binary operation 'addition' over reals, when applied to an empty set, should be zero, then you must first define what it means to apply a binary operation to a set. I am afraid there is no way to will the result to be zero when it's an empty set, except to define it so. You must first define such an operation over sets, before using properties of this operation (it may or may not be associative depending on your definition).

All reasonable return values will have their own supporters, and that's OK -- null apparently means different things to different people in different fields. It would have been better if those opinions were voiced before the decision, because after the fact, we often only hear from those who have been inconvenienced, and can't get a sense of who may have benefited from the new choice.

@kenahoo
Copy link

kenahoo commented Nov 11, 2017

Perhaps another justification:

Adding up things a certain number of times is called multiplication. [4, 4, 4].sum is also known as 4*3. [4, 4].sum is also known as 4*2. [4].sum is 4*1. And [].sum is 4*0. Why 4? Yeah, doesn't matter, anything times zero is zero. This is why we use the additive identity for an empty sum.

My point isn't to prove that an empty sum is zero, that's been settled for hundreds of years, and doesn't really need our help.

My point is that if you define it as anything else, huge numbers of inconsistencies will pop up all over the place, and the solution will be to either add special-case code all over the place, or to stop using pd.sum altogether and replace it with np.sum, as was required in a couple places earlier in this thread. That is not a good outcome for the sum function.

Is it possible that people are sticking to their guns here because they don't want to back down? Or they want it acknowledged that one could see things that way, even if it's not the best decision? I'm just having a hard time seeing a logical reason for this discussion and I'm trying to think of psychological reasons, I guess.

The justifications for [].sum = null (or NaN or whatever) I can pick out of the above messages are:

  • "All reasonable return values will have their own supporters, and that's OK -- null apparently means different things to different people in different fields."
    -I don't think so. What actual fields are we talking about? I work in time series and machine learning and that kind of stuff. What fields have the requirement that an empty sum is undefined?

  • "This is reasonable from a mathematical perspective, but again is not the choice made by databases. In databases (and pandas) nulls can be introduced into empty results quite easily from joins, and in general there is no careful distinction."
    -Which is exactly why Pandas shouldn't follow that lead. It's a major pain and widely acknowledged as a design mistake, as I cited above. Furthermore, I don't know of any situations where the standard mathematical rules for something turn out to not be the most convenient choice in a programming language, is there a precedent someone can think of?

  • @wesm's illustration that for pd.Series([]), sum() & mean() & max() & min() & std() & prod() all return nan
    -That may look neat in a terminal session, but it's false consistency for no good reason I can see. Why don't we also include all() and any(), which return True and False? Why don't we also include size, under the thinking that if we don't have any data, how can we measure its size? Why is ftype defined?
    sum() and prod() (and if I had my druthers, max() and min()) shouldn't be "consistent" in this way.

Let the NaN-inducing behavior fall out of something real like dividing by zero or subtracting infinities, not out of a committee.

@kawochen
Copy link
Contributor

kawochen commented Nov 11, 2017

@kenahoo Well, I said more than a year ago I would like .sum() to return additive identity when it's an all-null or empty Series -- yes, scroll up to my post; we have been discussing this for more than a year -- so don't mistake my stance. We are all very well aware of the theoretical justifications, because, again, we've been discussing this for more than a year.

What actual fields are we talking about? I work in time series and machine learning and that kind of stuff. What fields have the requirement that an empty sum is undefined?

How do your fields require that an empty sum be non-null? I am always quite willing to simply not deal with empty time series and empty data sets. But if you could provide a closer-to-real life example of how it breaks things, it would really help the discussion. @sam-s provided a useful data point, that he is used to checking presence of NaN's with .sum().

We shouldn't talk about the sum of an empty Series in isolation. Instead, consider also

pd.Series([6])
pd.Series([np.nan, np.nan, 6])
pd.Series()
pd.Series([np.nan, np.nan])

I don't know about your fields, but I quite like the fact that pandas handles null values like they aren't even there. So .sum() returns 6 for the first two. If you don't agree, you have had plenty of opportunities to start a discussion, because it's been this way for quite a while.

I personally think .sum() on the last two should both be zero, and, if not zero, they should at least equal each other. The behavior that they sometimes didn't equal each other was what was fixed. But there are people (e.g. @sam-s and you) who think the former should be 0 while the latter should be NaN. Like I said before, null values mean different things to different people. For @sam-s, it is something that should be 'reported ASAP'; for me, it is not; for you, I don't know. I see null and think 'ignore' -- you either agree with me on this, or agree with me with on what I said earlier, that different people think of null differently. I honestly don't even know how you I don't think so that statement of mine, which only stated the most obvious and hardly had any substance in it.

This consideration isn't even new. It is hard to keep operations in real numbers extended by positive infinity, negative infinity and not-a-number consistent. Now, why is no one bothered by the fact that with inf and -inf, all those nice properties you people have been championing for don't exist anymore (it breaks the associativity you have forced upon .sum())? My guess is because they don't come up as often. So real life examples, which we thank you in advance for, do matter.

Our voices have been heard. Let's only respond when we have new substance to bring to this discussion.

@wesm
Copy link
Member

wesm commented Nov 11, 2017

I have said several messages ago that there's a valid argument for both a sum that yields 0 on no data and one that yields null. I suggest we consider making a 0.21.1 release with this improvement and reverting the behavior of pd.Series([]).sum() to be 0.

I'm locking this thread because I think the discussion has ceased to be productive some time ago.

@pandas-dev pandas-dev locked and limited conversation to collaborators Nov 11, 2017
@jorisvandenbossche
Copy link
Member

@wesm it's maybe not the best place here, but there are still some things to discuss.

First, there is the question if we want to give the user the choice themselves between the options (for those who want the other option than we choose). Eg the suggestions of another method (total()) or a keyword (min_count).

Second, I would be fine with having empty sum return 0. But I personally feel this is very inconsistent with Series([NaN]).sum() returning NaN.
I would expect that s.sum(skipna=True) == s.dropna().sum(skipna=False), which would no longer be the case when we do this change.

@shoyer
Copy link
Member Author

shoyer commented Nov 12, 2017

@jorisvandenbossche I agree with you, let's try to make a final decision at the dev meeting.

Two other proposal based on precedence from databases:

  • pd.coalesce(x.sum(), 0), as suggested by Postgres/StackOverflow. We have equivalent functionality in the Series/DataFrame method combine_first, but it would need to be a function to work for scalars (as often result from sum).
  • pd.zeroifnull(x.sum()), suggested by @wesm above.

Also, for what it's worth: I did some searching for justifications for the SQL behavior, and the best I came up with was this scathing critique. This isn't really decisive, but it's clear there are lots of SQL users who aren't happy with this behavior.

@wesm
Copy link
Member

wesm commented Nov 14, 2017

In the case of s.sum(skipna=True) == s.dropna().sum(skipna=False), I don't necessarily think these should necessarily yield the same result.

I think what we've suggested here is:

  • Sum of "no data" is by default 0 (unless you use the "other sum" method)
  • All null data is still null. But s.dropna() yields "no data" so semantically a different thing than summing all null data

We can discuss on the call tomorrow

@jorisvandenbossche
Copy link
Member

As @shoyer listed above (#9422 (comment)), I think we should consider the three possible options tomorrow:

  1. Empty / all-NA sum is always zero: SUM([]) = 0 and SUM([NA]) = 0 (behaviour of pandas < 0.21 + bottleneck)
  2. Empty / all-NA sum is always NA: SUM([]) = NA and SUM([NA]) = NA (how it is currently in 0.21.0)
  3. The mixed behaviour you explain in the comment above: SUM([]) = 0 and SUM([NA]) = NA (behaviour of pandas < 0.21 without bottleneck)

And each of the three options might have different options to circumvent the default behaviour (coalesce-like function, separate method, new keyword, ...)

@jorisvandenbossche
Copy link
Member

To all subscribed here, note that there is a mailing list discussion about this: https://mail.python.org/pipermail/pandas-dev/2017-November/000657.html

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
API Design Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.