Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect result when resampling DatetimeIndex due to hidden cast of ints to floats #3707

Closed
gerdemb opened this issue May 29, 2013 · 20 comments
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations

Comments

@gerdemb
Copy link

gerdemb commented May 29, 2013

To summarize: should raise when loss of precision when casting from ints to floats for computations (everywhere!)

The wrong result is returned in the following case because of unnecessary casting of ints to floats.

max_int = np.iinfo(np.int64).max
min_int = np.iinfo(np.int64).min
assert max_int + min_int == -1
assert DataFrame([max_int,min_int], index=[datetime(2013,1,1),datetime(2013,1,1)]).resample('M', how=np.sum)[0][0] == -1

This is the offending code in groupby.py that does the casting:

class NDFrameGroupBy(GroupBy):
    def _cython_agg_blocks(self, how, numeric_only=True):
        .....
        if is_numeric:
            values = com.ensure_float(values)

and even worse, sneakily casts the floats back to integers hiding the problem!

        # see if we can cast the block back to the original dtype
        result = block._try_cast_result(result)

It should be possible to perform this calculation without any casting. For example, the sum() of a DataFrame returns the correct result:

assert DataFrame([max_int, min_int]).sum()[0] == -1

I am working with financial data that needs to stay as int to protect its precision. In my opinion, any operation that casts values to floats do a calculation, but then returns results as ints is very wrong. At the very least, the results should be returned as floats to show that a cast has been done. Even better would be to perform the calculations with ints whenever possible.

@jreback
Copy link
Contributor

jreback commented May 29, 2013

Since numpy cannot support a native NaN for integers, this must be cast to floats (it IS possible to do this in some cases as integers, e.g. when you KNOW that nans will not result, but this makes the operation take extra time). As for casting back, that is consistent will all other pandas operations. You are free to start with a dtype of float64. Working with extreme range integers is IMO, not useful, and what you are seeing is machine roundoff error.

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

@jreback how annoying would it be to roll up a NaI like NaT? there have probably been discussions about this, but then we'd have a missing value for the types that pandas supports (int, floats, objects (can still use regular nan), and dates/times)

@jreback
Copy link
Contributor

jreback commented May 29, 2013

@cpcloud you could use the NaT value for a not-available for ints, BUT, this runs into issues on 32-bit (where you effectively have to use the maxint on 32-bit). That's why we ONLY use datetime64[ns] (as its always 64-bit). Another option is to do masked type stuff. None of these are good solutions unfortunately.

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

yeah i hate maskedarrays...(link doesn't work) :)

@gerdemb
Copy link
Author

gerdemb commented May 29, 2013

If I understand you correctly, rather than trying to determine if an operation would create NaN values, pandas simply casts ints to floats in any case where this might happen because this performs better. Is that correct? In my trivial examples it's obvious that there is no missing data, but I could see how that may not always be obvious.

I still think that in the case where an int as cast to a float to do an operation, pandas should return floats. It's very non-obvious that an operation that accepts ints and returns ints is actually doing floating point calculations in the background!

My example uses big integers, to demonstrate the problem, but it could occur with smaller numbers too. int(float(i)) == i for all integer values of i is not always true. In monetary applications this is important.

Finally, changing the topic slightly, the behavior I'm actually looking for is to replace NaN with 0 when working with ints. For example:

>>> dates = (datetime(2013, 1, 1), datetime(2013,1,2), datetime(2013,3,1))
>>> s = Series([1,2,4],index=dates)
>>> s
2013-01-01    1
2013-01-02    2
2013-03-01    4
dtype: int64
>>> s.resample('M', how='sum')
2013-01-31     3
2013-02-28     0
2013-03-31     4
Freq: M, dtype: int64

I can't see how this is easily possible, but I'm not an expert here.

@jreback
Copy link
Contributor

jreback commented May 29, 2013

@gerdemb

checking for Nans costs time, and in any event these types of routines are implemented only as floats (they could in theory be done as ints, but this ONLY works if there are NO Nans at all, which when doing resampling is very rare if you think about it, most of the time you will end up generating nan for missing periods)

pandas tries to return the input type, its just more intuitve that way (before 0.11 pandas used to return floats in almost all cases, THAT was a bit of an issue). Now that pandas supports many dtypes it tries to return the correct one, casting is done ONLY when there is not loss of result. (do your above operation with floats, it returns the same result, 0)

2nd question

In [7]: s.astype('float').resample('M',how='sum').fillna(0).astype('int')
Out[7]: 
2013-01-31    3
2013-02-28    0
2013-03-31    4
Freq: M, dtype: int64

will work; I don't think pandas supports a fill= in resample; in theory it could I suppose

@jreback
Copy link
Contributor

jreback commented May 29, 2013

@gerdemb
Copy link
Author

gerdemb commented May 29, 2013

@jreback

I'm not sure what's wrong with returning floats when floats have been used to do the calculations, and I personally think it is much clearer than silently casting them back into ints with the subsequent loss of precsion, but it sounds like there may be other issues that I am not aware of.

Anyway, it seems that when using ints there are two possible scenarios: (1) the calculation is done entirely with ints without casting (2) the ints are cast to floats for calculation and then cast back to ints when the results are returned. When calling a method, how can I tell which one will happen?

Seems like this should at least be documented or even better a signal could be emitted when precision is lost by casting (perhaps by using a mechanism like decimal.Context which allows the user to decide which conditions they wish to trap).

Thank you for your answers to my second question. The stackoverflow question you linked was actually my post! Using a custom aggregation function as suggested in the answer does work correctly, but unfortunately the performance is about 30 times worse than how='sum'. I suppose this is because of the overhead of calling a Python function instead a C function?

Finally, just want to say thanks for developing and supporting pandas. Despite my complaints, I've found it to be an incredibly useful tool in my work.

@jreback
Copy link
Contributor

jreback commented May 29, 2013

@gerdemb

yes, unfortunately using a lambda is much slower; that said, it might be worth it to build this type of filling in (e.g. provide a fill_value= to resample, but a bit non-trivial

using ints in general is problematic because of the lack of a native nan as have mentioned

however, pandas does try to maintain dtype where possible; that is why the cast back happens. And as I said there is NO loss of precision (IF numpy allows it, which it does in this case), e.g.

In [1]: np.array([1],dtype='float64')
Out[1]: array([ 1.])

In [2]: np.array([1],dtype='int64')
Out[2]: array([1])

In [3]: np.array([1],dtype='float64') == np.array([1],dtype='int64')
Out[3]: array([ True], dtype=bool)

@gerdemb
Copy link
Author

gerdemb commented May 29, 2013

@jreback

It's not true that there is no loss of precision when casting from int to float. The code in my first post demonstrates the problem and how it can effect the results of a resample operation. Another trivial example:

>>> a=np.int64(18014398509481983)
>>> a, np.float64(a)
(18014398509481983, 18014398509481984.0)
>>> np.int64(np.float64(a)) == a
False

The value of a has more precision than is possible to store in an float64. Your example uses a value of 1 which is small enough to store exactly as a float64. Additionally, I believe it is casting the int64 to a float64 and then comparing the two floats which doesn't really prove anything.

@cpcloud
Copy link
Member

cpcloud commented May 29, 2013

It's not true that there is no loss of precision when casting from int to float.

In general, sure (but by definition this is true).

@jreback pandas does lose precision with these values. For example,

s = Series([18014398509481983] * 10, date_range(start='1/1/2000', periods=10, freq='D'))
s.resample('M')[0] == s[0] # False
s.resample('M')[0] == s[0] + 1 # True
s += 1
s.resample('M')[0] == s[0] # True

It (numpy) should be able to handle up to 2 ** 62 without overflow...

@jreback
Copy link
Contributor

jreback commented May 29, 2013

i think u are misunderstanding what pandas is actually doing. the conversion from float back to int occurs ONLY if numpy compares the values as equal (and there are no Nan's and the original dtype was int to begin with)
this is considered safe casting

your above example fails as the numbers don't compare equal in the first place

however, if there is a loss of precision, eg your example
then I think the conversion is 'wrong' in numpy and so pandas is wrong as well
but that is a very narrow case and I don't think near the end points of values it's guaranteed to work in any event

that said if could figure out a reasonable way to determine if there possibly would be a loss of precision then could raise an error ( and let the user decide what to do ). as u suggested above

here's a numpy related discussinon:
numpy/numpy#3118

I think we could do the test before casting, essentially this:

if isinstance(np.integer,x):

    if not (x.astype('float64') == x).all():
         raise Exception("possible data loss from casting int to float")

I don't believe this is expensive to do either

only issue is prob should do this in lots of places :)

The cast back does this test so that's not a problem (or it will leave it as float)

@gerdemb
Copy link
Author

gerdemb commented May 30, 2013

@jreback

There are TWO casts being done in the resample example that I originally posted. The first is as cast from int to float the second is a cast back to int from float. It doesn't mater if the second cast from float to int occurs only if numpy compares the values as equal, because we've already lost the precision in the FIRST cast from int to float. The code @cpcloud posted shows another example of the same problem. You said "your above example fails as the numbers don't compare equal in the first place" which is exactly the point! We CANNOT guarantee that a cast from int to float will not lose precision! (More precisely, if we are working with 64-bit types, I believe that if the number of bits in the integer exceeds 53, then it cannot be represented exactly as float as that would exceed the float's significand precision.)

Anyway, I propose the following possible solutions:

  1. Leave the behavior as-is and document it as a gotcha.
  2. Leave the behavior as-is, but create a mechanism to warn the user about loss of precision. We could consider if this warning would be raised only on a potential loss of precision (ie. we are doing a cast that might lose precision) or a confirmed loss of precision (like your example above that actually checks the values).
  3. Remove the behavior of casting results back into their original dtypes

Personally, I am strongly in favor of 3, but I realize there may be other opinions. My reasons are:

  1. It is very confusing and non-obvious that a method that accepts and returns integers is actually doing the calculations with floats and potentially losing precision. If the values are returned as floats this would be more obvious to the user.
  2. We already have cases where a method passed ints will return floats. For example, if I reindex a Series of ints with missing values, they are converted to floats so that NaNs can be inserted.
  3. It leaves the decision of what do with the results in the hands of the programmer. Perhaps, they are happy to work with floats or they could cast the results back into integers if they wish. It is a waste of time to always perform a cast back to the original dtype if the user doesn't care.
  4. Adding a mechanism to check for the potential loss of precision, would slow down the default case when the majority of users probably don't care. Adding these checks also makes the code more complicated.

In short, I believe this solution has the least confusing behavior, performs the fastest and requires the least amount of code.

Thoughts?

@jreback
Copy link
Contributor

jreback commented May 30, 2013

nice summary. The issue is the FIRST cast is not 'safe' (as I defined above), while SECOND is. As far as your solutions.

I know you like option 3 (cast most things to float and leave), but

  1. This violates least surprise principle. Things shouldn't change just because of an implementation detail. If you have an int, you generally have a good reason for it, so pandas shouldn't change it unless there is no other way.
  2. numpy and pandas go to great lengths to avoid changing dtype except where necessary. Your example is actually a special case (reindexing where you have missing values so change to float dtype). In fact see this: http://pandas.pydata.org/pandas-docs/dev/whatsnew.html#dtype-gotchas, there ARE times when we can preserve the dtype even though the implementation actually 'changes' it.
  3. you make a point (3) that the user should be in control of the dtype. Well this is exactly right, but you can't have it both ways. Either pandas takes and int and returns and int or the user should pass a float. Doing halfway is not a good solution (but is unavoidable in certain cases as we have seen). Converting EVERYTHING to float is a worse option.
  4. Your point (4) is relevant, it does make the code a bit more dense; I don't actually think slows it down as an astype check is pretty cheap (however, a-priori checking for nan to determne if we need to cast is not a good option)

So we are left with a gotchas doc, or providing a warning (or exception).

I think providing a LossOfPrecisionWarning (a sub-class of UserWarning) might be a good solution, as well as documenting; may not get to this to 0.12

Would you do a PR for a short section in http://pandas.pydata.org/pandas-docs/dev/gotchas.html for 0.11.1?

@gerdemb
Copy link
Author

gerdemb commented May 30, 2013

@jreback

  1. In my opinion, it is much more surprising to get the WRONG ANSWER (see the code in my very first post) than a value cast to a different dtype! You're right, I am using ints for a reason and it's more than just an "implementation detail" that they are changed to floats (and then sneakily cast back to integers) for some calculations, because this can give INCORRECT integer results.
  2. You're only giving the user the false illusion of being in control of the dtype since some calculations may be done by casting to intermediate types effectively ignoring the original dtype anyway. I believe like The Zen of Python says that "Explicit is better than implicit". If it's not possible for a method do a calculation without casting to float you should explicitly return that float and not implicitly hide it by converting it back to an int.

I've been using pandas for quite a while, but this is my first time venturing into the "development" side of things and I'm not sure how changes are introduced to the code, but I'd be curious to hear the opinions of other developers or users on this issue.

Finally, I realize my use-case is probably unusual and that is why no one has reported this problem before. I'm going to redesign my application to use Decimal objects instead of integers which will be a big performance hit, but I don't want to risk finding any more "gotchas" like this.

@jreback
Copy link
Contributor

jreback commented May 30, 2013

Some calculations could be done using ints directly, however this means additional code generation and complexity. This is a really narrow case that in practice is very unlikely. Dtype preservation is quite important; it comes up much more in that almost all calculates with for example datetime64[ns] are actually done using ints and then cast back.

There are tradeoffs that are done in pandas for performance gains, if they are giving an incorrect result, that can be fixed, its not really difficult (as I mentioned above), so I think a doc change now is appropriate, and can fix in 0.12

my 2c. Using Decimal objects is really missing the point of using pandas. Everything will be object type.

There are not many gotchas in pandas, in fact, great length has gone to remove many, many more gotchas/outright bugs that exist in numpy.

Thanks for bringing this issue up, that's how things get fixed!

@gerdemb
Copy link
Author

gerdemb commented May 30, 2013

@jreback

OK. I think we've both made all our points. :) Agree that at least a documentation gotcha should be added here and perhaps a LossOfPrecisionWarning warning later.

Finally, can I make a feature request for adding a fill_value option to resample() like reindex() has? If I could resample an integer series and fill it with zeros instead of NaN to avoid a cast to float, that would be perfect.

@jreback
Copy link
Contributor

jreback commented May 30, 2013

@gerdemb go ahead and put up the resample request. Its a good idea.

If you could do a PR for the docs (or just post here, and I can put in, either way)..

thanks.....

@jreback
Copy link
Contributor

jreback commented Jan 9, 2017

closing in favor of #11199

@gfyoung
Copy link
Member

gfyoung commented Jun 12, 2017

Resurrected in #16674.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions Groupby Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

No branches or pull requests

4 participants