New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong results from floordiv
#14596
Comments
Thanks for the report Labelling as high prio as silently incorrect results can potentially break analyses |
@orlp can you take a look? |
First I note that Unfortunately So in reality you are calculating 1391332687909454336 / 15258.7890625. If we put this into WolframAlpha we find the full correct answer: Now, in Polars we compute floating point flooring division as >>> 1391332687909454210 / (1e9 / JIFFIES_IN_ONE_SECOND)
91182379034834.0
>>> 1391332687909454210 // (1e9 / JIFFIES_IN_ONE_SECOND)
91182379034833.0 Python computes floating point flooring division as |
And furthermore, we actually optimize import timeit
import polars as pl
import numpy as np
a = np.random.rand(10**7)
pa = pl.Series(a)
print(timeit.timeit(lambda: a // 42.0, number=100)) # 2.65
print(timeit.timeit(lambda: pa // 42.0, number=100)) # 0.83 |
Does |
After some more internal discussion we've decided that we want to keep the current behavior, where Python implements (as far as I can see) true flooring division in a single logical step, whereas we first do a division followed by a flooring operation. This can be subtly different. Note that in general introducing hard discontinuous points with floating point will almost always result in surprising behavior. For example in Python: >>> 6.0 / 0.1
60.0
>>> 6.0 // 0.1
59.0 This is because >>> pl.Series([6.0]) // 0.1
shape: (1,)
Series: '' [f64]
[
60.0
] In general in Polars we have the following guideline for floating point operations:
I think Take your motivating example. You start with the integer value >>> 1391332687909454210 - int(float(1391332687909454210))
-126 For your specific example of converting integer nanosecond timestamps to a whole number of 'jiffies', I'd suggest the following using entirely integer arithmetic, making sure not to overflow since your example values are very close to the limits of a 64-bit signed integer: JIFFIES_PER_SECOND = 65536
NANOSECONDS_PER_SECOND = 10**9
# Logical calculation: (pl.col.ts / NANOSECONDS_PER_SECOND * JIFFIES_PER_SECOND).floor()
# Calculation without risk of overflow, or using floating point:
whole_seconds = pl.col.ts // NANOSECONDS_PER_SECOND
rest_nanosec = pl.col.ts % NANOSECONDS_PER_SECOND
jiffies = whole_seconds * JIFFIES_PER_SECOND + rest_nanosec * JIFFIES_PER_SECOND // NANOSECONDS_PER_SECOND |
I understand your point and I see how it matches the rest of the rationale for handling floats. However, I would still kindly ask for a change in the documentation for the function, as it says:
clearly, they are not equivalent as these two give different results. I was already computing correct jiffies in python using Thanks for the suggestion, I actually started using it after discovering the issue. The way I got there is to convert seconds to jiffies, that take the rest, as it doesn’t overflow. |
@knl I will open an issue to improve the documentation of the floor division.
Well, if you were using >>> 4611686021 * 65536 # Clearly correct.
302231455072256
>>> 4611686021000000000 // (1e9 / JIFFIES_IN_ONE_SECOND) # Oops.
302231455072255.0 If you want exact integer results for exact integer inputs, you're better off avoiding floating-point arithmetic unless you're very careful about error bounds. |
thanks a lot @orlp. And many thanks for the explanation of the issue, it’s quite educational!
hah, true! I’m avoiding floats as much as possible exactly for the reasons you mention. Dealing with these timestamps produces a lot of surprises when there are silent conversions to floats. It was painful with pandas to do a simple |
Checks
Reproducible example
Log output
Issue description
I'm trying to convert StratusVOS jiffies (there are 65536 jiffies in a second) to nanoseconds and vice-versa. In python, I can do the following:
and will get the correct result
91182379034833
. However, the given example in polars produces the wrong values in the last two rows.As you can see, the jiffy in the second row is
91182379034834
, not91182379034833
.diff
column shows that the nanosecond value of that jiffy is higher than the starting jiffy, meaning thatfloordiv
did not return the integral part of the division, but rounded the result.In addition, suggested improvement does not yield the right result either.
Expected behavior
The correct values are the ones in row
jiffy2
.Installed versions
The text was updated successfully, but these errors were encountered: