Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Rounding floats which are already equal to an integer changes the value #20514

Open
FudgeMunkey opened this issue Dec 4, 2021 · 11 comments
Labels

Comments

@FudgeMunkey
Copy link

Describe the issue:

Running np.round adds 0.5 to some whole integer floats. For example, 3061040371728385.0 becomes 3061040371728385.0. This error occurs if decimal values of 2, 5, 8 are used.

This becomes more significant with larger numbers such as 6.2768919806476296e16 which rounds up 8.0 numbers. This error occurs if decimal values of 1, 4, 7, 10 are used. Note, this isn't a problem if 62768919806476296 is used instead regardless of decimal value.

*I checked decimcal values in the range [0, 15]

*This bug was found using Hypothesis.

Reproduce the code example:

A = np.array([3061040371728385.0])
out = np.round(A, 2)
diff = out - A # 0.5

assert_array_equal(out, A)

A = np.array([6.2768919806476296e16])
out = np.round(A, 1)
diff = out - A # 8.0

assert_array_equal(out, A)

Error message:

# For np.array([3061040371728385.0]), 2
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 0.5
Max relative difference: 1.63343158e-16
 x: array([3.06104e+15])
 y: array([3.06104e+15])

# For np.array([6.2768919806476296e16]), 1
Mismatched elements: 1 / 1 (100%)
Max absolute difference: 8.
Max relative difference: 1.27451612e-16
 x: array([6.276892e+16])
 y: array([6.276892e+16])

NumPy/Python version information:

1.21.4 3.8.10 (default, Sep 28 2021, 16:10:42) 
[GCC 9.3.0]
@FudgeMunkey
Copy link
Author

@Zac-HD

@matthew-brett
Copy link
Contributor

Presumably due to:

#include <stdio.h>
#include <math.h>

int main(int argc, char **argv) {
    double v = 3061040371728385.0;
    double rv = round(v * 100) / 100;
    printf("round 2dp of %f is  %f\n", v, rv);
}

Output:

round 2dp of 3061040371728385.000000 is  3061040371728385.500000

@seberg
Copy link
Member

seberg commented Dec 4, 2021

Yes, it can happen. Should be the same issue as gh-13699. Not sure if there is a good way to get around some of these paths at least. Python seems to do slightly better, so maybe there is something to steal, if it doesn't make things much slower.

@Dharmatva
Copy link

Dharmatva commented Dec 25, 2021

I'd love to have a try at this issue, would that be fine with you guys? @seberg @matthew-brett @FudgeMunkey

I think it might be possible to mitigate this issue for some of the cases, but not entirely remove it. IMO the issue originates from inherent limitations of the floating point representation.

@seberg
Copy link
Member

seberg commented Dec 26, 2021

Nobody will be working on this specifically. An improval proposal would be great!

@Dharmatva
Copy link

Dharmatva commented Dec 26, 2021

Improvement Proposal

@matthew-brett, @seberg as you correctly identified, the error @FudgeMunkey stumbled upon can be attributed to limitations of the floating point representation (especially at larger scales). To get around it, we can treat the integral and the decimal portion of the input array separately and combine them after rounding the decimal portion. This leverages the fact that rounding works just fine at smaller scales.

This will ensure that:

  1. Large errors aren't introduced when rounding numbers, especially at larger scale.
  2. Rounding still works properly for smaller-scale floating point numbers

However, it is important to keep in mind that:

  1. Errors cannot be entirely removed - only minimized (likely to less than or equal to 1.00) using the following technique.
  2. Performance will take a hit as the following technique simply uses more operations than the existing version of np.round.

Here are my suggestions:

  • Proposal 1: We can implement two versions of round (as shown in the updated_round function below) - users may pass a specific parameter if they want a more 'precise' answer, otherwise numpy will default to the existing version of np.round

  • Proposal 2: An alternative (but imperfect thing) we could do is that we could cast np.float64 data types to np.float128 to preserve accuracy of floating point arithmetic during the calculation, and cast back to np.float64 after the calculation. This will also have a significant performance hit because np.float128 isn't exactly 'natively' supported on most machines. Additionally, note that large numbers (relative to np.float128) will suffer the same issue too.

Demonstration of Proposal 1

import numpy as np

def updated_round(input_array, input_decimals, carefully_handle_large_numbers=False):
    
    if (carefully_handle_large_numbers):
        
        # separate the integral, decimal portion of the input
        integral_portion = np.floor(input_array)
        decimal_portion = input_array - integral_portion

        # combine integral, decimal portion after rounding the decimal portion
        # (relies on the EXISTING version of np.round - which should be fine for small numbers < 1)
        output_array = integral_portion + np.round(decimal_portion, input_decimals) 
    
    else:
    
        # resort to normal rounding if user hasn't requested careful
        # handling of large numbers
        output_array = np.round(input_array, input_decimals)    

    return output_array
        
# EXAMPLE 1

A = np.array([3061040371728385.0])
old_out = updated_round(A, 2, carefully_handle_large_numbers=False)
new_out = updated_round(A, 2, carefully_handle_large_numbers=True)

old_diff = old_out - A # 0.5
new_diff = new_out - A # 0.0

print("EXAMPLE 1: ")
print(old_diff, new_diff)
print()

# EXAMPLE 2 

A = np.array([6.2768919806476296e16])
old_out = updated_round(A, 1, carefully_handle_large_numbers=False)
new_out = updated_round(A, 1, carefully_handle_large_numbers=True)

old_diff = old_out - A # 8.0
new_diff = new_out - A # 0.0

print("EXAMPLE 2: ")
print(old_diff, new_diff)
print()

# EXAMPLE 3 (just to demonstrate that rounding still works fine on small numbers)

A = np.array([6.6789])

old_out = updated_round(A, 3, carefully_handle_large_numbers=False)
new_out = updated_round(A, 3, carefully_handle_large_numbers=True)

old_diff = old_out - A # 0.0001
new_diff = new_out - A # 0.0001

print("EXAMPLE 3: ")
print(old_diff, new_diff)
print()

The above code has the following output:

EXAMPLE 1: 
[0.5] [0.]

EXAMPLE 2: 
[8.] [0.]

EXAMPLE 3: 
[0.0001] [0.0001]

Performance comparison of existing np.round, Proposal 1, and Proposal 2

I tested the performance of the above suggestions. It turns out that Proposal 1 is 2.3x slower, and Proposal 2 is 6.5x slower than the existing np.round implementation.

Here's the code for testing performance and the output of the test(s):

import random
import time
import numpy as np

def updated_round(input_array, input_decimals, method_type=1):
    
    if (method_type == 'Updated Round'):
        
        # separate the integral, decimal portion of the input
        integral_portion = np.floor(input_array)
        decimal_portion = input_array - integral_portion

        # combine integral, decimal portion after rounding the decimal portion
        # (relies on the EXISTING version of np.round - which should be fine for small numbers < 1)
        output_array = integral_portion + np.round(decimal_portion, input_decimals) 
    
    elif method_type == 'Existing Round':
    
        # resort to normal rounding if user hasn't requested careful
        # handling of large numbers
        output_array = np.round(input_array, input_decimals)    

    else:
        
        output_array = np.round(np.array(input_array, dtype=np.float128), input_decimals)
        
    return output_array

method_name = ['Existing Round', 'Updated Round', 'Cast and Round']

# specify scale of test - number of tests, the amount of samples, and the scale of values used in testing
TEST_FREQ = 5
ITER_COUNT = 10000000
LOG_SCALE = 16

# run tests
for test_count in range(TEST_FREQ):
    method_time = []

    for METHOD_TYPE in method_name:

        rand_list = np.random.random(ITER_COUNT) * (10 ** LOG_SCALE)

        start = time.perf_counter()
        calc_round = updated_round(rand_list, input_decimals=3, method_type=METHOD_TYPE)
        end = time.perf_counter()

        method_time.append((end - start) / np.float64(ITER_COUNT))

    # normalize timing
    method_time = np.round(method_time / np.min(method_time), 2)
    
    # print some stats
    print(f"TEST ITERATION {test_count + 1} with 1e{np.round(np.log10(ITER_COUNT), 2)} numbers")
    print()
    for METHOD_TYPE, TIME_VALUE in zip(method_name, method_time):
        print(f"{METHOD_TYPE} took {TIME_VALUE}x")
    print()

Here's the output:

TEST ITERATION 1 with 1e7.0 numbers

Existing Round took 1.0x
Updated Round took 1.54x
Cast and Round took 5.47x

TEST ITERATION 2 with 1e7.0 numbers

Existing Round took 1.0x
Updated Round took 2.16x
Cast and Round took 7.4x

TEST ITERATION 3 with 1e7.0 numbers

Existing Round took 1.0x
Updated Round took 1.99x
Cast and Round took 7.08x

TEST ITERATION 4 with 1e7.0 numbers

Existing Round took 1.0x
Updated Round took 2.23x
Cast and Round took 7.42x

TEST ITERATION 5 with 1e7.0 numbers

Existing Round took 1.0x
Updated Round took 2.11x
Cast and Round took 7.27x

@Dharmatva
Copy link

@seberg @matthew-brett @Zac-HD Hey guys, what do you think about my proposal?

@mattip
Copy link
Member

mattip commented Dec 29, 2021

In what proportion of the cases is there a difference in the output between the methods?

I think this would be relevant if we add a dtype kwarg to np.round, since currently the output is always float, and the inherent limits of float representation of integers will limit the usefulness of the change. As mentioned in the other issue I would be curious to hear what the use case is for rounding large floats.

@Dharmatva
Copy link

Dharmatva commented Dec 29, 2021

@mattip I don't think such a scenario is too common 'in the wild' (but so is the case for many other bugs). However, here are some examples where it might come into play:

  • Someone trying to create discrete bins (for let's say, a histogram) over a range of values that are really large.
  • Sensitive ML training algorithms relying on large loss values to iterate might use calculations that indirectly rely on np.round...
  • Someone who deliberately just wants to try it, I guess?

But in any case, I think rounding shouldn't lead to massive errors. Even if someone uses this function with a large value the code should at-least try to minimize errors. If a change in code isn't warranted, there should at-least be a warning in documentation for np.round that shares some examples related to this anomalous behavior.

@seberg
Copy link
Member

seberg commented Jan 4, 2022

A factor of two slower seems quite a lot, OTOH, I would not be surprised if we could have much faster rounding for certain cases (e.g. all cases without digits after the comma)?

@miccoli
Copy link
Contributor

miccoli commented Dec 28, 2023

The problem here is how round to n decimal places is defined for floating point numbers with a non decimal base.

If I got it right, (but I still have to check the source), numpy uses a variation of the round to specified multiple formula

$$ \mathrm {roundToMultiple} (x,m)=\mathrm {round} (x/m)\times m $$

namely

$$ \mathrm {roundToDecimalPlaces} (x, n) = \mathrm {round} (x \times 10^n ) / 10^n $$

or

$$ \mathrm {roundToDecimalPlaces} (x, n) = \mathrm {round} (x / 10^{-n} ) \times 10^{-n} $$

The problem is that both the division $\cdot/ 10^n$ and the exponentiation $10^{-n}$ are inexact for binary floating numbers. Strictly speaking a more refined approach should be used, because the adopted approach can break in edge cases.1

A related, much more serious, consequence of the way in which np.round(a, decimals=d) is defined for d>0 is that numpy floats break the Liskov substitution principle.

>>> py_f = 2.765
>>> np_f = np.float_(py_f)
>>> assert np_f == py_f
>>> assert isinstance(np_f, float)
>>> assert round(np_f, 2) == round(py_f, 2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AssertionError

In fact (remeber that round(number) delegates to number__round__)

>>> round(py_f, 2)
2.77
>>> round(np_f, 2)
2.76

Who is right? One could argue in favor of numpy (round to even), or python (2.77 is nearer to 2.765 than 2.76):

>>> round(py_f, 2) - py_f
0.004999999999999893
>>> round(np_f, 2) - np_f
-0.0050000000000003375

There are other problems with np.round, eg. #11881, so maybe a general revamping of the code could be undertaken.

Footnotes

  1. edit I just noticed that my observations are clearly described and discussed in the np.round docs, please excluse me for the extra noise on this thread. Nevertheless the fact that builtin round delegates to np.round for np.float_ variables should be addressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants