Floating point precision in DataFrame.to_csv #2069

Closed
wesm opened this Issue Oct 14, 2012 · 13 comments

Comments

Projects
None yet
6 participants

Hey all,

I just started using Pandas a few days ago and ran into a related issue.

Basically I am reading in data from a .csv file. I have been writing some unit tests and was getting some errors because my expected values were different from the ones I calculated in Excel. At first, I assumed it was due to rounding but when I inspected my data frame, I realized that I was getting errors because of floating point issues. Basically, an input price of 7.34 was now 7.3399999999999999 (I am working with stock prices).

I was just wondering what the recommended way of dealing with this is, if any? Should I be converting my data frame to another type once imported?

Thanks in advance for your help and great job on this solid library.

Owner

wesm commented Nov 3, 2012

It seems that CPython does a better job of float formatting than NumPy. I'll see what I can do

Owner

wesm commented Nov 3, 2012

I can't manage to find a standalone reproduction of this. The csv module uses str (via PyObject_Str) to format the numbers, and that appears to work fine on numbers like 0.085 or 7.34. If someone can post an example illustrating this breaking down, I'll see what I can do

Contributor

adamobeng commented Nov 28, 2012

I think I've been able to reproduce this:

    df = pa.DataFrame({'float' : [9.728141, 4.810295]})
    df.to_csv('floats.csv')

floats.csv looks like:

,float                                                                                                                                                                 
0,9.7281410000000008
1,4.810295
Owner

wesm commented Nov 28, 2012

What OS/Python/NumPy combination are you using?

Contributor

adamobeng commented Nov 28, 2012

uname -a

 Darwin boron 12.2.0 Darwin Kernel Version 12.2.0: Sat Aug 25 00:48:52 PDT 2012; root:xnu-2050.18.24~1/RELEASE_X86_64 x86_64

sys.version

'2.7.3 (default, Nov  3 2012, 17:31:26) \n[GCC 4.2.1 Compatible Apple Clang 4.0 ((tags/Apple/clang-421.0.57))]'

np.version

1.6.2

Edit: This does not happen (i.e. the output is as expected) on an EC2 node running starcluster with:

uname -a

Linux master 3.0.0-14-virtual #23-Ubuntu SMP Mon Nov 21 21:09:11 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

sys.version

'2.7.2+ (default, Oct  4 2011, 20:06:09) \n[GCC 4.6.1]'

np.version

'1.6.2'
Owner

wesm commented Nov 28, 2012

Urgh I've dug down into the belly of the Python interpreter and believe that the formatting is eventually happening in the C stdlib, which means that Linux and OS X (BSD) have slightly different implementations. This is annoying is crap.

Contributor

adamobeng commented Nov 28, 2012

If I understand correctly, the problem comes from trying to write the underlying ndarray directly.

Is there a philosophical reason why there could not be a DataFrameFormatter for the CSV format, given that FloatArrayFormatter already takes care of this problem when outputting to LaTeX, HTML and plain text?

Owner

wesm commented Nov 28, 2012

I guess the concern would be loss of precision

Contributor

adamobeng commented Nov 28, 2012

It depends whether you're using the CSV file for display or storage (i.e. as a faithful reproduction of the DataFrame). You might argue that using CSVs for storage is a bad idea anyway, because if the DataFrame contains arbitrary objects, you'll only end up with their string representations. Especially when you can serialize the same data very easily.

antonywu commented Apr 6, 2013

So the current workaround is to use Linux, instead of Mac to get the results we wanted in csv file?
Honestly, for display purpose, I would prefer the option to intentionally drop trailing digits (yes, I mean rounding)... I wonder if there is a way to make it happen with .to_csv()..or would I have to write my own .to_csv() with dataframe iteration + round()

frgomes commented May 8, 2013

I detected that read_csv has this bug too.

It's not a Python format issue. It's not a general floating point issue, despite it's true that floating point arithmetic is a subject which demands some care from the programmer. This article below clarifies a bit this subject:

http://docs.python.org/2/tutorial/floatingpoint.html

The problem is that it's necessary to employ fixed point arithmetic and only convert to floating point in the end, applying a convenient divisor.

A classic one-liner which shows the "problem" is ...

0.1 + 0.1 + 0.1
0.30000000000000004

... which does not display 0.3 as one would expect. On the other hand, if you handle the calculation using fixed point arithmetic and only in the last step you employ floating point arithmetic, it will work as you expect. See this:

(1 + 1 + 1) * 1.0 / 10
0.3

So, it's necessary to account to the position of the decimal point, ignore it initially and go ahead with the algorithm which converts text to integers (not floats!). The last step consists on converting an integer to a float by dividing by an adequate power of 10.

If you desperately need to circumvent this problem quickly, I recommend you create another CSV file which contains all figures as integers, for example multiplying by 100, 1000 or other factor which turns out to be convenient. Inside your application, read the CSV file as usual and you will get those integer values back. Then convert those values to floating point, dividing by the same factor you multiplied before.

Contributor

jreback commented Sep 21, 2013

closing in favor of #4668

jreback closed this Sep 21, 2013

olafveerman referenced this issue in developmentseed/climatescope-data Sep 12, 2014

Closed

Floating point precision #8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment