Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: floating point precision on writing/reading to csv #13159

Open
FBartlett opened this issue May 12, 2016 · 8 comments
Open

DOC: floating point precision on writing/reading to csv #13159

FBartlett opened this issue May 12, 2016 · 8 comments
Assignees
Labels

Comments

@FBartlett
Copy link

Code Sample

x0 = 18292498239.824
df1 = pd.DataFrame({'One': x0},index=["bignum"])
df1.to_csv('repr_test.csv')
df2 = pd.DataFrame.from_csv('repr_test.csv')
df3 = pd.read_csv('repr_test.csv')
x1 = df1['One'][0]
x2 = df2['One'][0]
x3 = df3['One'][0]
fh = open('repr_test.csv','rb')
ll = fh.readlines()
x4 = float(ll[1].split(',')[1].split()[0])
print "x0 = %f; x1 = %f; Are they equal? %s" % (x0,x1,(x0 == x1))
print "x0 = %f; x2 = %f; Are they equal? %s" % (x0,x2,(x0 == x2))
print "x0 = %f; x3 = %f; Are they equal? %s" % (x0,x3,(x0 == x3))
print "x0 = %f; x4 = %f; Are they equal? %s" % (x0,x4,(x0 == x4))

Expected Output

x0 = 18292498239.824001; x1 = 18292498239.824001; Are they equal? True
x0 = 18292498239.824001; x2 = 18292498239.824001; Are they equal? True
x0 = 18292498239.824001; x3 = 18292498239.824001; Are they equal? True
x0 = 18292498239.824001; x4 = 18292498239.824001; Are they equal? True

output of pd.show_versions()

(Note that there are two, presented side-by-side, with results underneath)

INSTALLED VERSIONS                      INSTALLED VERSIONS
------------------                      ------------------
commit: None                            commit: None
python: 2.7.5.final.0                   python: 2.7.11.final.0
python-bits: 64                         python-bits: 64
OS: Linux                               OS: Linux
OS-release: 2.6.32-431.56.1.el6.x86_64  OS-release: 2.6.32-431.56.1.el6.x86_64
machine: x86_64                         machine: x86_64
processor: x86_64                       processor: x86_64
byteorder: little                       byteorder: little
LC_ALL: None                            LC_ALL: None
LANG: en_US.UTF-8                       LANG: en_US.UTF-8

pandas: 0.15.1                          pandas: 0.18.0
nose: 1.3.4                             nose: 1.3.7
Cython: 0.21.2                          Cython: 0.23.4
numpy: 1.9.1                            numpy: 1.10.4
scipy: 0.14.0                           scipy: 0.17.0                 
statsmodels: 0.6.0                      statsmodels: 0.6.1            
IPython: 2.3.0                          IPython: 4.1.2 
sphinx: 1.2.3                           sphinx: 1.3.5  
patsy: 0.3.0                            patsy: 0.4.0   
dateutil: 2.2                           dateutil: 2.5.1
pytz: 2014.9                            pytz: 2016.2   
bottleneck: None                        bottleneck: 1.0.0
tables: 3.1.1                           tables: 3.2.2    
numexpr: 2.4                            numexpr: 2.5     
matplotlib: 1.4.2                       matplotlib: 1.5.1
openpyxl: None                          openpyxl: 2.3.2  
xlrd: 0.9.3                             xlrd: 0.9.4      
xlwt: 0.7.5                             xlwt: 1.0.0      
xlsxwriter: 0.6.3                       xlsxwriter: 0.8.4
lxml: 3.3.3                             lxml: 3.6.0      
bs4: 4.3.2                              bs4: 4.4.1       
html5lib: None                          html5lib: None   
httplib2: None                          httplib2: None   
apiclient: None                         apiclient: None  
rpy2: None                              
sqlalchemy: None                        sqlalchemy: 1.0.12                                                    
pymysql: None                           pymysql: None 
psycopg2: None                          psycopg2: None
                                        pip: 8.1.1      
                                        xarray: None    
                                        setuptools: 20.3
                                        blosc: None     
                                        jinja2: 2.8     
                                        boto: 2.39.0    

Results from left setup (0.15.1):

x0 = 18292498239.824001; x1 = 18292498239.824001; Are they equal? True
x0 = 18292498239.824001; x2 = 18292498239.823997; Are they equal? False
x0 = 18292498239.824001; x3 = 18292498239.823997; Are they equal? False
x0 = 18292498239.824001; x4 = 18292498239.824001; Are they equal? True

Results from right setup (0.18.0):

x0 = 18292498239.824001; x1 = 18292498239.824001; Are they equal? True
x0 = 18292498239.824001; x2 = 18292498239.799999; Are they equal? False
x0 = 18292498239.824001; x3 = 18292498239.799999; Are they equal? False
x0 = 18292498239.824001; x4 = 18292498239.799999; Are they equal? False

Expectations

I expect to be able to write a DataFrame to a csv file and later read it in to a new DataFrame such that the two DataFrames will be identical. The older version (result 0.15.1) is quite a bit better than the newer (since I can round to three decimal places to get the expected results or read from a filehandle instead of using from_csv() or read_csv()). The newer version (0.18.0) loses information, which is not acceptable.

Note that the documentation at http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.from_csv.html reads

It is preferable to use the more powerful pandas.read_csv() for most general purposes, but from_csv makes for an easy roundtrip to and from a file (the exact counterpart of to_csv), especially with a DataFrame of time series data.

But this does not describe what actually happens, as demonstrated above.

@sinhrks
Copy link
Member

sinhrks commented May 12, 2016

Specify required precision via float_format.

df1.to_csv('repr_test.csv', float_format='%.6f')
df2 = pd.DataFrame.from_csv('repr_test.csv')
df2.iloc[0, 0]
# 18292498239.824001

Maybe doc should have float_format section (for output), as it does in float_precision (for input).

@sinhrks sinhrks added the Docs label May 12, 2016
@jreback
Copy link
Contributor

jreback commented May 13, 2016

yes this is a tradeoff between speed of reading and exactness out to a certain ULP. as @sinhrks indicated for reading we offfer a higher precision option; writing is subject to the vagaries of floating point to stringifciation.

@jreback jreback added this to the Next Major Release milestone May 13, 2016
@jreback jreback changed the title to_csv() / from_csv() roundtrip breaks for floats in 0.15.1 and 0.18.0 DOC: floating point precision on writing/reading to csv May 13, 2016
@kawochen
Copy link
Contributor

kawochen commented May 13, 2016

I think writing should have something similar to float_precision, since the round-trip-ability is based mostly on the number of significant digits, not the number of digits after the decimal point.

I haven't looked at the code, but the difference here seems to be related to defaulting to __str__() vs __repr__() on P2. __repr__() has enough digits for round-trip.

@BlGene
Copy link

BlGene commented Nov 15, 2017

Please also consider the case where different columns having different rounding levels.

@jbrockmendel jbrockmendel added the IO CSV read_csv, to_csv label Jan 11, 2022
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@dhavide
Copy link

dhavide commented Oct 19, 2022

take

@hualiu01
Copy link

@dhavide Is this issue resolved? Can I take this issue?

@hualiu01
Copy link

take

@hualiu01
Copy link

hualiu01 commented Aug 20, 2024

Tested the original error with python3(3.10.14), printed values are as expected.

Specifically, see code:

import pandas as pd

WORKDIR = '../tmp'

x0 = 18292498239.824
df1 = pd.DataFrame({'One': x0},index=["bignum"])
df1.to_csv(f'{WORKDIR}/repr_test.csv')
# df2 = pd.DataFrame.from_csv('repr_test.csv')
df3 = pd.read_csv(f'{WORKDIR}/repr_test.csv')
x1 = df1['One'].loc[df1.index[0]]
# x2 = df2['One'][0]
x3 = df3['One'].loc[df3.index[0]]
fh = open(f'{WORKDIR}/repr_test.csv','rb')
ll = fh.readlines()

# x4 = float(ll[1].split(',')[1].split()[0])
x4 = float(ll[1].decode().split(',')[1].split()[0])

print(f"x0 = {x0}; x1 = {x1}; Are they equal? {x0 == x1}")
# print(f"x0 = {x0}; x2 = {x2}; Are they equal? {x0 == x2}")
print(f"x0 = {x0}; x3 = {x3}; Are they equal? {x0 == x3}")
print(f"x0 = {x0}; x4 = {x4}; Are they equal? {x0 == x4}")

output

x0 = 18292498239.824; x1 = 18292498239.824; Are they equal? True
x0 = 18292498239.824; x3 = 18292498239.824; Are they equal? True
x0 = 18292498239.824; x4 = 18292498239.824; Are they equal? True

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants