Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: incorrect reading of CSV containing large integers #52505

Closed
3 tasks done
fingoldo opened this issue Apr 7, 2023 · 5 comments · Fixed by #54679
Closed
3 tasks done

BUG: incorrect reading of CSV containing large integers #52505

fingoldo opened this issue Apr 7, 2023 · 5 comments · Fixed by #54679
Assignees
Labels
Bug IO CSV read_csv, to_csv

Comments

@fingoldo
Copy link

fingoldo commented Apr 7, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from io import StringIO

scsv = """#SYMBOL,SYSTEM,TYPE,MOMENT,ID,ACTION,PRICE,VOLUME,ID_DEAL,PRICE_DEAL
RIH3,F,S,20230301181139587,1925036343869802844,0,96690.00000,2,,
USDRUBF,F,S,20230301181139587,2023552585717888193,0,75.10000,14,,
USDRUBF,F,S,20230301181139587,2023552585717889863,1,75.00000,14,,
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,1,2023552585717263358,75.00000
USDRUBF,F,B,20230301181139587,2023552585717882895,2,75.00000,1,2023552585717263358,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,1,2023552585717263359,75.00000
USDRUBF,F,B,20230301181139587,2023552585717888161,2,75.00000,1,2023552585717263359,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,10,2023552585717263360,75.00000
USDRUBF,F,B,20230301181139587,2023552585717889759,2,75.00000,10,2023552585717263360,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,2,2023552585717263361,75.00000
USDRUBF,F,B,20230301181139587,2023552585717889827,2,75.00000,2,2023552585717263361,75.00000
"""

for engine in "c pyarrow python".split():
    test = StringIO(scsv)
    orders = pd.read_csv(test, engine=engine,)
    print(engine, len(orders.query("ID_DEAL==2023552585717263360")))

Issue Description

Reading of this financial data peace is terribly incorrect. I get the following output:

c 8
pyarrow 8
python 0

None of the engines is able to parse the data correctly, I suppose due to the integers being pretty big. But pandas produces no warning, just silently butchers the data. The issue is not new to 2.0, it was present in 1.5 as well.

It seems to be connected to float64 not having enough precision to hold all the digits. int64 would be ok probably, as the previous column works with int64, but current column has missing values.
using of the new nullable integer type kind of works, but not for pyarrow:
...
orders = pd.read_csv(test, engine=engine,dtype={"ID_DEAL": "Int64"})
...

c 2
pyarrow 8
python 2

But certainly the loss of precision needs to be recognized automatically, and either error should be issued, or Int64 dtype suggested. Otherwise, this bug can lead to catastrophical consequences in decision support systems powered by pandas.

Expected Behavior

c 2
pyarrow 2
python 2

Installed Versions

INSTALLED VERSIONS ------------------ commit : 478d340 python : 3.8.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 45 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Russian_Russia.1251
@fingoldo fingoldo added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 7, 2023
@fingoldo fingoldo changed the title BUG: incorrect reading of CSV containing large intergers BUG: incorrect reading of CSV containing large integers Apr 7, 2023
@reddyrg1
Copy link
Contributor

reddyrg1 commented Apr 7, 2023

take

manjalc pushed a commit to manjalc/pandas that referenced this issue Apr 23, 2023
@manjalc manjalc mentioned this issue Apr 23, 2023
5 tasks
@manjalc
Copy link

manjalc commented Apr 23, 2023

The reason why your getting an output of 8 or 0 when not passing the dtype is because the values are treated as float64s which has precision issues. Hence setting the dtype to int64 should work. However, the pyarrow engine does not respect the manually passed datatype.

Without passing the dtype, the engines use float64 which rounds the large integers:
c 8
pyarrow 8
python 0

When passing the dtype, every engine but pyarrow uses the passed dtype:
c 2
pyarrow 8
python 2

In the read() function of ArrowParserWrapper (arrow_parser_wrapper.py) the check for self.kwds["dtype_backend"] == "pyarrow" fails because the dtype_backend is NO_DEFAULT so the frame that is created has no dtype mapping:

if self.kwds["dtype_backend"] == "pyarrow":
  frame = table.to_pandas(types_mapper=pd.ArrowDtype) # This should be returned
elif self.kwds["dtype_backend"] == "numpy_nullable":
  frame = table.to_pandas(types_mapper=_arrow_dtype_mapping().get)
else:
  frame = table.to_pandas() # This is being returned as dtype_backend is NO_DEFAULT

I submitted a pull request setting the dtype_backend to "pyarrow" if the engine is pyarrow in io/parsers/readers.py, This fixes the bug when the dtype is explicitly passed in the read_csv call for the pyarrow engine, the other engines are unaffected by this bug.

@fingoldo
Copy link
Author

The reason why your getting an output of 8 or 0 when not passing the dtype is because the values are treated as float64s which has precision issues. Hence setting the dtype to int64 should work. However, the pyarrow engine does not respect the manually passed datatype.

Without passing the dtype, the engines use float64 which rounds the large integers: c 8 pyarrow 8 python 0

When passing the dtype, every engine but pyarrow uses the passed dtype: c 2 pyarrow 8 python 2

In the read() function of ArrowParserWrapper (arrow_parser_wrapper.py) the check for self.kwds["dtype_backend"] == "pyarrow" fails because the dtype_backend is NO_DEFAULT so the frame that is created has no dtype mapping:

if self.kwds["dtype_backend"] == "pyarrow":
  frame = table.to_pandas(types_mapper=pd.ArrowDtype) # This should be returned
elif self.kwds["dtype_backend"] == "numpy_nullable":
  frame = table.to_pandas(types_mapper=_arrow_dtype_mapping().get)
else:
  frame = table.to_pandas() # This is being returned as dtype_backend is NO_DEFAULT

I submitted a pull request setting the dtype_backend to "pyarrow" if the engine is pyarrow in io/parsers/readers.py, This fixes the bug when the dtype is explicitly passed in the read_csv call for the pyarrow engine, the other engines are unaffected by this bug.

Thanks, would it also be possible to do a check and use not float64 but nullable Int64 instead, if a column only contains integers and float dtype would incur a precision loss? Otherwise, for transactional data, the current approach simply ruins the data, and it does that in a silent way. Like, when one filters by transaction id or client it, they might get totally wild results.

@lithomas1 lithomas1 added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 5, 2023
@fingoldo
Copy link
Author

Hi, any progress on this issue?

@kvn4
Copy link
Contributor

kvn4 commented Aug 17, 2023

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
5 participants