BUG: incorrect reading of CSV containing large integers #52505

fingoldo · 2023-04-07T01:29:00Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
from io import StringIO

scsv = """#SYMBOL,SYSTEM,TYPE,MOMENT,ID,ACTION,PRICE,VOLUME,ID_DEAL,PRICE_DEAL
RIH3,F,S,20230301181139587,1925036343869802844,0,96690.00000,2,,
USDRUBF,F,S,20230301181139587,2023552585717888193,0,75.10000,14,,
USDRUBF,F,S,20230301181139587,2023552585717889863,1,75.00000,14,,
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,1,2023552585717263358,75.00000
USDRUBF,F,B,20230301181139587,2023552585717882895,2,75.00000,1,2023552585717263358,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,1,2023552585717263359,75.00000
USDRUBF,F,B,20230301181139587,2023552585717888161,2,75.00000,1,2023552585717263359,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,10,2023552585717263360,75.00000
USDRUBF,F,B,20230301181139587,2023552585717889759,2,75.00000,10,2023552585717263360,75.00000
USDRUBF,F,S,20230301181139587,2023552585717889863,2,75.00000,2,2023552585717263361,75.00000
USDRUBF,F,B,20230301181139587,2023552585717889827,2,75.00000,2,2023552585717263361,75.00000
"""

for engine in "c pyarrow python".split():
    test = StringIO(scsv)
    orders = pd.read_csv(test, engine=engine,)
    print(engine, len(orders.query("ID_DEAL==2023552585717263360")))

Issue Description

Reading of this financial data peace is terribly incorrect. I get the following output:

c 8
pyarrow 8
python 0

None of the engines is able to parse the data correctly, I suppose due to the integers being pretty big. But pandas produces no warning, just silently butchers the data. The issue is not new to 2.0, it was present in 1.5 as well.

It seems to be connected to float64 not having enough precision to hold all the digits. int64 would be ok probably, as the previous column works with int64, but current column has missing values.
using of the new nullable integer type kind of works, but not for pyarrow:
...
orders = pd.read_csv(test, engine=engine,dtype={"ID_DEAL": "Int64"})
...

c 2
pyarrow 8
python 2

But certainly the loss of precision needs to be recognized automatically, and either error should be issued, or Int64 dtype suggested. Otherwise, this bug can lead to catastrophical consequences in decision support systems powered by pandas.

Expected Behavior

c 2
pyarrow 2
python 2

Installed Versions

INSTALLED VERSIONS ------------------ commit : 478d340 python : 3.8.8.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.17763 machine : AMD64 processor : Intel64 Family 6 Model 45 Stepping 5, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : Russian_Russia.1251

The text was updated successfully, but these errors were encountered:

reddyrg1 · 2023-04-07T03:35:00Z

take

…2505

manjalc · 2023-04-23T19:54:36Z

The reason why your getting an output of 8 or 0 when not passing the dtype is because the values are treated as float64s which has precision issues. Hence setting the dtype to int64 should work. However, the pyarrow engine does not respect the manually passed datatype.

Without passing the dtype, the engines use float64 which rounds the large integers:
c 8
pyarrow 8
python 0

When passing the dtype, every engine but pyarrow uses the passed dtype:
c 2
pyarrow 8
python 2

In the read() function of ArrowParserWrapper (arrow_parser_wrapper.py) the check for self.kwds["dtype_backend"] == "pyarrow" fails because the dtype_backend is NO_DEFAULT so the frame that is created has no dtype mapping:

if self.kwds["dtype_backend"] == "pyarrow":
  frame = table.to_pandas(types_mapper=pd.ArrowDtype) # This should be returned
elif self.kwds["dtype_backend"] == "numpy_nullable":
  frame = table.to_pandas(types_mapper=_arrow_dtype_mapping().get)
else:
  frame = table.to_pandas() # This is being returned as dtype_backend is NO_DEFAULT

I submitted a pull request setting the dtype_backend to "pyarrow" if the engine is pyarrow in io/parsers/readers.py, This fixes the bug when the dtype is explicitly passed in the read_csv call for the pyarrow engine, the other engines are unaffected by this bug.

fingoldo · 2023-04-24T05:49:12Z

The reason why your getting an output of 8 or 0 when not passing the dtype is because the values are treated as float64s which has precision issues. Hence setting the dtype to int64 should work. However, the pyarrow engine does not respect the manually passed datatype.

Without passing the dtype, the engines use float64 which rounds the large integers: c 8 pyarrow 8 python 0

When passing the dtype, every engine but pyarrow uses the passed dtype: c 2 pyarrow 8 python 2

In the read() function of ArrowParserWrapper (arrow_parser_wrapper.py) the check for self.kwds["dtype_backend"] == "pyarrow" fails because the dtype_backend is NO_DEFAULT so the frame that is created has no dtype mapping:
if self.kwds["dtype_backend"] == "pyarrow":
  frame = table.to_pandas(types_mapper=pd.ArrowDtype) # This should be returned
elif self.kwds["dtype_backend"] == "numpy_nullable":
  frame = table.to_pandas(types_mapper=_arrow_dtype_mapping().get)
else:
  frame = table.to_pandas() # This is being returned as dtype_backend is NO_DEFAULT
I submitted a pull request setting the dtype_backend to "pyarrow" if the engine is pyarrow in io/parsers/readers.py, This fixes the bug when the dtype is explicitly passed in the read_csv call for the pyarrow engine, the other engines are unaffected by this bug.

Thanks, would it also be possible to do a check and use not float64 but nullable Int64 instead, if a column only contains integers and float dtype would incur a precision loss? Otherwise, for transactional data, the current approach simply ruins the data, and it does that in a silent way. Like, when one filters by transaction id or client it, they might get totally wild results.

fingoldo · 2023-07-25T13:03:19Z

Hi, any progress on this issue?

kvn4 · 2023-08-17T18:26:06Z

take

fingoldo added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 7, 2023

fingoldo changed the title ~~BUG: incorrect reading of CSV containing large intergers~~ BUG: incorrect reading of CSV containing large integers Apr 7, 2023

github-actions bot assigned reddyrg1 Apr 7, 2023

manjalc pushed a commit to manjalc/pandas that referenced this issue Apr 23, 2023

Set dtype_backend to pyarrow if engine is pyarrow. Fixes pandas-dev#5…

8d47672

…2505

manjalc pushed a commit to manjalc/pandas that referenced this issue Apr 23, 2023

Add Entry about bugfix for pandas-dev#52505

390bc61

manjalc mentioned this issue Apr 23, 2023

Fix Issue #52505 #52884

Closed

5 tasks

lithomas1 added IO CSV read_csv, to_csv and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 5, 2023

kvn4 mentioned this issue Aug 21, 2023

Incorrect reading of CSV containing large integers Issue#52505 #54679

Merged

2 tasks

mroeschke closed this as completed in #54679 Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: incorrect reading of CSV containing large integers #52505

BUG: incorrect reading of CSV containing large integers #52505

fingoldo commented Apr 7, 2023 •

edited

Loading

reddyrg1 commented Apr 7, 2023

manjalc commented Apr 23, 2023 •

edited

Loading

fingoldo commented Apr 24, 2023

fingoldo commented Jul 25, 2023

kvn4 commented Aug 17, 2023

BUG: incorrect reading of CSV containing large integers #52505

BUG: incorrect reading of CSV containing large integers #52505

Comments

fingoldo commented Apr 7, 2023 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

reddyrg1 commented Apr 7, 2023

manjalc commented Apr 23, 2023 • edited Loading

fingoldo commented Apr 24, 2023

fingoldo commented Jul 25, 2023

kvn4 commented Aug 17, 2023

fingoldo commented Apr 7, 2023 •

edited

Loading

manjalc commented Apr 23, 2023 •

edited

Loading