-
-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: incorrect reading of CSV containing large integers #52505
Comments
take |
The reason why your getting an output of 8 or 0 when not passing the dtype is because the values are treated as float64s which has precision issues. Hence setting the dtype to int64 should work. However, the pyarrow engine does not respect the manually passed datatype. Without passing the dtype, the engines use float64 which rounds the large integers: When passing the dtype, every engine but pyarrow uses the passed dtype: In the read() function of ArrowParserWrapper (arrow_parser_wrapper.py) the check for self.kwds["dtype_backend"] == "pyarrow" fails because the dtype_backend is NO_DEFAULT so the frame that is created has no dtype mapping: if self.kwds["dtype_backend"] == "pyarrow":
frame = table.to_pandas(types_mapper=pd.ArrowDtype) # This should be returned
elif self.kwds["dtype_backend"] == "numpy_nullable":
frame = table.to_pandas(types_mapper=_arrow_dtype_mapping().get)
else:
frame = table.to_pandas() # This is being returned as dtype_backend is NO_DEFAULT I submitted a pull request setting the dtype_backend to "pyarrow" if the engine is pyarrow in io/parsers/readers.py, This fixes the bug when the dtype is explicitly passed in the read_csv call for the pyarrow engine, the other engines are unaffected by this bug. |
Thanks, would it also be possible to do a check and use not float64 but nullable Int64 instead, if a column only contains integers and float dtype would incur a precision loss? Otherwise, for transactional data, the current approach simply ruins the data, and it does that in a silent way. Like, when one filters by transaction id or client it, they might get totally wild results. |
Hi, any progress on this issue? |
take |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Reading of this financial data peace is terribly incorrect. I get the following output:
c 8
pyarrow 8
python 0
None of the engines is able to parse the data correctly, I suppose due to the integers being pretty big. But pandas produces no warning, just silently butchers the data. The issue is not new to 2.0, it was present in 1.5 as well.
It seems to be connected to float64 not having enough precision to hold all the digits. int64 would be ok probably, as the previous column works with int64, but current column has missing values.
using of the new nullable integer type kind of works, but not for pyarrow:
...
orders = pd.read_csv(test, engine=engine,dtype={"ID_DEAL": "Int64"})
...
c 2
pyarrow 8
python 2
But certainly the loss of precision needs to be recognized automatically, and either error should be issued, or Int64 dtype suggested. Otherwise, this bug can lead to catastrophical consequences in decision support systems powered by pandas.
Expected Behavior
c 2
pyarrow 2
python 2
Installed Versions
The text was updated successfully, but these errors were encountered: