pandas.DataFrame.replace seems taking number string as integer and run into overflow error #25616

heygrain · 2019-03-09T12:06:49Z

Code Sample

import pandas as pd

x = {'user_id':['100000715097692381911', 
                '100003840837471130074'], 
     'item_id': [1, 2]
     }
dfx = pd.DataFrame(x)
dfx['user_id'].replace(
    {
     '100000715097692381911': 0, 
     '100003840837471130074': 1
     }, inplace=True)

Problem description

I got overflow error:

File "/home/grain/ml/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 683, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/_libs/lib.pyx", line 546, in pandas._libs.lib.astype_intsafe
OverflowError: Python int too large to convert to C long

This error occurs on my laptop with pandas 0.24.1, python 3.6.7. The example code can handle the replacement correctly on another computer with an older version pandas 0.20.3.

Expected Output

Since I loaded big id numbers like '100000715097692381911' as string type, the pandas.DataFrame.replace() method should replace it with the corresponding value in the dictionary. But I got overflow error. It seemed pandas took the id string as integer?

And if I modify just the first element of 'user_id' column to 's100000715097692381911', then the code won't run into overflow.

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 19.0.3
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

WillAyd · 2019-03-09T19:00:06Z

Hmm ok that is strange - investigation and PRs would certainly be welcome

ArtificialQualia · 2019-03-10T19:13:55Z

I'm new to working on pandas, so please note that some of the below may be incorrect.

I've looked into this a bit, and it looks like the cause is code added in #21477 in Block._replace_coerce, specifically the coerce_to_target_dtype function.

This didn't exist in the older version of pandas that the above code works on, and conversion was handled totally by the ObjectBlock.convert function. At some point, it seems that function no longer met all the needs of converting, and coerce_to_target_dtype was introduced into the replace process to meet some of those needs. However, that function tries too forcefully to convert the original Block (and it does so before it even finishes replacing values!), and thus we get an OverflowError in this case.

An easy fix to this is to just add OverflowError to the list of errors coerce_to_target_dtype can catch (here), which would fix this and retain other existing functionality. However, I'm not sure if coerce_to_target_dtype is meant to fail in the case of an int that is too large.

If we can't change that, a much more narrow fix would be to only call coerce_to_target_dtype after the last replacement is done, but that doesn't some other cases that the old code could handle (for instance: replacing one of those values in the example instead of both of them).

I couldn't see a fix we could make to ObjectBlock.convert to handle all conversions without needing to call coerce_to_target_dtype. I'm guessing this has to do with numpy changes, or perhaps changes to maybe_convert_objects, but I'm not sure.

WillAyd · 2019-03-10T20:06:50Z

@ArtificialQualia thanks a lot for your input!

However, that function tries too forcefully to convert the original Block (and it does so before it even finishes replacing values!

I haven't stepped through the code myself but this seems to be the actual root issue - have you tried debugging this behavior to see if a fix can be applied to prevent this?

ArtificialQualia · 2019-03-10T20:32:29Z

@WillAyd

I did try both the fixes I mentioned in my comment. They fix the issue mentioned in this issue, as well as passing all the tests in tests\series\test_replace.py.

If you want my recommendation, I would add OverflowError to coerce_to_target_dtype @ line 1082 and also move the call to coerce_to_target_dtype in _replace_coerce to after putmask so the coercion happens after the replacement. Possibly also add in a if convert: around the coerce_to_target_dtype call so it only happens on the last entry, which is similar to how ObjectBlock.convert works which I'm assuming is for performance reasons.

WillAyd · 2019-03-10T20:40:25Z

@ArtificialQualia if you'd like to submit a PR would probably be easier to review and give you feedback with that. Thanks again!

ArtificialQualia · 2019-03-10T20:49:50Z

No problem, I'll submit a PR.

…-dev#25616)

#25644)

…-dev#25616) (pandas-dev#25644) (cherry picked from commit 12fd316)

* master: (22 commits) Fixturize tests/frame/test_operators.py (pandas-dev#25641) Update ValueError message in corr (pandas-dev#25729) DOC: fix some grammar and inconsistency issues in the User Guide (pandas-dev#25728) ENH: Add public start, stop, and step attributes to RangeIndex (pandas-dev#25720) Make Rolling.apply documentation clearer (pandas-dev#25712) pandas-dev#25707 - Fixed flakiness in stata write test (pandas-dev#25714) Json normalize nan support (pandas-dev#25619) TST: resolve issues with test_constructor_dtype_datetime64 (pandas-dev#24868) DEPR: Deprecate box kwarg for to_timedelta and to_datetime (pandas-dev#24486) BUG: Preserve name in DatetimeIndex.snap (pandas-dev#25585) Fix concat not respecting order of OrderedDict (pandas-dev#25224) CLN: remove pandas.core.categorical (pandas-dev#25655) TST/CLN: Remove more Panel tests (pandas-dev#25675) Pinned pycodestyle (pandas-dev#25701) DOC: update date of 0.24.2 release notes (pandas-dev#25699) BUG: Fix error in replace with strings that are large numbers (pandas-dev#25616) (pandas-dev#25644) BUG: fix usage of na_sentinel with sort=True in factorize() (pandas-dev#25592) BUG: Fix to_string output when using header (pandas-dev#16718) (pandas-dev#25602) CLN: Remove unused test code (pandas-dev#25670) CLN: remove Panel from concat error message (pandas-dev#25676) ... # Conflicts: # doc/source/whatsnew/v0.25.0.rst

WillAyd added Bug Dtype Conversions Unexpected or buggy dtype conversions labels Mar 9, 2019

WillAyd added this to the Contributions Welcome milestone Mar 9, 2019

WillAyd added the Regression Functionality that used to work in a prior pandas version label Mar 9, 2019

ArtificialQualia added a commit to ArtificialQualia/pandas that referenced this issue Mar 10, 2019

BUG: Fix error in replace with strings that are large numbers (pandas…

742d81f

…-dev#25616)

ArtificialQualia mentioned this issue Mar 10, 2019

BUG: Fix error in replace with strings that are large numbers (#25616) #25644

Merged

4 tasks

jorisvandenbossche modified the milestones: Contributions Welcome, 0.24.3 Mar 12, 2019

jorisvandenbossche closed this as completed in #25644 Mar 12, 2019

jorisvandenbossche pushed a commit that referenced this issue Mar 12, 2019

BUG: Fix error in replace with strings that are large numbers (#25616) (

12fd316

#25644)

jorisvandenbossche modified the milestones: 0.24.3, 0.24.2 Mar 12, 2019

jorisvandenbossche pushed a commit to jorisvandenbossche/pandas that referenced this issue Mar 12, 2019

BUG: Fix error in replace with strings that are large numbers (pandas…

f4e1127

…-dev#25616) (pandas-dev#25644) (cherry picked from commit 12fd316)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.DataFrame.replace seems taking number string as integer and run into overflow error #25616

pandas.DataFrame.replace seems taking number string as integer and run into overflow error #25616

heygrain commented Mar 9, 2019 •

edited

Loading

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

WillAyd commented Mar 9, 2019

ArtificialQualia commented Mar 10, 2019 •

edited

Loading

WillAyd commented Mar 10, 2019

ArtificialQualia commented Mar 10, 2019

WillAyd commented Mar 10, 2019

ArtificialQualia commented Mar 10, 2019

pandas.DataFrame.replace seems taking number string as integer and run into overflow error #25616

pandas.DataFrame.replace seems taking number string as integer and run into overflow error #25616

Comments

heygrain commented Mar 9, 2019 • edited Loading

Code Sample

Problem description

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line] INSTALLED VERSIONS

WillAyd commented Mar 9, 2019

ArtificialQualia commented Mar 10, 2019 • edited Loading

WillAyd commented Mar 10, 2019

ArtificialQualia commented Mar 10, 2019

WillAyd commented Mar 10, 2019

ArtificialQualia commented Mar 10, 2019

heygrain commented Mar 9, 2019 •

edited

Loading

Output of `pd.show_versions()`

[paste the output of `pd.show_versions()` here below this line]
INSTALLED VERSIONS

ArtificialQualia commented Mar 10, 2019 •

edited

Loading