Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.DataFrame.replace seems taking number string as integer and run into overflow error #25616

Closed
heygrain opened this issue Mar 9, 2019 · 6 comments

Comments

@heygrain
Copy link

commented Mar 9, 2019

Code Sample

import pandas as pd

x = {'user_id':['100000715097692381911', 
                '100003840837471130074'], 
     'item_id': [1, 2]
     }
dfx = pd.DataFrame(x)
dfx['user_id'].replace(
    {
     '100000715097692381911': 0, 
     '100003840837471130074': 1
     }, inplace=True)

Problem description

I got overflow error:

File "/home/grain/ml/lib/python3.6/site-packages/pandas/core/dtypes/cast.py", line 683, in astype_nansafe
return lib.astype_intsafe(arr.ravel(), dtype).reshape(arr.shape)
File "pandas/_libs/lib.pyx", line 546, in pandas._libs.lib.astype_intsafe
OverflowError: Python int too large to convert to C long

This error occurs on my laptop with pandas 0.24.1, python 3.6.7. The example code can handle the replacement correctly on another computer with an older version pandas 0.20.3.

Expected Output

Since I loaded big id numbers like '100000715097692381911' as string type, the pandas.DataFrame.replace() method should replace it with the corresponding value in the dictionary. But I got overflow error. It seemed pandas took the id string as integer?

And if I modify just the first element of 'user_id' column to 's100000715097692381911', then the code won't run into overflow.

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-46-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: None
pip: 19.0.3
setuptools: 40.6.3
Cython: None
numpy: 1.16.0
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@WillAyd

This comment has been minimized.

Copy link
Member

commented Mar 9, 2019

Hmm ok that is strange - investigation and PRs would certainly be welcome

@WillAyd WillAyd added Bug Dtypes labels Mar 9, 2019

@WillAyd WillAyd added this to the Contributions Welcome milestone Mar 9, 2019

@WillAyd WillAyd added the Regression label Mar 9, 2019

@ArtificialQualia

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2019

I'm new to working on pandas, so please note that some of the below may be incorrect.

I've looked into this a bit, and it looks like the cause is code added in #21477 in Block._replace_coerce, specifically the coerce_to_target_dtype function.

This didn't exist in the older version of pandas that the above code works on, and conversion was handled totally by the ObjectBlock.convert function. At some point, it seems that function no longer met all the needs of converting, and coerce_to_target_dtype was introduced into the replace process to meet some of those needs. However, that function tries too forcefully to convert the original Block (and it does so before it even finishes replacing values!), and thus we get an OverflowError in this case.

An easy fix to this is to just add OverflowError to the list of errors coerce_to_target_dtype can catch (here), which would fix this and retain other existing functionality. However, I'm not sure if coerce_to_target_dtype is meant to fail in the case of an int that is too large.

If we can't change that, a much more narrow fix would be to only call coerce_to_target_dtype after the last replacement is done, but that doesn't some other cases that the old code could handle (for instance: replacing one of those values in the example instead of both of them).

I couldn't see a fix we could make to ObjectBlock.convert to handle all conversions without needing to call coerce_to_target_dtype. I'm guessing this has to do with numpy changes, or perhaps changes to maybe_convert_objects, but I'm not sure.

@WillAyd

This comment has been minimized.

Copy link
Member

commented Mar 10, 2019

@ArtificialQualia thanks a lot for your input!

However, that function tries too forcefully to convert the original Block (and it does so before it even finishes replacing values!

I haven't stepped through the code myself but this seems to be the actual root issue - have you tried debugging this behavior to see if a fix can be applied to prevent this?

@ArtificialQualia

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2019

@WillAyd

I did try both the fixes I mentioned in my comment. They fix the issue mentioned in this issue, as well as passing all the tests in tests\series\test_replace.py.

If you want my recommendation, I would add OverflowError to coerce_to_target_dtype @ line 1082 and also move the call to coerce_to_target_dtype in _replace_coerce to after putmask so the coercion happens after the replacement. Possibly also add in a if convert: around the coerce_to_target_dtype call so it only happens on the last entry, which is similar to how ObjectBlock.convert works which I'm assuming is for performance reasons.

@WillAyd

This comment has been minimized.

Copy link
Member

commented Mar 10, 2019

@ArtificialQualia if you'd like to submit a PR would probably be easier to review and give you feedback with that. Thanks again!

@ArtificialQualia

This comment has been minimized.

Copy link
Contributor

commented Mar 10, 2019

No problem, I'll submit a PR.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.24.3, 0.24.2 Mar 12, 2019

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this issue Mar 12, 2019

sighingnow added a commit to sighingnow/pandas that referenced this issue Mar 14, 2019

Merge branch 'master' into fix-25587
* master: (22 commits)
  Fixturize tests/frame/test_operators.py (pandas-dev#25641)
  Update ValueError message in corr (pandas-dev#25729)
  DOC: fix some grammar and inconsistency issues in the User Guide (pandas-dev#25728)
  ENH: Add public start, stop, and step attributes to RangeIndex (pandas-dev#25720)
  Make Rolling.apply documentation clearer (pandas-dev#25712)
  pandas-dev#25707 - Fixed flakiness in stata write test (pandas-dev#25714)
  Json normalize nan support (pandas-dev#25619)
  TST: resolve issues with test_constructor_dtype_datetime64 (pandas-dev#24868)
  DEPR: Deprecate box kwarg for to_timedelta and to_datetime (pandas-dev#24486)
  BUG: Preserve name in DatetimeIndex.snap (pandas-dev#25585)
  Fix concat not respecting order of OrderedDict (pandas-dev#25224)
  CLN: remove pandas.core.categorical (pandas-dev#25655)
  TST/CLN: Remove more Panel tests (pandas-dev#25675)
  Pinned pycodestyle (pandas-dev#25701)
  DOC: update date of 0.24.2 release notes (pandas-dev#25699)
  BUG: Fix error in replace with strings that are large numbers (pandas-dev#25616) (pandas-dev#25644)
  BUG: fix usage of na_sentinel with sort=True in factorize() (pandas-dev#25592)
  BUG: Fix to_string output when using header (pandas-dev#16718) (pandas-dev#25602)
  CLN: Remove unused test code (pandas-dev#25670)
  CLN: remove Panel from concat error message (pandas-dev#25676)
  ...

# Conflicts:
#	doc/source/whatsnew/v0.25.0.rst
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.