Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: OverflowError on to_json with numbers larger than sys.maxsize #34395

Closed
2 of 3 tasks
kinghuang opened this issue May 26, 2020 · 8 comments · Fixed by #34473
Closed
2 of 3 tasks

BUG: OverflowError on to_json with numbers larger than sys.maxsize #34395

kinghuang opened this issue May 26, 2020 · 8 comments · Fixed by #34473
Labels
Bug IO JSON read_json, to_json, json_normalize
Milestone

Comments

@kinghuang
Copy link

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of pandas.
  • (optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

import sys
from pandas.io.json import dumps

dumps(sys.maxsize)
dumps(sys.maxsize + 1)

Problem description

The Pandas JSON dumper doesn't seem to handle number values larger than sys.maxsize (a word). I have a dataframe that I'm trying to write to_json, but it's failing with OverflowError: int too big to convert. There are some numbers larger than 9223372036854775807 in it.

Passing a default_handler doesn't help. It doesn't get called for the error.

>>> dumps(sys.maxsize)
'9223372036854775807'
>>> dumps(sys.maxsize + 1, default_handler=str)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
OverflowError: int too big to convert

Expected Output

Python's built-in json module handles large numbers without issues.

>>> import json
>>> json.dumps(sys.maxsize)
'9223372036854775807'
>>> json.dumps(sys.maxsize+1)
'9223372036854775808'

I expect Pandas to be able to output large numbers to JSON. An option to use the built-in json module instead of ujson would be fine.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.8.3.final.0
python-bits : 64
OS : Linux
OS-release : 4.19.76-linuxkit
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2019.3
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.4.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : None
pyxlsb : None
s3fs : 0.4.2
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
numba : None

@kinghuang kinghuang added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 26, 2020
@arw2019
Copy link
Member

arw2019 commented May 27, 2020

I checked that this bug exists in the master version.

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 62c7dd3
python : 3.8.2.final.0
python-bits : 64
OS : Linux
OS-release : 4.15.0-101-generic
Version : #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020
machine : x86_64
processor :
byteorder : little
LC_ALL : C.UTF-8
LANG : C.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.1.0.dev0+1681.g62c7dd3e7
numpy : 1.17.5
pytz : 2020.1
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.4.0.post20200518
Cython : 0.29.19
pytest : 5.4.2
hypothesis : 5.15.1
sphinx : 3.0.4
blosc : None
feather : None
xlsxwriter : 1.2.8
lxml.etree : 4.5.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.14.0
pandas_datareader: None
bs4 : 4.9.1
bottleneck : 1.3.2
fastparquet : 0.4.0
gcsfs : None
matplotlib : 3.2.1
numexpr : 2.7.1
odfpy : None
openpyxl : 3.0.3
pandas_gbq : None
pyarrow : 0.17.1
pytables : None
pyxlsb : None
s3fs : 0.4.2
scipy : 1.4.1
sqlalchemy : 1.3.17
tables : 3.6.1
tabulate : 0.8.7
xarray : 0.15.1
xlrd : 1.2.0
xlwt : 1.3.0
numba : 0.49.1

@dsaxton dsaxton added the IO JSON read_json, to_json, json_normalize label May 27, 2020
@arw2019
Copy link
Member

arw2019 commented May 27, 2020

I dug a little and tracked the problem down to the version of dumps specified in pandas._libs. The following reproduces the same bug as above:

import sys
import pandas as pd

pd._libs.json.dumps(sys.maxsize)
pd._libs.json.dumps(sys.maxsize + 1)

I'm stuck on finding the actual code for dumps inside _libs. I'm happy to keep going with this, though, if somebody can give me a prod in the right direction!

@kinghuang
Copy link
Author

I think the implementation comes from the embedded version of ultrajson in pandas/_libs/src/ujson. I'm not sure how it's vendored or gets linked up, though.

@mroeschke mroeschke removed the Needs Triage Issue that has not been reviewed by a pandas team member label May 27, 2020
@arw2019
Copy link
Member

arw2019 commented May 27, 2020

Thanks!

It looks to me like the code which does the encoding is in pandas/_libs/src/ujson/lib/ultrajsonenc.c and it gets linked up in pandas/_libs/src/ujson/python/objToJSON.c.

@arw2019
Copy link
Member

arw2019 commented May 27, 2020

I guess that there is no way to fix the problem without messing with the ultrajson source code?

In pandas/io/json/_json.py dumps is defined via a direct call to ultrajson's dumps method, so I think to resolve the current bug one has to make changes to ultrajson.

import pandas._libs.json as json                       # line 10
dumps = json.dumps                                     # line 28

@WillAyd
Copy link
Member

WillAyd commented May 27, 2020

Related to #20599 this isn’t really feasible to do in the ujson source so would probably have to catch and coerce to a serializable type

@arw2019
Copy link
Member

arw2019 commented May 27, 2020

@WillAyd Thanks for this!

Reading through that thread it seems like a solution to this issue would be to wrap ultrajson's dumps and catch the OverflowError inside pandas/io/json/_json.py.

So, instead of:

dumps = json.dumps   # line 28

we would do something like this:

def dumps(obj, default_handler=str, **kwargs):
    try:
        return json.dumps(obj, **kwargs)
    except OverflowError:
        return json.dumps(default_handler(obj), **kwargs)

This fixes the original error. I checked that with this change the code still passes the unit test in pandas/tests/io/json/test_ujson.py - so the rewrite doesn't seem to break anything.

I'm happy to keep working on fixing this is this solution isn't quite right!

Once we've settled on the fix, would the next steps be these?

  • add the testcase to pandas/tests/io/json/test_ujson.py
  • submit a pull request

@WillAyd
Copy link
Member

WillAyd commented May 28, 2020

Yea if you want to add a test case and submit a pull request we can go from there. Will also want to check the performance benchmarks for JSON which you’ll find more info on here

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

arw2019 added a commit to arw2019/pandas that referenced this issue May 30, 2020
@jreback jreback added this to the 1.1 milestone Jun 24, 2020
WillAyd added a commit that referenced this issue Jun 24, 2020
* BUG: overflow on to_json with numbers larger than sys.maxsize

* TST: overflow on to_json with numbers larger than sys.maxsize (#34395)

* DOC: update with issue #34395

* TST: removed unused import

* ENH: added case JT_BIGNUM to encode

* ENH: added JT_BIGNUM to JSTYPES

* BUG: changed error for ints>sys.maxsize into JT_BIGNUM

* ENH: removed debug statements

* BUG: removed dumps wrapper

* removed bigNum from TypeContext

* TST: fixed bug in the test

* added pointer to string rep converter for BigNum

* TST: removed ujson.loads from the test

* added getBigNumStringValue

* added code to JT_BIGNUM handler by analogy with JT_UTF8

* TST: update pandas/tests/io/json/test_ujson.py

Co-authored-by: William Ayd <william.ayd@icloud.com>

* added Object_getBigNumStringValue to pyEncoder

* added skeletal code for Object_GetBigNumStringValue

* completed Object_getBigNumStringValue using PyObject_Repr

* BUG: changed Object_getBigNumStringValue

* improved Object_getBigNumStringValue some more

* update getBigNumStringValue argument

* corrected Object_getBigNumStringValue

* more fixes to Object_getBigNumStringValue

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

* Update pandas/_libs/src/ujson/python/objToJSON.c

* updated pyEncoder for JT_BIGNUM

* updated pyEncoder

* moved getBigNumStringValue to pyEncoder

* fixed declaration of Object_getBigNumStringValue

* fixed Object_getBigNumStringValue

* catch overflow error with PyLong_AsLongLongAndOverflow

* remove unnecessary error check

* added shortcircuit for error check

* simplify int overflow error catching

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update long int test in pandas/tests/io/json/test_ujson.py

Co-authored-by: William Ayd <william.ayd@icloud.com>

* removed tests expecting numeric overflow

* remove underscore from overflow

Co-authored-by: William Ayd <william.ayd@icloud.com>

* removed underscores from _overflow everywhere

* fixed small typo

* fix type of exc

* deleted numeric overflow tests

* remove extraneous condition in if statement

Co-authored-by: William Ayd <william.ayd@icloud.com>

* remove extraneous condition in if statement

Co-authored-by: William Ayd <william.ayd@icloud.com>

* change _Bool into int

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/lib/ultrajsonenc.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* allocate an extra byte in Object_getBigNumStringValue

Co-authored-by: William Ayd <william.ayd@icloud.com>

* allocate an extra byte in Object_getBigNumStringValue

Co-authored-by: William Ayd <william.ayd@icloud.com>

* reinstate RESERVE_STRING(szlen) in JT_BIGNUM case

* replaced (private) with (public) in whatnew

* release bytes in Object_endTypeContext

* in JT_BIGNUM change if+if into if+else if

* added reallocation of bigNum_bytes

* removed bigNum_bytes

* added to_json test for ints>sys.maxsize

* Use python malloc to match PyObject_Free in endTypeContext

Co-authored-by: William Ayd <william.ayd@icloud.com>

* TST: added manually constructed strs to compare encodings

* fixed styling to minimize diff with master

* fixed styling

* fixed conflicts with master

* fix styling to minimize diff

* fix styling to minimize diff

* fixed styling

* added negative nigNum to test_to_json_large_numers

* added negative nigNum to test_to_json_large_numers

* Update pandas/tests/io/json/test_ujson.py

Co-authored-by: William Ayd <william.ayd@icloud.com>

* fixe test_to_json_for_large_nums for -ve

* TST: added xfail for ujson.encode with long int input

* TST: fixed variable names in test_to_json_large_numbers

* TST: added xfail test for json.decode Series with long int

* TST: added xfail test for json.decode DataFrame with long int

* BENCH: added benchmarks for long ints

Co-authored-by: William Ayd <william.ayd@icloud.com>
fangchenli pushed a commit to fangchenli/pandas that referenced this issue Jun 27, 2020
* BUG: overflow on to_json with numbers larger than sys.maxsize

* TST: overflow on to_json with numbers larger than sys.maxsize (pandas-dev#34395)

* DOC: update with issue pandas-dev#34395

* TST: removed unused import

* ENH: added case JT_BIGNUM to encode

* ENH: added JT_BIGNUM to JSTYPES

* BUG: changed error for ints>sys.maxsize into JT_BIGNUM

* ENH: removed debug statements

* BUG: removed dumps wrapper

* removed bigNum from TypeContext

* TST: fixed bug in the test

* added pointer to string rep converter for BigNum

* TST: removed ujson.loads from the test

* added getBigNumStringValue

* added code to JT_BIGNUM handler by analogy with JT_UTF8

* TST: update pandas/tests/io/json/test_ujson.py

Co-authored-by: William Ayd <william.ayd@icloud.com>

* added Object_getBigNumStringValue to pyEncoder

* added skeletal code for Object_GetBigNumStringValue

* completed Object_getBigNumStringValue using PyObject_Repr

* BUG: changed Object_getBigNumStringValue

* improved Object_getBigNumStringValue some more

* update getBigNumStringValue argument

* corrected Object_getBigNumStringValue

* more fixes to Object_getBigNumStringValue

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

* Update pandas/_libs/src/ujson/python/objToJSON.c

* updated pyEncoder for JT_BIGNUM

* updated pyEncoder

* moved getBigNumStringValue to pyEncoder

* fixed declaration of Object_getBigNumStringValue

* fixed Object_getBigNumStringValue

* catch overflow error with PyLong_AsLongLongAndOverflow

* remove unnecessary error check

* added shortcircuit for error check

* simplify int overflow error catching

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update long int test in pandas/tests/io/json/test_ujson.py

Co-authored-by: William Ayd <william.ayd@icloud.com>

* removed tests expecting numeric overflow

* remove underscore from overflow

Co-authored-by: William Ayd <william.ayd@icloud.com>

* removed underscores from _overflow everywhere

* fixed small typo

* fix type of exc

* deleted numeric overflow tests

* remove extraneous condition in if statement

Co-authored-by: William Ayd <william.ayd@icloud.com>

* remove extraneous condition in if statement

Co-authored-by: William Ayd <william.ayd@icloud.com>

* change _Bool into int

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* Update pandas/_libs/src/ujson/lib/ultrajsonenc.c

Co-authored-by: William Ayd <william.ayd@icloud.com>

* allocate an extra byte in Object_getBigNumStringValue

Co-authored-by: William Ayd <william.ayd@icloud.com>

* allocate an extra byte in Object_getBigNumStringValue

Co-authored-by: William Ayd <william.ayd@icloud.com>

* reinstate RESERVE_STRING(szlen) in JT_BIGNUM case

* replaced (private) with (public) in whatnew

* release bytes in Object_endTypeContext

* in JT_BIGNUM change if+if into if+else if

* added reallocation of bigNum_bytes

* removed bigNum_bytes

* added to_json test for ints>sys.maxsize

* Use python malloc to match PyObject_Free in endTypeContext

Co-authored-by: William Ayd <william.ayd@icloud.com>

* TST: added manually constructed strs to compare encodings

* fixed styling to minimize diff with master

* fixed styling

* fixed conflicts with master

* fix styling to minimize diff

* fix styling to minimize diff

* fixed styling

* added negative nigNum to test_to_json_large_numers

* added negative nigNum to test_to_json_large_numers

* Update pandas/tests/io/json/test_ujson.py

Co-authored-by: William Ayd <william.ayd@icloud.com>

* fixe test_to_json_for_large_nums for -ve

* TST: added xfail for ujson.encode with long int input

* TST: fixed variable names in test_to_json_large_numbers

* TST: added xfail test for json.decode Series with long int

* TST: added xfail test for json.decode DataFrame with long int

* BENCH: added benchmarks for long ints

Co-authored-by: William Ayd <william.ayd@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants