Segmentation fault when trying to load large json file #11344

eddie-dunn · 2015-10-16T14:22:17Z

I have a 1.1GB json file that I try to load with pandas.json.load:

import pandas as pd
with open('/tmp/file.json') as fileh:
    text = pd.json.load(fileh)

It breaks with the following output

Error in `python3': double free or corruption (out): 0x00007ffe082171f0 ***

I can load the file with Pythons built-in json module. What is going wrong here?

The text was updated successfully, but these errors were encountered:

jreback · 2015-10-16T14:39:46Z

you need to use read_json (or the built in json module). pd.json is not the same thing

eddie-dunn · 2015-10-16T15:07:39Z

I get the same segfault when using read_json.

jreback · 2015-10-16T15:11:36Z

well not sure how you generated it, its possible its not valid json / parseable by read_json

eddie-dunn · 2015-10-16T15:19:24Z

I've generated the json file with both Python's built-in json.dump and pandas.json.dump; same result.

json.load can read the file just fine, so the json should be valid.

TomAugspurger · 2015-10-16T15:27:41Z

Can you try to narrow it down to a smaller example that generates the segfault, and post that file?

eddie-dunn · 2015-10-19T11:30:49Z

This issue should be reopened
I extracted a subset of the json-file (~200MB) and read_json worked fine

As I said originally, I think the issue is with the size of the json file.

TomAugspurger · 2015-10-19T11:57:50Z

Can you

Join 5 or 6 of those 200MB sections into a larger one and try to read that
Work through the next 5 or 6 sections of the file in 200MB chunks

and see if either of those fail?
It shouldn't be segfaulting, but we'll need a reproducible example before we can start to diagnose where the bug is.

eddie-dunn · 2015-10-19T12:34:08Z

I realized that the subset only took 200MB of RAM when loaded in the REPL. Once dumped to a json file on disk it was much smaller. 4 joined subsets did not present a problem for read_json as the resulting json file was only 321 MB.

I incremented the number of subsets; 7 subsets at 561 MB, 11 at 881 MB worked fine. 14 joined subsets at 1.1 GB crashed ipython:

In [44]: with open('/tmp/pandas14x.json') as fileh:    pand = pd.read_json(fileh)
   ....:     
*** Error in `/usr/bin/python3': double free or corruption (out): 0x00007ffedeb05b40 ***
[1]    4914 abort (core dumped)  ipython3

Might be worth your while to generate your own large json file to debug why read_json crashes Python if the json file is too big.

jreback · 2015-10-19T13:01:58Z

@eddie-dunn it would be really helpful for you to show a copy-pastable example that reproduces the problem. These cases are almost always a function of the dtypes of the structure you are saving.

This would make debugging much easier, as this is the first report I have ever seen for this type of bug.

Further pls show pd.show_versions().

eddie-dunn · 2015-10-20T09:11:18Z

Generate example code:

#!/usr/bin/env python3
import json
import sys

SIZE = int(5e7)

FILENAME = 'generated.json'

print("Generating json with {} elements...".format(SIZE))
mdict = {key: val for key, val in zip(range(SIZE), range(SIZE))}

print("Dumping json to {}...".format(FILENAME))
with open(FILENAME, 'w') as fileh:
    json.dump(mdict, fileh)

Test run example code:

#!/usr/bin/env python3
import json
import pandas as pd
import sys


FILENAME = 'generated.json'

try:
    PANDAS = False if sys.argv[1] == 'n' else True
except IndexError:
    PANDAS = True

print("Loading json{}...".format(" with pandas" if PANDAS else ""))
with open(FILENAME) as fileh:
    if PANDAS:
        pd.json.load(fileh)
    else:
        json.load(fileh)  # set PANDAS=False and it will work

print("You see this, it worked!")

Please note that if you try to run the script with PANDAS=False you will need approximately 8 GB RAM or it will exit with an out of memory exception.

pandas.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-63-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: None
numpy: 1.8.2
scipy: 0.13.3
statsmodels: None
IPython: 3.2.0
sphinx: None
patsy: None
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

jreback · 2015-10-20T12:01:59Z

pls try this with 0.17.0 and report back.

TomAugspurger · 2015-10-20T13:18:33Z

I can reproduce it 0.17.

eddie-dunn · 2015-10-20T14:26:43Z

Yes, pandas still segfaults on 0.17.

jreback · 2015-10-20T17:07:55Z

ok will reopen if anyone cares to dig into the c-code

@kawochen

kawochen · 2015-10-20T17:55:14Z

Will hopefully submit a PR tonight

jreback · 2015-10-20T22:28:14Z

xref #7641 as well

jreback · 2015-10-20T22:28:40Z

cc @Komnomnomnom

jreback · 2015-10-21T10:33:31Z

does #11393 fix for you? (you have to build from the PR to test)

adri0 · 2015-10-23T10:52:14Z

I'm also having a segmentation fault when using 'read_csv' method to load a file with around 160Mb in pandas 0.17.0.
When I downgrade to 0.16.2 it works fine.

TomAugspurger · 2015-10-23T12:06:18Z

@adri0 read_csv uses different C code that read_json, so that's probably a separate bug. Could you make a new issue for that, with an example that reproduces the segfault?

adri0 · 2015-10-23T14:03:59Z

Okay, thanks. I just opened #11419

bsolomon1124 · 2018-09-11T19:41:33Z

I'm seeing this traceback from the same cause on Pandas 0.23.4 + Python 3.7.0.

jreback closed this as completed Oct 16, 2015

jreback added IO JSON read_json, to_json, json_normalize Compat pandas objects compatability with Numpy or Python functions labels Oct 16, 2015

jreback added this to the Next Major Release milestone Oct 20, 2015

jreback added Difficulty Intermediate labels Oct 20, 2015

jreback reopened this Oct 20, 2015

jreback mentioned this issue Oct 20, 2015

kernel crash #7641

Closed

kawochen mentioned this issue Oct 21, 2015

BUG: GH11344 in pandas.json when file to read is big #11393

Merged

jreback modified the milestones: 0.17.1, Next Major Release Oct 21, 2015

jreback closed this as completed in #11393 Oct 23, 2015

levon003 mentioned this issue Jan 11, 2018

Segmentation fault when trying to load large json file #19194

Closed

hmgaudecker mentioned this issue Sep 24, 2018

read_json segfaults with Python 3.7 #22817

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation fault when trying to load large json file #11344

Segmentation fault when trying to load large json file #11344

eddie-dunn commented Oct 16, 2015

jreback commented Oct 16, 2015

eddie-dunn commented Oct 16, 2015

jreback commented Oct 16, 2015

eddie-dunn commented Oct 16, 2015

TomAugspurger commented Oct 16, 2015

eddie-dunn commented Oct 19, 2015

TomAugspurger commented Oct 19, 2015

eddie-dunn commented Oct 19, 2015

jreback commented Oct 19, 2015

eddie-dunn commented Oct 20, 2015

jreback commented Oct 20, 2015

TomAugspurger commented Oct 20, 2015

eddie-dunn commented Oct 20, 2015

jreback commented Oct 20, 2015

kawochen commented Oct 20, 2015

jreback commented Oct 20, 2015

jreback commented Oct 20, 2015

jreback commented Oct 21, 2015

adri0 commented Oct 23, 2015

TomAugspurger commented Oct 23, 2015

adri0 commented Oct 23, 2015

bsolomon1124 commented Sep 11, 2018

Segmentation fault when trying to load large json file #11344

Segmentation fault when trying to load large json file #11344

Comments

eddie-dunn commented Oct 16, 2015

jreback commented Oct 16, 2015

eddie-dunn commented Oct 16, 2015

jreback commented Oct 16, 2015

eddie-dunn commented Oct 16, 2015

TomAugspurger commented Oct 16, 2015

eddie-dunn commented Oct 19, 2015

TomAugspurger commented Oct 19, 2015

eddie-dunn commented Oct 19, 2015

jreback commented Oct 19, 2015

eddie-dunn commented Oct 20, 2015

jreback commented Oct 20, 2015

TomAugspurger commented Oct 20, 2015

eddie-dunn commented Oct 20, 2015

jreback commented Oct 20, 2015

kawochen commented Oct 20, 2015

jreback commented Oct 20, 2015

jreback commented Oct 20, 2015

jreback commented Oct 21, 2015

adri0 commented Oct 23, 2015

TomAugspurger commented Oct 23, 2015

adri0 commented Oct 23, 2015

bsolomon1124 commented Sep 11, 2018