Segmentation fault when trying to load large json file #11344

Closed
eddie-dunn opened this Issue Oct 16, 2015 · 21 comments

Comments

Projects
None yet
5 participants

I have a 1.1GB json file that I try to load with pandas.json.load:

import pandas as pd
with open('/tmp/file.json') as fileh:
    text = pd.json.load(fileh)

It breaks with the following output

Error in `python3': double free or corruption (out): 0x00007ffe082171f0 ***

I can load the file with Pythons built-in json module. What is going wrong here?

Contributor

jreback commented Oct 16, 2015

you need to use read_json (or the built in json module). pd.json is not the same thing

jreback closed this Oct 16, 2015

I get the same segfault when using read_json.

Contributor

jreback commented Oct 16, 2015

well not sure how you generated it, its possible its not valid json / parseable by read_json

I've generated the json file with both Python's built-in json.dump and pandas.json.dump; same result.

json.load can read the file just fine, so the json should be valid.

Contributor

TomAugspurger commented Oct 16, 2015

Can you try to narrow it down to a smaller example that generates the segfault, and post that file?

  1. This issue should be reopened
  2. I extracted a subset of the json-file (~200MB) and read_json worked fine

As I said originally, I think the issue is with the size of the json file.

Contributor

TomAugspurger commented Oct 19, 2015

Can you

  1. Join 5 or 6 of those 200MB sections into a larger one and try to read that
  2. Work through the next 5 or 6 sections of the file in 200MB chunks

and see if either of those fail?
It shouldn't be segfaulting, but we'll need a reproducible example before we can start to diagnose where the bug is.

I realized that the subset only took 200MB of RAM when loaded in the REPL. Once dumped to a json file on disk it was much smaller. 4 joined subsets did not present a problem for read_json as the resulting json file was only 321 MB.

I incremented the number of subsets; 7 subsets at 561 MB, 11 at 881 MB worked fine. 14 joined subsets at 1.1 GB crashed ipython:

In [44]: with open('/tmp/pandas14x.json') as fileh:    pand = pd.read_json(fileh)
   ....:     
*** Error in `/usr/bin/python3': double free or corruption (out): 0x00007ffedeb05b40 ***
[1]    4914 abort (core dumped)  ipython3

Might be worth your while to generate your own large json file to debug why read_json crashes Python if the json file is too big.

Contributor

jreback commented Oct 19, 2015

@eddie-dunn it would be really helpful for you to show a copy-pastable example that reproduces the problem. These cases are almost always a function of the dtypes of the structure you are saving.

This would make debugging much easier, as this is the first report I have ever seen for this type of bug.

Further pls show pd.show_versions().

Generate example code:

#!/usr/bin/env python3
import json
import sys

SIZE = int(5e7)

FILENAME = 'generated.json'

print("Generating json with {} elements...".format(SIZE))
mdict = {key: val for key, val in zip(range(SIZE), range(SIZE))}

print("Dumping json to {}...".format(FILENAME))
with open(FILENAME, 'w') as fileh:
    json.dump(mdict, fileh)

Test run example code:

#!/usr/bin/env python3
import json
import pandas as pd
import sys


FILENAME = 'generated.json'

try:
    PANDAS = False if sys.argv[1] == 'n' else True
except IndexError:
    PANDAS = True

print("Loading json{}...".format(" with pandas" if PANDAS else ""))
with open(FILENAME) as fileh:
    if PANDAS:
        pd.json.load(fileh)
    else:
        json.load(fileh)  # set PANDAS=False and it will work

print("You see this, it worked!")

Please note that if you try to run the script with PANDAS=False you will need approximately 8 GB RAM or it will exit with an out of memory exception.

pandas.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-63-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: None
numpy: 1.8.2
scipy: 0.13.3
statsmodels: None
IPython: 3.2.0
sphinx: None
patsy: None
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
Contributor

jreback commented Oct 20, 2015

pls try this with 0.17.0 and report back.

Contributor

TomAugspurger commented Oct 20, 2015

I can reproduce it 0.17.

Yes, pandas still segfaults on 0.17.

jreback added this to the Next Major Release milestone Oct 20, 2015

jreback reopened this Oct 20, 2015

Contributor

jreback commented Oct 20, 2015

ok will reopen if anyone cares to dig into the c-code

@kawochen

Contributor

kawochen commented Oct 20, 2015

Will hopefully submit a PR tonight

Contributor

jreback commented Oct 20, 2015

xref #7641 as well

jreback referenced this issue Oct 20, 2015

Closed

kernel crash #7641

Contributor

jreback commented Oct 20, 2015

@jreback jreback modified the milestone: 0.17.1, Next Major Release Oct 21, 2015

Contributor

jreback commented Oct 21, 2015

does #11393 fix for you? (you have to build from the PR to test)

adri0 commented Oct 23, 2015

I'm also having a segmentation fault when using 'read_csv' method to load a file with around 160Mb in pandas 0.17.0.
When I downgrade to 0.16.2 it works fine.

Contributor

TomAugspurger commented Oct 23, 2015

@adri0 read_csv uses different C code that read_json, so that's probably a separate bug. Could you make a new issue for that, with an example that reproduces the segfault?

adri0 commented Oct 23, 2015

Okay, thanks. I just opened #11419

jreback closed this in #11393 Oct 23, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment