Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when trying to load large json file #11344

Closed
eddie-dunn opened this issue Oct 16, 2015 · 22 comments · Fixed by #11393
Closed

Segmentation fault when trying to load large json file #11344

eddie-dunn opened this issue Oct 16, 2015 · 22 comments · Fixed by #11393
Labels
Compat pandas objects compatability with Numpy or Python functions IO JSON read_json, to_json, json_normalize
Milestone

Comments

@eddie-dunn
Copy link

I have a 1.1GB json file that I try to load with pandas.json.load:

import pandas as pd
with open('/tmp/file.json') as fileh:
    text = pd.json.load(fileh)

It breaks with the following output

Error in `python3': double free or corruption (out): 0x00007ffe082171f0 ***

I can load the file with Pythons built-in json module. What is going wrong here?

@jreback
Copy link
Contributor

jreback commented Oct 16, 2015

you need to use read_json (or the built in json module). pd.json is not the same thing

@jreback jreback closed this as completed Oct 16, 2015
@jreback jreback added IO JSON read_json, to_json, json_normalize Compat pandas objects compatability with Numpy or Python functions labels Oct 16, 2015
@eddie-dunn
Copy link
Author

I get the same segfault when using read_json.

@jreback
Copy link
Contributor

jreback commented Oct 16, 2015

well not sure how you generated it, its possible its not valid json / parseable by read_json

@eddie-dunn
Copy link
Author

I've generated the json file with both Python's built-in json.dump and pandas.json.dump; same result.

json.load can read the file just fine, so the json should be valid.

@TomAugspurger
Copy link
Contributor

Can you try to narrow it down to a smaller example that generates the segfault, and post that file?

@eddie-dunn
Copy link
Author

  1. This issue should be reopened
  2. I extracted a subset of the json-file (~200MB) and read_json worked fine

As I said originally, I think the issue is with the size of the json file.

@TomAugspurger
Copy link
Contributor

Can you

  1. Join 5 or 6 of those 200MB sections into a larger one and try to read that
  2. Work through the next 5 or 6 sections of the file in 200MB chunks

and see if either of those fail?
It shouldn't be segfaulting, but we'll need a reproducible example before we can start to diagnose where the bug is.

@eddie-dunn
Copy link
Author

I realized that the subset only took 200MB of RAM when loaded in the REPL. Once dumped to a json file on disk it was much smaller. 4 joined subsets did not present a problem for read_json as the resulting json file was only 321 MB.

I incremented the number of subsets; 7 subsets at 561 MB, 11 at 881 MB worked fine. 14 joined subsets at 1.1 GB crashed ipython:

In [44]: with open('/tmp/pandas14x.json') as fileh:    pand = pd.read_json(fileh)
   ....:     
*** Error in `/usr/bin/python3': double free or corruption (out): 0x00007ffedeb05b40 ***
[1]    4914 abort (core dumped)  ipython3

Might be worth your while to generate your own large json file to debug why read_json crashes Python if the json file is too big.

@jreback
Copy link
Contributor

jreback commented Oct 19, 2015

@eddie-dunn it would be really helpful for you to show a copy-pastable example that reproduces the problem. These cases are almost always a function of the dtypes of the structure you are saving.

This would make debugging much easier, as this is the first report I have ever seen for this type of bug.

Further pls show pd.show_versions().

@eddie-dunn
Copy link
Author

Generate example code:

#!/usr/bin/env python3
import json
import sys

SIZE = int(5e7)

FILENAME = 'generated.json'

print("Generating json with {} elements...".format(SIZE))
mdict = {key: val for key, val in zip(range(SIZE), range(SIZE))}

print("Dumping json to {}...".format(FILENAME))
with open(FILENAME, 'w') as fileh:
    json.dump(mdict, fileh)

Test run example code:

#!/usr/bin/env python3
import json
import pandas as pd
import sys


FILENAME = 'generated.json'

try:
    PANDAS = False if sys.argv[1] == 'n' else True
except IndexError:
    PANDAS = True

print("Loading json{}...".format(" with pandas" if PANDAS else ""))
with open(FILENAME) as fileh:
    if PANDAS:
        pd.json.load(fileh)
    else:
        json.load(fileh)  # set PANDAS=False and it will work

print("You see this, it worked!")

Please note that if you try to run the script with PANDAS=False you will need approximately 8 GB RAM or it will exit with an out of memory exception.

pandas.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-63-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: None
numpy: 1.8.2
scipy: 0.13.3
statsmodels: None
IPython: 3.2.0
sphinx: None
patsy: None
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

pls try this with 0.17.0 and report back.

@TomAugspurger
Copy link
Contributor

I can reproduce it 0.17.

@eddie-dunn
Copy link
Author

Yes, pandas still segfaults on 0.17.

@jreback jreback added this to the Next Major Release milestone Oct 20, 2015
@jreback jreback reopened this Oct 20, 2015
@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

ok will reopen if anyone cares to dig into the c-code

@kawochen

@kawochen
Copy link
Contributor

Will hopefully submit a PR tonight

@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

xref #7641 as well

@jreback jreback mentioned this issue Oct 20, 2015
@jreback
Copy link
Contributor

jreback commented Oct 20, 2015

cc @Komnomnomnom

@jreback jreback modified the milestones: 0.17.1, Next Major Release Oct 21, 2015
@jreback
Copy link
Contributor

jreback commented Oct 21, 2015

does #11393 fix for you? (you have to build from the PR to test)

@adri0
Copy link

adri0 commented Oct 23, 2015

I'm also having a segmentation fault when using 'read_csv' method to load a file with around 160Mb in pandas 0.17.0.
When I downgrade to 0.16.2 it works fine.

@TomAugspurger
Copy link
Contributor

@adri0 read_csv uses different C code that read_json, so that's probably a separate bug. Could you make a new issue for that, with an example that reproduces the segfault?

@adri0
Copy link

adri0 commented Oct 23, 2015

Okay, thanks. I just opened #11419

@bsolomon1124
Copy link

I'm seeing this traceback from the same cause on Pandas 0.23.4 + Python 3.7.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Compat pandas objects compatability with Numpy or Python functions IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants