New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when trying to load large json file #11344

Closed
eddie-dunn opened this Issue Oct 16, 2015 · 22 comments

Comments

Projects
None yet
6 participants
@eddie-dunn

eddie-dunn commented Oct 16, 2015

I have a 1.1GB json file that I try to load with pandas.json.load:

import pandas as pd
with open('/tmp/file.json') as fileh:
    text = pd.json.load(fileh)

It breaks with the following output

Error in `python3': double free or corruption (out): 0x00007ffe082171f0 ***

I can load the file with Pythons built-in json module. What is going wrong here?

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 16, 2015

Contributor

you need to use read_json (or the built in json module). pd.json is not the same thing

Contributor

jreback commented Oct 16, 2015

you need to use read_json (or the built in json module). pd.json is not the same thing

@eddie-dunn

This comment has been minimized.

Show comment
Hide comment
@eddie-dunn

eddie-dunn Oct 16, 2015

I get the same segfault when using read_json.

eddie-dunn commented Oct 16, 2015

I get the same segfault when using read_json.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 16, 2015

Contributor

well not sure how you generated it, its possible its not valid json / parseable by read_json

Contributor

jreback commented Oct 16, 2015

well not sure how you generated it, its possible its not valid json / parseable by read_json

@eddie-dunn

This comment has been minimized.

Show comment
Hide comment
@eddie-dunn

eddie-dunn Oct 16, 2015

I've generated the json file with both Python's built-in json.dump and pandas.json.dump; same result.

json.load can read the file just fine, so the json should be valid.

eddie-dunn commented Oct 16, 2015

I've generated the json file with both Python's built-in json.dump and pandas.json.dump; same result.

json.load can read the file just fine, so the json should be valid.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 16, 2015

Contributor

Can you try to narrow it down to a smaller example that generates the segfault, and post that file?

Contributor

TomAugspurger commented Oct 16, 2015

Can you try to narrow it down to a smaller example that generates the segfault, and post that file?

@eddie-dunn

This comment has been minimized.

Show comment
Hide comment
@eddie-dunn

eddie-dunn Oct 19, 2015

  1. This issue should be reopened
  2. I extracted a subset of the json-file (~200MB) and read_json worked fine

As I said originally, I think the issue is with the size of the json file.

eddie-dunn commented Oct 19, 2015

  1. This issue should be reopened
  2. I extracted a subset of the json-file (~200MB) and read_json worked fine

As I said originally, I think the issue is with the size of the json file.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 19, 2015

Contributor

Can you

  1. Join 5 or 6 of those 200MB sections into a larger one and try to read that
  2. Work through the next 5 or 6 sections of the file in 200MB chunks

and see if either of those fail?
It shouldn't be segfaulting, but we'll need a reproducible example before we can start to diagnose where the bug is.

Contributor

TomAugspurger commented Oct 19, 2015

Can you

  1. Join 5 or 6 of those 200MB sections into a larger one and try to read that
  2. Work through the next 5 or 6 sections of the file in 200MB chunks

and see if either of those fail?
It shouldn't be segfaulting, but we'll need a reproducible example before we can start to diagnose where the bug is.

@eddie-dunn

This comment has been minimized.

Show comment
Hide comment
@eddie-dunn

eddie-dunn Oct 19, 2015

I realized that the subset only took 200MB of RAM when loaded in the REPL. Once dumped to a json file on disk it was much smaller. 4 joined subsets did not present a problem for read_json as the resulting json file was only 321 MB.

I incremented the number of subsets; 7 subsets at 561 MB, 11 at 881 MB worked fine. 14 joined subsets at 1.1 GB crashed ipython:

In [44]: with open('/tmp/pandas14x.json') as fileh:    pand = pd.read_json(fileh)
   ....:     
*** Error in `/usr/bin/python3': double free or corruption (out): 0x00007ffedeb05b40 ***
[1]    4914 abort (core dumped)  ipython3

Might be worth your while to generate your own large json file to debug why read_json crashes Python if the json file is too big.

eddie-dunn commented Oct 19, 2015

I realized that the subset only took 200MB of RAM when loaded in the REPL. Once dumped to a json file on disk it was much smaller. 4 joined subsets did not present a problem for read_json as the resulting json file was only 321 MB.

I incremented the number of subsets; 7 subsets at 561 MB, 11 at 881 MB worked fine. 14 joined subsets at 1.1 GB crashed ipython:

In [44]: with open('/tmp/pandas14x.json') as fileh:    pand = pd.read_json(fileh)
   ....:     
*** Error in `/usr/bin/python3': double free or corruption (out): 0x00007ffedeb05b40 ***
[1]    4914 abort (core dumped)  ipython3

Might be worth your while to generate your own large json file to debug why read_json crashes Python if the json file is too big.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 19, 2015

Contributor

@eddie-dunn it would be really helpful for you to show a copy-pastable example that reproduces the problem. These cases are almost always a function of the dtypes of the structure you are saving.

This would make debugging much easier, as this is the first report I have ever seen for this type of bug.

Further pls show pd.show_versions().

Contributor

jreback commented Oct 19, 2015

@eddie-dunn it would be really helpful for you to show a copy-pastable example that reproduces the problem. These cases are almost always a function of the dtypes of the structure you are saving.

This would make debugging much easier, as this is the first report I have ever seen for this type of bug.

Further pls show pd.show_versions().

@eddie-dunn

This comment has been minimized.

Show comment
Hide comment
@eddie-dunn

eddie-dunn Oct 20, 2015

Generate example code:

#!/usr/bin/env python3
import json
import sys

SIZE = int(5e7)

FILENAME = 'generated.json'

print("Generating json with {} elements...".format(SIZE))
mdict = {key: val for key, val in zip(range(SIZE), range(SIZE))}

print("Dumping json to {}...".format(FILENAME))
with open(FILENAME, 'w') as fileh:
    json.dump(mdict, fileh)

Test run example code:

#!/usr/bin/env python3
import json
import pandas as pd
import sys


FILENAME = 'generated.json'

try:
    PANDAS = False if sys.argv[1] == 'n' else True
except IndexError:
    PANDAS = True

print("Loading json{}...".format(" with pandas" if PANDAS else ""))
with open(FILENAME) as fileh:
    if PANDAS:
        pd.json.load(fileh)
    else:
        json.load(fileh)  # set PANDAS=False and it will work

print("You see this, it worked!")

Please note that if you try to run the script with PANDAS=False you will need approximately 8 GB RAM or it will exit with an out of memory exception.

pandas.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-63-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: None
numpy: 1.8.2
scipy: 0.13.3
statsmodels: None
IPython: 3.2.0
sphinx: None
patsy: None
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None

eddie-dunn commented Oct 20, 2015

Generate example code:

#!/usr/bin/env python3
import json
import sys

SIZE = int(5e7)

FILENAME = 'generated.json'

print("Generating json with {} elements...".format(SIZE))
mdict = {key: val for key, val in zip(range(SIZE), range(SIZE))}

print("Dumping json to {}...".format(FILENAME))
with open(FILENAME, 'w') as fileh:
    json.dump(mdict, fileh)

Test run example code:

#!/usr/bin/env python3
import json
import pandas as pd
import sys


FILENAME = 'generated.json'

try:
    PANDAS = False if sys.argv[1] == 'n' else True
except IndexError:
    PANDAS = True

print("Loading json{}...".format(" with pandas" if PANDAS else ""))
with open(FILENAME) as fileh:
    if PANDAS:
        pd.json.load(fileh)
    else:
        json.load(fileh)  # set PANDAS=False and it will work

print("You see this, it worked!")

Please note that if you try to run the script with PANDAS=False you will need approximately 8 GB RAM or it will exit with an out of memory exception.

pandas.show_versions():

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-63-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: None
numpy: 1.8.2
scipy: 0.13.3
statsmodels: None
IPython: 3.2.0
sphinx: None
patsy: None
dateutil: 2.4.0
pytz: 2014.10
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.3.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999
httplib2: 0.8
apiclient: None
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: None
@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 20, 2015

Contributor

pls try this with 0.17.0 and report back.

Contributor

jreback commented Oct 20, 2015

pls try this with 0.17.0 and report back.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 20, 2015

Contributor

I can reproduce it 0.17.

Contributor

TomAugspurger commented Oct 20, 2015

I can reproduce it 0.17.

@eddie-dunn

This comment has been minimized.

Show comment
Hide comment
@eddie-dunn

eddie-dunn Oct 20, 2015

Yes, pandas still segfaults on 0.17.

eddie-dunn commented Oct 20, 2015

Yes, pandas still segfaults on 0.17.

@jreback jreback added this to the Next Major Release milestone Oct 20, 2015

@jreback jreback reopened this Oct 20, 2015

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 20, 2015

Contributor

ok will reopen if anyone cares to dig into the c-code

@kawochen

Contributor

jreback commented Oct 20, 2015

ok will reopen if anyone cares to dig into the c-code

@kawochen

@kawochen

This comment has been minimized.

Show comment
Hide comment
@kawochen

kawochen Oct 20, 2015

Contributor

Will hopefully submit a PR tonight

Contributor

kawochen commented Oct 20, 2015

Will hopefully submit a PR tonight

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 20, 2015

Contributor

xref #7641 as well

Contributor

jreback commented Oct 20, 2015

xref #7641 as well

@jreback jreback referenced this issue Oct 20, 2015

Closed

kernel crash #7641

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback
Contributor

jreback commented Oct 20, 2015

@jreback jreback modified the milestones: 0.17.1, Next Major Release Oct 21, 2015

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 21, 2015

Contributor

does #11393 fix for you? (you have to build from the PR to test)

Contributor

jreback commented Oct 21, 2015

does #11393 fix for you? (you have to build from the PR to test)

@adri0

This comment has been minimized.

Show comment
Hide comment
@adri0

adri0 Oct 23, 2015

I'm also having a segmentation fault when using 'read_csv' method to load a file with around 160Mb in pandas 0.17.0.
When I downgrade to 0.16.2 it works fine.

adri0 commented Oct 23, 2015

I'm also having a segmentation fault when using 'read_csv' method to load a file with around 160Mb in pandas 0.17.0.
When I downgrade to 0.16.2 it works fine.

@TomAugspurger

This comment has been minimized.

Show comment
Hide comment
@TomAugspurger

TomAugspurger Oct 23, 2015

Contributor

@adri0 read_csv uses different C code that read_json, so that's probably a separate bug. Could you make a new issue for that, with an example that reproduces the segfault?

Contributor

TomAugspurger commented Oct 23, 2015

@adri0 read_csv uses different C code that read_json, so that's probably a separate bug. Could you make a new issue for that, with an example that reproduces the segfault?

@adri0

This comment has been minimized.

Show comment
Hide comment
@adri0

adri0 Oct 23, 2015

Okay, thanks. I just opened #11419

adri0 commented Oct 23, 2015

Okay, thanks. I just opened #11419

@bsolomon1124

This comment has been minimized.

Show comment
Hide comment
@bsolomon1124

bsolomon1124 Sep 11, 2018

I'm seeing this traceback from the same cause on Pandas 0.23.4 + Python 3.7.0.

bsolomon1124 commented Sep 11, 2018

I'm seeing this traceback from the same cause on Pandas 0.23.4 + Python 3.7.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment