Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.to_json segfaults with categorical column types #10778

Closed
blink1073 opened this issue Aug 9, 2015 · 5 comments
Closed

df.to_json segfaults with categorical column types #10778

blink1073 opened this issue Aug 9, 2015 · 5 comments
Labels
Categorical Categorical Data Type IO JSON read_json, to_json, json_normalize
Milestone

Comments

@blink1073
Copy link

This code from the Categorical tutorial:

df = pd.DataFrame({"A":["a","b","c","a"]})
df["B"] = df["A"].astype('category')

Will segfault when df.to_json is called. Tested on:

Python 3.4.3 |Anaconda 2.3.0 (x86_64)| (default, Mar  6 2015, 12:07:41)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
$ conda list  | grep pandas
pandas                    0.16.2               np19py34_0
@blink1073
Copy link
Author

If I convert using df['B'] = df['B'].astype(str), then df.to_json works fine.

@jreback
Copy link
Contributor

jreback commented Aug 10, 2015

xref #10321

This has not been implemented. pull-requests are welcome.

@jreback jreback added IO JSON read_json, to_json, json_normalize Categorical Categorical Data Type labels Aug 10, 2015
@jreback jreback added this to the Next Major Release milestone Aug 10, 2015
@bbirand
Copy link

bbirand commented Dec 18, 2015

👍

@Timusan
Copy link

Timusan commented Jan 30, 2016

My apologies if this is simply extra noise, but I think I came across the same issue in (maybe) a slightly different situation. Thought this might help to give more insight into this issue.

With me this happens when trying to convert a DataFrame with binned data into JSON.

The kernel reports back the following:

Jan 30 18:18:04 debian kernel: [ 1632.371968] python[24794]: segfault at ffffffffffffffff ip 00007f58f570655d sp 00007ffe13a8c410 error 7 in multiarray.so[7f58f55bf000+1d9000]

Version info:

  • Linux Debian Jessie x86_64
  • Python 2.7.10 (GCC 4.9.2)
  • NumPy 1.10.4
  • Pandas 0.17.1

Greatly simplified piece of code which reproduces the error:

import pandas

def calculate_age_dist(df):
    # Setup the bins to divide. Setting a final bin
    # of 200 to mimic 60+ behavior.
    age_ranges = [0,2,10,20,30,40,50,60,200]
    age_ranges_labels = ['0-1', '2-9', '10-19', '20-29',
                         '30-39', '40-49', '50-59', '60+']

    # Divide the data into the bins. Using "right=False" to
    # exclude the upper range (so 30 is only included in bin 30-40
    # and not in bin 20-30).
    df['age_cat'] = pandas.cut(x = df['age'],
                               bins = age_ranges,
                               right = False,
                               labels = age_ranges_labels)

    # Leave only the top ten sports per age group
    # and convert the whole back to a DataFrame.
    age_dist = age_dist.groupby('age_cat').head(10).reset_index(drop=True)

    return age_dist

# The data.
df = pandas.DataFrame(data = {'age': [55,20,30,40,50,60,65,23],
                              'sport': ['soccer', 'baseball', 'soccer', 'football',
                                        'swimming', 'tennig', 'swimming', 'baseball']},
                      columns = {'age', 'sport'})

# The call which causes the segfault.
print calculate_age_dist(df).to_json()

Converting the "age_cat" column with df['age_cat'] = df['age_cat'].astype(str) does indeed circumvent the segfault.

@annacnev
Copy link

Apologies for the noise, came across this issue two weeks ago when I was converting a json file to a pandas data frame using pd.DataFrame.from_records. I had a large json file that I pulled from multiple webpages into one file that I was trying to parse through and separate using a loop and the pandas function. One of the webpages did not load and it went into my larger file as an empty list-unbeknownst to me. That empty list when read into pandas caused the seg fault. Took me forever to figure that out and eventually adding a try and except to my code solved the issue.

Not sure if this is the appropriate place to post this, but this was the first forum that came up when I googled "seg fault json pandas" and I wanted to put this out there incase someone in future ran into the problem I had and might be able to save some time.

One the developer site though, just a recommendation, it might be nice if a try and except could be added to the "from records" function that would give an output when an empty list is encountered because that seg fault had me messing with my python framework thinking something was very wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants