New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

df.to_json segfaults with categorical column types #10778

Closed
blink1073 opened this Issue Aug 9, 2015 · 4 comments

Comments

Projects
None yet
4 participants
@blink1073

blink1073 commented Aug 9, 2015

This code from the Categorical tutorial:

df = pd.DataFrame({"A":["a","b","c","a"]})
df["B"] = df["A"].astype('category')

Will segfault when df.to_json is called. Tested on:

Python 3.4.3 |Anaconda 2.3.0 (x86_64)| (default, Mar  6 2015, 12:07:41)
[GCC 4.2.1 (Apple Inc. build 5577)] on darwin
$ conda list  | grep pandas
pandas                    0.16.2               np19py34_0
@blink1073

This comment has been minimized.

Show comment
Hide comment
@blink1073

blink1073 Aug 9, 2015

If I convert using df['B'] = df['B'].astype(str), then df.to_json works fine.

blink1073 commented Aug 9, 2015

If I convert using df['B'] = df['B'].astype(str), then df.to_json works fine.

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Aug 10, 2015

Contributor

xref #10321

This has not been implemented. pull-requests are welcome.

Contributor

jreback commented Aug 10, 2015

xref #10321

This has not been implemented. pull-requests are welcome.

@bbirand

This comment has been minimized.

Show comment
Hide comment
@bbirand

bbirand commented Dec 18, 2015

👍

@Timusan

This comment has been minimized.

Show comment
Hide comment
@Timusan

Timusan Jan 30, 2016

My apologies if this is simply extra noise, but I think I came across the same issue in (maybe) a slightly different situation. Thought this might help to give more insight into this issue.

With me this happens when trying to convert a DataFrame with binned data into JSON.

The kernel reports back the following:

Jan 30 18:18:04 debian kernel: [ 1632.371968] python[24794]: segfault at ffffffffffffffff ip 00007f58f570655d sp 00007ffe13a8c410 error 7 in multiarray.so[7f58f55bf000+1d9000]

Version info:

  • Linux Debian Jessie x86_64
  • Python 2.7.10 (GCC 4.9.2)
  • NumPy 1.10.4
  • Pandas 0.17.1

Greatly simplified piece of code which reproduces the error:

import pandas

def calculate_age_dist(df):
    # Setup the bins to divide. Setting a final bin
    # of 200 to mimic 60+ behavior.
    age_ranges = [0,2,10,20,30,40,50,60,200]
    age_ranges_labels = ['0-1', '2-9', '10-19', '20-29',
                         '30-39', '40-49', '50-59', '60+']

    # Divide the data into the bins. Using "right=False" to
    # exclude the upper range (so 30 is only included in bin 30-40
    # and not in bin 20-30).
    df['age_cat'] = pandas.cut(x = df['age'],
                               bins = age_ranges,
                               right = False,
                               labels = age_ranges_labels)

    # Leave only the top ten sports per age group
    # and convert the whole back to a DataFrame.
    age_dist = age_dist.groupby('age_cat').head(10).reset_index(drop=True)

    return age_dist

# The data.
df = pandas.DataFrame(data = {'age': [55,20,30,40,50,60,65,23],
                              'sport': ['soccer', 'baseball', 'soccer', 'football',
                                        'swimming', 'tennig', 'swimming', 'baseball']},
                      columns = {'age', 'sport'})

# The call which causes the segfault.
print calculate_age_dist(df).to_json()

Converting the "age_cat" column with df['age_cat'] = df['age_cat'].astype(str) does indeed circumvent the segfault.

Timusan commented Jan 30, 2016

My apologies if this is simply extra noise, but I think I came across the same issue in (maybe) a slightly different situation. Thought this might help to give more insight into this issue.

With me this happens when trying to convert a DataFrame with binned data into JSON.

The kernel reports back the following:

Jan 30 18:18:04 debian kernel: [ 1632.371968] python[24794]: segfault at ffffffffffffffff ip 00007f58f570655d sp 00007ffe13a8c410 error 7 in multiarray.so[7f58f55bf000+1d9000]

Version info:

  • Linux Debian Jessie x86_64
  • Python 2.7.10 (GCC 4.9.2)
  • NumPy 1.10.4
  • Pandas 0.17.1

Greatly simplified piece of code which reproduces the error:

import pandas

def calculate_age_dist(df):
    # Setup the bins to divide. Setting a final bin
    # of 200 to mimic 60+ behavior.
    age_ranges = [0,2,10,20,30,40,50,60,200]
    age_ranges_labels = ['0-1', '2-9', '10-19', '20-29',
                         '30-39', '40-49', '50-59', '60+']

    # Divide the data into the bins. Using "right=False" to
    # exclude the upper range (so 30 is only included in bin 30-40
    # and not in bin 20-30).
    df['age_cat'] = pandas.cut(x = df['age'],
                               bins = age_ranges,
                               right = False,
                               labels = age_ranges_labels)

    # Leave only the top ten sports per age group
    # and convert the whole back to a DataFrame.
    age_dist = age_dist.groupby('age_cat').head(10).reset_index(drop=True)

    return age_dist

# The data.
df = pandas.DataFrame(data = {'age': [55,20,30,40,50,60,65,23],
                              'sport': ['soccer', 'baseball', 'soccer', 'football',
                                        'swimming', 'tennig', 'swimming', 'baseball']},
                      columns = {'age', 'sport'})

# The call which causes the segfault.
print calculate_age_dist(df).to_json()

Converting the "age_cat" column with df['age_cat'] = df['age_cat'].astype(str) does indeed circumvent the segfault.

@Komnomnomnom Komnomnomnom referenced this issue Apr 5, 2016

Closed

BUG: fix json segfaults #12802

6 of 6 tasks complete

@jreback jreback modified the milestones: 0.18.1, Next Major Release Apr 5, 2016

@jreback jreback closed this in 37a7e69 Apr 26, 2016

nps added a commit to nps/pandas that referenced this issue May 17, 2016

BUG: fix json segfaults
closes pandas-dev#11473
closes pandas-dev#10778
closes pandas-dev#11299

Author: Kieran O'Mahony <kieranom@gmail.com>

Closes pandas-dev#12802 from Komnomnomnom/json-seg-faults and squashes the following commits:

b14d0df [Kieran O'Mahony] CLN: rename json test inline with others
af006a4 [Kieran O'Mahony] BUG: fix json segfaults

brainstorm added a commit to brainstorm/altair that referenced this issue Jun 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment