New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ERR: Segfault with df.astype(category) on empty dataframe #18004

Closed
jcrist opened this Issue Oct 27, 2017 · 5 comments

Comments

Projects
None yet
4 participants
@jcrist
Contributor

jcrist commented Oct 27, 2017

Pandas segfaults when calling DataFrame.astype('category') on an empty dataframe. This fails in 0.21.0rc1, 0.20.3, and probably previous versions as well.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'x': ['a', 'b', 'c'], 'y': ['a', 'b', 'c'], 'z': ['a', 'b', 'c']})

In [3]: empty = df.iloc[:0].copy()  # copy is necessary, no segfault without it

In [4]: empty.astype('category')
Segmentation fault: 11

For non-empty frames, an error message is raised saying this operation isn't supported yet. Also note that a copy is needed to cause the segfault, without it the error message is still raised.

pd.show_versions:
In [5]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.4.final.0
python-bits: 64
OS: Darwin
OS-release: 16.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.1.0
pip: 9.0.1
setuptools: 36.4.0
Cython: 0.25.2
numpy: 1.13.3
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: 1.5.3
patsy: None
dateutil: 2.6.0
pytz: 2017.2
blosc: None
bottleneck: None
tables: 3.4.2
numexpr: 2.6.4
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: 1.1.9
pymysql: None
psycopg2: None
jinja2: 2.9.5
s3fs: 0.1.1
pandas_gbq: None
pandas_datareader: None

@jreback jreback changed the title from Segfault with `astype(category)` on empty dataframe to ERR: Segfault with df.astype(category) on empty dataframe Oct 27, 2017

@jreback jreback added this to the 0.21.1 milestone Oct 27, 2017

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback Oct 27, 2017

Contributor

thanks @jcrist yep, this should raise

Contributor

jreback commented Oct 27, 2017

thanks @jcrist yep, this should raise

@jschendel

This comment has been minimized.

Show comment
Hide comment
@jschendel

jschendel Oct 27, 2017

Member

I can confirm this on master with some slightly simpler code:

pd.DataFrame(columns=['x', 'y']).astype('category')

I get the segfault when 2+ columns are specified in the example above. If only one column is specified I get a NotImplementedError.

Member

jschendel commented Oct 27, 2017

I can confirm this on master with some slightly simpler code:

pd.DataFrame(columns=['x', 'y']).astype('category')

I get the segfault when 2+ columns are specified in the example above. If only one column is specified I get a NotImplementedError.

@jschendel

This comment has been minimized.

Show comment
Hide comment
@jschendel

jschendel Oct 27, 2017

Member

Looks like this boils down to factorize in this specific case, and the hash table code in general.

In the case of this specific issue, what ultimately is happening is something like this:

In [3]: arr = np.array([np.array([], dtype=object), np.array([], dtype=object)])

In [4]: pd.factorize(arr)
Segmentation fault (core dumped)

This isn't specific to factorize though, and seems to impact functions that rely on hash tables, e.g. unique:

In [3]: arr = np.array([np.array([], dtype=object), np.array([], dtype=object)])

In [4]: pd.unique(arr)
Segmentation fault (core dumped)

Note that this isn't segfaulting for integer or float dtypes:

In [3]: arr = np.array([np.array([], dtype='int64'), np.array([], dtype='int64')])

In [4]: pd.factorize(arr)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-65d36072b155> in <module>()
----> 1 pd.factorize(arr)

/usr/local/lib/python3.4/dist-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
    558     uniques = vec_klass()
    559     check_nulls = not is_integer_dtype(original)
--> 560     labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
    561
    562     labels = _ensure_platform_int(labels)

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_labels (pandas/_libs/hashtable.c:15265)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Following the error above, it looks like there's a template approach for int/float dtype hashtables but StringHashTable and PyObjectHashTable have their own custom code, which I'm guessing will need to be patched to raise a ValueError similar to the one above? Probably a bit beyond my current knowledge level.

Member

jschendel commented Oct 27, 2017

Looks like this boils down to factorize in this specific case, and the hash table code in general.

In the case of this specific issue, what ultimately is happening is something like this:

In [3]: arr = np.array([np.array([], dtype=object), np.array([], dtype=object)])

In [4]: pd.factorize(arr)
Segmentation fault (core dumped)

This isn't specific to factorize though, and seems to impact functions that rely on hash tables, e.g. unique:

In [3]: arr = np.array([np.array([], dtype=object), np.array([], dtype=object)])

In [4]: pd.unique(arr)
Segmentation fault (core dumped)

Note that this isn't segfaulting for integer or float dtypes:

In [3]: arr = np.array([np.array([], dtype='int64'), np.array([], dtype='int64')])

In [4]: pd.factorize(arr)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-65d36072b155> in <module>()
----> 1 pd.factorize(arr)

/usr/local/lib/python3.4/dist-packages/pandas/core/algorithms.py in factorize(values, sort, order, na_sentinel, size_hint)
    558     uniques = vec_klass()
    559     check_nulls = not is_integer_dtype(original)
--> 560     labels = table.get_labels(values, uniques, 0, na_sentinel, check_nulls)
    561
    562     labels = _ensure_platform_int(labels)

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.Int64HashTable.get_labels (pandas/_libs/hashtable.c:15265)()

ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Following the error above, it looks like there's a template approach for int/float dtype hashtables but StringHashTable and PyObjectHashTable have their own custom code, which I'm guessing will need to be patched to raise a ValueError similar to the one above? Probably a bit beyond my current knowledge level.

@cgohlke

This comment has been minimized.

Show comment
Hide comment
@cgohlke

cgohlke Oct 27, 2017

Contributor

The crash is at inference.pyx#L361, where values is an empty array that should not be indexed.

Could be due to:

>>> len(np.array([[],[]]))
2
>>> len(np.array([[],[]]).ravel())
0

Possible fix: move n = len(values) after values = values.ravel():

diff --git a/pandas/_libs/src/inference.pyx b/pandas/_libs/src/inference.pyx
index b0a64e1cc..c340e870e 100644
--- a/pandas/_libs/src/inference.pyx
+++ b/pandas/_libs/src/inference.pyx
@@ -349,13 +349,13 @@ def infer_dtype(object value, bint skipna=False):
     if values.dtype != np.object_:
         values = values.astype('O')

+    # make contiguous
+    values = values.ravel()
+
     n = len(values)
     if n == 0:
         return 'empty'

-    # make contiguous
-    values = values.ravel()
-
     # try to use a valid value
     for i in range(n):
         val = util.get_value_1d(values, i)

Raises ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

Contributor

cgohlke commented Oct 27, 2017

The crash is at inference.pyx#L361, where values is an empty array that should not be indexed.

Could be due to:

>>> len(np.array([[],[]]))
2
>>> len(np.array([[],[]]).ravel())
0

Possible fix: move n = len(values) after values = values.ravel():

diff --git a/pandas/_libs/src/inference.pyx b/pandas/_libs/src/inference.pyx
index b0a64e1cc..c340e870e 100644
--- a/pandas/_libs/src/inference.pyx
+++ b/pandas/_libs/src/inference.pyx
@@ -349,13 +349,13 @@ def infer_dtype(object value, bint skipna=False):
     if values.dtype != np.object_:
         values = values.astype('O')

+    # make contiguous
+    values = values.ravel()
+
     n = len(values)
     if n == 0:
         return 'empty'

-    # make contiguous
-    values = values.ravel()
-
     # try to use a valid value
     for i in range(n):
         val = util.get_value_1d(values, i)

Raises ValueError: Buffer has wrong number of dimensions (expected 1, got 2)

@jschendel

This comment has been minimized.

Show comment
Hide comment
@jschendel

jschendel Oct 28, 2017

Member

Thanks @cgohlke! Fix looks good and passes all tests locally for me.

Member

jschendel commented Oct 28, 2017

Thanks @cgohlke! Fix looks good and passes all tests locally for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment