Dir fails on dataframes with pathological column names #25509

mrocklin · 2019-03-01T17:42:53Z

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'\ud83d': []})
_ = dir(df)

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-3-4e949fe17c82> in <module>
----> 1 _ = dir(df)

~/miniconda/envs/dev/lib/python3.7/site-packages/pandas/core/accessor.py in __dir__(self)
     37         """
     38         rv = set(dir(type(self)))
---> 39         rv = (rv - self._dir_deletions()) | self._dir_additions()
     40         return sorted(rv)
     41

~/miniconda/envs/dev/lib/python3.7/site-packages/pandas/core/generic.py in _dir_additions(self)
   5110         If info_axis is a MultiIndex, it's first level values are used.
   5111         """
-> 5112         additions = {c for c in self._info_axis.unique(level=0)[:100]
   5113                      if isinstance(c, string_types) and isidentifier(c)}
   5114         return super(NDFrame, self)._dir_additions().union(additions)

~/miniconda/envs/dev/lib/python3.7/site-packages/pandas/core/indexes/base.py in unique(self, level)
   1999         if level is not None:
   2000             self._validate_index_level(level)
-> 2001         result = super(Index, self).unique()
   2002         return self._shallow_copy(result)
   2003

~/miniconda/envs/dev/lib/python3.7/site-packages/pandas/core/base.py in unique(self)
   1312         else:
   1313             from pandas.core.algorithms import unique1d
-> 1314             result = unique1d(values)
   1315
   1316         return result

~/miniconda/envs/dev/lib/python3.7/site-packages/pandas/core/algorithms.py in unique(values)
    360
    361     table = htable(len(values))
--> 362     uniques = table.unique(values)
    363     uniques = _reconstruct_data(uniques, dtype, original)
    364     return uniques

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.StringHashTable.unique()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.StringHashTable._unique()

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

Problem description

Dir fails on dataframes with pathalogical column names

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: 3.10.1
pip: 18.1
setuptools: 40.6.2
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.3
IPython: 7.2.0
sphinx: 1.8.4
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.2.0
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

See pandas-dev/pandas#25509

jbrockmendel · 2019-07-08T16:42:32Z

On OSX this segfaults

WillAyd · 2019-07-08T17:29:21Z

What is the expectation here? Is this the first half of a surrogate pair?

jbrockmendel · 2019-12-17T02:28:30Z

Tracking this down, it looks like we get to tslibs.util.get_c_string_buf_and_size and within that we call PyUnicode_AsUTF8AndSize and fail there

WillAyd · 2019-12-17T02:39:29Z

So with regards to the OP I don't think this is a bug with pandas - an exception gets thrown when passing it as an argument to print in python though strangely enough the object seems OK held in a string

>>> print('\ud83d')
>>> type('\ud83d')
<class 'str'>
>>> alist = ['\ud83d']
>>> alist[0] # surprised this works
'\ud83d'
>>> print(alist[0]) # this failure matches pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

w.r.t. Cython I see the following warnings before segfault, so maybe something of interest there:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Exception ignored in: 'pandas._libs.tslibs.util.get_c_string_buf_and_size'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
[1]    32913 segmentation fault  python

jbrockmendel · 2019-12-18T17:59:35Z

Doing some googling, is there a specific range of unicode characters that are surrogates that we might be able to screen for? Are there non-surrogate "pathological" cases we need to worry about?

jbrockmendel · 2019-12-20T04:07:23Z

In pd._libs.tslibs.util.get_c_string_buf_and_size if we change:

    if PyUnicode_Check(py_string):
        buf = PyUnicode_AsUTF8AndSize(py_string, length)

to

    if PyUnicode_Check(py_string):
        if not py_string.isprintable():
            py_string = repr(py_string)
        buf = PyUnicode_AsUTF8AndSize(py_string, length)

then a) the dir(df) call doesn't raise/segfault, b) dir(df) does not contain the column name like we would expect, c) df.columns.unique() doesn't raise/segfault, d) df.columns.unique() does contain the 1 entry we expect.

Not sure what to do with this information, but its out there.

roberthdevries · 2020-03-11T11:47:31Z

I have traced this down to get_c_string (from util.pxd) in hashtable_class_helper.pxi.in.
The unicode string with the surrogate character in it is passed to get_c_string which passes it to get_c_string_buf_and_size, which passes it to PyUnicode_AsUTF8AndSize.
This last one will happily return a NULL pointer when it receives a surrogate character. This is passed down the line to a point where it is dereferenced in khash.h (probably to calculate a hash value).
Hence the segfault.

This is where the NULL pointer gets assigned to an array value in hashtable_class_helper.pxi.in:791:

                # if ignore_na is False, we also stringify NaN/None/etc.
                v = get_c_string(<str>val)
                vecs[i] = v

mrocklin added a commit to mrocklin/dask that referenced this issue Mar 1, 2019

Avoid calling dir on dataframes

d0627cf

See pandas-dev/pandas#25509

jreback changed the title ~~Dir fails on dataframes with pathalogical column names~~ Dir fails on dataframes with pathological column names Mar 3, 2019

mroeschke added Bug Output-Formatting __repr__ of pandas objects, to_string labels May 27, 2019

jbrockmendel mentioned this issue Dec 21, 2019

BUG: passing non-printable unicode to datetime parsing functions #30374

Closed

roberthdevries mentioned this issue Mar 14, 2020

BUG: Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name #32701

Merged

5 tasks

jreback added this to the 1.1 milestone Mar 14, 2020

WillAyd closed this as completed in #32701 Mar 19, 2020

roberthdevries mentioned this issue Apr 28, 2020

TST: Use try/except block to properly catch and handle the exception #33235

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dir fails on dataframes with pathological column names #25509

Dir fails on dataframes with pathological column names #25509

mrocklin commented Mar 1, 2019

INSTALLED VERSIONS

jbrockmendel commented Jul 8, 2019

WillAyd commented Jul 8, 2019

jbrockmendel commented Dec 17, 2019

WillAyd commented Dec 17, 2019

jbrockmendel commented Dec 18, 2019

jbrockmendel commented Dec 20, 2019

roberthdevries commented Mar 11, 2020

Dir fails on dataframes with pathological column names #25509

Dir fails on dataframes with pathological column names #25509

Comments

mrocklin commented Mar 1, 2019

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

jbrockmendel commented Jul 8, 2019

WillAyd commented Jul 8, 2019

jbrockmendel commented Dec 17, 2019

WillAyd commented Dec 17, 2019

jbrockmendel commented Dec 18, 2019

jbrockmendel commented Dec 20, 2019

roberthdevries commented Mar 11, 2020

Output of `pd.show_versions()`