Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dir fails on dataframes with pathological column names #25509

Closed
mrocklin opened this issue Mar 1, 2019 · 7 comments · Fixed by #32701
Closed

Dir fails on dataframes with pathological column names #25509

mrocklin opened this issue Mar 1, 2019 · 7 comments · Fixed by #32701
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string
Milestone

Comments

@mrocklin
Copy link
Contributor

mrocklin commented Mar 1, 2019

Code Sample, a copy-pastable example if possible

import pandas as pd
df = pd.DataFrame({'\ud83d': []})
_ = dir(df)
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-3-4e949fe17c82> in <module>
----> 1 _ = dir(df)

~/miniconda/envs/dev/lib/python3.7/site-packages/pandas/core/accessor.py in __dir__(self)
     37         """
     38         rv = set(dir(type(self)))
---> 39         rv = (rv - self._dir_deletions()) | self._dir_additions()
     40         return sorted(rv)
     41

~/miniconda/envs/dev/lib/python3.7/site-packages/pandas/core/generic.py in _dir_additions(self)
   5110         If info_axis is a MultiIndex, it's first level values are used.
   5111         """
-> 5112         additions = {c for c in self._info_axis.unique(level=0)[:100]
   5113                      if isinstance(c, string_types) and isidentifier(c)}
   5114         return super(NDFrame, self)._dir_additions().union(additions)

~/miniconda/envs/dev/lib/python3.7/site-packages/pandas/core/indexes/base.py in unique(self, level)
   1999         if level is not None:
   2000             self._validate_index_level(level)
-> 2001         result = super(Index, self).unique()
   2002         return self._shallow_copy(result)
   2003

~/miniconda/envs/dev/lib/python3.7/site-packages/pandas/core/base.py in unique(self)
   1312         else:
   1313             from pandas.core.algorithms import unique1d
-> 1314             result = unique1d(values)
   1315
   1316         return result

~/miniconda/envs/dev/lib/python3.7/site-packages/pandas/core/algorithms.py in unique(values)
    360
    361     table = htable(len(values))
--> 362     uniques = table.unique(values)
    363     uniques = _reconstruct_data(uniques, dtype, original)
    364     return uniques

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.StringHashTable.unique()

pandas/_libs/hashtable_class_helper.pxi in pandas._libs.hashtable.StringHashTable._unique()

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

Problem description

Dir fails on dataframes with pathalogical column names

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.1
pytest: 3.10.1
pip: 18.1
setuptools: 40.6.2
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.3
IPython: 7.2.0
sphinx: 1.8.4
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.2.0
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: None

mrocklin added a commit to mrocklin/dask that referenced this issue Mar 1, 2019
@jreback jreback changed the title Dir fails on dataframes with pathalogical column names Dir fails on dataframes with pathological column names Mar 3, 2019
@mroeschke mroeschke added Bug Output-Formatting __repr__ of pandas objects, to_string labels May 27, 2019
@jbrockmendel
Copy link
Member

On OSX this segfaults

@WillAyd
Copy link
Member

WillAyd commented Jul 8, 2019

What is the expectation here? Is this the first half of a surrogate pair?

@jbrockmendel
Copy link
Member

Tracking this down, it looks like we get to tslibs.util.get_c_string_buf_and_size and within that we call PyUnicode_AsUTF8AndSize and fail there

@WillAyd
Copy link
Member

WillAyd commented Dec 17, 2019

So with regards to the OP I don't think this is a bug with pandas - an exception gets thrown when passing it as an argument to print in python though strangely enough the object seems OK held in a string

>>> print('\ud83d')
>>> type('\ud83d')
<class 'str'>
>>> alist = ['\ud83d']
>>> alist[0] # surprised this works
'\ud83d'
>>> print(alist[0]) # this failure matches pandas
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

w.r.t. Cython I see the following warnings before segfault, so maybe something of interest there:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Exception ignored in: 'pandas._libs.tslibs.util.get_c_string_buf_and_size'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
[1]    32913 segmentation fault  python

@jbrockmendel
Copy link
Member

Doing some googling, is there a specific range of unicode characters that are surrogates that we might be able to screen for? Are there non-surrogate "pathological" cases we need to worry about?

@jbrockmendel
Copy link
Member

In pd._libs.tslibs.util.get_c_string_buf_and_size if we change:

    if PyUnicode_Check(py_string):
        buf = PyUnicode_AsUTF8AndSize(py_string, length)

to

    if PyUnicode_Check(py_string):
        if not py_string.isprintable():
            py_string = repr(py_string)
        buf = PyUnicode_AsUTF8AndSize(py_string, length)

then a) the dir(df) call doesn't raise/segfault, b) dir(df) does not contain the column name like we would expect, c) df.columns.unique() doesn't raise/segfault, d) df.columns.unique() does contain the 1 entry we expect.

Not sure what to do with this information, but its out there.

@roberthdevries
Copy link
Contributor

I have traced this down to get_c_string (from util.pxd) in hashtable_class_helper.pxi.in.
The unicode string with the surrogate character in it is passed to get_c_string which passes it to get_c_string_buf_and_size, which passes it to PyUnicode_AsUTF8AndSize.
This last one will happily return a NULL pointer when it receives a surrogate character. This is passed down the line to a point where it is dereferenced in khash.h (probably to calculate a hash value).
Hence the segfault.

This is where the NULL pointer gets assigned to an array value in hashtable_class_helper.pxi.in:791:

                # if ignore_na is False, we also stringify NaN/None/etc.
                v = get_c_string(<str>val)
                vecs[i] = v

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Output-Formatting __repr__ of pandas objects, to_string
Projects
None yet
6 participants