New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: .filter with unicode labels when can't encode #13101

Closed
gritschel opened this Issue May 6, 2016 · 2 comments

Comments

Projects
None yet
4 participants
@gritschel

gritschel commented May 6, 2016

Edit #10506 breaks if the DataFrame contains unicode column names with non-ASCII characters.

import pandas as pd
df = pd.DataFrame({u'a': [1, 2, 3], u'ä': [4, 5, 6]})
df.filter(regex=u'a')

throws me a

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-10-9de5a19c260e> in <module>()
----> 1 df.filter(regex=u'a')

C:\Users\...\AppData\Local\Continuum\32bit\Anaconda\envs\test\lib\site-packages\pandas\core\generic.pyc in filter(self, items, like, regex, axis)
   2013             matcher = re.compile(regex)
   2014             return self.select(lambda x: matcher.search(str(x)) is not None,
-> 2015                                axis=axis_name)
   2016         else:
   2017             raise TypeError('Must pass either `items`, `like`, or `regex`')

C:\Users\...\AppData\Local\Continuum\32bit\Anaconda\envs\test\lib\site-packages\pandas\core\generic.pyc in select(self, crit, axis)
   1545         if len(axis_values) > 0:
   1546             new_axis = axis_values[
-> 1547                 np.asarray([bool(crit(label)) for label in axis_values])]
   1548         else:
   1549             new_axis = axis_values

C:\Users\...\AppData\Local\Continuum\32bit\Anaconda\envs\test\lib\site-packages\pandas\core\generic.pyc in <lambda>(x)
   2012         elif regex:
   2013             matcher = re.compile(regex)
-> 2014             return self.select(lambda x: matcher.search(str(x)) is not None,
   2015                                axis=axis_name)
   2016         else:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)

@gritschel gritschel changed the title from Edit #10506 breaks if the DataFrame contains unicode column names with non-ASCII characters. to BUG: Edit #10506 breaks if the DataFrame contains unicode column names with non-ASCII characters. May 6, 2016

@jreback

This comment has been minimized.

Show comment
Hide comment
@jreback

jreback May 6, 2016

Contributor

xref #10384

yeah str(x) will try to encode, so probably easiest to either just catch this (and pass thru if it cannot encode), or just stringify integers (but then that leaves out things like float columns and such).

So I think the former is ok. want to do a PR?

would need to add some tests for other column label types as well

(e.g. the tests should loop thru all of the index types).

Contributor

jreback commented May 6, 2016

xref #10384

yeah str(x) will try to encode, so probably easiest to either just catch this (and pass thru if it cannot encode), or just stringify integers (but then that leaves out things like float columns and such).

So I think the former is ok. want to do a PR?

would need to add some tests for other column label types as well

(e.g. the tests should loop thru all of the index types).

@jreback jreback added this to the 0.18.2 milestone May 6, 2016

@jreback jreback changed the title from BUG: Edit #10506 breaks if the DataFrame contains unicode column names with non-ASCII characters. to BUG: .filter with unicode labels when can't encode May 6, 2016

@gritschel

This comment has been minimized.

Show comment
Hide comment
@gritschel

gritschel May 9, 2016

I don't have an installed git environment at the moment. So I cannot do the Pull Request, unfortunately.
I would support the passing-through solution if the argument cannot be encoded, since it is the easiest and a pretty general fix (although this fallback mechanism might seem a bit intransparent).

gritschel commented May 9, 2016

I don't have an installed git environment at the moment. So I cannot do the Pull Request, unfortunately.
I would support the passing-through solution if the argument cannot be encoded, since it is the easiest and a pretty general fix (although this fallback mechanism might seem a bit intransparent).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment