Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_dummies chokes on unicode values #6885

Closed
maxgrenderjones opened this issue Apr 15, 2014 · 12 comments
Closed

get_dummies chokes on unicode values #6885

maxgrenderjones opened this issue Apr 15, 2014 · 12 comments
Labels
Bug Strings String extension data type and string data Unicode Unicode strings
Milestone

Comments

@maxgrenderjones
Copy link
Contributor

(Context: pandas version 0.13.1 running on 2.7.6 |Anaconda 1.9.1 (64-bit)| (default, Nov 11 2013, 10:49:15) [MSC v.1500 64 bit (AMD64)])

In my code I have a category containing lots of non-English names and want to create dummies out of it.

So I call:

dummies=pandas.get_dummies(data[cat], prefix=prefix)

and get:

c:\Anaconda\lib\site-packages\pandas\core\reshape.pyc in get_dummies(data, prefix, prefix_sep, dummy_na)
    971     if prefix is not None:
    972         dummy_cols = ['%s%s%s' % (prefix, prefix_sep, str(v))
--> 973                       for v in levels]
    974     else:
    975         dummy_cols = levels

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 19: ordinal not in range(128)

Issue would appear to be the call to str(v) - if v is a unicode string with non-ascii, this is liable to explode.

@TomAugspurger
Copy link
Contributor

Could you try:

import pandas as pd
from StringIO import StringIO

s = """å,b
œ,c
"""
df = pd.read_csv(StringIO(s), header=None)
pd.get_dummies(df)

This works fine on my machine. Or share a link to the file that generated your data and the code so we can try to reproduce the problem.

@maxgrenderjones
Copy link
Contributor Author

Thanks for the assist!

import pandas as pd
from StringIO import StringIO

s = """letter,cat
å,b
œ,c
"""
df = pd.read_csv(StringIO(s))
pd.get_dummies(df['letter'], prefix=u'foo')

reproduces the bug

@jreback
Copy link
Contributor

jreback commented Apr 15, 2014

I think @hayd mentioned this
their is a str somewhere inside which should encode if it fails

@hayd
Copy link
Contributor

hayd commented Apr 15, 2014

That one was in relation to str.get_dummies. This is slightly different, is it iterating through these incorrectly (as strings rather than unicode)?

ipdb> levels
Index([u'å', u'œ'], dtype='object')
ipdb> levels[0]
'\xc3\xa5'
ipdb> type(levels[0])
<type 'str'>
ipdb> [v for v in levels]
['\xc3\xa5', '\xc5\x93']

@jreback
Copy link
Contributor

jreback commented Apr 15, 2014

that should be fine

@maxgrenderjones
Copy link
Contributor Author

No. Issue is that it tries to create a column name for each of the different values in the columns. When it creates a column name for the dummy (and there's a prefix) it calls:

dummy_cols = ['%s%s%s' % (prefix, prefix_sep, str(v)) for v in levels]

(as per the stacktrace I pasted earlier)

since v (the value of the element in the category) is non-ascii-able unicode, calling str(v) explodes.

Instead of calling str(v) you want to do something like v if isintance(v, six.string_types) else safely_get_str(v)

It may be sufficient to use str as safely_get_str as any object that implements a __str__ method ought to return something rather than a unicode explosion (I think)

@hayd
Copy link
Contributor

hayd commented Apr 15, 2014

It actually breaks in the format stage:

ipdb> prefix
u'foo'
ipdb> prefix_sep
'_'
ipdb> str(v)  # weirdly this is the same as v
'\xc5\x93'
ipdb> '%s%s%s' % (prefix, prefix_sep, str(v))
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128)

@maxgrenderjones
Copy link
Contributor Author

I assume changing it to u('%s%s%s') % (prefix, prefix_sep, str(v)) would be cheating? :)

Alternative would be

try:
    dummy_cols = ['%s%s%s' % (prefix, prefix_sep, str(v)) for v in levels]
catch UnicodeDecodeError:
    dummy_cols = [u('%s%s%s') % (prefix, prefix_sep, str(v)) for v in levels]

@hayd
Copy link
Contributor

hayd commented Apr 16, 2014

@maxgrenderjones I don't think that's cheating, in fact always returning unicode is IMO correct (i.e. just your second line). fancy putting together a PR ? :)

@maxgrenderjones
Copy link
Contributor Author

I think there's a bug in our test case. Changing the test to:

import pandas as pd
reload(pd.core)
reload(pd.core)
from StringIO import StringIO

s = u"""letter,cat
å,b
œ,c
""".encode('utf-8')
df = pd.read_csv(StringIO(s), encoding='utf-8')
print(df)
pd.get_dummies(df['letter'], prefix='foo')

(i.e. make sure that pandas knows it's reading unicode) and all that is needed to get correct output is to remove the call to str. You can't go wantonly surrounding things with u() as u() raises an Exception if its input is already unicode. Relevant line now reads:

    if prefix is not None:
        dummy_cols = ['%s%s%s' % (prefix, prefix_sep, v)
                      for v in levels]

Trivial change - if it's enough for a pull request, happy to create one.

@hayd
Copy link
Contributor

hayd commented Apr 20, 2014

Definitely sounds like enough / would be a good PR, with the tests :)

@jreback
Copy link
Contributor

jreback commented Apr 30, 2014

closed by #6975

@jreback jreback closed this as completed Apr 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Strings String extension data type and string data Unicode Unicode strings
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants