get_dummies chokes on unicode values #6885

maxgrenderjones · 2014-04-15T09:12:43Z

(Context: pandas version 0.13.1 running on 2.7.6 |Anaconda 1.9.1 (64-bit)| (default, Nov 11 2013, 10:49:15) [MSC v.1500 64 bit (AMD64)])

In my code I have a category containing lots of non-English names and want to create dummies out of it.

So I call:

dummies=pandas.get_dummies(data[cat], prefix=prefix)

and get:

c:\Anaconda\lib\site-packages\pandas\core\reshape.pyc in get_dummies(data, prefix, prefix_sep, dummy_na)
    971     if prefix is not None:
    972         dummy_cols = ['%s%s%s' % (prefix, prefix_sep, str(v))
--> 973                       for v in levels]
    974     else:
    975         dummy_cols = levels

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 19: ordinal not in range(128)

Issue would appear to be the call to str(v) - if v is a unicode string with non-ascii, this is liable to explode.

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2014-04-15T20:40:34Z

Could you try:

import pandas as pd
from StringIO import StringIO

s = """å,b
œ,c
"""
df = pd.read_csv(StringIO(s), header=None)
pd.get_dummies(df)

This works fine on my machine. Or share a link to the file that generated your data and the code so we can try to reproduce the problem.

maxgrenderjones · 2014-04-15T21:20:30Z

Thanks for the assist!

import pandas as pd
from StringIO import StringIO

s = """letter,cat
å,b
œ,c
"""
df = pd.read_csv(StringIO(s))
pd.get_dummies(df['letter'], prefix=u'foo')

reproduces the bug

jreback · 2014-04-15T21:28:39Z

I think @hayd mentioned this
their is a str somewhere inside which should encode if it fails

hayd · 2014-04-15T21:48:47Z

That one was in relation to str.get_dummies. This is slightly different, is it iterating through these incorrectly (as strings rather than unicode)?

ipdb> levels
Index([u'å', u'œ'], dtype='object')
ipdb> levels[0]
'\xc3\xa5'
ipdb> type(levels[0])
<type 'str'>
ipdb> [v for v in levels]
['\xc3\xa5', '\xc5\x93']

jreback · 2014-04-15T21:56:34Z

that should be fine

maxgrenderjones · 2014-04-15T22:21:37Z

No. Issue is that it tries to create a column name for each of the different values in the columns. When it creates a column name for the dummy (and there's a prefix) it calls:

dummy_cols = ['%s%s%s' % (prefix, prefix_sep, str(v)) for v in levels]

(as per the stacktrace I pasted earlier)

since v (the value of the element in the category) is non-ascii-able unicode, calling str(v) explodes.

Instead of calling str(v) you want to do something like v if isintance(v, six.string_types) else safely_get_str(v)

It may be sufficient to use str as safely_get_str as any object that implements a __str__ method ought to return something rather than a unicode explosion (I think)

hayd · 2014-04-15T22:29:42Z

It actually breaks in the format stage:

ipdb> prefix
u'foo'
ipdb> prefix_sep
'_'
ipdb> str(v)  # weirdly this is the same as v
'\xc5\x93'
ipdb> '%s%s%s' % (prefix, prefix_sep, str(v))
*** UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 0: ordinal not in range(128)

maxgrenderjones · 2014-04-15T22:59:03Z

I assume changing it to u('%s%s%s') % (prefix, prefix_sep, str(v)) would be cheating? :)

Alternative would be

try:
    dummy_cols = ['%s%s%s' % (prefix, prefix_sep, str(v)) for v in levels]
catch UnicodeDecodeError:
    dummy_cols = [u('%s%s%s') % (prefix, prefix_sep, str(v)) for v in levels]

hayd · 2014-04-16T00:46:04Z

@maxgrenderjones I don't think that's cheating, in fact always returning unicode is IMO correct (i.e. just your second line). fancy putting together a PR ? :)

maxgrenderjones · 2014-04-19T23:52:30Z

I think there's a bug in our test case. Changing the test to:

import pandas as pd
reload(pd.core)
reload(pd.core)
from StringIO import StringIO

s = u"""letter,cat
å,b
œ,c
""".encode('utf-8')
df = pd.read_csv(StringIO(s), encoding='utf-8')
print(df)
pd.get_dummies(df['letter'], prefix='foo')

(i.e. make sure that pandas knows it's reading unicode) and all that is needed to get correct output is to remove the call to str. You can't go wantonly surrounding things with u() as u() raises an Exception if its input is already unicode. Relevant line now reads:

    if prefix is not None:
        dummy_cols = ['%s%s%s' % (prefix, prefix_sep, v)
                      for v in levels]

Trivial change - if it's enough for a pull request, happy to create one.

hayd · 2014-04-20T06:40:44Z

Definitely sounds like enough / would be a good PR, with the tests :)

jreback · 2014-04-30T12:40:32Z

closed by #6975

hayd added this to the 0.14.0 milestone Apr 21, 2014

jreback mentioned this issue Apr 21, 2014

str.get_dummies uses astype(str) #6634

Closed

jreback added Bug labels Apr 21, 2014

maxgrenderjones mentioned this issue Apr 26, 2014

Fix for GH 6885 - get_dummies chokes on unicode values #6975

Closed

jreback closed this as completed Apr 30, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

get_dummies chokes on unicode values #6885

get_dummies chokes on unicode values #6885

maxgrenderjones commented Apr 15, 2014

TomAugspurger commented Apr 15, 2014

maxgrenderjones commented Apr 15, 2014

jreback commented Apr 15, 2014

hayd commented Apr 15, 2014

jreback commented Apr 15, 2014

maxgrenderjones commented Apr 15, 2014

hayd commented Apr 15, 2014

maxgrenderjones commented Apr 15, 2014

hayd commented Apr 16, 2014

maxgrenderjones commented Apr 19, 2014

hayd commented Apr 20, 2014

jreback commented Apr 30, 2014

get_dummies chokes on unicode values #6885

get_dummies chokes on unicode values #6885

Comments

maxgrenderjones commented Apr 15, 2014

TomAugspurger commented Apr 15, 2014

maxgrenderjones commented Apr 15, 2014

jreback commented Apr 15, 2014

hayd commented Apr 15, 2014

jreback commented Apr 15, 2014

maxgrenderjones commented Apr 15, 2014

hayd commented Apr 15, 2014

maxgrenderjones commented Apr 15, 2014

hayd commented Apr 16, 2014

maxgrenderjones commented Apr 19, 2014

hayd commented Apr 20, 2014

jreback commented Apr 30, 2014