astype(unicode) does not work as expected #7758

Closed
fulmicoton opened this Issue Jul 15, 2014 · 11 comments

Comments

Projects
None yet
3 participants
Contributor

fulmicoton commented Jul 15, 2014

astype unicode seems to call str, so that the following code throws

import pandas
df = pandas.DataFrame({"somecol": [u"適当"]})
df["somecol"].astype("unicode")

raises :

UnicodeEncodeError: 'ascii' codec can't encode ch
aracters in position 0-1: ordinal not in range(12
8)
Contributor

jreback commented Jul 15, 2014

you can do: df['somecol'].values.astype('unicode')

what are you doing with this?

pandas keeps all string-likes as object dtype so this is really only for external usage

jreback added the Unicode label Jul 15, 2014

Contributor

fulmicoton commented Jul 15, 2014

I have a method that detects whether a column should be considered as a category based on its type and cardinality. Columns that are considered as categories are casted into unicode object.

I know how to workaround this issue, but I thought I should report what I thought was a bug.

Let me know if you need more information.

Contributor

jreback commented Jul 15, 2014

ok, this could be more informative, but its fundamentally an issue. This would return a numpy array (and NOT a series, and that would simply recast, and lose the cast to unicode).

I think that is a bit odd though. What do you think should happen?

Contributor

fulmicoton commented Jul 15, 2014

Ideally, I would have either wanted the cast to work as python unicode() function.
That is : returned object are always of the "unicode" type.

  • Unicode objects are left unchanged.
  • Numbers are stringified into unicode strings.
  • str object are decoded using the default encoding and a unicode object is returned.

Does that make sense in Pandas?

Member

cpcloud commented Jul 15, 2014

@fulmicoton Why do you need to convert to unicode? Do you have things that are convertible to unicode but aren't already converted? Can you give a more detailed example that illustrates why you need to do this. I think I'm just missing something.

Contributor

jreback commented Jul 15, 2014

This could all be done I think (may need to allow an encoding argument for your 3rd bullet.
Keep in mind that current pandas does not have a unicode type per-se (str and unicode are stored as object dtype), but its really not a big deal, as when a unicode dtype is presented it can simply be inferred.

here's a picture of the internal structure:

In [16]: df
Out[16]: 
  somecol
0      適当

In [17]: df._data
Out[17]: 
BlockManager
Items: Index([u'somecol'], dtype='object')
Axis 1: Int64Index([0], dtype='int64')
ObjectBlock: slice(0, 1, 1), 1 x 1, dtype: object

In [18]: df._data.blocks[0]
Out[18]: ObjectBlock: slice(0, 1, 1), 1 x 1, dtype: object

In [19]: df._data.blocks[0].values
Out[19]: array([[u'\u9069\u5f53']], dtype=object)

In [20]: pd.lib.infer_dtype(df._data.blocks[0].values)
Out[20]: 'unicode'

jreback added this to the 0.15.0 milestone Jul 15, 2014

Contributor

jreback commented Jul 15, 2014

@fulmicoton interested in doing a pull-request for this?

Contributor

fulmicoton commented Jul 15, 2014

@cpcloud Just having a piece of code trying to coerce a bunch of columns marked as categorical into unicode strings. Some of them are already unicode, some of them have been detected as int but have such a low cardinality I want to handle them as categories.
They are getting dummified after... So it's important they all end up as unicode string at one point or another.

Contributor

fulmicoton commented Jul 15, 2014

@jreback I'll take a look at that tonight.

Contributor

jreback commented Jul 15, 2014

@fulmicoton you might wasn to explore this as well (just merged in): http://pandas-docs.github.io/pandas-docs-travis/categorical.html. Prob not a lot of tests for unicode (but it should work)

@fulmicoton fulmicoton added a commit to fulmicoton/pandas that referenced this issue Jul 15, 2014

@fulmicoton fulmicoton Closes #7758 - astype(unicode) returning unicode.
Just calls numpy.unicode on all the values.
Seems to work alright on python2 and python3.
a92e593
Contributor

fulmicoton commented Jul 15, 2014

Here is the pull requests. I didn't have to use infer_dtype, so I hope I didn't do anything wrong.

@fulmicoton fulmicoton added a commit to fulmicoton/pandas that referenced this issue Jul 15, 2014

@fulmicoton fulmicoton Added bugfix of #7758 to v0.15.0 changelog. 01d6897

jreback closed this in a797b28 Jul 16, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment