Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode : change df.to_string() and friends to always return unicode objects #2224

Merged
merged 9 commits into from Nov 27, 2012
Merged

Unicode : change df.to_string() and friends to always return unicode objects #2224

merged 9 commits into from Nov 27, 2012

Conversation

ghost
Copy link

@ghost ghost commented Nov 11, 2012

closes #2225

Note: Although all the tests pass with minor fixes, this PR has an above-average chance of
breaking things for people who have relied on broken behaviour thus far.

df.tidy_repr combines several strings to produce a result. when one component is unicode
and other other is a non-ascii bytestring, it tries to convert the latter back to a unicode string
using the 'ascii' codec and fails.

I suggest that _get_repr -> to_string should always return unicode, as implemented by this PR,
and that the force_unicode argument be deprecated everyhwere.

The force_unicode argument in to_string conflates two things:

  • which codec to use to decode the string (which can only be a hopeful guess)
  • whether to return a unicode() object or str() object,

The first is now no longer necessary since pprint_thing already resorts to the same hack
of using utf-8 (with errors='replace') as a fallback.
I believe making the latter optional is wrong, precisely because it brings about situations
like the test case above.
to_string, like all internal functions , should utilize unicode objects, whenever feasible.

@wesm
Copy link
Member

wesm commented Nov 11, 2012

This seems pretty reasonable. Should I take a chance merging this for 0.9.1? I've encountered the bug you fixed here before

@ghost
Copy link
Author

ghost commented Nov 11, 2012

I would at least wait a few days before merging this (perhaps @jseabold or someone else would like
to argue their use-case ).

@wesm
Copy link
Member

wesm commented Nov 11, 2012

I guess the question is what code will break because the string is coming back as unicode. Obviously if you had df.to_string(force_unicode=True).decode('utf-8') that is going to break. Maybe this should be held off until 0.10 series

@ghost
Copy link
Author

ghost commented Nov 11, 2012

it depends whether you consider this a bug fix or a breaking change. I'm fine with 0.10 though.

@changhiskhan
Copy link
Contributor

Let wait 'til 0.10. Let's merge it into master as soon as the release is out though.

@wesm
Copy link
Member

wesm commented Nov 12, 2012

Agreed...

@aldanor
Copy link
Contributor

aldanor commented Nov 13, 2012

This would be great. As of right now, you have to do something dirty (at least that's the only way I found it works) like DataFrame(series).to_string(force_unicode=True, header=False) to correctly print a Series object with unicode characters to a utf-8 console.

@ghost
Copy link
Author

ghost commented Nov 14, 2012

I took this a step further, Realizing that the unicode issue really matters only
when we want to get a string representation of an object.

So:

  • I Converted more related functions to work exclusively with unicode.
  • Since everything should taper down to pprint_thing at the bottom, any utf-8 bytestrings
    should get silently decoded into unicode.
  • If your data is not unicode and not utf-8, it's unreasonable to expect str(df) to do
    the right thing, and so you'll get � (the unicode replacement character), but not exceptions
    (hopefully).
  • fixing a couple of corner cases along the way, I added all the boilerplae so that
    str(x)/unicode(x)/bytes(x) work on py2 and py3 for series/df/panel.

Yell if something broke.

@wesm
Copy link
Member

wesm commented Nov 21, 2012

@aldanor I see you deleted your comment but I checked that your example works now, at least on my environment...

@aldanor
Copy link
Contributor

aldanor commented Nov 21, 2012

@wesm Thanks, sounds good. I just didn't want to confuse everyone cause I wasn't sure this wasn't something specific to my environment. I will try and test it again soon as I can.

y-p added 9 commits November 22, 2012 20:48
…e force_unicode #2225

using pprint_thing will try to decode using utf-8 as a fallback,
but by these functions will now return unicode() rather then str()
objects.
…g strings)

we need to keep everything unicode at the bottom levels, so that
we can combine strings with other unicode strings at the I/O
choke-points, otherwise python tries to coerce bytestring
into unicode using 'ascii' encoding, and we get UnicodeDecodeError
DOC: add note about formatters needing to return unicode )if returning strings)

we need to keep everything unicode at the bottom levels, so that
we can combine strings with other unicode strings at the I/O
choke-points, otherwise python tries to coerce bytestring
into unicode using 'ascii' encoding, and we get UnicodeDecodeError
…ries,df,panel

- If you put in proper unicode data, you're good.
- If you put in utf-8 bytestrings you should still be good (it works if rendering
is wrapped by pprint_thing, I may have missed a few spots).
- If you put in non utf-8 bytestrings, with the encoding unknown, and expect
unicode(x) or str(x) to do the right thing - you're doing it wrong.
@ghost
Copy link
Author

ghost commented Nov 22, 2012

Added str/unicode/bytes support for Index,MultiIndex.

wesm added a commit that referenced this pull request Nov 27, 2012
@wesm wesm merged commit 436bf36 into pandas-dev:master Nov 27, 2012
@ghost
Copy link
Author

ghost commented Nov 27, 2012

takeback

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: series tidy_repr UnicodeDecodeError
3 participants