New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8 problems with msgpack? #15

Closed
igrigorik opened this Issue Sep 11, 2010 · 7 comments

Comments

Projects
None yet
7 participants
@igrigorik

igrigorik commented Sep 11, 2010

A bit of a shot in the dark, but has anyone come across problems with utf8 + msgpack? I'm using the Ruby bindings. Logged ~500 GB of data in zmpac format (stream + zlib), in ~200mb chunks (~1GB uncompressed). Trying to read the data back, and running into parse errors on random files.

Haven't had much luck tracking down the culprit so far, but if I try to sysread chunks of the file 1024 bytes at a time, and parse out the messages.. once the message is thrown, and I dump the buffer, I am seeing chinese characters, etc.

Same behavior under 1.8 and under 1.9. Any suggestions for how to recover this data, and/or any other tips?

@rb2k

This comment has been minimized.

rb2k commented Aug 15, 2011

I have similar problems to_msgpack() -> redis -> MessagePack.unpack(data) leads to UTF8 errors.
I get an "invalid byte sequence in UTF-8" error after a simple decoded_data.include?("something")

It seems to happen with hashes and this is what the unpacked data looks like:
"DATE"=>", 11 Aug 2011 21:39:30 GMT\xA6SE"

(ruby 1.9.3 preview 1)

@arsduo

This comment has been minimized.

arsduo commented Feb 21, 2012

Did either of you have any luck figuring this out? We're having a similar issue.

@igrigorik

This comment has been minimized.

igrigorik commented Feb 21, 2012

Nope, never got down to the bottom of it..

@sgtFloyd

This comment has been minimized.

sgtFloyd commented Mar 21, 2012

I ran into the same issue with to_msgpack -> redis -> MessagePack.unpack. I tracked it down to a single UTF character \xC8 (é)

Forcing ASCII-8BIT encoding before deserialization seems to fix the problem. MessagePack.unpack(data.force_encoding('ASCII-8BIT'))

@mikelaurence

This comment has been minimized.

mikelaurence commented Jul 11, 2012

Experienced the same problem - "force_encoding" solution described by @sgtFloyd fixed it!

@sickp

This comment has been minimized.

sickp commented Jul 14, 2012

The redis-rb gem forces the Redis response encoding to Encoding::default_external in Redis::Connection::CommandHelper -- this is logical, as the string is coming from an external I/O stream so it uses the default here, which in most setups is UTF-8. The normal case of setting/getting UTF-8 encoded strings in Redis works as expected.

But MessagePack is a binary serialization format, and it expects to unpack from a raw binary string, so you need to force the string you get from redis-rb into binary (or ASCII-8BIT as @sgtFloyd suggested above):

MessagePack.unpack(data.force_encoding(Encoding::BINARY))

I think the MessagePack.unpack method itself should perform this force_encoding in a future version, but for now we have to do it ourselves.

@kuenishi

This comment has been minimized.

Member

kuenishi commented Aug 17, 2013

As each language implementation was separated, please open another issue at each repository if this is still problematic. Thank you.

@kuenishi kuenishi closed this Aug 17, 2013

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment