utf8 problems with msgpack? #15

igrigorik · 2010-09-11T22:22:18Z

A bit of a shot in the dark, but has anyone come across problems with utf8 + msgpack? I'm using the Ruby bindings. Logged ~500 GB of data in zmpac format (stream + zlib), in ~200mb chunks (~1GB uncompressed). Trying to read the data back, and running into parse errors on random files.

Haven't had much luck tracking down the culprit so far, but if I try to sysread chunks of the file 1024 bytes at a time, and parse out the messages.. once the message is thrown, and I dump the buffer, I am seeing chinese characters, etc.

Same behavior under 1.8 and under 1.9. Any suggestions for how to recover this data, and/or any other tips?

rb2k · 2011-08-15T13:01:49Z

I have similar problems to_msgpack() -> redis -> MessagePack.unpack(data) leads to UTF8 errors.
I get an "invalid byte sequence in UTF-8" error after a simple decoded_data.include?("something")

It seems to happen with hashes and this is what the unpacked data looks like:
"DATE"=>", 11 Aug 2011 21:39:30 GMT\xA6SE"

(ruby 1.9.3 preview 1)

arsduo · 2012-02-21T12:31:05Z

Did either of you have any luck figuring this out? We're having a similar issue.

igrigorik · 2012-02-21T17:35:16Z

Nope, never got down to the bottom of it..

sgtFloyd · 2012-03-21T20:16:20Z

I ran into the same issue with to_msgpack -> redis -> MessagePack.unpack. I tracked it down to a single UTF character \xC8 (é)

Forcing ASCII-8BIT encoding before deserialization seems to fix the problem. MessagePack.unpack(data.force_encoding('ASCII-8BIT'))

mikelaurence · 2012-07-11T20:33:59Z

Experienced the same problem - "force_encoding" solution described by @sgtFloyd fixed it!

trashpanda001 · 2012-07-14T18:47:25Z

The redis-rb gem forces the Redis response encoding to Encoding::default_external in Redis::Connection::CommandHelper -- this is logical, as the string is coming from an external I/O stream so it uses the default here, which in most setups is UTF-8. The normal case of setting/getting UTF-8 encoded strings in Redis works as expected.

But MessagePack is a binary serialization format, and it expects to unpack from a raw binary string, so you need to force the string you get from redis-rb into binary (or ASCII-8BIT as @sgtFloyd suggested above):

MessagePack.unpack(data.force_encoding(Encoding::BINARY))

I think the MessagePack.unpack method itself should perform this force_encoding in a future version, but for now we have to do it ourselves.

kuenishi · 2013-08-17T07:51:34Z

As each language implementation was separated, please open another issue at each repository if this is still problematic. Thank you.

kuenishi closed this as completed Aug 17, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utf8 problems with msgpack? #15

utf8 problems with msgpack? #15

igrigorik commented Sep 11, 2010

rb2k commented Aug 15, 2011

arsduo commented Feb 21, 2012

igrigorik commented Feb 21, 2012

sgtFloyd commented Mar 21, 2012

mikelaurence commented Jul 11, 2012

trashpanda001 commented Jul 14, 2012

kuenishi commented Aug 17, 2013

utf8 problems with msgpack? #15

utf8 problems with msgpack? #15

Comments

igrigorik commented Sep 11, 2010

rb2k commented Aug 15, 2011

arsduo commented Feb 21, 2012

igrigorik commented Feb 21, 2012

sgtFloyd commented Mar 21, 2012

mikelaurence commented Jul 11, 2012

trashpanda001 commented Jul 14, 2012

kuenishi commented Aug 17, 2013