Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

utf8 problems with msgpack? #15

Closed
igrigorik opened this issue Sep 11, 2010 · 7 comments
Closed

utf8 problems with msgpack? #15

igrigorik opened this issue Sep 11, 2010 · 7 comments

Comments

@igrigorik
Copy link

A bit of a shot in the dark, but has anyone come across problems with utf8 + msgpack? I'm using the Ruby bindings. Logged ~500 GB of data in zmpac format (stream + zlib), in ~200mb chunks (~1GB uncompressed). Trying to read the data back, and running into parse errors on random files.

Haven't had much luck tracking down the culprit so far, but if I try to sysread chunks of the file 1024 bytes at a time, and parse out the messages.. once the message is thrown, and I dump the buffer, I am seeing chinese characters, etc.

Same behavior under 1.8 and under 1.9. Any suggestions for how to recover this data, and/or any other tips?

@rb2k
Copy link

rb2k commented Aug 15, 2011

I have similar problems to_msgpack() -> redis -> MessagePack.unpack(data) leads to UTF8 errors.
I get an "invalid byte sequence in UTF-8" error after a simple decoded_data.include?("something")

It seems to happen with hashes and this is what the unpacked data looks like:
"DATE"=>", 11 Aug 2011 21:39:30 GMT\xA6SE"

(ruby 1.9.3 preview 1)

@arsduo
Copy link

arsduo commented Feb 21, 2012

Did either of you have any luck figuring this out? We're having a similar issue.

@igrigorik
Copy link
Author

Nope, never got down to the bottom of it..

@sgtFloyd
Copy link

I ran into the same issue with to_msgpack -> redis -> MessagePack.unpack. I tracked it down to a single UTF character \xC8 (é)

Forcing ASCII-8BIT encoding before deserialization seems to fix the problem. MessagePack.unpack(data.force_encoding('ASCII-8BIT'))

@mikelaurence
Copy link

Experienced the same problem - "force_encoding" solution described by @sgtFloyd fixed it!

@trashpanda001
Copy link

The redis-rb gem forces the Redis response encoding to Encoding::default_external in Redis::Connection::CommandHelper -- this is logical, as the string is coming from an external I/O stream so it uses the default here, which in most setups is UTF-8. The normal case of setting/getting UTF-8 encoded strings in Redis works as expected.

But MessagePack is a binary serialization format, and it expects to unpack from a raw binary string, so you need to force the string you get from redis-rb into binary (or ASCII-8BIT as @sgtFloyd suggested above):

MessagePack.unpack(data.force_encoding(Encoding::BINARY))

I think the MessagePack.unpack method itself should perform this force_encoding in a future version, but for now we have to do it ourselves.

@kuenishi
Copy link
Member

As each language implementation was separated, please open another issue at each repository if this is still problematic. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants