Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
utf8 problems with msgpack? #15
A bit of a shot in the dark, but has anyone come across problems with utf8 + msgpack? I'm using the Ruby bindings. Logged ~500 GB of data in zmpac format (stream + zlib), in ~200mb chunks (~1GB uncompressed). Trying to read the data back, and running into parse errors on random files.
Haven't had much luck tracking down the culprit so far, but if I try to sysread chunks of the file 1024 bytes at a time, and parse out the messages.. once the message is thrown, and I dump the buffer, I am seeing chinese characters, etc.
Same behavior under 1.8 and under 1.9. Any suggestions for how to recover this data, and/or any other tips?
I have similar problems to_msgpack() -> redis -> MessagePack.unpack(data) leads to UTF8 errors.
It seems to happen with hashes and this is what the unpacked data looks like:
(ruby 1.9.3 preview 1)
The redis-rb gem forces the Redis response encoding to Encoding::default_external in Redis::Connection::CommandHelper -- this is logical, as the string is coming from an external I/O stream so it uses the default here, which in most setups is UTF-8. The normal case of setting/getting UTF-8 encoded strings in Redis works as expected.
But MessagePack is a binary serialization format, and it expects to unpack from a raw binary string, so you need to force the string you get from redis-rb into binary (or ASCII-8BIT as @sgtFloyd suggested above):
I think the MessagePack.unpack method itself should perform this force_encoding in a future version, but for now we have to do it ourselves.