Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GRPC Ruby client eventually fails to decode JSON string, v3.15.5 #8420

Closed
heyZeus opened this issue Mar 22, 2021 · 19 comments
Closed

GRPC Ruby client eventually fails to decode JSON string, v3.15.5 #8420

heyZeus opened this issue Mar 22, 2021 · 19 comments
Assignees

Comments

@heyZeus
Copy link

heyZeus commented Mar 22, 2021

What version of protobuf and what language are you using?
Version: v3.15.6
Language: Ruby

What operating system (Linux, Windows, ...) and version?
MacOS 10.15.7
Ubuntu 16.04

What runtime / compiler are you using (e.g., python version or gcc version)
Ruby 2.6.5

What did you do?
The RPC server will Base64.encode some JSON, like:

    json_metadata = Base64.encode64(json)

the client will decode like:

    decoded_metadata = Base64.decode64(json_metadata)
    JSON.parse(decoded_metadata).to_h

The proto type is definted as string json_metadata = 7; // base64 encoded

If I go into a REPL, as a GPRC client, and make a call in a loop to fetch the metadata then decode, eventually parsing the metadata fails.

What did you expect to see
No JSON::ParserError, no matter how many times the code is executed.

What did you see instead?
It will work for awhile in a loop, then eventually fail (at random tries) with an error:

 JSON.parse(decoded_metadata).to_h  # the line that throws a JSON::ParserError
 
 JSON::ParserError:  784: unexpected token at (some garbled JSON string sent from the server)

The loop will return the same string every time, but will eventually fail to parse the string and throw the JSON::ParserError.

The GRPC server is using a different of Ruby (v2.5.8) and protobuf (v3.9.0) than the Ruby client.

If I downgrade the Ruby client's protobuf version to 3.14.0 this error goes away. All I need to do to reproduce this error is to upgrade the Ruby client to protobuf version 3.15.5.

@haberman
Copy link
Member

If I'm understanding correctly, you are saying that the protobuf library is corrupting the json_metadata field, which then causes Ruby's JSON.parse() function to fail?

You said that the loop will return the same string every time, but parsing sporadically fails. Can you reproduce this if the loop does not use GRPC at all?

payload = "..."

loop do
  proto = YourProto.decode(payload)
  decoded_metadata = Base64.decode64(proto.json_metadata)
  JSON.parse(decoded_metadata)  # Does this eventually fail?
end

If you can get that to sporadically fail, please send the payload and .proto definition and I can definitely debug that.

@heyZeus
Copy link
Author

heyZeus commented Mar 22, 2021

@haberman The GRPC server always sends the same metadata string value to the client. That metadata value is:

eyJGb3JtIjoibm9uZSIsIlBhZ2VzIjoiNiIsIlRpdGxlIjoiYmllYmVyX3No
b3J0IiwiQXV0aG9yIjoiQ2xheSBDYXJyaWdhbiIsIlRhZ2dlZCI6Im5vIiwi
Q3JlYXRvciI6IlByZXZpZXciLCJNb2REYXRlIjoiRnJpIEZlYiAyMiAxNzo1
MDowMyAyMDE5IE1TVCIsIlN1YmplY3QiOiJLZXl3b3JkczogICAgICAgIiwi
UGFnZSByb3QiOiIwIiwiUHJvZHVjZXIiOiJtYWNPUyBWZXJzaW9uIDEwLjE0
LjMgKEJ1aWxkIDE4RDEwOSkgUXVhcnR6IFBERkNvbnRleHQiLCJTdXNwZWN0
cyI6Im5vIiwiRW5jcnlwdGVkIjoibm8iLCJGaWxlIHNpemUiOiIzMzU5NjMg
Ynl0ZXMiLCJPcHRpbWl6ZWQiOiJubyIsIlBhZ2Ugc2l6ZSI6IjYxMiB4IDc5
MiBwdHMgKGxldHRlcikiLCJKYXZhU2NyaXB0Ijoibm8iLCJwYWdlX3NpemVz
Ijp7ImxlZ2FsIjpbXSwibGV0dGVyIjpbMSwyLDMsNCw1LDZdLCJ1bmtub3du
IjpbXX0sIlBERiB2ZXJzaW9uIjoiMS4zIiwiQ3JlYXRpb25EYXRlIjoiRnJp
IEZlYiAyMiAxNzo1MDowMyAyMDE5IE1TVCIsIlVzZXJQcm9wZXJ0aWVzIjoi
bm8ifQ==

On the client side, in the loop, I print out the metadata value and it is the same. Except
eventually the value is different (but only on the client side) and causes the JSON parsing to fail. When it does fail, client's metadata value is different:

joibm9uZSIsIlBhZ2VzIjoiNiIsIlRpdGxlIjoiYmllYmVyX3No
b3J0IiwiQXV0aG9yIjoiQ2xheSBDYXJyaWdhbiIsIlRhZ2dlZCI6Im5vIiwi
Q3JlYXRvciI6IlByZXZpZXciLCJNb2REYXRlIjoiRnJpIEZlYiAyMiAxNzo1
MDowMyAyMDE5IE1TVCIsIlN1YmplY3QiOiJLZXl3b3JkczogICAgICAgIiwi
UGFnZSByb3QiOiIwIiwiUHJvZHVjZXIiOiJtYWNPUyBWZXJzaW9uIDEwLjE0
LjMgKEJ1aWxkIDE4RDEwOSkgUXVhcnR6IFBERkNvbnRleHQiLCJTdXNwZWN0
cyI6Im5vIiwiRW5jcnlwdGVkIjoibm8iLCJGaWxlIHNpemUiOiIzMzU5NjMg
Ynl0ZXMiLCJPcHRpbWl6ZWQiOiJubyIsIlBhZ2Ugc2l6ZSI6IjYxMiB4IDc5
MiBwdHMgKGxldHRlcikiLCJKYXZhU2NyaXB0Ijoibm8iLCJwYWdlX3NpemVz
Ijp7ImxlZ2FsIjpbXSwibGV0dGVyIjpbMSwyLDMsNCw1LDZdLCJ1bmtub3du
IjpbXX0sIlBERiB2ZXJzaW9uIjoiMS4zIiwiQ3JlYXRpb25EYXRlIjoiRnJp
IEZlYiAyMiAxNzo1MDowMyAyMDE5IE1TVCIsIlVzZXJQcm9wZXJ0aWVzIjoi
bm8ifQ==

It appears the leading characters are being trimmed eyJGb3JtI from the value. If I downgrade the client side to use protobuf 3.14.0 it never fails in the loop. It only fails when I upgrade the client to use protobuf 3.15.5.

The client is always requesting the same resource in the loop. If I start the loop on the client for resource 1, it runs maybe 10 times and then throws the error. I run the loop again for resource 1, it runs about 56 times and throws the error. The server is always returning the same data for resource 1.

The proto definition is:

message Document {
  int32 id = 1;
  string source_id = 2;
  SourceType source_type = 3;
  string document_type = 4;
  int32 file_size = 5;
  string content_type = 6;
  string json_metadata = 7; // base64 encoded
  string filename = 8;
  int32 uploaded_by_user_id = 9;
  string uploaded_by_user_role = 10;
  google.protobuf.Timestamp created_at = 11;
  google.protobuf.Timestamp deleted_at = 12;
  string remote_url = 13;
}

Thanks.

@haberman
Copy link
Member

@heyZeus,

I still wonder if you can reproduce this without gRPC. Can you create a repro that uses the protobuf library only? The info you've given me so far isn't enough to reproduce this problem on my own machine.

@heyZeus
Copy link
Author

heyZeus commented Mar 23, 2021

It's unclear how to reproduce this without gRPC, but I'll work on it.

@haberman
Copy link
Member

I would hope that you could grab the payload right before it is passed to protobuf. You might have to edit the gRPC code that calls protobuf, but hopefully that's not too tricky.

If protobuf is truly being inconsistent in how it parses the payload, a loop like I proposed in #8420 (comment) should reproduce the error.

@heyZeus
Copy link
Author

heyZeus commented Mar 25, 2021

@haberman I created a loop, like you proposed in #8420 and it never produced the error. I'll see about editing the gRPC code that calls protobuf. Thanks.

Here is the loop I created:

loop do
  temp_proto = DocumentManagement::Document.new(json_metadata: payload)
  proto = DocumentManagement::Document.decode(DocumentManagement::Document.encode(temp_proto))
  decoded_metadata = Base64.decode64(proto.json_metadata)
  JSON.parse(decoded_metadata)  # Does this eventually fail?
end

Where payload is

payload = %Q(
eyJGb3JtIjoibm9uZSIsIlBhZ2VzIjoiNiIsIlRpdGxlIjoiYmllYmVyX3No
b3J0IiwiQXV0aG9yIjoiQ2xheSBDYXJyaWdhbiIsIlRhZ2dlZCI6Im5vIiwi
Q3JlYXRvciI6IlByZXZpZXciLCJNb2REYXRlIjoiRnJpIEZlYiAyMiAxNzo1
MDowMyAyMDE5IE1TVCIsIlN1YmplY3QiOiJLZXl3b3JkczogICAgICAgIiwi
UGFnZSByb3QiOiIwIiwiUHJvZHVjZXIiOiJtYWNPUyBWZXJzaW9uIDEwLjE0
LjMgKEJ1aWxkIDE4RDEwOSkgUXVhcnR6IFBERkNvbnRleHQiLCJTdXNwZWN0
cyI6Im5vIiwiRW5jcnlwdGVkIjoibm8iLCJGaWxlIHNpemUiOiIzMzU5NjMg
Ynl0ZXMiLCJPcHRpbWl6ZWQiOiJubyIsIlBhZ2Ugc2l6ZSI6IjYxMiB4IDc5
MiBwdHMgKGxldHRlcikiLCJKYXZhU2NyaXB0Ijoibm8iLCJwYWdlX3NpemVz
Ijp7ImxlZ2FsIjpbXSwibGV0dGVyIjpbMSwyLDMsNCw1LDZdLCJ1bmtub3du
IjpbXX0sIlBERiB2ZXJzaW9uIjoiMS4zIiwiQ3JlYXRpb25EYXRlIjoiRnJp
IEZlYiAyMiAxNzo1MDowMyAyMDE5IE1TVCIsIlVzZXJQcm9wZXJ0aWVzIjoi
bm8ifQ==
).squish.delete(' ')

@haberman
Copy link
Member

haberman commented Apr 7, 2021

Hi @heyZeus, any luck reproducing this?

I do have concern that your bug is fixed by reverting to an older version. That suggests that there is a real bug here of some kind. If we can reproduce it reliably, I'm confident I can find a fix.

@heyZeus
Copy link
Author

heyZeus commented Apr 8, 2021

@haberman I've been on vacation and just got back. I hope to try and reproduce tomorrow or Friday. Thanks for reaching out.

@haberman
Copy link
Member

@heyZeus Are you still running into this?

@heyZeus
Copy link
Author

heyZeus commented Jun 15, 2021

@haberman Yes, I am still seeing it. I tried upgrading today to v3.17.3. I haven't had time to debug it though. I'm guessing no one else is complaining about this issue?

@Muyinza256
Copy link

Actually, I have a similar issue. I was using v3.14 to decode something. Was working well until I upgraded to v3.15 and then started getting parsing errors

@haberman
Copy link
Member

@Muyinza256 do you have a repro? With a repro or crash report, there is a high chance of being able to fix the bug. Without a repro, it's very difficult to debug.

@elharo elharo added the ruby label Aug 21, 2021
@xanderman
Copy link

@haberman I'm picking this up where @heyZeus left off. Using a whole bunch of logging I've narrowed down where the data is getting corrupted, but I'm still not able to create a repro. I'm hoping with this information it will be enough to get us further. The problem seems to happen when lists of protos are concatenated. Let me explain with a high-level overview of our code.

The client code does 3 or 4 lookups to the document server, concatenates all the results, then creates wrapper objects for each document -- which is the point at which the Base64 decode and JSON parse happens. High level it looks kind of like this:

documents = fetch_some_docs + fetch_some_more + fetch_a_third_type
if some_condition
  documents += fetch_a_fourth_type
end
documents.map { |d| create_wrapper(d, other_superfluous_data) }

Each of the fetch methods calls the documents service with different parameters and returns message.documents (the return proto of the service has a repeated Document documents field). By inserting logging inside each of the fetch methods and right before documents.map I can tell that the json_metadata field is valid up until just after the concatenation. Also, it is always the first document in the final list that is corrupted.

To lend further credence, if I change the logic slightly to the below, the problem disappears:

documents = []
documents += fetch_some_docs.map { |d| create_wrapper(d, other_superfluous_data) }
documents += fetch_some_more.map { |d| create_wrapper(d, other_superfluous_data) }
documents += fetch_a_third_type.map { |d| create_wrapper(d, other_superfluous_data) }
if some_condition
  documents += fetch_a_fourth_type.map { |d| create_wrapper(d, other_superfluous_data) }
end
documents

It so happens that for the particular repro case I'm working with, the first and third fetches return no documents, and some_condition is false, so in theory it should be equivalent to do something like [] + fetch_some_more + []. However, I'm not able to repro it so far with this:

docs = fetch_some_more
2000.times do
  a = [] + docs + []
  a.each { |proto| JSON.parse(Base64.decode64(proto.json_metadata)) }
end

@xanderman
Copy link

Shortly after writing the above, I remembered that in Ruby repeated fields aren't actually arrays, they're RepeatedField. So I wondered if converting to a real array before concatenating would also solve the problem, and sure enough it does. I changed each of the fetch methods, which used to return message.documents, to instead return message.documents.to_a and the problem disappears again. Sadly, doing the reverse translation in my attempt to repro didn't work. This still does not reproduce the error:

docs = fetch_some_docs
2000.times do
  a = Google::Protobuf::RepeatedField.new(:message, DocumengManagement::Document, []) + docs + Google::Protobuf::RepeatedField.new(:message, DocumengManagement::Document, [])
  a.each { |proto| JSON.parse(Base64.decode64(proto.json_metadata)) }
end

@haberman
Copy link
Member

Thanks so much for narrowing this down! I think I have found the bug, and I have a fix.

I'm working on getting the fix into the release. For now do you want to see if this patch fixes the bug?

diff --git a/ruby/ext/google/protobuf_c/repeated_field.c b/ruby/ext/google/protobuf_c/repeated_field.c
index 88e8434f0..5ff3c769a 100644
--- a/ruby/ext/google/protobuf_c/repeated_field.c
+++ b/ruby/ext/google/protobuf_c/repeated_field.c
@@ -551,6 +551,7 @@ VALUE RepeatedField_plus(VALUE _self, VALUE list) {
     RepeatedField* dupped = ruby_to_RepeatedField(dupped_);
     upb_array *dupped_array = RepeatedField_GetMutable(dupped_);
     upb_arena* arena = Arena_get(dupped->arena);
+    Arena_fuse(list_rptfield->arena, arena);
     int size = upb_array_size(list_rptfield->array);
     int i;

@xanderman
Copy link

xanderman commented Sep 13, 2021

I'm trying to verify the patch, and having a devil of a time getting it working. I've tried compiling from source and using a path: option in my Gemfile, and then tried gem 'google-protobuf', github: 'protocolbuffers/protobuf', branch: '3.18.x' after seeing that your patch is merged, and nothing's working right. I'm either getting cannot load such file -- google/protobuf/descriptor_pb or .../ruby/lib/google/protobuf/descriptor_pb.rb:6:in '<main>': undefined method 'build' for #<Google::Protobuf::DescriptorPool:0x00007fccaec2c220> (NoMethodError). I'm not sure what to do about that. Is there a released version I can use instead?

@haberman
Copy link
Member

Sorry about that. The fix is released in 3.18.0, just released: https://rubygems.org/gems/google-protobuf/versions/3.18.0-x86_64-linux

Want to give it a try and see if it fixes your problem?

@xanderman
Copy link

I'm a few thousand iterations into a loop of 20k, with no issues so far. It's always failed in the first few hundred before, so I'd say that it's fixed. Thank you!

@haberman
Copy link
Member

Excellent, thanks for your patience and for creating a helpful repro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants