Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proto2 protobuf cannot handle enums with value higher than 31 during serialization in Python #10950

Closed
tinix0 opened this issue Nov 9, 2022 · 16 comments
Assignees
Labels

Comments

@tinix0
Copy link

tinix0 commented Nov 9, 2022

What version of protobuf and what language are you using?
Version: protoc 3.21.9
Language: Python - protobuf 4.21.9

OS: Windows 11

Interpreter: Python 3.11.0

What did you do?

  1. Create proto2 protobuf that contains Enum with values larger than 31
  2. Compile using protoc
  3. Instantiate the protobuf in code and set the enum field to value larger than 31
  4. Serialize the protobuf to string
  5. Deserialize the protobuf

What did you expect to see
The deserialized protobuf has the enum field set to the value larger than 31

What did you see instead?
The deserialized protobuf has the field clear (the serialized form does not contain the value at all).

Anything else we should know about your project / environment
I can provide minimal reproduction example if desired, but it was pretty simple to reproduce. The issue was also reproduceable on protobuf lib versions as low as 4.21.3 and with grpcio-tools as low as 1.49.1. Lastly, switching the syntax to proto3 does fix the issue, however it is currently not viable for us.

@tinix0 tinix0 added the untriaged auto added to all issues by default when created. label Nov 9, 2022
@tinix0 tinix0 changed the title /roto2 protobuf cannot handle enums with value higher than 31 during serialization in Python Proto2 protobuf cannot handle enums with value higher than 31 during serialization in Python Nov 9, 2022
@acozzette acozzette added python and removed untriaged auto added to all issues by default when created. labels Nov 10, 2022
@acozzette
Copy link
Member

Thank you for the bug report. This sounds like a potentially serious bug. Would you mind providing the reproducible example?

@acozzette acozzette assigned zhangskz and haberman and unassigned zhangskz Nov 10, 2022
@haberman
Copy link
Member

haberman commented Nov 10, 2022

Yes a repro would be very helpful here. I was not able to reproduce it:

// test.proto
syntax = "proto2";

enum E {
  ONE = 1;
  THIRTY_ONE = 31;
  THIRTY_TWO = 32; 
  ONE_HUNDRED = 100; 
}       
        
message M {
  optional E e = 1;
} 
# test.py
from test_pb2 import M, E
import unittest

class TestEnumValueSerialize(unittest.TestCase):
    def do_test_val(self, val):
        m1 = M(e = val)
        serialized = m1.SerializeToString()
        m2 = M()
        m2.ParseFromString(serialized)
        self.assertEqual(val, m2.e)
  
    def test_enum_values(self):
        self.do_test_val(E.ONE)
        self.do_test_val(E.THIRTY_ONE)
        self.do_test_val(E.THIRTY_TWO)
        self.do_test_val(E.ONE_HUNDRED)

if __name__ == '__main__':
    unittest.main()

Output:

$ venv/bin/python test.py --verbose
test_enum_values (__main__.TestEnumValueSerialize) ... ok

----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

@tinix0
Copy link
Author

tinix0 commented Nov 10, 2022

Hello, here is the repro with output. Note that I have also seen the issue with holes in the numeric ranges, but the total number of values was still larger than 31.

// test.proto
syntax = "proto2";
package Test;

message Person {
  
    enum PhoneType {
      blaa = 0;
      HOME = 1;
      WORK = 2;
      a = 3;
      b = 4;
      c = 5;
      d = 6;
      e = 7;
      f = 8;
      g = 9;
      h = 10;
      i = 11;
      j = 12;
      k = 13;
      l = 14;
      m = 15;
      n = 16;
      o = 17;
      p = 18;
      q = 19;
      r = 20;
      s = 21;
      t = 22;
      u = 23;
      v = 24;
      w = 25;
      x = 26;
      y = 27;
      z = 28;
      aa = 29;
      ab = 30;
      ac = 31;
      ad = 32;
      ae = 33;

    }
  
    optional PhoneType TaskType = 1;
    optional string Name = 2;
  }
# main.py
import test_pb2


person = test_pb2.Person()
person.TaskType = 32
person.Name = "Name1"
serialized = person.SerializeToString()
personDe = test_pb2.Person()
print("a", person)
print("---")
personDe.ParseFromString(serialized)
print("b", personDe)
print("---")

person2 = test_pb2.Person()
person2.TaskType = 31
person2.Name = "Name2"
serialized2 = person2.SerializeToString()
personDe2 = test_pb2.Person()
print("a", person2)
print("---")
personDe2.ParseFromString(serialized2)
print("b", personDe2)

Ouput:

a TaskType: ad
Name: "Name1"

b Name: "Name1"

a TaskType: ac
Name: "Name2"

b TaskType: ac
Name: "Name2"

@haberman
Copy link
Member

This must be a Windows-specific bug. When I try your repro on Linux, I get:

a TaskType: ad
Name: "Name1"

---
b TaskType: ad
Name: "Name1"

---
a TaskType: ac
Name: "Name2"

---
b TaskType: ac
Name: "Name2"

I suspect this is related to long being only 32-bits on Windows. I will investigate further.

@ericsalo
Copy link
Member

I also cannot reproduce the bug on (64-bit) Linux.

@tinix0
Copy link
Author

tinix0 commented Nov 10, 2022

I can also confirm that I cannot reproduce it inside WSL. It does indeed seem to be Windows specific (both machines I tried this on were either Windows 11 or Windows 10).

@haberman
Copy link
Member

Can you print out len(serialized)? I suspect it will be non-zero, but want to verify.

(I don't have quick access to a Windows machine to verify).

@tinix0
Copy link
Author

tinix0 commented Nov 10, 2022

len(serialized): 9
len(serialized2): 9
serialized in base64: 'CCASBU5hbWUx'
serialized2 in base64: 'CB8SBU5hbWUy'

@haberman
Copy link
Member

Great, thanks. It appears to be serializing correctly. When we parse it though, the decoder thinks the value (32) is not in the enum for some reason. In proto2, when you parse an unrecognized enum value, it goes into the unknown field set, which is what appears to be happening here.

The only question now is why the decoder thinks that 32 is an unknown value for this enum.

@haberman
Copy link
Member

This is the code performing the check. The bug is likely here or nearby, but I can't see it: https://github.com/protocolbuffers/upb/blob/f45eeec625a1b5b4514683aba4b38d1864f0e211/upb/decode.c#L410-L440

In particular, this check looks correct to me. We cast 1ULL to unsigned long long, so it should be 64-bit even on Windows, which has 32-bit longs: https://github.com/protocolbuffers/upb/blob/f45eeec625a1b5b4514683aba4b38d1864f0e211/upb/decode.c#L437

@tinix0
Copy link
Author

tinix0 commented Nov 10, 2022

Another helpful thing I can point out is that this is an regression introduced in 4.21.x. Protobuf 3.20.3 deserializes the message correctly.

@haberman
Copy link
Member

Another helpful thing I can point out is that this is an regression introduced in 4.21.x.

That makes perfect sense, as 4.21.x introduced a completely rewritten native extension. More info here: https://developers.google.com/protocol-buffers/docs/news/2022-05-06#python-updates

@ericsalo
Copy link
Member

I would be mildly surprised if the bug turned out to be in decode.c since that code has not changed much at all. I'm wondering whether this may be related to the mini descriptors.

@haberman
Copy link
Member

It's true that decode.c has not changed much, but we also have much less usage on Windows than other platforms.

Keep in mind that we're releasing from the 21.x branch, which has not gotten any of the recent changes around MiniDescriptors: https://github.com/protocolbuffers/upb/tree/21.x/upb

@haberman
Copy link
Member

I have a root cause and a fix for the bug. The problem was in the UPB_LIKELY() macro, which uses __builtin_expect(). The arguments to __buitin_expect() are long, so on Windows this truncates the arguments to 32 bits.

The fix is to cast the argument to bool:

--- a/upb/port_def.inc
+++ b/upb/port_def.inc
@@ -96,8 +96,8 @@
 
 /* Hints to the compiler about likely/unlikely branches. */
 #if defined (__GNUC__) || defined(__clang__)
-#define UPB_LIKELY(x) __builtin_expect((x),1)
-#define UPB_UNLIKELY(x) __builtin_expect((x),0)
+#define UPB_LIKELY(x) __builtin_expect((bool)(x),1)
+#define UPB_UNLIKELY(x) __builtin_expect((bool)(x),0)
 #else
 #define UPB_LIKELY(x) (x)
 #define UPB_UNLIKELY(x) (x)

copybara-service bot pushed a commit to protocolbuffers/upb that referenced this issue Nov 16, 2022
copybara-service bot pushed a commit to protocolbuffers/upb that referenced this issue Nov 16, 2022
copybara-service bot pushed a commit to protocolbuffers/upb that referenced this issue Nov 16, 2022
copybara-service bot pushed a commit to protocolbuffers/upb that referenced this issue Nov 17, 2022
copybara-service bot pushed a commit to protocolbuffers/upb that referenced this issue Nov 17, 2022
@ericsalo
Copy link
Member

A fix for this has just been pushed so I am closing.

ericsalo added a commit to protocolbuffers/upb that referenced this issue Nov 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants