Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nanopb breaks Clang optimizations by violating strict aliasing on _Bool #434

Closed
saleemrashid opened this issue Oct 1, 2019 · 23 comments
Closed

Comments

@saleemrashid
Copy link

According to the C99 standard, _Bool will only be assigned 0 or 1. However, Nanopb violates strict aliasing rules by using an int_least8_t pointer to write to the underlying memory. This allows the _Bool to contain a value greater than one.

🚨 UNDEFINED BEHAVIOUR ALERT 🚨

Clang heavily relies on this assumption. For example, message.bool_field ? X : Y is often optimized into message.bool_field ^ N, causing bugs in a codebase using Nanopb.

@invd
Copy link

invd commented Oct 1, 2019

Some background: we have run into this problem while fuzzing an x86_64 binary that was built with recent versions of clang and includes the stable nanopb-0.3.9.3 release.

The specific code issue in the target code is triggered via the nanopb undefined behavior in combination with the relevant clang optimizations at -O2 and above. The optimization levels -O0 and -O1 do not appear to include the relevant clang optimization and did not trigger this problem.

@PetteriAimonen
Copy link
Member

I've seen a similar effect myself with GCC before:
https://stackoverflow.com/questions/27661768/weird-results-for-conditional-operator-with-gcc-and-bool-pointers

But yeah, I didn't realize this could occur when deserializing bool fields. Not sure if better option is to separate bool deserialization from other varints, or to make custom pb_bool_t as in issue #287.

@PetteriAimonen
Copy link
Member

After considering this, I think most reasonable approach is to add separate decoder for bool in 0.3.9.4 as that has least worries about backwards compatibility.

Whether to go pb_bool_t route for 0.4.0, I'm still not sure. It feels annoying to introduce a separate type, but the problems in #287 crop up from time to time also. But I guess better to solve one issue at a time, add the separate decoder for 0.4.0 also and worry about #287 later.

@PetteriAimonen
Copy link
Member

Hmm, the reason why this isn't picked up by existing fuzzing is here:
https://github.com/nanopb/nanopb/blob/master/tests/SConstruct#L106

The bool sanitizer was skipped for fuzz test because the fuzzer writes random stuff to the structure being encoded also, and encoder reads that via bool pointer. I guess that is something to be fixed also, as it's undefined behavior anyway and the input structure to pb_decode() is declared as mostly untrusted.

PetteriAimonen added a commit that referenced this issue Oct 2, 2019
Previously nanopb didn't enforce that decoded bool fields
had valid true/false values. This could lead to undefined
behavior in user code.

This has potential security implications when

1) message contains bool field (has_ fields are safe)
and
2) user code uses ternary operator dependent on the field value,
   such as: int value = msg.my_bool ? 1234 : 0
and
3) the value returned from ternary operator affects a memory access,
   such as: data_array[value] = 9999
PetteriAimonen added a commit that referenced this issue Oct 2, 2019
Previously nanopb didn't enforce that decoded bool fields
had valid true/false values. This could lead to undefined
behavior in user code.

This has potential security implications when

1) message contains bool field (has_ fields are safe)
and
2) user code uses ternary operator dependent on the field value,
   such as: int value = msg.my_bool ? 1234 : 0
and
3) the value returned from ternary operator affects a memory access,
   such as: data_array[value] = 9999
@PetteriAimonen
Copy link
Member

Fixed now both in master and maintenance_0.3 branches.

I'll be publishing the fix in 0.3.9.4 soon, but if you could try with your fuzz tests also, that would be great.

@invd
Copy link

invd commented Oct 2, 2019

@PetteriAimonen: Thank you for the swift fix.
At the moment, it looks like the nanopb generator is running into build problems:

TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "<removed-filename>.proto":
  <removed-type>.node: ".<removed-type>" is not defined.

(<removed> sections have been removed from the output)

At the moment, this looks like a regression in the current master that was introduced before the patches for this issue. I'll look into it later today.

@PetteriAimonen
Copy link
Member

Can you post more of the backtrace from that error? i.e. where in nanopb_generator.py that occurs?

If it is in default_value() at around line 1146, it would also be important to know the types of the fields involved.

@invd
Copy link

invd commented Oct 2, 2019

@PetteriAimonen: I can confirm that the target binary no longer runs into the undefined behavior problem when built with nanopb from the maintenance_0.3 branch. The mentioned nanopb generator errors only turn up on master.

Backtrace:

Traceback (most recent call last):
  File "../../vendor/nanopb/generator/nanopb_generator.py", line 1908, in <module>
    main_cli()
  File "../../vendor/nanopb/generator/nanopb_generator.py", line 1809, in main_cli
    results = process_file(filename, None, options)
  File "../../vendor/nanopb/generator/nanopb_generator.py", line 1772, in process_file
    headerdata = ''.join(f.generate_header(includes, headerbasename, options))
  File "../../vendor/nanopb/generator/nanopb_generator.py", line 1435, in generate_header
    yield msg.fields_declaration(self.dependencies) + '\n'
  File "../../vendor/nanopb/generator/nanopb_generator.py", line 1031, in fields_declaration
    defval = self.default_value(dependencies)
  File "../../vendor/nanopb/generator/nanopb_generator.py", line 1146, in default_value
    desc = google.protobuf.descriptor.MakeDescriptor(optional_only)
  File "/usr/local/lib/python3.7/dist-packages/google/protobuf/descriptor.py", line 1016, in MakeDescriptor
    _message.default_pool.Add(file_descriptor_proto)

I plan to provide more information later today.

@PetteriAimonen
Copy link
Member

PetteriAimonen commented Oct 2, 2019

Yeah, those generator errors have been appearing on the master branch before, though I can't know what's causing it this time. It's due to the new default value generation that seems to be a bit prone to break with particular field types.

I guess "<removed-type>" is some custom type? But messages and enums are already handled specially.

@invd
Copy link

invd commented Oct 2, 2019

@PetteriAimonen: it might make sense to move this to a new Github issue.

Here is the specific type:

TypeError: Couldn't build proto file into descriptor pool!
Invalid proto descriptor for file "7feec53021e50c950c04a15a3f990191.proto":
  GetPublicKey.script_type: Couldn't parse default value "SPENDADDRESS".
  GetPublicKey.script_type: ".InputScriptType" is not defined.

This is related to https://github.com/trezor/trezor-firmware/tree/master/legacy/firmware/protob .

@PetteriAimonen
Copy link
Member

Hmm, I don't get that error with those files. I did get an error about unhandled field type 'MESSAGE' when compiling with only some files present, but that seems unrelated.

@invd
Copy link

invd commented Oct 2, 2019

@PetteriAimonen:

  File "../../vendor/nanopb/generator/nanopb_generator.py", line 600, in data_size
    raise Exception("Unhandled field type: %s" % self.pbtype)
Exception: Unhandled field type: MESSAGE

was also seen with the recent master, but not on my specific build environment. I'm assuming that 0.4.x still has a few regressions to work out.

However, it appears that the current fix for the bool handling is faulty, as my target binary runs into a segfault through the null pointer:

==26033==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x000000000000 <removed> T0)
==26033==Hint: pc points to the zero page.
==26033==The signal is caused by a READ memory access.
==26033==Hint: address points to the zero page.

This happens on 252f419 , but not on 7b0a42f .

@PetteriAimonen
Copy link
Member

PetteriAimonen commented Oct 2, 2019

Yep, that's mostly why 0.4.0 has been stuck on master branch for a year now and not released, I haven't had time to test it enough that I would trust it enough to release. That particular bug about MESSAGE should be now fixed.

I guess you'll have to look a bit deeper into the two issues you are currently having, as with the information you've given I can't reproduce them here.

One possible cause for that jump to null pointer could be if you haven't recompiled some of the source code files, as the LTYPE numbering changed.

@invd
Copy link

invd commented Oct 3, 2019

I've further debugged the segfault.

Backtrace:

0x0000000000000000 in ?? ()
(gdb) bt
#0  0x0000000000000000 in ?? ()
#1  0x00000000006f68fd in encode_basic_field (stream=0x7fffffffd300, field=0x806f13 <TxRequest_fields+19>, pData=0x117648c <resp+12>) at pb_encode.c:364
#2  0x00000000006f358e in encode_field (stream=0x7fffffffd300, field=0x806f13 <TxRequest_fields+19>, pData=0x117648c <resp+12>) at pb_encode.c:428
#3  0x00000000006f2e0d in pb_encode (stream=0x7fffffffd300, fields=0x806f00 <TxRequest_fields>, src_struct=0x1176480 <resp>) at pb_encode.c:515
#4  0x00000000006f46b5 in pb_get_encoded_size (size=0x7fffffffd480, fields=0x806f00 <TxRequest_fields>, src_struct=0x1176480 <resp>) at pb_encode.c:542

Stepping through to the error:

Thread 1 "binary.elf" hit Breakpoint 1, encode_basic_field (stream=0x7fffffffd300, field=0x806f00 <TxRequest_fields>, pData=0x806f02 <TxRequest_fields+2>) at pb_encode.c:314
314     {
(gdb) next
316         bool implicit_has;
(gdb)
317         const void *pSize = &implicit_has;
(gdb)
319         func = PB_ENCODERS[PB_LTYPE(field->type)];
(gdb)
321         if (field->size_offset)
(gdb) print func
$2 = (pb_encoder_t) 0x0
324             pSize = (const char*)pData + field->size_offset;
(gdb)
325         }
(gdb)
337         if (PB_ATYPE(field->type) == PB_ATYPE_POINTER)
(gdb)
347         switch (PB_HTYPE(field->type))
(gdb)
359                 if (*(const bool*)pSize)
(gdb)
361                     if (!pb_encode_tag_for_field(stream, field))
(gdb)
364                     if (!func(stream, field, pData))
(gdb)

Thread 1 "binary.elf" received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()

For some reason

func = PB_ENCODERS[PB_LTYPE(field->type)];
returns the 0x0 as the function pointer that is later called, leading to the crash.

This is the type in question:

(gdb) print field->type
$1 = 24 '\030'

(Side note: gdb reports the line numbers with an offset, but as far as I can see, everything else is normal and relates to the linked commit.)

@PetteriAimonen
Copy link
Member

That field->type has value 0x18, i.e. LTYPE of 8. In the array

static const pb_encoder_t PB_ENCODERS[PB_LTYPES_COUNT] = {
that should be &pb_enc_submessage. But there was one entry added to that array in the bool fix, if that is missing from your code base for some reason, that would cause it to read the NULL pointer for extensions instead.

Considering your line numbers are messed up also, it seems like there may be some git merge error or other local modification in your repo. Can you diff your pb_encode.c against the version downloaded straight from github?

@invd
Copy link

invd commented Oct 3, 2019

@PetteriAimonen the problem has been resolved. The mentioned segfault was indeed caused by a local toolchain issue and the upstream code appears to be fine.

The nanopb git repo was on the correct commit (I had double-checked this) without local changes and the hashes over relevant files such as pb_encode.c matched. However, our project Makefile still included a local patch for pb_encode.c that was related to 6d4bd5c and effectively overwrote the relevant source file.

The resulting structures were a bit puzzling, but explain the error behavior:

(gdb) print PB_ENCODERS
$1 = {0x6f8a60 <pb_enc_varint>, 0x6f9170 <pb_enc_uvarint>, 0x6f9880 <pb_enc_svarint>, 0x6f9f90 <pb_enc_fixed32>, 0x6fa040 <pb_enc_fixed64>, 0x6fa0f0 <pb_enc_bytes>, 0x6fa580 <pb_enc_string>, 0x6fa8f0 <pb_enc_submessage>, 0x0, 0x6fac30 <pb_enc_fixed_length_bytes>, 0x0}
(gdb) print PB_DECODERS
$2 = {0x6e8c60 <pb_dec_bool>, 0x6ec810 <pb_dec_varint>, 0x6ed3f0 <pb_dec_uvarint>, 0x6edf30 <pb_dec_svarint>, 0x6eea70 <pb_dec_fixed32>, 0x6eeb20 <pb_dec_fixed64>, 0x6eebd0 <pb_dec_bytes>, 0x6ef5f0 <pb_dec_string>, 0x6eff00 <pb_dec_submessage>, 0x0,
  0x6f04e0 <pb_dec_fixed_length_bytes>}
(gdb) print pb_enc_bool
No symbol "pb_enc_bool" in current context.
(gdb) print pb_dec_bool
$3 = {_Bool (pb_istream_t *, const pb_field_t *, void *)} 0x6e8c60 <pb_dec_bool>

Thanks for looking into this.

@saleemrashid
Copy link
Author

@invd From what I can tell, Nanopb 0.4 will replace PB_DECODERS with switch & case which should be far more robust. Thanks @PetteriAimonen!

@invd
Copy link

invd commented Oct 9, 2019

@PetteriAimonen: I think this issue can be closed. Thanks!

@PetteriAimonen
Copy link
Member

@invd I'll close it once the fix is released, so it's easier to find if someone else hits it.

@prusnak
Copy link
Contributor

prusnak commented Oct 10, 2019

@PetteriAimonen Any ETA for 0.3.9.4 release?

@PetteriAimonen
Copy link
Member

@prusnak I hope to find time to do it in the next few weeks, but currently I'm too busy with for-profit work.

@prusnak
Copy link
Contributor

prusnak commented Oct 10, 2019

Okay, I will use the tip of maintenance_0.3 branch for now. Thank you for all the non-profit hard work you do!

@PetteriAimonen
Copy link
Member

Fix released in 0.3.9.4.

shreyasbharath pushed a commit to halter-corp/nanopb that referenced this issue Apr 10, 2020
shreyasbharath pushed a commit to halter-corp/nanopb that referenced this issue Apr 10, 2020
Previously nanopb didn't enforce that decoded bool fields
had valid true/false values. This could lead to undefined
behavior in user code.

This has potential security implications when

1) message contains bool field (has_ fields are safe)
and
2) user code uses ternary operator dependent on the field value,
   such as: int value = msg.my_bool ? 1234 : 0
and
3) the value returned from ternary operator affects a memory access,
   such as: data_array[value] = 9999
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants