0.4.0 stack usage increase? #484

zacwbond · 2020-01-24T19:29:38Z

Recently I upgraded some code to use nanopb .0.4.0, but I'm having some problems with significantly higher stack usage with the new version. We're building this for an embedded ARM-based microcontroller (STM32F4). We're compiling with GCC, optimization level -O2.

Our compile options for nanopb are:

-DPB_FIELD_16BIT
-DPB_WITHOUT_64BIT
-DPB_NO_ERRMSG

One example is a schema with the following definition:

message ItemList
{
    repeated uint32 ids  = 1;
    uint32 version       = 2;
}

We specify a maximum length in the options file:

ItemList.ids max_count  : 16

For that reason, decoding the ItemList pb is very simple:

bool LoadModel(uint8_t* buffer, size_t size, ItemList* model)
{
    pb_istream_t iStream =
        pb_istream_from_buffer(buffer, size);
    return pb_decode(&iStream, ItemList_fields, model);
}

Here's the 5-byte buffer that is being decoded:

0A 01 00 10 01

In my experiments, nanopb 0.4.0 used about 60% more stack than the old version we were using, going from 236 to 376 bytes of stack to decode the buffer above. The old version was a pull of master from late 2018, I believe--git describe says nanopb-0.4.0-dev-32-g580d7c8.

I also ran a similar test with a much more complicated schema (see below) and got similar results: an increase from 884 bytes to 1445 bytes to decode the message.

Is this change intentional/expected? Are there any options I can set to reduce stack usage?

Schema for the more complicated message:

message Property
{
    enum Mode
    {
        A = 1;
        B = 2;
        C = 3;
    }

    oneof field
    {
        Mode   DeviceA_Mode     = 1; 
        uint32 DeviceA_Size     = 2;
        uint32 DeviceA_Length   = 3; 

        Mode   DeviceB_Mode     = 4; 
        uint32 DeviceB_Size     = 5;
        uint32 DeviceB_Length   = 6; 
    }
}

message SettingsGroup
{
    message Settings
    {
        message Command
        {
            uint32 label                 = 1; 
            repeated Property properties = 2;
        }
    
        uint32      id      = 1; 
        string      name    = 2; 
        uint32      flags   = 3; 
        bool        en      = 5; 
        Command     begin   = 6; 
        Command     end     = 7; 
    }

    Settings settings   = 1;
    uint32 version      = 2;
}

Options:

* anonymous_oneof: true
* long_names     : false
SettingsGroup.Settings.name               max_length : 32
SettingsGroup.Settings.Command.properties max_count  : 6

The text was updated successfully, but these errors were encountered:

PetteriAimonen · 2020-01-25T07:55:16Z

It's kind-of expected, as there hasn't been as much optimization done after the large changes in 0.4.0. But 60% increase does sound like a bit much.

mselseth · 2020-01-31T00:29:50Z

Similar issue. Switched to latest nanopb 0.4.0 from master as per last week from 0.3.6.
As per zacwbond, I could revert to nanopb-0.4.0-dev-32-g580d7c8.
What commit after nanopb-0.4.0-dev-32-g580d7c8 still uses the old RAM usage?

We were defining PB_FIELD_16BIT in 0.3.6 code.
Noting that latest nanopb code defaults to 16 bits so PB_FIELD_16BIT define has no effect.

PetteriAimonen · 2020-01-31T09:35:05Z

This is the last build before the merge of the new field descriptor format:
http://raichu:50140/jenkins/job/nanopb-binary-windows/279/
As a side-effect of the merge mistake that caused #388 also, the version number shown there is "0.3.9.2-33-g580d7c8" even though it's actually built from the 0.4 development branch.

That's the 580d7c8 commit which is the last before merge.

PetteriAimonen · 2020-01-31T18:02:44Z

Ok, I've now looked into this a bit and the stack usage is indeed ridiculous.

I'm not yet exactly sure why, but it seems to boil down to how compilers consider stack space essentially free. So in e.g. pb_decode_ex(), the field iterator that is only used for initializing default values is allocated on stack. And even though it goes out of scope it'll just remain as unnecessarily allocated stack space during all other execution, as that saves one addition instruction before the function call.

I'll try to get some basic fixes in 0.4.1 that should go out in February, and will see later on bringing the stack usage down more.

Gives about 15% stack usage reduction.

mselseth · 2020-01-31T22:25:00Z

PetteriAimonen, thank you for the feedback.

Moving the message initialization to pb_decode_inner() avoids allocating a second pb_field_iter_t on the stack.

Gives about 15% stack usage reduction.

Moving the message initialization to pb_decode_inner() avoids allocating a second pb_field_iter_t on the stack.

maximevince · 2021-02-25T15:41:45Z

I have a similar question about stack usage. I am using GCC 9.2.1, with -Os, with nanopb 0.4.4.

When decoding a (fairly large) protobuf struct into a statically allocated buffer (target size ~925 bytes),
it seems stack usage is very large (for the tiny Cortex-M0 running this application).

This is the GDB backtrace:

(gdb) bt
#0  0x10059896 in load_descriptor_values (iter=iter@entry=0x20000700 <snt_task_stack+148>) at ../../../libs/nanopb/pb_common.c:14
#1  0x100599e8 in pb_field_iter_begin (iter=iter@entry=0x20000700 <snt_task_stack+148>, desc=0x10067c44 <firmware_version_t_msg>, message=0x20001168 <profile+312>) at ../../../libs/nanopb/pb_common.c:163
#2  0x10058eb4 in pb_decode_inner (stream=stream@entry=0x20000778 <snt_task_stack+268>, fields=<optimized out>, dest_struct=<optimized out>, flags=0) at ../../../libs/nanopb/pb_decode.c:986
#3  0x10058a6c in pb_dec_submessage (field=0x20000810 <snt_task_stack+420>, stream=0x20000888 <snt_task_stack+540>) at ../../../libs/nanopb/pb_decode.c:1594
#4  decode_basic_field (stream=stream@entry=0x20000888 <snt_task_stack+540>, wire_type=wire_type@entry=PB_WT_STRING, field=field@entry=0x20000810 <snt_task_stack+420>) at ../../../libs/nanopb/pb_decode.c:451
#5  0x10058bb2 in decode_static_field (field=<optimized out>, wire_type=<optimized out>, stream=0x20000888 <snt_task_stack+540>) at ../../../libs/nanopb/pb_decode.c:538
#6  decode_field (stream=0x20000888 <snt_task_stack+540>, wire_type=<optimized out>, field=0x20000810 <snt_task_stack+420>) at ../../../libs/nanopb/pb_decode.c:792
#7  0x10059072 in pb_decode_inner (stream=stream@entry=0x20000888 <snt_task_stack+540>, fields=<optimized out>, dest_struct=<optimized out>, flags=0) at ../../../libs/nanopb/pb_decode.c:1089
#8  0x10058a6c in pb_dec_submessage (field=0x20000920 <snt_task_stack+692>, stream=0x20000998 <snt_task_stack+812>) at ../../../libs/nanopb/pb_decode.c:1594
#9  decode_basic_field (stream=stream@entry=0x20000998 <snt_task_stack+812>, wire_type=wire_type@entry=PB_WT_STRING, field=field@entry=0x20000920 <snt_task_stack+692>) at ../../../libs/nanopb/pb_decode.c:451
#10 0x10058bb2 in decode_static_field (field=<optimized out>, wire_type=<optimized out>, stream=0x20000998 <snt_task_stack+812>) at ../../../libs/nanopb/pb_decode.c:538
#11 decode_field (stream=0x20000998 <snt_task_stack+812>, wire_type=<optimized out>, field=0x20000920 <snt_task_stack+692>) at ../../../libs/nanopb/pb_decode.c:792
#12 0x10059072 in pb_decode_inner (stream=stream@entry=0x20000998 <snt_task_stack+812>, fields=<optimized out>, dest_struct=<optimized out>, flags=1) at ../../../libs/nanopb/pb_decode.c:1089
#13 0x10058a6c in pb_dec_submessage (field=0x20000a30 <snt_task_stack+964>, stream=0x20000ab0 <snt_task_stack+1092>) at ../../../libs/nanopb/pb_decode.c:1594
#14 decode_basic_field (stream=stream@entry=0x20000ab0 <snt_task_stack+1092>, wire_type=wire_type@entry=PB_WT_STRING, field=field@entry=0x20000a30 <snt_task_stack+964>) at ../../../libs/nanopb/pb_decode.c:451
#15 0x10058bb2 in decode_static_field (field=<optimized out>, wire_type=<optimized out>, stream=0x20000ab0 <snt_task_stack+1092>) at ../../../libs/nanopb/pb_decode.c:538
#16 decode_field (stream=0x20000ab0 <snt_task_stack+1092>, wire_type=<optimized out>, field=0x20000a30 <snt_task_stack+964>) at ../../../libs/nanopb/pb_decode.c:792
#17 0x10059072 in pb_decode_inner (stream=stream@entry=0x20000ab0 <snt_task_stack+1092>, fields=<optimized out>, dest_struct=dest_struct@entry=0x20001030 <profile>, flags=flags@entry=0) at ../../../libs/nanopb/pb_decode.c:1089
#18 0x100590e8 in pb_decode (stream=stream@entry=0x20000ab0 <snt_task_stack+1092>, fields=<optimized out>, dest_struct=dest_struct@entry=0x20001030 <profile>) at ../../../libs/nanopb/pb_decode.c:1165
#19 0x1005d5b0 in pb_profile_decode (dst_profile=0x20001030 <profile>, src_pb=src_pb@entry=0x2000182c <smalloc_pool> "\b\356\237\360\326\v\022\006\066\065\061\063\067e\032\004\b\001\020\t\"P\n\020\b\254\002\020(\030\310\001 \002(\024\060\003H\001\022\022\b<\020", 
    len=len@entry=578) at ../../../libs/libsnt/src/protobuf/profile_protobuf.c:79
<...>

pb_decode_inner seems to be recursively called, and using quite some stack.
In this case I calculated total nanopb stack usage to be ~1040 bytes.

Is this to be expected? Is there a way to work around this issue?
Would it help using FT_CALLBACK, or doesn't that change anything about stack usage?

Thanks,
Maxime

PetteriAimonen · 2021-02-25T17:50:37Z

@maximevince Yeah, it is to be expected that stack usage grows in relation to the number of submessage levels.

Callbacks don't directly make a difference. In long term it would be better to make nanopb store only the required information (buffer pointer etc.) on each recursion level, instead of the full stack frame needed by recursive C functions, but that is a quite large change.

I'm not aware of any particularly effective workaround, other than changing the .proto structure to have less submessage levels.

maximevince · 2021-02-26T08:52:04Z

@PetteriAimonen thanks for your swift reply. Now that we know the limitations, we can work with/around them.
Thanks for your excellent project, by the way.

PetteriAimonen added Component-Decoder Component-Encoder Priority-Medium Type-Enhancement labels Jan 25, 2020

PetteriAimonen added a commit that referenced this issue Jan 31, 2020

Add test case for stack usage (#484)

f9f31e7

PetteriAimonen added a commit that referenced this issue Jan 31, 2020

Avoid unnecessary wrappers in pb_dec_submessage() (#484)

4b14e66

Gives about 15% stack usage reduction.

PetteriAimonen added a commit that referenced this issue Feb 1, 2020

Reduce stack space usage in pb_decode() by 5% (#484)

bbf1aa7

Moving the message initialization to pb_decode_inner() avoids allocating a second pb_field_iter_t on the stack.

shreyasbharath pushed a commit to halter-corp/nanopb that referenced this issue Apr 10, 2020

Add test case for stack usage (nanopb#484)

d9095fd

shreyasbharath pushed a commit to halter-corp/nanopb that referenced this issue Apr 10, 2020

Avoid unnecessary wrappers in pb_dec_submessage() (nanopb#484)

ec40706

Gives about 15% stack usage reduction.

shreyasbharath pushed a commit to halter-corp/nanopb that referenced this issue Apr 10, 2020

Reduce stack space usage in pb_decode() by 5% (nanopb#484)

205eed8

Moving the message initialization to pb_decode_inner() avoids allocating a second pb_field_iter_t on the stack.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.4.0 stack usage increase? #484

0.4.0 stack usage increase? #484

zacwbond commented Jan 24, 2020 •

edited

PetteriAimonen commented Jan 25, 2020

mselseth commented Jan 31, 2020

PetteriAimonen commented Jan 31, 2020

PetteriAimonen commented Jan 31, 2020

mselseth commented Jan 31, 2020

maximevince commented Feb 25, 2021

PetteriAimonen commented Feb 25, 2021

maximevince commented Feb 26, 2021

0.4.0 stack usage increase? #484

0.4.0 stack usage increase? #484

Comments

zacwbond commented Jan 24, 2020 • edited

PetteriAimonen commented Jan 25, 2020

mselseth commented Jan 31, 2020

PetteriAimonen commented Jan 31, 2020

PetteriAimonen commented Jan 31, 2020

mselseth commented Jan 31, 2020

maximevince commented Feb 25, 2021

PetteriAimonen commented Feb 25, 2021

maximevince commented Feb 26, 2021

zacwbond commented Jan 24, 2020 •

edited