Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.4.0 stack usage increase? #484

Open
zacwbond opened this issue Jan 24, 2020 · 8 comments
Open

0.4.0 stack usage increase? #484

zacwbond opened this issue Jan 24, 2020 · 8 comments

Comments

@zacwbond
Copy link

zacwbond commented Jan 24, 2020

Recently I upgraded some code to use nanopb .0.4.0, but I'm having some problems with significantly higher stack usage with the new version. We're building this for an embedded ARM-based microcontroller (STM32F4). We're compiling with GCC, optimization level -O2.

Our compile options for nanopb are:

-DPB_FIELD_16BIT
-DPB_WITHOUT_64BIT
-DPB_NO_ERRMSG

One example is a schema with the following definition:

message ItemList
{
    repeated uint32 ids  = 1;
    uint32 version       = 2;
}

We specify a maximum length in the options file:

ItemList.ids max_count  : 16

For that reason, decoding the ItemList pb is very simple:

bool LoadModel(uint8_t* buffer, size_t size, ItemList* model)
{
    pb_istream_t iStream =
        pb_istream_from_buffer(buffer, size);
    return pb_decode(&iStream, ItemList_fields, model);
}

Here's the 5-byte buffer that is being decoded:

0A 01 00 10 01

In my experiments, nanopb 0.4.0 used about 60% more stack than the old version we were using, going from 236 to 376 bytes of stack to decode the buffer above. The old version was a pull of master from late 2018, I believe--git describe says nanopb-0.4.0-dev-32-g580d7c8.

I also ran a similar test with a much more complicated schema (see below) and got similar results: an increase from 884 bytes to 1445 bytes to decode the message.

Is this change intentional/expected? Are there any options I can set to reduce stack usage?

Schema for the more complicated message:

message Property
{
    enum Mode
    {
        A = 1;
        B = 2;
        C = 3;
    }

    oneof field
    {
        Mode   DeviceA_Mode     = 1; 
        uint32 DeviceA_Size     = 2;
        uint32 DeviceA_Length   = 3; 

        Mode   DeviceB_Mode     = 4; 
        uint32 DeviceB_Size     = 5;
        uint32 DeviceB_Length   = 6; 
    }
}

message SettingsGroup
{
    message Settings
    {
        message Command
        {
            uint32 label                 = 1; 
            repeated Property properties = 2;
        }
    
        uint32      id      = 1; 
        string      name    = 2; 
        uint32      flags   = 3; 
        bool        en      = 5; 
        Command     begin   = 6; 
        Command     end     = 7; 
    }

    Settings settings   = 1;
    uint32 version      = 2;
}

Options:

* anonymous_oneof: true
* long_names     : false
SettingsGroup.Settings.name               max_length : 32
SettingsGroup.Settings.Command.properties max_count  : 6
@PetteriAimonen
Copy link
Member

It's kind-of expected, as there hasn't been as much optimization done after the large changes in 0.4.0. But 60% increase does sound like a bit much.

@mselseth
Copy link

Similar issue. Switched to latest nanopb 0.4.0 from master as per last week from 0.3.6.
As per zacwbond, I could revert to nanopb-0.4.0-dev-32-g580d7c8.
What commit after nanopb-0.4.0-dev-32-g580d7c8 still uses the old RAM usage?

We were defining PB_FIELD_16BIT in 0.3.6 code.
Noting that latest nanopb code defaults to 16 bits so PB_FIELD_16BIT define has no effect.

@PetteriAimonen
Copy link
Member

This is the last build before the merge of the new field descriptor format:
http://raichu:50140/jenkins/job/nanopb-binary-windows/279/
As a side-effect of the merge mistake that caused #388 also, the version number shown there is "0.3.9.2-33-g580d7c8" even though it's actually built from the 0.4 development branch.

That's the 580d7c8 commit which is the last before merge.

@PetteriAimonen
Copy link
Member

Ok, I've now looked into this a bit and the stack usage is indeed ridiculous.

I'm not yet exactly sure why, but it seems to boil down to how compilers consider stack space essentially free. So in e.g. pb_decode_ex(), the field iterator that is only used for initializing default values is allocated on stack. And even though it goes out of scope it'll just remain as unnecessarily allocated stack space during all other execution, as that saves one addition instruction before the function call.

I'll try to get some basic fixes in 0.4.1 that should go out in February, and will see later on bringing the stack usage down more.

PetteriAimonen added a commit that referenced this issue Jan 31, 2020
Gives about 15% stack usage reduction.
@mselseth
Copy link

PetteriAimonen, thank you for the feedback.

PetteriAimonen added a commit that referenced this issue Feb 1, 2020
Moving the message initialization to pb_decode_inner()
avoids allocating a second pb_field_iter_t on the stack.
shreyasbharath pushed a commit to halter-corp/nanopb that referenced this issue Apr 10, 2020
shreyasbharath pushed a commit to halter-corp/nanopb that referenced this issue Apr 10, 2020
shreyasbharath pushed a commit to halter-corp/nanopb that referenced this issue Apr 10, 2020
Moving the message initialization to pb_decode_inner()
avoids allocating a second pb_field_iter_t on the stack.
@maximevince
Copy link

I have a similar question about stack usage. I am using GCC 9.2.1, with -Os, with nanopb 0.4.4.

When decoding a (fairly large) protobuf struct into a statically allocated buffer (target size ~925 bytes),
it seems stack usage is very large (for the tiny Cortex-M0 running this application).

This is the GDB backtrace:

(gdb) bt
#0  0x10059896 in load_descriptor_values (iter=iter@entry=0x20000700 <snt_task_stack+148>) at ../../../libs/nanopb/pb_common.c:14
#1  0x100599e8 in pb_field_iter_begin (iter=iter@entry=0x20000700 <snt_task_stack+148>, desc=0x10067c44 <firmware_version_t_msg>, message=0x20001168 <profile+312>) at ../../../libs/nanopb/pb_common.c:163
#2  0x10058eb4 in pb_decode_inner (stream=stream@entry=0x20000778 <snt_task_stack+268>, fields=<optimized out>, dest_struct=<optimized out>, flags=0) at ../../../libs/nanopb/pb_decode.c:986
#3  0x10058a6c in pb_dec_submessage (field=0x20000810 <snt_task_stack+420>, stream=0x20000888 <snt_task_stack+540>) at ../../../libs/nanopb/pb_decode.c:1594
#4  decode_basic_field (stream=stream@entry=0x20000888 <snt_task_stack+540>, wire_type=wire_type@entry=PB_WT_STRING, field=field@entry=0x20000810 <snt_task_stack+420>) at ../../../libs/nanopb/pb_decode.c:451
#5  0x10058bb2 in decode_static_field (field=<optimized out>, wire_type=<optimized out>, stream=0x20000888 <snt_task_stack+540>) at ../../../libs/nanopb/pb_decode.c:538
#6  decode_field (stream=0x20000888 <snt_task_stack+540>, wire_type=<optimized out>, field=0x20000810 <snt_task_stack+420>) at ../../../libs/nanopb/pb_decode.c:792
#7  0x10059072 in pb_decode_inner (stream=stream@entry=0x20000888 <snt_task_stack+540>, fields=<optimized out>, dest_struct=<optimized out>, flags=0) at ../../../libs/nanopb/pb_decode.c:1089
#8  0x10058a6c in pb_dec_submessage (field=0x20000920 <snt_task_stack+692>, stream=0x20000998 <snt_task_stack+812>) at ../../../libs/nanopb/pb_decode.c:1594
#9  decode_basic_field (stream=stream@entry=0x20000998 <snt_task_stack+812>, wire_type=wire_type@entry=PB_WT_STRING, field=field@entry=0x20000920 <snt_task_stack+692>) at ../../../libs/nanopb/pb_decode.c:451
#10 0x10058bb2 in decode_static_field (field=<optimized out>, wire_type=<optimized out>, stream=0x20000998 <snt_task_stack+812>) at ../../../libs/nanopb/pb_decode.c:538
#11 decode_field (stream=0x20000998 <snt_task_stack+812>, wire_type=<optimized out>, field=0x20000920 <snt_task_stack+692>) at ../../../libs/nanopb/pb_decode.c:792
#12 0x10059072 in pb_decode_inner (stream=stream@entry=0x20000998 <snt_task_stack+812>, fields=<optimized out>, dest_struct=<optimized out>, flags=1) at ../../../libs/nanopb/pb_decode.c:1089
#13 0x10058a6c in pb_dec_submessage (field=0x20000a30 <snt_task_stack+964>, stream=0x20000ab0 <snt_task_stack+1092>) at ../../../libs/nanopb/pb_decode.c:1594
#14 decode_basic_field (stream=stream@entry=0x20000ab0 <snt_task_stack+1092>, wire_type=wire_type@entry=PB_WT_STRING, field=field@entry=0x20000a30 <snt_task_stack+964>) at ../../../libs/nanopb/pb_decode.c:451
#15 0x10058bb2 in decode_static_field (field=<optimized out>, wire_type=<optimized out>, stream=0x20000ab0 <snt_task_stack+1092>) at ../../../libs/nanopb/pb_decode.c:538
#16 decode_field (stream=0x20000ab0 <snt_task_stack+1092>, wire_type=<optimized out>, field=0x20000a30 <snt_task_stack+964>) at ../../../libs/nanopb/pb_decode.c:792
#17 0x10059072 in pb_decode_inner (stream=stream@entry=0x20000ab0 <snt_task_stack+1092>, fields=<optimized out>, dest_struct=dest_struct@entry=0x20001030 <profile>, flags=flags@entry=0) at ../../../libs/nanopb/pb_decode.c:1089
#18 0x100590e8 in pb_decode (stream=stream@entry=0x20000ab0 <snt_task_stack+1092>, fields=<optimized out>, dest_struct=dest_struct@entry=0x20001030 <profile>) at ../../../libs/nanopb/pb_decode.c:1165
#19 0x1005d5b0 in pb_profile_decode (dst_profile=0x20001030 <profile>, src_pb=src_pb@entry=0x2000182c <smalloc_pool> "\b\356\237\360\326\v\022\006\066\065\061\063\067e\032\004\b\001\020\t\"P\n\020\b\254\002\020(\030\310\001 \002(\024\060\003H\001\022\022\b<\020", 
    len=len@entry=578) at ../../../libs/libsnt/src/protobuf/profile_protobuf.c:79
<...>

pb_decode_inner seems to be recursively called, and using quite some stack.
In this case I calculated total nanopb stack usage to be ~1040 bytes.

Is this to be expected? Is there a way to work around this issue?
Would it help using FT_CALLBACK, or doesn't that change anything about stack usage?

Thanks,
Maxime

@PetteriAimonen
Copy link
Member

@maximevince Yeah, it is to be expected that stack usage grows in relation to the number of submessage levels.

Callbacks don't directly make a difference. In long term it would be better to make nanopb store only the required information (buffer pointer etc.) on each recursion level, instead of the full stack frame needed by recursive C functions, but that is a quite large change.

I'm not aware of any particularly effective workaround, other than changing the .proto structure to have less submessage levels.

@maximevince
Copy link

@PetteriAimonen thanks for your swift reply. Now that we know the limitations, we can work with/around them.
Thanks for your excellent project, by the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants