Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document cppreg (zero) impact on performance and code size. #1

Closed
3 tasks done
sendyne-nicocvn opened this issue Feb 22, 2018 · 16 comments
Closed
3 tasks done
Assignees
Labels
discussion Discussion about possible features or revisions.

Comments

@sendyne-nicocvn
Copy link
Contributor

sendyne-nicocvn commented Feb 22, 2018

The main goal is to provide a document (e.g., Performance.md) where we show assembly outputs for various level of optimizations (and possibly compilers) to illustrate that cppreg does not affect runtime performance or code size.

This requires:

  • design a code example with a cppreg implementation and a traditional implementation (à la CMSIS)
  • produce compiler outputs (the target is GCC ARM but others could be added) for Og, O1, O2, O3, Os (although O[1-3] will most likely be identical)
  • write Performance.md

We could put the various materials in a benchmark directory to not pollute the main one.

@sendyne-nicocvn sendyne-nicocvn added the discussion Discussion about possible features or revisions. label Feb 22, 2018
@hak8or
Copy link
Collaborator

hak8or commented Feb 23, 2018

As I was cleaning up an old project of mine, I started thinking it might be a decent example showing a full use case comparing CMSIS and cppreg. On an unrelated note, the minimal example to do a clock tree init and get a LED toggling at ~1hz is only ~280 bytes!

For the performance.md, my idea is to include a godbolt based example for ::write, ::chained_write.with.with, and ::is_set (so four operations total) in terms of the C++ code and assembly output using -Os. This would be used to show that the resulting instructions cannot be optimized further.

Then, a full example, possibly implementing clock tree initialization and then some UART transmissions and LED blinking, with various optimization flags (-O0,-Os,-O1,-O2, -O3, -Og) for cppreg vs CMSIS. The data would be presented in the form of two tables, each with two columns (cppreg vs CMSIS) and 6 rows (optimization levels) table. The first table would be total size of the resulting binary in bytes, and the second table somehow showing how long it takes to do the clock init and a few million loops of the LED toggling and UART sending, in milliseconds or cycles.

@sendyne-nicocvn
Copy link
Contributor Author

Sounds good to me.

The clock tree + UART + LED would be a great example. I assume 280 bytes is without startup code (i.e., no interrupts table and SystemInit) but that would actually be clearer that way. For the data this seems like a good start. Once we collect them we can decide on how to organize them.

I created the performance branch so that we can start putting code in the repository.

@hak8or
Copy link
Collaborator

hak8or commented Feb 23, 2018 via email

@hak8or
Copy link
Collaborator

hak8or commented Mar 6, 2018

Scenario

I decided to heavily leverage godbolt in the comparison because I feel it will be much more accessible to anyone who will want to go through the edit-compile-lookatassembly loop themselves. They won't have to download a compiler, will be easier to show comparisons, etc.

Anyways, for a decent yet simple comparison it would be good to turn a led on, and then in a loop turn the led off, wait a bit turn the led on, and loop back. This is small enough that in terms of code it fits in a screen in terms of lines of code, is easy to understand by everyone, and doesn't bog the example down in terms of setting up a lot of registers (which would be needed to do a full demo including clocking and power gating).

Progress

Here is what I am working with right now (the URL is so long that even neither google nor bit.ly can shorten it!). It looks very similar but there are two interesting differences.

I want to focus a bit on #5 before continuing on writing up a performance comparison. @sendyne-nclauvelin Do let me know if you think the comparison scenario is appropriate though so the first task can be marked as complete.

@hak8or
Copy link
Collaborator

hak8or commented Mar 6, 2018

On a somewhat related note, I wonder if it would be worthwhile to include this in a larger (many registers) comparison somehow as another metric when looking at the assembly alone is not feasible.

@sendyne-nicocvn
Copy link
Contributor Author

sendyne-nicocvn commented Mar 9, 2018

Subsequent to #6, #8 and #9 I think we can resume the performance assessment. I forwarded the performance branch to the current state.

@hak8or
Copy link
Collaborator

hak8or commented Mar 9, 2018

Agreed, will aim to have it in a good state by tonight.

@hak8or
Copy link
Collaborator

hak8or commented Mar 10, 2018

I was working and spotted two "issues" that cause differences.

  1. For registers which have a read only field and you are writing to another field, the bits relevant to the read only field are being cleared. I do not see why there should be more work done (clearing it) compared to just writing the old relevant bits back. I do not recall ever seeing a register that contains a read only field which required writing back 0's to the associated field (which would make it not read only), instead just that all writes were ignored. This does introduce overhead I feel because with cppreg as-is you cannot disable it (write back old values instead of clearing). What was your line of reasoning for clearing the bits?

  2. The lack of expressing related registers (CMSIS style structures mapped to memory) sadly does introduce "overhead" in the form of the compiler sticking two close addresses in .text and then overwriting an old CPU register with the new address (introducing more register pressure), instead to just doing memory interfacing using offsets.

This is the example I am planning to use for the comparison.

Ignoring those two issues, the assembly is identical.

@sendyne-nicocvn
Copy link
Contributor Author

For the second point this relates to #7 and I have first to create an example which isolate the issue to understand a bit better what is exactly happening. For the first point can you provide a minimal example because I am not clear on what you are describing?

@sendyne-nicocvn
Copy link
Contributor Author

Also in your example you have:

UART::STATUS::merge_write<UART::STATUS::Enable>(1).with<UART::STATUS::Sending>(1).done();

Using:

UART::STATUS::merge_write<UART::STATUS::Enable,1>().with<UART::STATUS::Sending,1>().done();

simplifies the generated assembly (look at the L3 branch). This an important detail ... cppreg uses a faster implementation if the data are known at compile time and to indicate that you need to use the template version. This is also true for regular write calls.

At this point the only difference is how offsets between registers is managed in cppreg. Will work on that for now.

@hak8or
Copy link
Collaborator

hak8or commented Mar 10, 2018

Ah, I didn't realize that could be done via template arguments for merge writes (is that part of the new API changes you did recently?).

The minimal example is here. Looking at it again, I realized I misread the assembly originally. It looked like the BICS instruction was clearing bits that are in the readonly field, but turns out it was actually clearing the bits that were being actually written to via the merge write. In that case, I agree that your fix (via template arg) does fix the issue.

When/if the register offsets concept gets put in, then the assembly should finally match for pretty much all use cases, hopefully.

@sendyne-nicocvn
Copy link
Contributor Author

sendyne-nicocvn commented Mar 11, 2018

Ah, I didn't realize that could be done via template arguments for merge writes (is that part of the new API changes you did recently?).

No this was already there. As part of #10 I added more details to the API documentation regarding this particular point.

@hak8or
Copy link
Collaborator

hak8or commented Mar 28, 2018

What do you think as of 08bc98c?

@sendyne-nicocvn
Copy link
Contributor Author

Looks good. I actually prefer that we only present a small example rather than a lengthy and complex one. I will fix some typos and tie it with the README.

@hak8or
Copy link
Collaborator

hak8or commented Mar 28, 2018

Awesome!

@sendyne-nicocvn
Copy link
Contributor Author

The performance comparison is now available in the master branch so I consider this issue closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion Discussion about possible features or revisions.
Projects
None yet
Development

No branches or pull requests

2 participants