Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will there be a README or other documentation? #1

Open
travisdowns opened this issue Oct 9, 2016 · 8 comments
Open

Will there be a README or other documentation? #1

travisdowns opened this issue Oct 9, 2016 · 8 comments

Comments

@travisdowns
Copy link
Contributor

It would be awesome to have a README or documentation on this tool. A lot what you've described in this answer could simply be copied over.

Are you willing to answer questions about the tool? What's the best forum for it? Issues here on github? Questions on stackoverflow? Somewhere else?

@obilaniu
Copy link
Owner

obilaniu commented Oct 9, 2016

@travisdowns Yessir, will get down to writing it. This Github repo would probably be the best place to discuss it. In the README or in the Wiki page. I don't want to just copy over what I wrote there, since I was explaining why the counters must have been set to count in OS mode, but I'll certainly inspire myself from them.

@obilaniu
Copy link
Owner

obilaniu commented Oct 9, 2016

@travisdowns There's the beginnings of a README.md in the repo now, though much remains unsaid, especially about the kernel code.

@travisdowns
Copy link
Contributor Author

Awesome, reading it now.

What's the approximate cost of the PFCSTART/PFCEND calls? Do the make a kernel transition, or does the LKM enable user-space setting & reading of the PMC counters?

How does this compare to agner fogs testp program:

http://www.agner.org/optimize/#testp

?

How does this compare to PAPI?

I'm actually looking for a lightweight way to time smallish sections of code. My current approach is to use Linux perf, but it doesn't have an API (you could, in principle, use the underlying perf_events syscalls, but I haven't looked into how hard that would actually be). It seems like libpfc could be that way.

@obilaniu
Copy link
Owner

obilaniu commented Oct 10, 2016

@travisdowns They are defined here. pfcRemoveBias() automatically computes the costs for the current counter configuration. In particular, both sequences cost precisely the same (Assuming add/sub with memory operands cost the same), and both cost 37 instructions, ~240 unhalted core cycles and 0 branches (at least on my systems). There is no other overhead, and no system calls. In my experience, n pairs of PFCSTART() and PFCEND() followed by pfcRemoveBias(, n) produces essentially exact counts; For instance, if they sandwich no code, they'll reliably measure about 0 on all metrics.

The software does allow userspace to write configurations and counts to the hardware MSRs, and makes a kernel transition when doing so, but the macros PFCSTART() and PFCEND(), which employ rdpmc instructions, specifically do not make kernel transitions in order to ensure their deterministic run-time and cost. This determinism is relied upon by pfcRemoveBias() to compensate that deterministic cost.

testp is software in the same vein as libpfc, and supports more OSes and more CPUs. But it is not library-based, and its overhead estimation is not as deterministic as mine. The overhead estimation is written in C and involves loops; The code size for this is much greater, touches more icache lines, involves loops (and therefore branches) and there is no guarantee that the code for overhead estimation is exactly the same as the actual hot code timing. Moreover, the start code and end code are not precisely the same (the former involves an assignment, the latter a subtraction). Lastly, the rdpmc readouts from testp are int, which is 32 bits, while my macros perform full-bitwidth reads as reported by CPUID (On Haswell, 48-bit) and accumulate them into a 64-bit integer. However, it does correctly set the User bit and clear the Operating System bit, like me.

The PFCSTART() and PFCEND() macros are written in inline assembler. The instructions within them are precisely the same (except for the add/sub distinction), have the same cost and instruction size and are branchless. pfcRemoveBias() contains an single inline assembler chunk with both of them, to measure precisely their overhead. The PFCSTART() macro subtracts while the PFCEND() macro adds the current readouts of rdpmc, which means you can use multiple pairs to perform fine-grained performance measurements within the code, then invoke pfcRemoveBias(, n) with n equaling the number of such pairs to remove the overhead precisely.

IIRC, PAPI is the interface perf uses, in which case it would suffer from the same overcounting problem as perf.

My pfcdemo code should get you started using my library; The "hot section" is where you'd place your code for isolated snippets, but alternately you can ditch that and use my code as a library within your larger projects. For that, call my initialization, thread-pinning and counter setup code in your main, define a global array of 7 64-bit integers, and sandwich my PFC* macros around any chunk of code you wish to time. Then at program exit call pfcRemoveBias(, n) with the number of times n that chunk of code was executed, divide the counts by n to compute an average, and print out this value.

@travisdowns
Copy link
Contributor Author

The 240 cycles is for reading all 8 counters, right? Is there an option to only read a subset?

@obilaniu
Copy link
Owner

@travisdowns Well, technically, 7 counters (3 fixed, 4 general-purpose).

It would be possible to read a subset by hacking the inline asm macros, but I wanted to avoid branches in them for reasons of predictability and avoiding incrementing counters if I could avoid it (like # of branches encountered and (mis)-predicted). Avoiding branches in that code while allowing any subset of 7 counters would require 2^7 versions of the macro, a bit painful.

Is the overhead of 7 counter reads that considerable?

@obilaniu
Copy link
Owner

@travisdowns Other thing to note, certain performance events can only be counted on certain counters (Some L1/L2 events can only be counted in GP1, for instance). I've no idea why.

@ms2pony
Copy link

ms2pony commented Jul 18, 2021

I can't build through the readme.d, the meson.py .. -Dbuildtype=release --prefix=/path/to/prefixdir # Such as $HOME/.local is hard to understand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants