-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf difference with Nim 2.0.2 and --threads:on (the default) vs --threads:off #23347
Comments
As a quick update, I have tried the same test on my M1 Mac and found interesting results. First, the difference between
Took 1.35s
Took 1.42s This seems much more like what one might expect. If I use x64 instead on M1 I get:
2.58s
1.68s The huge difference comes back. Something related to amd64, and replicated now on Mac.
|
I have done a little more experimenting and found significant evidence that something related to thread locals is the cause here. To check this, I:
Obviously this is broken when more than one thread is involved, however the little benchmark is single-threaded so it is good enough for a with/without comparison. Nim stable compiler generating Same PC (amd64) used for this experiment. Nearly double the time for some reason. |
Looked at the
I am compiling on Windows 10 using a fresh As another test I compiled with |
Did some more investigating. It appears mingw uses emulated TLS on Windows: http://mails.dpdk.org/archives/dev/2020-February/157446.html
This could be wrong, outdated, incomplete, who knows what. I am not a GCC or mingw expert. Just putting it here as part of what I'm finding. |
Try compiling inside MSYS clang environment, it should be a lot faster, because clang actually supports native TLS on Windows. In my case it gave a giant speed up for --threads:on EDIT: also see #21810 |
@RSDuck Thanks for the suggestion and link to that previous issue, I had not seen it. It does appear this is more of a consequence of using mingw/gcc on Windows as opposed to a Nim issue. Given that this is the default compilation path for Nim on Windows, it seems worth seeing if relatively small steps could mitigate a bunch of the perf hit. Earlier in this issue I suggested some potential low-hanging opportunities to skip some frequently generated checks of the thread local error bool. I imagine this could reduce the noticeable perf hit by a lot. I don't know if it is actually safe though within the error model of Nim's allocator and value types etc. Sorry to bother you @Araq, but could some error checks be safely skip-able as I theorize or are there problems here I'm not aware of? |
You're fine with |
Maybe this should be documented more explicitly. |
This should be shipped to windows users by default, not only documented. |
Agreed, how do we do that? |
Let me think... When I will find more time in my life, I might play with building Nim on windows in VM in msys2 clang environment in automated fashion, then I will provide some sort of PR. Is that acceptable? |
As a first bone thrown I can propose trying with such toolchain. I will definitely fire up W10 VM in following days and build Nim using it. PS. I'm sorry if my previous comment sounded rude in any way, shape or form. |
Description
Hello, I have been working on parsing JSON with Nim in a way I wanted to explore and have found a few things of interest to share and ask about. Everything in this initial post will be with Nim stable (2.0.2).
First, I have found that compiling with
--threads:off
makes execution ~40% faster. This seems higher than I would expect.nim c --mm:arc -d:release --debugger:native -rf .\tests\bench.nim
(then just add--threads:on
or--threads:off
)I have set up a repo for an easy and real reproduction of this here. There are no non-standard-lib deps so it should be as easy as possible to run. The repo only has:
I have also included the generated C code here.
You can also see the diff going from
--threads:off
to--threads:on
here.(Note the little source file itself is stripped from a larger project so the focus is not particularly on what the code does, just that it demonstrates a significant difference in execution time based on compilation flags.)
I had previously opened this issue for reasons related to this code as well.
From what I can tell, there is a significant difference between something here, and IMO
*nimErr_
is a likely candidate. It changes to a thread local when--threads:on
and I'm wondering if that either makes derefing more costly or if it prevents the C compiler from being able to do some optimizations?I don't know the exact source of the performance difference, that is simply a guess.
It would make sense for there to be an unavoidable cost to error checking, however I wanted to present what I'm seeing and learn more about how expected vs unexpected this is.
If there is nothing to be done about the cost of
*nimErr_
with--threads:on
(assuming it even is the culprit), then I do see some opportunities that could mitigate some of the cost. For example:Is it the case that zeroing-initializing memory always needs to be checked?
Can the default destroy hook for seqs / objects with just value types or strings set
*nimErr_
? If not, its a common check that could be skipped. I think same question for things likedeallocShared
and much of std/bitops etc.I am not sure what is and is not possible, what is expected, etc however I thought presenting all of this could be helpful.
Thanks for your time and thoughts.
Nim Version
Nim Compiler Version 2.0.2 [Windows: amd64]
Compiled at 2023-12-15
Copyright (c) 2006-2023 by Andreas Rumpf
active boot switches: -d:release
Current Output
No response
Expected Output
No response
Possible Solution
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: