Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf difference with Nim 2.0.2 and --threads:on (the default) vs --threads:off #23347

Open
guzba opened this issue Feb 24, 2024 · 12 comments
Open

Comments

@guzba
Copy link
Contributor

guzba commented Feb 24, 2024

Description

Hello, I have been working on parsing JSON with Nim in a way I wanted to explore and have found a few things of interest to share and ask about. Everything in this initial post will be with Nim stable (2.0.2).

First, I have found that compiling with --threads:off makes execution ~40% faster. This seems higher than I would expect.

nim c --mm:arc -d:release --debugger:native -rf .\tests\bench.nim (then just add --threads:on or --threads:off)

I have set up a repo for an easy and real reproduction of this here. There are no non-standard-lib deps so it should be as easy as possible to run. The repo only has:

I have also included the generated C code here.

You can also see the diff going from --threads:off to --threads:on here.

(Note the little source file itself is stripped from a larger project so the focus is not particularly on what the code does, just that it demonstrates a significant difference in execution time based on compilation flags.)

I had previously opened this issue for reasons related to this code as well.

From what I can tell, there is a significant difference between something here, and IMO *nimErr_ is a likely candidate. It changes to a thread local when --threads:on and I'm wondering if that either makes derefing more costly or if it prevents the C compiler from being able to do some optimizations?

I don't know the exact source of the performance difference, that is simply a guess.

It would make sense for there to be an unavoidable cost to error checking, however I wanted to present what I'm seeing and learn more about how expected vs unexpected this is.

If there is nothing to be done about the cost of *nimErr_ with --threads:on (assuming it even is the culprit), then I do see some opportunities that could mitigate some of the cost. For example:

#line 30 "C:\\Users\\Me\\.choosenim\\toolchains\\nim-2.0.2\\lib\\system\\memory.nim"
static N_INLINE(void, nimZeroMem)(void* p_p0, NI size_p1) {NIM_BOOL* nimErr_;{nimErr_ = nimErrorFlag();
#line 31
  nimSetMem__systemZmemory_u7(p_p0, ((int)0), size_p1);
  if (NIM_UNLIKELY(*nimErr_)) goto BeforeRet_;  }BeforeRet_: ;
}

Is it the case that zeroing-initializing memory always needs to be checked?

eqdestroy___OOZsrcZghissuefeb505250485052_u514((&value));
if (NIM_UNLIKELY(*nimErr_)) goto LA1_;

Can the default destroy hook for seqs / objects with just value types or strings set *nimErr_? If not, its a common check that could be skipped. I think same question for things like deallocShared and much of std/bitops etc.

I am not sure what is and is not possible, what is expected, etc however I thought presenting all of this could be helpful.

Thanks for your time and thoughts.

Nim Version

Nim Compiler Version 2.0.2 [Windows: amd64]
Compiled at 2023-12-15
Copyright (c) 2006-2023 by Andreas Rumpf

active boot switches: -d:release

Current Output

No response

Expected Output

No response

Possible Solution

No response

Additional Information

No response

@guzba
Copy link
Contributor Author

guzba commented Feb 24, 2024

As a quick update, I have tried the same test on my M1 Mac and found interesting results.

First, the difference between --threads:on and --threads:off is much smaller when using arm64 on M1:

nim c --gc:arc --cpu:arm64 --passL:"-arch arm64" --passC:"-arch arm64" --debugger:native -rf -d:release --threads:off tests/bench.nim

Took 1.35s

nim c --gc:arc --cpu:arm64 --passL:"-arch arm64" --passC:"-arch arm64" --debugger:native -rf -d:release --threads:on tests/bench.nim

Took 1.42s

This seems much more like what one might expect.

If I use x64 instead on M1 I get:

nim c --mm:arc -d:release --threads:on -r tests/bench.nim

2.58s

nim c --mm:arc -d:release --threads:off -r tests/bench.nim

1.68s

The huge difference comes back. Something related to amd64, and replicated now on Mac.

Nim Compiler Version 2.0.2 [MacOSX: amd64]
Compiled at 2024-01-14
Copyright (c) 2006-2023 by Andreas Rumpf

active boot switches: -d:release

@guzba
Copy link
Contributor Author

guzba commented Feb 26, 2024

I have done a little more experimenting and found significant evidence that something related to thread locals is the cause here.

To check this, I:

  • Cloned the Nim repo on devel
  • Confirmed no performance difference between devel and stable without modification,
  • Commented out this line
  • Confirmed in the generated C code that nimInErrorMode was no longer NIM_THREADVAR.

Obviously this is broken when more than one thread is involved, however the little benchmark is single-threaded so it is good enough for a with/without comparison.

Nim stable compiler generating NIM_THREADVAR: ~1.6ms
Modified to not include NIM_THREADVAR: 0.86ms

Same PC (amd64) used for this experiment. Nearly double the time for some reason.

@guzba
Copy link
Contributor Author

guzba commented Feb 26, 2024

Looked at the --asm output and found mov rcx, QWORD PTR .refptr.__emutls_v.nimInErrorMode__system_u3501[rip] etc.

__emutls appears to be GCC TLS emulation. I am not sure if this is expected to be used?

I am compiling on Windows 10 using a fresh choosenim install, choosenim stable with gcc -v of gcc version 11.1.0 (MinGW-W64 x86_64-posix-seh, built by Brecht Sanders).

As another test I compiled with --cc:vcc and saw much less performance difference. --cc:vcc produces cmp BYTE PTR nimInErrorMode__system_u3423, 0 etc.

@guzba
Copy link
Contributor Author

guzba commented Feb 26, 2024

Did some more investigating. It appears mingw uses emulated TLS on Windows:

http://mails.dpdk.org/archives/dev/2020-February/157446.html

The second aspect is performance. Per [2], Win32 API TLS functions are ~10%
slower than non-emulated access on Linux, and MinGW emulation layer slows
access by another 20% (30% total). Clang emulation code is similar to
MinGW's [3], although I wasn't able to find any benchmarks. As a DPDK user, I
know that rte_lcore_id() is heavily used on the data-path, so this is severe.

https://sourceforge.net/p/mingw-w64/mailman/mingw-w64-public/thread/d72aad95-b6aa-af03-667b-5898456a5a63@gmx.com/

Surprisingly it's getting almost 30% slower on windows in the 
cpu-intensive part (single threaded, no syscalls, memory intensive 
integer arithmetic mostly).
...
Ah, found the big one:
Was testing single threaded but code uses __thread thread-local storage 
which slows things down a lot on mingw.

This could be wrong, outdated, incomplete, who knows what. I am not a GCC or mingw expert. Just putting it here as part of what I'm finding.

@RSDuck
Copy link
Contributor

RSDuck commented Feb 26, 2024

Try compiling inside MSYS clang environment, it should be a lot faster, because clang actually supports native TLS on Windows. In my case it gave a giant speed up for --threads:on

EDIT: also see #21810

@guzba
Copy link
Contributor Author

guzba commented Feb 26, 2024

@RSDuck Thanks for the suggestion and link to that previous issue, I had not seen it. It does appear this is more of a consequence of using mingw/gcc on Windows as opposed to a Nim issue.

Given that this is the default compilation path for Nim on Windows, it seems worth seeing if relatively small steps could mitigate a bunch of the perf hit.

Earlier in this issue I suggested some potential low-hanging opportunities to skip some frequently generated checks of the thread local error bool. I imagine this could reduce the noticeable perf hit by a lot. I don't know if it is actually safe though within the error model of Nim's allocator and value types etc.

Sorry to bother you @Araq, but could some error checks be safely skip-able as I theorize or are there problems here I'm not aware of?

@Araq
Copy link
Member

Araq commented Mar 3, 2024

Sorry to bother you @Araq, but could some error checks be safely skip-able as I theorize or are there problems here I'm not aware of?

You're fine with --fieldChecks:on --boundChecks:on and disabling all the rest IME.

@juancarlospaco
Copy link
Collaborator

Try compiling inside MSYS clang environment, it should be a lot faster, because clang actually supports native TLS on Windows. In my case it gave a giant speed up for --threads:on

Maybe this should be documented more explicitly.

@KhazAkar
Copy link

Try compiling inside MSYS clang environment, it should be a lot faster, because clang actually supports native TLS on Windows. In my case it gave a giant speed up for --threads:on

Maybe this should be documented more explicitly.

This should be shipped to windows users by default, not only documented.

@Araq
Copy link
Member

Araq commented Mar 14, 2024

This should be shipped to windows users by default, not only documented.

Agreed, how do we do that?

@KhazAkar
Copy link

This should be shipped to windows users by default, not only documented.

Agreed, how do we do that?

Let me think... When I will find more time in my life, I might play with building Nim on windows in VM in msys2 clang environment in automated fashion, then I will provide some sort of PR. Is that acceptable?

@KhazAkar
Copy link

KhazAkar commented Apr 15, 2024

As a first bone thrown I can propose trying with such toolchain. I will definitely fire up W10 VM in following days and build Nim using it.
https://github.com/mstorsjo/llvm-mingw

PS. I'm sorry if my previous comment sounded rude in any way, shape or form.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants