Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missed optimization for std::atomic_thread_fence(std::memory_order_seq_cst) #91731

Open
ConorWilliams opened this issue May 10, 2024 · 7 comments

Comments

@ConorWilliams
Copy link

ConorWilliams commented May 10, 2024

While implementing a lock-free-queue I noticed that the pop function was about twice as slow on clang vs gcc. After digging through the assembly on compiler explorer and then reducing to a minimal example it seems that this is happening:

Source GCC Clang
std::atomic_thread_fence(std::memory_order_seq_cst) lock or QWORD PTR [rsp], 0 mfence

The mfence instruction is much slower, MSVC also generates lock inc DWORD PTR __Guard$1[esp+4] instead of an mfence. I raised this on the r/cpp a while ago and was referred to this GCC patch which introduced the optimisation. How can we go about getting something like this into llvm? I have been using boost atomic which seems to generate better assembly but, it would be really nice to drop the dependency.

@ConorWilliams
Copy link
Author

Digging into boost atomic they note the same missed optimization here and here

@ConorWilliams
Copy link
Author

Some 3rd party benchmarks here

@dtcxzyw
Copy link
Member

dtcxzyw commented May 10, 2024

cc @RKSimon @topperc

@llvmbot
Copy link
Collaborator

llvmbot commented May 10, 2024

@llvm/issue-subscribers-backend-x86

Author: Conor Williams (ConorWilliams)

While implementing a [lock-free-queue](https://github.com/ConorWilliams/libfork/blob/main/include/libfork/core/ext/deque.hpp) I noticed that the `pop` function was about twice as slow on clang vs gcc. After digging through the assembly [on compiler explorer](https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYgATKVpMGoZAFI9AIQuXSS%2BsgJ4BlRugDCqWgFcWDfaRuADJ4DJgAcr4ARpjEIADMABykAA6oCoTODJ4%2BfgFpGU4CIWGRLDFxSXaYDkUMQgRMxAQ5vv4G9piOWQ1NBCUR0bEJyQqNza15HeP9oYPlw0kAlHao3sTI7BwW8aHIPlgA1Gbx7kxELHjm8dhmGgCCd/djxN6Oh1gsqMcA7NYPh0BhyY3iIhzSKUOECWv3%2B9yBCMOoQIhzQDDGhyiqAIF2O8QAIocWAB9LE41AsAB0tFQTHQEDG6BAIDYX2IAE9iSQsMRicQakxVJh0DCALSHLgnSyHAD0MqeiKBILBACoot4qHjCST1VRqbT6YzmaySJzubE%2BQKhSKpYq5QrFUTSdiLpSxiRMBAyRdSIcjSzMGyzcQeZajNalrbHdGY7KZYcGpgxLEFKjjIcGN8aSZYn6CEnaJSng7Ff7zhSrsSCAh%2BXTiVRGJsGQQmQGg1yQxalABHYnIMaR%2BJw6PIw5ESEnbVV1ApfV05utk0cjuh/nh4WDuElxF4TUQcd49yTzEuikwsx/bfRuWHcICUWBlIEdkfTDd7yYK%2BO3dQg%2BT4/emesJfna8YACoIJghxGBihCBqiay0OghzAJgKLuvQDC%2BlEL5MHmBa5tW5zQQawonuSLCYpg/D8ocazEIcADuxBwcCWIAG6YEWAKxkie5gGAJLjpSaAsCkTSYMSmCqMgCDGKhxIvAIwD7jOvp/lYEq%2Bv6S7BqGPZ9mM2ktsagamiuFproKG7npePG8UCN4AGJMHQZHEEwmzqQgeDUSh2JjpB0FMLB%2BZUiBMY6qeVLuvyXrRccmlcMZi5mcu5q8lZEZShF0b8gQ6wMIcGg5fZMYXviuVAlFFFukQcWAZRNhaXmqXthlYbWTaQ5VYCFW9XGhxnEVWBGC%2BNJ0oc3iCHQwJUPmDGee%2BeAsSYfo0iiIV0QwtAvtWeCptWtbIaIRUpN4CgINCpADTeEEHSRk0PZmKL8st/LIWCMTAhcVzAqmjFQRJgVQdJ%2BwXXgnFMSxC3cfCvH5YVmIamYACsliNWj%2BJzoaJltuZHVZRupXw4q/VlQiNWurFnqNYl0rJa1pntZ2mVWsTPUU0CiPEEVJWc6TvyVQ8Dplr91zuMiJzYE644k4CYsVhLUs3E6mMCwrePlpcEuKzrJyS4I0uHCqxs6ijAv9ZbDzKt8qEMMSIUsBAnyoBYABshyqJGdmC6olIQtCJMVRwKy0JwqO8P4HBaKQqCcEeVjNQo9GbIl8R6LwBCaKHKwANYJJIlLuz8iQAJzuxoeio3oGdcD8ZfSOHHCSLwLASBoGikNHsfxxwvAKCAXfZzHoekHAsBIKJKRuWQFAQNPs8gMACjMCkl3YnwdALYPXo56QUShE07KcDwB9HxyADyUTaF0I9n6JbCCJfO0n6PpBYOqwBnLQtCD9wvBPjyXEO/fA/JuicX/rHaSXRQRbDPsiGo%2B9aB4CiB5DkngsD7wICxduADSCcWIFiJQ%2BJAzANCKAUeKwqBGBXgANV8oxS%2BKRGCn14PwQQIhkwSGkBw%2BQSg1D710MlIwJgUA2BsIYVBg9IArBnHUf%2BopGSTnMEnKwegNCHFFJfKgIUUSinxOEfE2BLByAAOJaKoJmB8MlMBPiyAoXgqBCEsSwDI66nRuguAgG4SY/hkrBDmGUCoEhUjpEyAIPxoSCgRIYAMYJwxkqeLqL0CYXg2ihOST0GY8ShhxCSTMKJBS%2Bi5IWPklYKd1ibH0GHCOUd959y9okd2op3aSBQsgZAv5XgMDzjCCAuBCAkHTnoJYWcc5LBWJBOkwxroF3iBoSkGg2k/C4FwMuVcfiSA0KjERnBW7dwaZwAeQ9SAjy0CsCeiAUAUhnvQOelBF73JQKI4AVYel5y3rQHelAoj70PswDkbDz6AvZNfW%2BjhgWP0YAQF%2Bu196f28N/MQf9gVAJMCA2OYC76Q0wFA3gMDkBwOBYg5uscUFoOPpgrYsccF4DwWfQhxDMCkJYOQsRVC%2BC0IUAwzATCWHRzPnwrh4gpAyEEIoFQ6h366AMK88RajbAUvcXI%2BxAhFHKIJKo6w6jNHaPiFowxxjTEWNFFY1ANjNhqvRE4lxeA3HwAqTUHF3jfHpLyAEhg6BSkhOSjEuoRSwmFCyD6xJ1RajZL6IGrJAhUmzFKHkzJhT3X%2BLsDkoJiauAVNTuwUZhg6mHPfo01QzTWntP2PJbp00%2BlQkGfVEZYyzkTPzgkVGlJq7uwzqXSQkgy7xB%2BKs1G%2BaW71KLccuwpzzm52HZnQtvdx1TsmQQlMWQQCSCAA%3D%3D%3D) and then reducing to a [minimal example](https://godbolt.org/#z:OYLghAFBqd5TKALEBjA9gEwKYFFMCWALugE4A0BIEAZgQDbYB2AhgLbYgDkAjF%2BTXRMiAZVQtGIHgBYBQogFUAztgAKAD24AGfgCsp5eiyahUAUgBMAIUtXyKxqiIEh1ZpgDC6egFc2TEABWADZydwAZAiZsADk/ACNsUhBZAAd0JWIXJi9ffxTydMznIUjouLZE5NkHbCdskSIWUiJcvwCQ%2B2xHEqZG5qIy2ISkgqUmlrb8monBqOHK0ekASnt0H1JUTi5LAGYo1F8cAGozXY8WEjYCc13cMy0AQQfHgDd0AkxjmmYtiGXTgB2GxPY5g8aYEAgS7oa6oAD6RCQpGwLEw8J%2BTD%2BEKhHDYZAAnvCyDhSPCVABHeGocbLM4g56AgAiL1ZTxeXFW9G4gX4AS4OnI6G4HlstmOSnWm2wpwsuz45CI2k5qwA1iBdgBOAB0gMCAA4eJrgvr9btAbtpJrNYZuNJ%2BGwglpyPzBcKuPwlCBnUqBZzyHBYCgMGxUgwkpRqCGw4xkq9UKlUvDXkb4UYiNhxvD1PrQnR6BnSF6IPFleR4lFmgTuAqK6xSASAPLxXR1X0KkMcYSNpj0at%2B8g4eI%2BYAXej0L28fg4NjGYCSAeEFH1V6ZsvYdR1HwZmv8KIZ7kD%2BgEeKkKteHBloikAiOqfkVekeIZbBM7CzkzHkzK1Y0IzAJQADUCGwAB3RtUmYXc5GEMQJE4GQYMUFQNDLfQLEMOc0DFaxDBPL1IFWdBUl6L0AHpG3UIVHxvHACP%2BLoemyNwmE8bx2iCDCInmCoqikXZCgyLIhCmDoMKKYSmCGXjRh4ATanqIR%2Bkmdj8htBTemUuZyhGZI5PsWZRJAdTZmk3T%2BNWSUNi2KQuR5Pky3dY4c2CY42CUeNjhTHV00zIhjggfBiDIWV5WWfhfR0ZZViQVFSWoOyuHtchHUCZ1XX4d1PW9RUfzVIJdm1YJpF2QJNTSwFpAsLRAWCQFbS4XYHIHLLcr9VZAwQeAIGDWEYwjKgIGjcNkkOOdEVIHwmFVAQGELYtSwHOsq2g5aG2bVsnGgztmCIHs%2BzLIcRzHCdoJnOcF0FJc2wIVdJ0FDctx3e9926Mtj1Pc8sG2QVr1vaDH2fFQ3w/ecolAdqBH/ICQPAyD%2BQVQRYPESREKR5C1E0Ad9B4TDv3MaxbDw%2BJ6KIkjsknY4AFpGya6iklotdCMYm7XAgdwjJ4QIwlYszFmScShN6TnuYk3o%2Bb4jCNIaQzVICerpaU0yePMqXZbyeWDIGCXRgsSypRsnhEt5F1HO4Zzcyp4rjjGkwAuvKbVQBQLCBIUhQqNiK8vIWK0VGBj1WkYJtVK4qtCtCxAkBTU5QEw9ktS9KzY9ewcsi/1Op6tA%2BpGyMhpz2MQGAGRcfzebqEWwU1v7WtK3Wls2222Euz23t%2Byu7Bh1HCRTvvc7Px%2B6cCGXZw7vXTdUG3bYFVew9BQ%2Bs8GwvQfFRvO8FUBl8QYu8GfyhlgAOAsCIKg%2B90bg1HZHR5RMbQjU8dMHC7A%2B0mhXJoRKZpun0Boz4mYYxWAR2asSMqVHmmAdYC0EsUbIRkbRi2yJAkAUtuisz6OrDiYDAFaSQWrAYcCtYtFwfrayCFjbNTdObFyVtpDHGAKgVAxwZDah4AFIKbsPbhTalFGKcV/aJQTk6U2LVuDZR9N7QOwdQ7SHDjHKOMddhx24HTDKQpRHcJVA1CwFDMrqPTtFB8SRMiuGkEAA%3D%3D%3D) it seems that this is happening:
Source GCC Clang
std::atomic_thread_fence(std::memory_order_seq_cst) lock or QWORD PTR [rsp], 0 mfence

The mfence instruction is much slower, MSVC also generates lock inc DWORD PTR __Guard$1[esp+4] instead of an mfence. I raised this on the r/cpp a while ago and was referred to this GCC patch which introduced the optimisation. How can we go about getting something like this into llvm? I have been using boost atomic which seems to generate better assembly but, it would be really nice to drop the dependency.

@RKSimon
Copy link
Collaborator

RKSimon commented May 10, 2024

Funnily enough, if the cpu target doesn't have MFENCE LowerATOMIC_FENCE will emit LOCK OR $(RSP), 0

@topperc
Copy link
Collaborator

topperc commented May 10, 2024

There's a patch here that never merged https://reviews.llvm.org/D129947

@ConorWilliams
Copy link
Author

ConorWilliams commented May 10, 2024

There's a patch here that never merged https://reviews.llvm.org/D129947

@topperc could it still be merged? I'm not sure I follow why it got stalled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants