-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missed optimization for std::atomic_thread_fence(std::memory_order_seq_cst)
#91731
Comments
Some 3rd party benchmarks here |
@llvm/issue-subscribers-backend-x86 Author: Conor Williams (ConorWilliams)
While implementing a [lock-free-queue](https://github.com/ConorWilliams/libfork/blob/main/include/libfork/core/ext/deque.hpp) I noticed that the `pop` function was about twice as slow on clang vs gcc. After digging through the assembly [on compiler explorer](https://godbolt.org/#z:OYLghAFBqd5QCxAYwPYBMCmBRdBLAF1QCcAaPECAMzwBtMA7AQwFtMQByARg9KtQYEAysib0QXACx8BBAKoBnTAAUAHpwAMvAFYgATKVpMGoZAFI9AIQuXSS%2BsgJ4BlRugDCqWgFcWDfaRuADJ4DJgAcr4ARpjEIADMABykAA6oCoTODJ4%2BfgFpGU4CIWGRLDFxSXaYDkUMQgRMxAQ5vv4G9piOWQ1NBCUR0bEJyQqNza15HeP9oYPlw0kAlHao3sTI7BwW8aHIPlgA1Gbx7kxELHjm8dhmGgCCd/djxN6Oh1gsqMcA7NYPh0BhyY3iIhzSKUOECWv3%2B9yBCMOoQIhzQDDGhyiqAIF2O8QAIocWAB9LE41AsAB0tFQTHQEDG6BAIDYX2IAE9iSQsMRicQakxVJh0DCALSHLgnSyHAD0MqeiKBILBACoot4qHjCST1VRqbT6YzmaySJzubE%2BQKhSKpYq5QrFUTSdiLpSxiRMBAyRdSIcjSzMGyzcQeZajNalrbHdGY7KZYcGpgxLEFKjjIcGN8aSZYn6CEnaJSng7Ff7zhSrsSCAh%2BXTiVRGJsGQQmQGg1yQxalABHYnIMaR%2BJw6PIw5ESEnbVV1ApfV05utk0cjuh/nh4WDuElxF4TUQcd49yTzEuikwsx/bfRuWHcICUWBlIEdkfTDd7yYK%2BO3dQg%2BT4/emesJfna8YACoIJghxGBihCBqiay0OghzAJgKLuvQDC%2BlEL5MHmBa5tW5zQQawonuSLCYpg/D8ocazEIcADuxBwcCWIAG6YEWAKxkie5gGAJLjpSaAsCkTSYMSmCqMgCDGKhxIvAIwD7jOvp/lYEq%2Bv6S7BqGPZ9mM2ktsagamiuFproKG7npePG8UCN4AGJMHQZHEEwmzqQgeDUSh2JjpB0FMLB%2BZUiBMY6qeVLuvyXrRccmlcMZi5mcu5q8lZEZShF0b8gQ6wMIcGg5fZMYXviuVAlFFFukQcWAZRNhaXmqXthlYbWTaQ5VYCFW9XGhxnEVWBGC%2BNJ0oc3iCHQwJUPmDGee%2BeAsSYfo0iiIV0QwtAvtWeCptWtbIaIRUpN4CgINCpADTeEEHSRk0PZmKL8st/LIWCMTAhcVzAqmjFQRJgVQdJ%2BwXXgnFMSxC3cfCvH5YVmIamYACsliNWj%2BJzoaJltuZHVZRupXw4q/VlQiNWurFnqNYl0rJa1pntZ2mVWsTPUU0CiPEEVJWc6TvyVQ8Dplr91zuMiJzYE644k4CYsVhLUs3E6mMCwrePlpcEuKzrJyS4I0uHCqxs6ijAv9ZbDzKt8qEMMSIUsBAnyoBYABshyqJGdmC6olIQtCJMVRwKy0JwqO8P4HBaKQqCcEeVjNQo9GbIl8R6LwBCaKHKwANYJJIlLuz8iQAJzuxoeio3oGdcD8ZfSOHHCSLwLASBoGikNHsfxxwvAKCAXfZzHoekHAsBIKJKRuWQFAQNPs8gMACjMCkl3YnwdALYPXo56QUShE07KcDwB9HxyADyUTaF0I9n6JbCCJfO0n6PpBYOqwBnLQtCD9wvBPjyXEO/fA/JuicX/rHaSXRQRbDPsiGo%2B9aB4CiB5DkngsD7wICxduADSCcWIFiJQ%2BJAzANCKAUeKwqBGBXgANV8oxS%2BKRGCn14PwQQIhkwSGkBw%2BQSg1D710MlIwJgUA2BsIYVBg9IArBnHUf%2BopGSTnMEnKwegNCHFFJfKgIUUSinxOEfE2BLByAAOJaKoJmB8MlMBPiyAoXgqBCEsSwDI66nRuguAgG4SY/hkrBDmGUCoEhUjpEyAIPxoSCgRIYAMYJwxkqeLqL0CYXg2ihOST0GY8ShhxCSTMKJBS%2Bi5IWPklYKd1ibH0GHCOUd959y9okd2op3aSBQsgZAv5XgMDzjCCAuBCAkHTnoJYWcc5LBWJBOkwxroF3iBoSkGg2k/C4FwMuVcfiSA0KjERnBW7dwaZwAeQ9SAjy0CsCeiAUAUhnvQOelBF73JQKI4AVYel5y3rQHelAoj70PswDkbDz6AvZNfW%2BjhgWP0YAQF%2Bu196f28N/MQf9gVAJMCA2OYC76Q0wFA3gMDkBwOBYg5uscUFoOPpgrYsccF4DwWfQhxDMCkJYOQsRVC%2BC0IUAwzATCWHRzPnwrh4gpAyEEIoFQ6h366AMK88RajbAUvcXI%2BxAhFHKIJKo6w6jNHaPiFowxxjTEWNFFY1ANjNhqvRE4lxeA3HwAqTUHF3jfHpLyAEhg6BSkhOSjEuoRSwmFCyD6xJ1RajZL6IGrJAhUmzFKHkzJhT3X%2BLsDkoJiauAVNTuwUZhg6mHPfo01QzTWntP2PJbp00%2BlQkGfVEZYyzkTPzgkVGlJq7uwzqXSQkgy7xB%2BKs1G%2BaW71KLccuwpzzm52HZnQtvdx1TsmQQlMWQQCSCAA%3D%3D%3D) and then reducing to a [minimal example](https://godbolt.org/#z:OYLghAFBqd5TKALEBjA9gEwKYFFMCWALugE4A0BIEAZgQDbYB2AhgLbYgDkAjF%2BTXRMiAZVQtGIHgBYBQogFUAztgAKAD24AGfgCsp5eiyahUAUgBMAIUtXyKxqiIEh1ZpgDC6egFc2TEABWADZydwAZAiZsADk/ACNsUhBZAAd0JWIXJi9ffxTydMznIUjouLZE5NkHbCdskSIWUiJcvwCQ%2B2xHEqZG5qIy2ISkgqUmlrb8monBqOHK0ekASnt0H1JUTi5LAGYo1F8cAGozXY8WEjYCc13cMy0AQQfHgDd0AkxjmmYtiGXTgB2GxPY5g8aYEAgS7oa6oAD6RCQpGwLEw8J%2BTD%2BEKhHDYZAAnvCyDhSPCVABHeGocbLM4g56AgAiL1ZTxeXFW9G4gX4AS4OnI6G4HlstmOSnWm2wpwsuz45CI2k5qwA1iBdgBOAB0gMCAA4eJrgvr9btAbtpJrNYZuNJ%2BGwglpyPzBcKuPwlCBnUqBZzyHBYCgMGxUgwkpRqCGw4xkq9UKlUvDXkb4UYiNhxvD1PrQnR6BnSF6IPFleR4lFmgTuAqK6xSASAPLxXR1X0KkMcYSNpj0at%2B8g4eI%2BYAXej0L28fg4NjGYCSAeEFH1V6ZsvYdR1HwZmv8KIZ7kD%2BgEeKkKteHBloikAiOqfkVekeIZbBM7CzkzHkzK1Y0IzAJQADUCGwAB3RtUmYXc5GEMQJE4GQYMUFQNDLfQLEMOc0DFaxDBPL1IFWdBUl6L0AHpG3UIVHxvHACP%2BLoemyNwmE8bx2iCDCInmCoqikXZCgyLIhCmDoMKKYSmCGXjRh4ATanqIR%2Bkmdj8htBTemUuZyhGZI5PsWZRJAdTZmk3T%2BNWSUNi2KQuR5Pky3dY4c2CY42CUeNjhTHV00zIhjggfBiDIWV5WWfhfR0ZZViQVFSWoOyuHtchHUCZ1XX4d1PW9RUfzVIJdm1YJpF2QJNTSwFpAsLRAWCQFbS4XYHIHLLcr9VZAwQeAIGDWEYwjKgIGjcNkkOOdEVIHwmFVAQGELYtSwHOsq2g5aG2bVsnGgztmCIHs%2BzLIcRzHCdoJnOcF0FJc2wIVdJ0FDctx3e9926Mtj1Pc8sG2QVr1vaDH2fFQ3w/ecolAdqBH/ICQPAyD%2BQVQRYPESREKR5C1E0Ad9B4TDv3MaxbDw%2BJ6KIkjsknY4AFpGya6iklotdCMYm7XAgdwjJ4QIwlYszFmScShN6TnuYk3o%2Bb4jCNIaQzVICerpaU0yePMqXZbyeWDIGCXRgsSypRsnhEt5F1HO4Zzcyp4rjjGkwAuvKbVQBQLCBIUhQqNiK8vIWK0VGBj1WkYJtVK4qtCtCxAkBTU5QEw9ktS9KzY9ewcsi/1Op6tA%2BpGyMhpz2MQGAGRcfzebqEWwU1v7WtK3Wls2222Euz23t%2Byu7Bh1HCRTvvc7Px%2B6cCGXZw7vXTdUG3bYFVew9BQ%2Bs8GwvQfFRvO8FUBl8QYu8GfyhlgAOAsCIKg%2B90bg1HZHR5RMbQjU8dMHC7A%2B0mhXJoRKZpun0Boz4mYYxWAR2asSMqVHmmAdYC0EsUbIRkbRi2yJAkAUtuisz6OrDiYDAFaSQWrAYcCtYtFwfrayCFjbNTdObFyVtpDHGAKgVAxwZDah4AFIKbsPbhTalFGKcV/aJQTk6U2LVuDZR9N7QOwdQ7SHDjHKOMddhx24HTDKQpRHcJVA1CwFDMrqPTtFB8SRMiuGkEAA%3D%3D%3D) it seems that this is happening:
The |
Funnily enough, if the cpu target doesn't have MFENCE LowerATOMIC_FENCE will emit |
There's a patch here that never merged https://reviews.llvm.org/D129947 |
@topperc could it still be merged? I'm not sure I follow why it got stalled. |
While implementing a lock-free-queue I noticed that the
pop
function was about twice as slow on clang vs gcc. After digging through the assembly on compiler explorer and then reducing to a minimal example it seems that this is happening:std::atomic_thread_fence(std::memory_order_seq_cst)
lock or QWORD PTR [rsp], 0
mfence
The
mfence
instruction is much slower, MSVC also generateslock inc DWORD PTR __Guard$1[esp+4]
instead of anmfence
. I raised this on the r/cpp a while ago and was referred to this GCC patch which introduced the optimisation. How can we go about getting something like this into llvm? I have been using boost atomic which seems to generate better assembly but, it would be really nice to drop the dependency.The text was updated successfully, but these errors were encountered: