The latter code looks better, so I suspect BTSQ is slow for some reason. The CL referenced above fixed some rules that as a side effect triggered this optimization. I think it would be worth going back to the CLs that introduced BTSQmodify instructions to make sure those are actually performant.
If BitBase is a memory address, the BitOffset can range has different ranges depending on the operand size (see Table 3-2).
Table 3-2. Range of Bit Positions Specified by Bit Offset Operands
0 to 15
−215 to 215 −1
0 to 31
−231 to 231 −1
0 to 63
−263 to 263 −1
The addressed bit is numbered (Offset MOD 8) within the byte at address (BitBase + (BitOffset DIV 8)) where DIV is signed division with rounding towards negative infinity and MOD returns a positive number (see
The lack of masking was the cause of the issue which prompted the fix.