Vectorize the ASCII check using SSE2 instructions#74
Conversation
At least with Point taken for the "documented" part. The macros should be at at least as stable as the platform.machine option. |
| n -= 1; | ||
| } | ||
| // Check the most significant bits in the accumulated words and chars. | ||
| return !(_mm_movemask_epi8(all_words) || (all_chars & ASCII_MASK_1BYTE)); |
There was a problem hiding this comment.
Nice how the movemask instruction is such a good fit here.
There was a problem hiding this comment.
It is a very useful instruction. Intrinsic compare functions set the most significant bit too. So if you compare one vector to another you end up with a vector of bytes with the most significant bit set. There is also a popcnt (POPCOUNT) instruction that simply reports the number of set bits. So you can use mm_cmpneq + mm_movemask + popcnt to calculate the hamming distance of a vector in just three instructions.
There is also mm_blend, where you create a new vector from two other vectors, based on a provided mask. Very useful, as this allows branchless programming while still using conditionals (create a mask with a compare function, calculate the two possible result vectors, then select based on the mask). They use this in minimap2 for the alignment algorithm. So that might be interesting for cutadapt.
|
Looks good now – although I am a bit disappointed that the Thanks! |
The sacrifices we make for a few % performance gains... What have we become?! If it makes you feel better you can take a look at the python-isal setup.py ;-). Although that has become slightly less verbose with the move to a pure C extension. |
SSE2 is guaranteed to be present on al AMD64 (x86_64) platforms. So a simple check for such a platform is sufficient to enable the instruction set without running into compile problems.
This increases the ASCII check speed from 20GB/s to 50GB/s. Making our ASCII string cost creation almost free.