Skip to content

Vectorize the ASCII check using SSE2 instructions#74

Merged
marcelm merged 3 commits into
marcelm:mainfrom
rhpvorderman:SSE2
Apr 20, 2022
Merged

Vectorize the ASCII check using SSE2 instructions#74
marcelm merged 3 commits into
marcelm:mainfrom
rhpvorderman:SSE2

Conversation

@rhpvorderman
Copy link
Copy Markdown
Collaborator

SSE2 is guaranteed to be present on al AMD64 (x86_64) platforms. So a simple check for such a platform is sufficient to enable the instruction set without running into compile problems.

This increases the ASCII check speed from 20GB/s to 50GB/s. Making our ASCII string cost creation almost free.

@rhpvorderman rhpvorderman requested a review from marcelm April 19, 2022 09:54
Comment thread src/dnaio/_core.pyx
Comment thread setup.py Outdated
Comment thread src/dnaio/_core.pyx
@rhpvorderman
Copy link
Copy Markdown
Collaborator Author

Regarding your other point, I disagree. Querying a documented, pre-defined compiler macro is totally fine and in my opinion not worse than relying on platform.machine(), which is at the same level of "universalness", demonstrated by having to check for both "x86_64" and "AMD64".

At least with platform.machine there are only two options. x86_64 and AMD64. There are quite a lot of C compilers out there. GCC, Clang, MSVC, Intel C compiler, AMD Optimizing Compiler etc. So there is definitely going to be more variety in pre-defined compiler macros. There is no standardization in this space at all, so I feel the platform.machine choice is safer.

Point taken for the "documented" part. The macros should be at at least as stable as the platform.machine option.

n -= 1;
}
// Check the most significant bits in the accumulated words and chars.
return !(_mm_movemask_epi8(all_words) || (all_chars & ASCII_MASK_1BYTE));
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice how the movemask instruction is such a good fit here.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a very useful instruction. Intrinsic compare functions set the most significant bit too. So if you compare one vector to another you end up with a vector of bytes with the most significant bit set. There is also a popcnt (POPCOUNT) instruction that simply reports the number of set bits. So you can use mm_cmpneq + mm_movemask + popcnt to calculate the hamming distance of a vector in just three instructions.

There is also mm_blend, where you create a new vector from two other vectors, based on a provided mask. Very useful, as this allows branchless programming while still using conditionals (create a mask with a compare function, calculate the two possible result vectors, then select based on the mask). They use this in minimap2 for the alignment algorithm. So that might be interesting for cutadapt.

@marcelm
Copy link
Copy Markdown
Owner

marcelm commented Apr 20, 2022

Looks good now – although I am a bit disappointed that the setup.py isn’t so nice and short anymore ...

Thanks!

@marcelm marcelm merged commit 3f261a3 into marcelm:main Apr 20, 2022
@rhpvorderman rhpvorderman deleted the SSE2 branch April 20, 2022 10:31
@rhpvorderman
Copy link
Copy Markdown
Collaborator Author

Looks good now – although I am a bit disappointed that the setup.py isn’t so nice and short anymore ...

The sacrifices we make for a few % performance gains... What have we become?!

If it makes you feel better you can take a look at the python-isal setup.py ;-). Although that has become slightly less verbose with the move to a pure C extension.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants