-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use a newline index for FastqIter #108
Comments
This didn't help for performance. Also it made the code slightly more complex. The next call got easier, but that was offset by added complexity elsewhere. So this experiment can be considered failed. |
Thanks for doing that experiment and documenting it here. |
In the meantime I did another one: what if I only count the newlines but do not index them. The resulting It also resulted in a very fast SIMD enhanced newline counter. I am posting it here so it may be reused for some other purpose later. I initially thought of using the popcount instruction and the movemask instruction, but popcount is not a part of SSE2. I think the following solution is much faster anyway: https://github.com/rhpvorderman/dnaio/blob/78174ac701be61f5691e6a0cf2b3c4b90d83f3d9/src/dnaio/count_newlines_sse2.h The reason I have time to do this is that I am wrapping up a lot of work today before going on holiday (to move house). So in between meetings and e-mails I decided to do some educational hacking. |
Nice educational hacking you did there 😄! I’m just getting started again after my holidays. I looked at the SIMD newline counting function just to learn a bit about SSE intrinsics. I am not quite sure, but is it possible that the |
It unrolls the loop. That is actually quite efficient as it eliminates branches from the code. It only happens every 2000-something bytes or so, so efficiency is not that big of a deal. Thanks! That is great! That looks so much more convenient. I will have a look at it. |
Sorry I didn’t make this clear: I know, my point was that it works on each byte individually instead of in a vectorized fashion.
Thanks, that’s the point I didn’t get, so then it it shouldn’t be an issue. |
I blame that on unclear commenting on my part. The following updated version adds some more comments, clear vector variable names and uses _mm_sad_epu8 as per your excellent find. I knew there are instructions with This looks much better in the compiler explorer. Unfortunately I can't benchmark it properly on this laptop and I haven't assembled my home PC yet (lots of it is still in boxes). I do wonder if due to modern processors having branch prediction etc. if it would be faster to include |
Thanks, and cool that it works with |
I am back and can now benchmark on my own machine again. Unfortunately the _mm_sad_epu8 solution is slower. I did a count and found that if the compiler was very naive and assigned a new vector register for each differently named variable it would need 9 vector registers (out of 8 available). The _mm_storeu option is faster and easier to understand: |
Welcome back and thanks for testing! As you explained to me, since that part of the code isn’t called that often, any speedup wouldn’t have been that large anyway, so I guess it doesn’t matter. But good you added the comment. |
Currently the
__next__
function is quite messy. Memchr is called several times and several times aNULL
check needs to be performed. Any of these conditions might exit the while loop that is put in place.Alternatively the
update_buffer
call can be used to create a newline index. The__next__
call can then simply get the next four lines from the index and yield a record.Advantages:
__next__
call, which now uses a while loop to allow for repeated buffer resizings. Instead the newline indexing step can indicate if there is an entire record in the buffer, which means update_buffer can guarantee that at least one FASTQ record is present. Which makes the while loop redundant. Also a lot less NULL checks. Basically only FASTQ integrity checks.Disadvantages.
@
and+
) and\r
checks. This means these cache lines need to be fetched from memory/cache again. In contrast, they were probably already populated with the correct memory in our current solution.In case there is no notable speedup, a reduction in code complexity is also nice. (Taking the line diff as measure).
The text was updated successfully, but these errors were encountered: