New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to adjust the searching window for positions near the end of the sequence #2
Comments
For computing the searching window of m2, I'm not sure if there's a bug. https://github.com/ksahlin/strobemers/blob/main/modules/indexing.py#L189
Does the condition |
My brain is down, would you please help to mark the correct windows in the illustration.xlsx. Thank you so much. |
Hehe sure. Overall I don't know if it matters much, but nevertheless, it is good to be consistent. I implemented alternative A in your first illustration (up to any offset bugs that you may find). That is, first the offset (w_min) decreases, then the window gets narrower at the very end. For three or more, I implemented (and suggested in the paper) to scale down all segments equally with floor operator. With a "segment" here, I mean the range between the previous w_max to current w_max (i.e., a segment is w_max bases). In practice, this leads to what I've drawn out in the figure below in the highlighted box. Note that as soon as there is not sufficient room, all the windows are divided. the number to the left shows the total number of positions to the last valid strobe position. there are two segments, to we get 9//2 = 4, 8//2 =4, 7//2=3, 6//2=3. However, a final remark. If all this "extra checking" is detrimental to runtime, I think speed should be prioritized. One could consider stopping to generate strobemers whenever when one needs rescaling (or any other heuristic solution). Essentially, I don't think my particular scaling is significantly better than any other solution. |
Thank you for the details. |
@shenwei356 - It is amazing how I missed your main concern in this issue: "the strobemers near the end of the sequences would be different from these in the reference sequence, because of the incomplete searching window.", which can matter a lot for short sequences. This is a very good point that was also just raised by another user @lutfia95 which made me revisit this. My opinion is now that one should pick whichever strategy that maximizes potential matches in the end, or alternatively stop before any modification is needed. Given this, your strategy B would be preferable. Thanks for pointing this out and sorry for my initial answer which is not a good answer for read mapping applications. |
Hi, I'm a little confused about how to adjust the searching window for positions near the end of the sequence.
I know that these positions need to be kept for producing an equal number of strobemers as k-mers. But in sequence mapping/searching scenarios, unlike k-mers, the strobemers near the end of the sequences would be different from these in the reference sequence, because of the incomplete searching window.
In the function
seq_to_randstrobes2_iter
:For positions near the end of the sequence (
p + strobe_w_max_offset > len(hash_seq_list)
),equals to
As I understand, it keeps the size of the searching window and moves the window to the left (box A in the figure below), am I right? Have you tried the way in box B?
Besides, for order 3, the windows of
m2
andm3
would have some duplicated regions, is this OK?The text was updated successfully, but these errors were encountered: