Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to adjust the searching window for positions near the end of the sequence #2

Closed
shenwei356 opened this issue Apr 13, 2021 · 5 comments

Comments

@shenwei356
Copy link

Hi, I'm a little confused about how to adjust the searching window for positions near the end of the sequence.

I know that these positions need to be kept for producing an equal number of strobemers as k-mers. But in sequence mapping/searching scenarios, unlike k-mers, the strobemers near the end of the sequences would be different from these in the reference sequence, because of the incomplete searching window.

In the function seq_to_randstrobes2_iter:

window_p_start = p + strobe_w_min_offset if p + strobe_w_max_offset <= len(hash_seq_list) else max( (p + strobe_w_min_offset) -  (p + strobe_w_max_offset - len(hash_seq_list)), p )
window_p_end = min(p + strobe_w_max_offset, len(hash_seq_list))

For positions near the end of the sequence (p + strobe_w_max_offset > len(hash_seq_list)),

max( (p + strobe_w_min_offset) -  (p + strobe_w_max_offset - len(hash_seq_list)), p )

equals to

max( len(hash_seq_list) - (strobe_w_max_offset - strobe_w_in_offset), p)

As I understand, it keeps the size of the searching window and moves the window to the left (box A in the figure below), am I right? Have you tried the way in box B?

Screenshot_20210413_230241

Besides, for order 3, the windows of m2 and m3 would have some duplicated regions, is this OK?

Screenshot_20210413_230706

@shenwei356
Copy link
Author

For computing the searching window of m2, I'm not sure if there's a bug.

https://github.com/ksahlin/strobemers/blob/main/modules/indexing.py#L189

window_p1_start = p + strobe_w_min_offset if p + 2*strobe_w_max_offset <= len(hash_seq_list) else max(p,  len(hash_seq_list)  + 2*(strobe_w_min_offset - strobe_w_max_offset))

Does the condition if p + 2*strobe_w_max_offset <= len(hash_seq_list) should be if p + strobe_w_max_offset <= len(hash_seq_list).

@shenwei356
Copy link
Author

My brain is down, would you please help to mark the correct windows in the illustration.xlsx. Thank you so much.

@ksahlin
Copy link
Owner

ksahlin commented Apr 13, 2021

Hehe sure. Overall I don't know if it matters much, but nevertheless, it is good to be consistent.

I implemented alternative A in your first illustration (up to any offset bugs that you may find). That is, first the offset (w_min) decreases, then the window gets narrower at the very end.

For three or more, I implemented (and suggested in the paper) to scale down all segments equally with floor operator. With a "segment" here, I mean the range between the previous w_max to current w_max (i.e., a segment is w_max bases).

In practice, this leads to what I've drawn out in the figure below in the highlighted box. Note that as soon as there is not sufficient room, all the windows are divided. the number to the left shows the total number of positions to the last valid strobe position. there are two segments, to we get 9//2 = 4, 8//2 =4, 7//2=3, 6//2=3.

However, a final remark. If all this "extra checking" is detrimental to runtime, I think speed should be prioritized. One could consider stopping to generate strobemers whenever when one needs rescaling (or any other heuristic solution). Essentially, I don't think my particular scaling is significantly better than any other solution.

Screen Shot 2021-04-13 at 7 11 34 PM

@shenwei356
Copy link
Author

Thank you for the details.

@ksahlin
Copy link
Owner

ksahlin commented Jun 10, 2021

@shenwei356 - It is amazing how I missed your main concern in this issue: "the strobemers near the end of the sequences would be different from these in the reference sequence, because of the incomplete searching window.", which can matter a lot for short sequences. This is a very good point that was also just raised by another user @lutfia95 which made me revisit this.

My opinion is now that one should pick whichever strategy that maximizes potential matches in the end, or alternatively stop before any modification is needed. Given this, your strategy B would be preferable.

Thanks for pointing this out and sorry for my initial answer which is not a good answer for read mapping applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants