How to adjust the searching window for positions near the end of the sequence #2

shenwei356 · 2021-04-13T15:17:51Z

Hi, I'm a little confused about how to adjust the searching window for positions near the end of the sequence.

I know that these positions need to be kept for producing an equal number of strobemers as k-mers. But in sequence mapping/searching scenarios, unlike k-mers, the strobemers near the end of the sequences would be different from these in the reference sequence, because of the incomplete searching window.

In the function seq_to_randstrobes2_iter:

window_p_start = p + strobe_w_min_offset if p + strobe_w_max_offset <= len(hash_seq_list) else max( (p + strobe_w_min_offset) -  (p + strobe_w_max_offset - len(hash_seq_list)), p )
window_p_end = min(p + strobe_w_max_offset, len(hash_seq_list))

For positions near the end of the sequence (p + strobe_w_max_offset > len(hash_seq_list)),

max( (p + strobe_w_min_offset) -  (p + strobe_w_max_offset - len(hash_seq_list)), p )

equals to

max( len(hash_seq_list) - (strobe_w_max_offset - strobe_w_in_offset), p)

As I understand, it keeps the size of the searching window and moves the window to the left (box A in the figure below), am I right? Have you tried the way in box B?

Besides, for order 3, the windows of m2 and m3 would have some duplicated regions, is this OK?

The text was updated successfully, but these errors were encountered:

shenwei356 · 2021-04-13T15:41:00Z

For computing the searching window of m2, I'm not sure if there's a bug.

https://github.com/ksahlin/strobemers/blob/main/modules/indexing.py#L189

window_p1_start = p + strobe_w_min_offset if p + 2*strobe_w_max_offset <= len(hash_seq_list) else max(p,  len(hash_seq_list)  + 2*(strobe_w_min_offset - strobe_w_max_offset))

Does the condition if p + 2*strobe_w_max_offset <= len(hash_seq_list) should be if p + strobe_w_max_offset <= len(hash_seq_list).

shenwei356 · 2021-04-13T15:43:26Z

My brain is down, would you please help to mark the correct windows in the illustration.xlsx. Thank you so much.

ksahlin · 2021-04-13T17:07:00Z

Hehe sure. Overall I don't know if it matters much, but nevertheless, it is good to be consistent.

I implemented alternative A in your first illustration (up to any offset bugs that you may find). That is, first the offset (w_min) decreases, then the window gets narrower at the very end.

For three or more, I implemented (and suggested in the paper) to scale down all segments equally with floor operator. With a "segment" here, I mean the range between the previous w_max to current w_max (i.e., a segment is w_max bases).

In practice, this leads to what I've drawn out in the figure below in the highlighted box. Note that as soon as there is not sufficient room, all the windows are divided. the number to the left shows the total number of positions to the last valid strobe position. there are two segments, to we get 9//2 = 4, 8//2 =4, 7//2=3, 6//2=3.

However, a final remark. If all this "extra checking" is detrimental to runtime, I think speed should be prioritized. One could consider stopping to generate strobemers whenever when one needs rescaling (or any other heuristic solution). Essentially, I don't think my particular scaling is significantly better than any other solution.

shenwei356 · 2021-04-14T02:05:08Z

Thank you for the details.

ksahlin · 2021-06-10T06:56:31Z

@shenwei356 - It is amazing how I missed your main concern in this issue: "the strobemers near the end of the sequences would be different from these in the reference sequence, because of the incomplete searching window.", which can matter a lot for short sequences. This is a very good point that was also just raised by another user @lutfia95 which made me revisit this.

My opinion is now that one should pick whichever strategy that maximizes potential matches in the end, or alternatively stop before any modification is needed. Given this, your strategy B would be preferable.

Thanks for pointing this out and sorry for my initial answer which is not a good answer for read mapping applications.

shenwei356 closed this as completed Apr 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to adjust the searching window for positions near the end of the sequence #2

How to adjust the searching window for positions near the end of the sequence #2

shenwei356 commented Apr 13, 2021

shenwei356 commented Apr 13, 2021

shenwei356 commented Apr 13, 2021

ksahlin commented Apr 13, 2021 •

edited

shenwei356 commented Apr 14, 2021

ksahlin commented Jun 10, 2021

How to adjust the searching window for positions near the end of the sequence #2

How to adjust the searching window for positions near the end of the sequence #2

Comments

shenwei356 commented Apr 13, 2021

shenwei356 commented Apr 13, 2021

shenwei356 commented Apr 13, 2021

ksahlin commented Apr 13, 2021 • edited

shenwei356 commented Apr 14, 2021

ksahlin commented Jun 10, 2021

ksahlin commented Apr 13, 2021 •

edited