Include read length in the index filename #195

marcelm · 2023-01-19T09:41:33Z

Closes #133.

We currently have the concept of an (actual) read length, which is either estimated from the input FASTQs or provided using the -r command-line option. This PR introduces the concept of a "canonical read length". The canonical read length is one of 50, 100, 150, 225, 325 and 400 (we choose the one that is closest to the actual read length) and represents one of the six pre-defined index parameter settings.

With this PR, all parameters are entirely based on the canonical read length only, which means that it can be re-used for datasets that have similar read lengths. This was almost the case before except that the max_dist parameter was computed from the actual read length. Commit 8ba1f93 changes this so that max_dist is computed from the canonical read length. (This changes mapping results when actual and canonical read length differ.)

To make it possible to store multiple indices with different canonical read lengths alongside each other, the canonical read length is also included in the index filename, as in ref.fasta.r100.sti. I chose to include the r here mainly because I think it may be clearer that way what the number represents. Another minor reason is that we could extend the naming scheme at a later point to include other parameters. For example, .r100k24.sti could mean that one needs to use -r 100 -k 24 to reproduce the settings the index was created with.

At the moment, if the user overrides any indexing parameters, the file name extension is going to be just .sti as before. The user then again has the problem that they cannot store multiple of these indices in the same directory, but I think solving this has much lower priority. (One workaround is to symlink the reference into separate directories for each parameter setting and then create the index there.) If the index doesn’t match the command-line settings, then there’s an error message in any case, so accidentally using the incorrect index should not be possible.

Update documentation
Update changelog
Ensure we get sensible error messages if index settings don’t match or an index couldn’t be found
Measure how accuracy changes due to max_dist now being based on canonical, not actual read length
Update baseline commit

marcelm · 2023-01-19T15:25:54Z

CHM13

2 x 100 bp reads

`-r`	mapping rate	accuracy
76	99.79551	93.263595
100	99.804485	93.27491
125	99.803985	93.27183

rye

2 x 100 bp reads

`-r`	mapping rate	accuracy
76	99.692785	85.54961
100	99.797135	85.56528
125	99.80467	85.52795

Hm, that drop in mapping rate for -r 76 is a bit bigger than the other differences.

ksahlin · 2023-01-20T08:32:35Z

So if I interpret your analysis correctly, we are 'reversing the order of the analysis' as you described earlier. That is, you test different settings for -r (or maxdist really) for fixed a read length. From your analysis, this suggests that for the shortest read lengths (76) that get assigned to the canonical read length 100, the effect of indexing with a max dist based on 100 could have a larger effect.

Some brainstroming:

Setting the maxdist based on the lowest read length in the interval.
Tighter canonical intervals in the lower range. One could imaging splitting the 75-125 range into 75-100,100-125 where only max dist change (set to 87 or 112 I guess)

I like suggestion 2 better, because I think this difference is only noticeable in the shorter read range(?). One analysis that would shed some light on this question would be to index a genome with different values on -r and see how many times maxdist is 'active' when getting our window of candidate syncmers.

Suggestion 1 is an easy fix but it has the possible drawback of slower relative runtime by being conservative (not leveraging the larger spans that we could use for the reads).

For suggestion 2, in https://github.com/ksahlin/strobealign/blob/main/src/index.cpp:

        settings {75, 20, -4, -4, 2},
        settings {100, 20, -4, -3, 2},
        settings {125, 20, -4, -2, 2},

Here I just made a guess on settings {100, 20, -4, -3, 2}, but i think the (-3,2) interval makes sense.

What do you think?

marcelm · 2023-01-20T09:38:35Z

More fine-grained read-length intervals is an option. This would make it a bit more difficult to generate all possible indices in advance. I think some users/admins would want to do that "just to be prepared". And then strobealign could gain a reputation for requiring massive amounts of storage for its indices ...

Shouldn’t the ranges have the most common read lengths as their middle points to ensure that the index works best with them? (Or at least have the max_dist setting optimized for it.) 76-125 was good because it had 100 in the middle, so perhaps it needs to be split into three parts. Or change the other intervals as well. Have you by any chance done a survey to find out which read lengths are most commonly used in practice? I would guess many people use the maximum Illumina read lengths, which are 100, 125, 150, 250, 300 (interesting, didn’t know about the 125).

One analysis that would shed some light on this question would be to index a genome with different values on -r and see how many times maxdist is 'active' when getting our window of candidate syncmers.

I’ll try to do something in that regard.

ksahlin · 2023-01-20T10:29:15Z

More fine-grained read-length intervals is an option. This would make it a bit more difficult to generate all possible indices in advance. I think some users/admins would want to do that "just to be prepared". And then strobealign could gain a reputation for requiring massive amounts of storage for its indices ...

Ok agree.

Shouldn’t the ranges have the most common read lengths as their middle points to ensure that the index works best with them?

That makes sense. Would then be interesting to investigate parameters for 100, 125, 150, 250, 300. In my study I benchmarked only 100, 150, and 300 out of those numbers. (The other lengths I investigated were 50, 75, 200, 500 which are less relevant with Illumina, for now.)

I can with little effort add 125 and 250 to my benchmarks. Interestingly 125 is right on the border which probably means the current parameters (also the syncmer span) aren't ideal for them.

How do you want to proceed with this? I am happy to run benchmarks whenever we have something to try.

marcelm · 2023-01-20T13:40:32Z

I added a "dumprandstrobes" tool in #198 to be able to look at the randstrobes for any FASTA file and any parameters.

I have only started using it, but it was already helpful because it reminded me that 1) max_dist is computed as read length minus 70 and 2) max_dist is restricted to be at least $k$. That means that it is always 20 within the range 76-90. (That is, the index is identical for -r 76 and -r 90.)

I’ll run the tool a bit to get an impression of how much the max_dist setting changes the index.

ksahlin · 2023-01-20T13:45:14Z

great!

marcelm · 2023-01-23T14:51:47Z

I ran dumprandstrobes on the fruit fly genome, varying the read length from 70 to 287 (at which point my disk was full) and then counted how many randstrobes change when comparing read length $n$ to read length $n+1$.

As expected, small max_dist values have the strongest effect. I’ll post another table later looking at the max_dist values directly.

See the table here

The total number of randstrobes was always about 26.5 million.

len1 → len2	no. of changes
70 → 71	0
71 → 72	0
72 → 73	0
73 → 74	0
74 → 75	0
75 → 76	11067172
76 → 77	0
77 → 78	0
78 → 79	0
79 → 80	0
80 → 81	0
81 → 82	0
82 → 83	0
83 → 84	0
84 → 85	0
85 → 86	0
86 → 87	0
87 → 88	0
88 → 89	0
89 → 90	0
90 → 91	1299302
91 → 92	1177129
92 → 93	1097366
93 → 94	1048296
94 → 95	938567
95 → 96	873557
96 → 97	828832
97 → 98	742453
98 → 99	673770
99 → 100	623985
100 → 101	545579
101 → 102	482820
102 → 103	424230
103 → 104	357368
104 → 105	303858
105 → 106	254989
106 → 107	204637
107 → 108	165840
108 → 109	134761
109 → 110	105163
110 → 111	82478
111 → 112	65060
112 → 113	49456
113 → 114	38211
114 → 115	29376
115 → 116	22025
116 → 117	16366
117 → 118	12462
118 → 119	8984
119 → 120	6692
120 → 121	4947
121 → 122	3844
122 → 123	2637
123 → 124	2013
124 → 125	1437
125 → 126	20664155
126 → 127	374043
127 → 128	334959
128 → 129	301984
129 → 130	276357
130 → 131	242701
131 → 132	215097
132 → 133	191469
133 → 134	164461
134 → 135	141917
135 → 136	123035
136 → 137	104145
137 → 138	89577
138 → 139	74392
139 → 140	62325
140 → 141	50928
141 → 142	42005
142 → 143	34895
143 → 144	28495
144 → 145	23336
145 → 146	18428
146 → 147	15286
147 → 148	12245
148 → 149	10053
149 → 150	7786
150 → 151	6236
151 → 152	4753
152 → 153	3854
153 → 154	2976
154 → 155	2277
155 → 156	1911
156 → 157	1627
157 → 158	1107
158 → 159	969
159 → 160	700
160 → 161	549
161 → 162	425
162 → 163	314
163 → 164	239
164 → 165	210
165 → 166	155
166 → 167	117
167 → 168	118
168 → 169	79
169 → 170	63
170 → 171	51
171 → 172	47
172 → 173	39
173 → 174	33
174 → 175	33
175 → 176	17849451
176 → 177	28711
177 → 178	24443
178 → 179	20888
179 → 180	17691
180 → 181	14755
181 → 182	12228
182 → 183	10303
183 → 184	8806
184 → 185	6798
185 → 186	5977
186 → 187	4915
187 → 188	4138
188 → 189	3334
189 → 190	2651
190 → 191	2214
191 → 192	1848
192 → 193	1444
193 → 194	1200
194 → 195	1000
195 → 196	832
196 → 197	638
197 → 198	557
198 → 199	477
199 → 200	352
200 → 201	277
201 → 202	211
202 → 203	197
203 → 204	152
204 → 205	125
205 → 206	118
206 → 207	86
207 → 208	56
208 → 209	61
209 → 210	55
210 → 211	44
211 → 212	45
212 → 213	28
213 → 214	44
214 → 215	24
215 → 216	26
216 → 217	23
217 → 218	20
218 → 219	28
219 → 220	30
220 → 221	26
221 → 222	57
222 → 223	65
223 → 224	80
224 → 225	101
225 → 226	106
226 → 227	107
227 → 228	138
228 → 229	164
229 → 230	156
230 → 231	173
231 → 232	165
232 → 233	176
233 → 234	174
234 → 235	158
235 → 236	173
236 → 237	148
237 → 238	142
238 → 239	152
239 → 240	147
240 → 241	138
241 → 242	116
242 → 243	126
243 → 244	99
244 → 245	124
245 → 246	110
246 → 247	103
247 → 248	104
248 → 249	90
249 → 250	89
250 → 251	105
251 → 252	80
252 → 253	76
253 → 254	77
254 → 255	75
255 → 256	64
256 → 257	63
257 → 258	68
258 → 259	61
259 → 260	69
260 → 261	67
261 → 262	74
262 → 263	62
263 → 264	55
264 → 265	54
265 → 266	63
266 → 267	48
267 → 268	52
268 → 269	56
269 → 270	40
270 → 271	40
271 → 272	50
272 → 273	43
273 → 274	43
274 → 275	42
275 → 276	0 (bug?)
276 → 277	29
277 → 278	24
278 → 279	20
279 → 280	23
280 → 281	20
281 → 282	23
282 → 283	19
283 → 284	11
284 → 285	16
285 → 286	12
286 → 287	8

marcelm · 2023-01-23T15:25:23Z

Here’s a histogram for the randstrobes in CHM13, generated using -r 125 and -r 76. This shows the frequencies of offsets, which are computed as strobe2.start - strobe1.start. The read lengths result in max. offsets of 75 and 40, respectively (max_dist is $k=20$ lower because the "distance" would be strobe2.start - strobe1.end).

There are 577 million randstrobes in CHM13. If running with -r 125, 235 million of them have an offset greater than 40. They hit max_dist when running with -r 90 or less. 70 million have an offset greater than 50. They hit max_dist for -r 100.

BTW, why did you decide to introduce max_dist instead of adjusting w_max?

I had this idea of using the highest randstrobe setting to create the on-disk index and then dynamically updating it to use a lower max_dist setting after reading it in from disk, based on the actual read length, but that sounds complicated and may not be much quicker than just recomputing the index ...

I’m reluctant because we need to do another round of benchmarks, but it seems the simplest option is your option 2 (tighter intervals).

ksahlin · 2023-01-25T05:12:03Z

The read lengths result in max. offsets of 75 and 40,

To establish terminology, do you mean they result in maximum seed lengths 75 and 40? Because the 'offset' between first and second syncmer would be min(r-70, k) which would always be 20 for r 76, which together with a syncmer size of 20 gives maximum seed size of 40 (the offset of the second syncmer to the first plus the syncmer size).

BTW, why did you decide to introduce max_dist instead of adjusting w_max?

The w_max is the limit on the number of syncmers in the window. Some regions have sparser syncmer distribution (and there are syncmers in regions of N's). Max dist assures a seed size limit in nucleotide space. For example, without maxdist we would get seeds that were millions of bases long (spanning over teleomer regions with N's in hg38. Max dist was a practical way to limit this, also in order to limit to 8 bits storing the offset size.

I had this idea of using the highest randstrobe setting to create the on-disk index and then dynamically updating it to use a lower max_dist setting after reading it in from disk, based on the actual read length, but that sounds complicated and may not be much quicker than just recomputing the index ...

Sounds interesting, but I am worried there will be many practical complications:
i) The k and s varies in the higher limit (k-s=6), which will not be compatible with lower (k-s=4) settings
ii) I used a different k for two of the highest options. (I will see if someone can perform a parameter optimisation analysis.)
iii) If max_dist should be adjusted, then better to just store the syncmers and their positions, as the window [w_min, w_max] (in syncmers) will be determined by the read length. (I don't think you can modify the randstrobes after they are created)

or perhaps I am thinking about it the wrong way?

I’m reluctant because we need to do another round of benchmarks, but it seems the simplest option is your option 2 (tighter intervals).

You mean your modified version of my option 2, right? Which also has "the ranges have the most common read lengths as their middle points to ensure that the index works best with them".

I don't have a strong preference here, basically because I can't come up with a this-is-the-way-to-go solution.

marcelm · 2023-01-25T13:48:27Z

The read lengths result in max. offsets of 75 and 40,

To establish terminology, do you mean they result in maximum seed lengths 75 and 40? Because the 'offset' between first and second syncmer would be min(r-70, k) which would always be 20 for r 76, which together with a syncmer size of 20 gives maximum seed size of 40 (the offset of the second syncmer to the first plus the syncmer size).

Yes, you’re right. What’s shown in the figure is the seed length. Sorry about the confusion – I computed the length incorrectly for the statistics, which made me think offset and distance are somehow different things. But of course offset and distance are both the difference in strobe start positions, and adding $k$ gives us the seed length.

BTW, why did you decide to introduce max_dist instead of adjusting w_max?

The w_max is the limit on the number of syncmers in the window. Some regions have sparser syncmer distribution (and there are syncmers in regions of N's). Max dist assures a seed size limit in nucleotide space. For example, without maxdist we would get seeds that were millions of bases long (spanning over teleomer regions with N's in hg38. Max dist was a practical way to limit this, also in order to limit to 8 bits storing the offset size.

That makes sense. Not that it matters here, but I just wonder whether it would be nicer (conceptually) to only have one criterion. So one could only use max_dist and abandon w_max.

I had this idea of using the highest randstrobe setting to create the on-disk index and then dynamically updating it to use a lower max_dist setting after reading it in from disk, based on the actual read length, but that sounds complicated and may not be much quicker than just recomputing the index ...

Sounds interesting, but I am worried there will be many practical complications:
i) The k and s varies in the higher limit (k-s=6), which will not be compatible with lower (k-s=4) settings

Yes, I thought one would still have multiple indices as before and only adjust those randstrobes that differ due to the different max_dist settings.

ii) I used a different k for two of the highest options. (I will see if someone can perform a parameter optimisation analysis.)
iii) If max_dist should be adjusted, then better to just store the syncmers and their positions, as the window [w_min, w_max] (in syncmers) will be determined by the read length. (I don't think you can modify the randstrobes after they are created)

Good point, only storing syncmers would be an option, good point. On the other hand, the savings would be lower because generating randstrobes also takes time ... But I don’t think it’s worth going into this direction at the moment, it was just a brainstorming idea.

I’m reluctant because we need to do another round of benchmarks, but it seems the simplest option is your option 2 (tighter intervals).

You mean your modified version of my option 2, right? Which also has "the ranges have the most common read lengths as their middle points to ensure that the index works best with them".

Yes, but I described it unclearly: The essential point is that we choose max_dist so that it works best with the read lengths we’re most likely to encounter. Instead of computing max_dist from the canonical read length, we can hardcode it as part of the settings, and then we can choose whatever we want for each canonical read length, and it doesn’t need to be centered. (The canonical read length would then just become the name of one of the pre-defined index parameters.)

Because you designed these settings initially, I don’t have a good intuition for how they would need to be changed, but let me make a suggestion as a starting point anyway. I think we want this:

One extra setting centered at read length 125 to cater for that read length
Pick the max_dist so that it works best with the most likely read length in that interval (read length minus 70)

Current settings

canonical read length	read len. interval	k	s	w_min	w_max	max_dist
50	<=75	20	16	1	6	20
100	<=125	20	16	2	6	20..55
150	<=175	20	16	5	11	56..105
225	<=275	20	16	8	17	106..205
325	<=375	22	18	6	16	206..255
400	>375	23	17	5	15	255

Possible settings

For the added read length 125, I interpolated w_min and w_max from the adjacent settings, but one might as well pick 4 and 9.

Bold numbers designate changed or new values.

canonical read length	read len. interval	k	s	w_min	w_max	max_dist
50	<=75	20	16	1	6	20
100	<=110	20	16	2	6	30
125	<=135	20	16	3	8	55
150	<=175	20	16	5	11	80
250	<=275	20	16	8	17	180
300	<=375	22	18	6	16	230
400	>375	23	17	5	15	255

ksahlin · 2023-01-25T15:19:34Z

I think most of those settings look good. The only one I am hesitant about is the <=75 interval. Reads of length 76 will now be bumped up to max_dist 30 while being capped at 20 before. I believe we should be conservative with this interval and do:

canonical read length	read len. interval	k	s	w_min	w_max	max_dist
50	<=90	20	16	1	6	20
100	<=110	20	16	2	6	30
125	<=135	20	16	3	8	55
150	<=175	20	16	5	11	80
250	<=275	20	16	8	17	180
300	<=375	22	18	6	16	230
400	>375	23	17	5	15	255

(I have edited only the read len. interval column in the first row) This basically states that max_dist have precedence over w_min,w_max boundaries for very short reads.

Before this request is merged, I could start a benchmark including some new read lengths and compare to v0.7.1. As the results should be similar for the actual canonical read lengths (125, 250, 300). I should probably test read lengths 76, 111, 136, and 176 for 'worst case scenarios', right?

There are six canonical read lengths (50, 100, 150, 225, 325 and 400) which represent the six pre-defined index parameter settings. The canonical read length is now used as part of index filename as in "ref.fasta.r100.sti"

This allows to re-use an on-disk index even if the read length differs slightly.

marcelm · 2023-01-25T21:07:53Z

I added a new setting for read length 125 using your version of the table. Makes absolute sense to end that interval at 90. I also bumped the baseline commit so that we finally get a green checkmark.

Before this request is merged, I could start a benchmark including some new read lengths and compare to v0.7.1. As the results should be similar for the actual canonical read lengths (125, 250, 300). I should probably test read lengths 76, 111, 136, and 176 for 'worst case scenarios', right?

Yes, this PR is ready now, would be good with some benchmarks now that also test some of these worst case lengths.

ksahlin · 2023-01-25T21:51:01Z

Great! I will set up and start experiments for read lengths 91, 111, 136, and 176 tomorrow (in addition to previously compared read lengths) on Drosophila, maize, CHM13 and rye. I will also add 125 and 250 as these are standard lengths I did not benchmark before.

ksahlin · 2023-01-26T16:50:59Z

The evaluation is up and running but will take some time to complete due to job scheduling waiting times. I am also benchmarking the mapping modes of the two versions (-x) to more directly check the effect of seed parameters.

It can happen that alignment mode is "masking" a bit of the true difference if we, e.g., enter alignment rescue mode more often (which negatively affects runtime). Accuracy differences at seed level will be more apparent in map mode.

ksahlin · 2023-01-27T09:30:19Z

Attached are the plots from the benchmark.

Overall: Decent cut in both memory and runtime. This may be a good time to say that I am very impressed with the work you have done!

Minor 'issue' in the results: The accuracy is slightly lower for read lengths 100 and 150 (and 75?) on maize and rye. To me, this is surprising. I would have expected the read length matching canonical read lengths to be identical-ish, as we use the same max_dist as v0.7.1. Any ideas?

I was also curious about why strobealign-master have many more reads aligned for the lowest read lengths.

accuracy_plot.pdf
time_plot.pdf
memory_plot.pdf
percentage_aligned_plot.pdf

ksahlin · 2023-01-27T10:28:53Z

Also the evaluation used strobealign with 16 cores for all runs.

Scrutinising the results a bit more:

It looks like the decrease in accuracy come from some alignment specific feature. Because the map-mode accuracy is identical (and map mode also maps much more reads now!). Perhaps the fix of the buggy if-statement (long ago), or is there any regression lately? The benchmark also highlights we should be careful to extrapolate from drosophila.
Observe the peak memory difference in drosophila for reads 300 and 500. Master uses like 5-9 times less peak memory of v0.7.1. Could there be a memory leak in v0.7.1 or something?
Related to 2. The runtime saving is mostly in the 'align-mode', not so much in 'map-mode'. As they both share generating the seeds and the find_nams functionality, it looks line something in the "alignment part" of the code is actually faster now. Or a more efficient parallelization (cpu usage)?
As we predicted, the memory saving is not as substantial for highly repetitive genomes (maize and rye)

ksahlin · 2023-01-27T10:41:20Z

Final note for now: I am restarting all strobealign v0.7.1 experiments, as some of the datapoints (50,75,100,150,200,250,300,500) were 'old runs' from the benchmark we did related to the accuracy regressions that we observed for some master commits a while ago. Just to make sure all runs are up to date.

Will get back confirming identity or update with new plots when they finish.

marcelm · 2023-01-27T10:46:17Z

Final note for now: I am restarting all strobealign v0.7.1 experiments, as some of the datapoints (50,75,100,150,200,250,300,500) were 'old runs' from the benchmark we did related to the accuracy regressions that we observed for some master commits a while ago. Just to make sure all runs are up to date.

I’m writing a longer reply, but wanted to comment briefly regarding this: You should make sure that you book an entire node for the benchmark runs. The nodes have 20 cores now and if you book 16 (-n 16), you could get interference from other processes running on the same node. And you could also bump the number of threads to 20, but that won’t change much.

ksahlin · 2023-01-27T10:55:38Z

Yes I always book entire nodes for strobealign runtime benchmarks, even when I benchmark using a single core. Already got an email about potential over-allocation from uppmax ;)

ksahlin · 2023-01-27T11:14:04Z

Another possibility explaining runtime improvement is compilation flags. I am now using the cmake compiled version of v0.7.1.

When I compare the runtime to what I reported in Fig S13 (suppl data) in the paper, they differ. In fact, the paper runtimes look very similar to runtimes of the current master.

So I am wondering if my compilation of v0.7.1 was not optimal? I used the version I compiled a while ago for the accuracy regressions benchmark. In the paper I used my old compilation instructions.

marcelm · 2023-01-27T12:49:17Z

Nice plots! I’m suprised (but delighted) to see that mapping runtimes have decreased as well, and also that the mapping rate went up for the short reads.

I don’t know where the mapping runtime improvement actually comes from. I wonder if something else is going on. One thought was that there could be a bias depending on which node the job runs on, but the rackham nodes all have identical CPUs, so I think it’s fine to compare runtimes obtained on different nodes. Did you enable AVX2?

Minor 'issue' in the results: The accuracy is slightly lower for read lengths 100 and 150 (and 75?) on maize and rye. To me, this is surprising. I would have expected the read length matching canonical read lengths to be identical-ish, as we use the same max_dist as v0.7.1. Any ideas?

This must have something to do with the other changes made since v0.7.1. I’m guessing it’s once again from the changed sort order for the randstrobes.

BTW, just to confirm: Your plots says strobealign-master, but the measurements are actually for the readlen-sti branch, right?

marcelm · 2023-01-27T12:59:46Z

I ran a test over lunch, and I’m now pretty sure that most of the increase in mapping rate at the short read lengths is explained by commit 87fb45e, where I made a change to generate one more randstrobe at the end of each sequence. For 2x50 bp Drosophila reads, mapping rate and (!) accuracy increase by 0.1 percentage points with that commit.

The benchmark also highlights we should be careful to extrapolate from drosophila.

Yes, you’re right. I’d like to update the automated tests to use another reference that is perhaps a bit big bigger and more repetitive, and perhaps also test other read lengths. I’ll open an issue about this.

Observe the peak memory difference in drosophila for reads 300 and 500. Master uses like 5-9 times less peak memory of v0.7.1. Could there be a memory leak in v0.7.1 or something?

That is weird. The only memory leak I remember fixing was in #181, but I had introduced that myself.

ksahlin · 2023-01-27T14:11:01Z

New runs are still pending. Some answers:

BTW, just to confirm: Your plots says strobealign-master, but the measurements are actually for the readlen-sti branch, right?

Yes. "strobealign" is strobealign v0.7.1. "strobealign-master" is commit 8e53e41 (the readlen-sti branch).

Did you enable AVX2?

I am becoming unsure about which compiled version of 0.7.1 i am benchmarking against. I will let the new experiments run. Then if we still observe the same results I will run the conda installation of 0.7.1.

What I can say for sure is that:

For this benchmark I installed commit 8e53e41 (named strobealign-master that I realize should be renamed to strobealign-main) according to your instructions with cmake.
My benchmark of 0.7.1 in the paper I used my old installation instructions (with -mavx2), pasted below

git clone https://github.com/ksahlin/StrobeAlign
cd StrobeAlign
# Needs a newer g++ version. Tested with version 8 and upwards.
g++ -std=c++14 main.cpp source/index.cpp source/xxhash.c source/ksw2_extz2_sse.c source/ssw_cpp.cpp source/ssw.c -lz -fopenmp -o strobealign -O3 -mavx2

That is weird. The only memory leak I remember fixing was in #181, but I had introduced that myself.

Some more brainstroming:

What about us now having a smaller batch size? (If this is the case). That would explain that the difference gets larger with increased read length. Although I cannot see that this causes several gigabytes overhead..
Also, there is a drastic difference between align and map mode. I know SW alignment had a bug where it tried to align o reference segments larger than 2000nt. Maybe this bug was unintentionally fixed? Again I don't see how it can explain several gigabytes.

marcelm · 2023-01-27T17:31:04Z

Did you enable AVX2?

I am becoming unsure about which compiled version of 0.7.1 i am benchmarking against. I will let the new experiments run. Then if we still observe the same results I will run the conda installation of 0.7.1.

Can you point me to the two binaries? I can try to check whether they have AVX2 instructions in them.

What I can say for sure is that:

For this benchmark I installed commit 8e53e41 (named strobealign-master that I realize should be renamed to strobealign-main) according to your instructions with cmake.

Hm, if you did as it said, you didn’t get AVX2 because we disabled that by default #38. I’ll need to add to the instructions how to re-enable it (add -DENABLE_AVX=ON to the cmake options).

git clone https://github.com/ksahlin/StrobeAlign
cd StrobeAlign
# Needs a newer g++ version. Tested with version 8 and upwards.
g++ -std=c++14 main.cpp source/index.cpp source/xxhash.c source/ksw2_extz2_sse.c source/ssw_cpp.cpp source/ssw.c -lz -fopenmp -o strobealign -O3 -mavx2

That looks quite similar to what we do now (except for -fopenmp) and the fact that -mavx2 is now optional.

That is weird. The only memory leak I remember fixing was in #181, but I had introduced that myself.

Some more brainstroming:

What about us now having a smaller batch size? (If this is the case). That would explain that the difference gets larger with increased read length. Although I cannot see that this causes several gigabytes overhead..

Oh, but maybe that’s the explanation, at least for Drosophila? For simplicity, let’s say a 500 bp read uses 1 kiB (sequence + quality + name). Then a pair uses 2 kiB. We changed chunk size from 100'000 to 10'000. Then memory usage per chunk is reduced by 90'000 * 2 kiB = 172 MiB. There’s one chunk for each thread, so with 16 thread, we arrive at 2.7 GiB less memory used (which is actually larger than the index).

Also, there is a drastic difference between align and map mode. I know SW alignment had a bug where it tried to align o reference segments larger than 2000nt. Maybe this bug was unintentionally fixed? Again I don't see how it can explain several gigabytes.

I don’t know. If you want, I can adjust my "test all commits" scripts to run in map-only mode, which would allow us to pinpoint the commit where this changed.

marcelm · 2023-01-27T23:16:56Z

If you want, I can adjust my "test all commits" scripts to run in map-only mode, which would allow us to pinpoint the commit where this changed.

I ran the adjusted script and while watching it run realized that there’s a second buffer for each thread that contains the processed output. If it’s PAF, then it’s small, but when it’s SAM, it’s strictly larger than the input FASTQ. So that roughly doubles the memory usage just for the buffers to ~6 GiB.

I think this explains the deviation between map-only and align mode.

ksahlin · 2023-01-28T04:32:52Z

A fail by me. I now have the results after rerunning strobealign v0.7.1 (attached)

Accuracy

previous analysis compared against old data for strobealign run with -M40 (remember that previous evaluation mishap). Somehow I convinced myself that I had rerun v0.7.1 in default mode after the mishap, but what I did was to simply look up the 'old' csv files with the accuracies for strobealign for default params. So the statistics on the server were a mix of default parameters (from the new read lengths) and -M40 (old read lengths).

For the new runs with default parameters, accuracy is near identical to before besides very short reads where our new version does a bit better. Likely due to the extra seed.

Memory

I consider the explanation of reduced batch size together and the SAM/PAF buffer difference to explain the memory increase that we observe. Mystery solved.

Runtime

Runtime looks a tad better for readlen-sti branch, but overall quite similar. Also, the runtimes match what I reported in paper, so mystery solved here as well.

accuracy_plot.pdf
time_plot.pdf
memory_plot.pdf
percentage_aligned_plot.pdf

ksahlin · 2023-01-28T04:44:37Z

Can you point me to the two binaries? I can try to check whether they have AVX2 instructions in them.

I have put the two versions I ran in the folder /proj/snic2022-6-31/nobackup/kris/strobealign_compilations/

Would be interesting to know.

Hm, if you did as it said, you didn’t get AVX2 because we disabled that by default #38. I’ll need to add to the instructions how to re-enable it (add -DENABLE_AVX=ON to the cmake options).

Is Automatic CPU detection easy to add? I feel like this would be a nice addition because I certainly didn't think about it, just blindly followed the installation instructions :)

If the version(s) were not run with AVX2 I am happy to recompile and run them. Although I claim that v0.7.1 should have those instructions per my previous installation.

marcelm · 2023-01-28T10:22:10Z

I have put the two versions I ran in the folder /proj/snic2022-6-31/nobackup/kris/strobealign_compilations/

Poking around a bit in the disassembly, I’m quite sure that strobealign-8e53e41 was compiled without AVX2 and strobealign-v0.7.1 with AVX2.

Is Automatic CPU detection easy to add? I feel like this would be a nice addition because I certainly didn't think about it, just blindly followed the installation instructions :)

I’ll open a separate issue about this. There are various ways of achieving this.

For now, we can just change the default compilation instructions to include -DENABLE_AVX=ON and then tell people to omit that option if their CPU doesn’t support it.

marcelm · 2023-01-28T10:28:46Z

I now have the results after rerunning strobealign v0.7.1 (attached)

Great, so is this an accurate high-level summary for the changelog?

Reduced memory usage (because of 1) different way the index is stored and 2) smaller chunk sizes)
Essentially same accuracy
Higher mapping rate for the shorter read lengths
Slightly faster mapping

And coming back to the PR, does anything else need to be done? (I’ll push changes for the changelog and documentation update later)

ksahlin · 2023-01-28T13:11:45Z

Poking around a bit in the disassembly, I’m quite sure that strobealign-8e53e41 was compiled without AVX2 and strobealign-v0.7.1 with AVX2.

Okay, then I should rerun strobealign-8e53e41 with AVX2 instructions enabled so we can compare actual difference in runtime. Do you mind giving me the instructions for how to do this within the cmake installation?

Great, so is this an accurate high-level summary for the changelog?

I would wait a bit with "Slightly faster mapping" until we also try the AVX2, although it may not change much.

Also, I would add a point stating something like "Several minor bug fixes (including SAM formatting)".

And coming back to the PR, does anything else need to be done? (I’ll push changes for the changelog and documentation update later)

Besides documentation for any interface changes - not that I can think of. We have verified that this version performs as well as previous version on accuracy. And better in terms of runtime and mem. All good to me.

marcelm · 2023-01-28T13:37:58Z

Poking around a bit in the disassembly, I’m quite sure that strobealign-8e53e41 was compiled without AVX2 and strobealign-v0.7.1 with AVX2.

Okay, then I should rerun strobealign-8e53e41 with AVX2 instructions enabled so we can compare actual difference in runtime. Do you mind giving me the instructions for how to do this within the cmake installation?

I opened #211 to update the README. In short:

cmake -B build -DENABLE_AVX=ON
make -j -C build

(With -C build, make changes into the directory build/ before starting.)

Also, I would add a point stating something like "Several minor bug fixes (including SAM formatting)".

This is already in the changelog under the "bug fixes" heading, but because the changelog is going to be somewhat longish, I wanted to add a short summary at the top.

And coming back to the PR, does anything else need to be done? (I’ll push changes for the changelog and documentation update later)

Besides documentation for any interface changes - not that I can think of. We have verified that this version performs as well as previous version on accuracy. And better in terms of runtime and mem. All good to me.

Good, then I’ll do those last things and merge.

ksahlin · 2023-01-28T13:49:31Z

cmake -B build -DENABLE_AVX=ON
make -j -C build

I did this on the same commit I evaluated here (8e53e41) and restarted all the runs for this version. Experiments currently running.

This is already in the changelog under the "bug fixes" heading, but because the changelog is going to be somewhat longish, I wanted to add a short summary at the top.

Okay!

ksahlin · 2023-01-28T17:31:34Z

Runs are completed. I verified that accuracy, percent aligned, and memory are identical (just because of previous mishaps).

Now, for the runtime (new time-plot attached). While the results are for one run per data point (so expect some variation), it looks like the AVX2 version is slightly faster. The gain seems to be the largest on drosophila(?). Maybe because our new indexing approach in this branch is the most active in smaller genomes with higher fraction of unique seeds? (see the difference in map-only mode on drosophila)

All-in-all, a nice combination of reduced runtime and memory in this pull request. I am happy to merge this pull request.

time_plot.pdf

marcelm · 2023-01-30T12:12:02Z

Runs are completed. I verified that accuracy, percent aligned, and memory are identical (just because of previous mishaps).

Great!

Now, for the runtime (new time-plot attached). While the results are for one run per data point (so expect some variation), it looks like the AVX2 version is slightly faster. The gain seems to be the largest on drosophila(?). Maybe because our new indexing approach in this branch is the most active in smaller genomes with higher fraction of unique seeds? (see the difference in map-only mode on drosophila)

I also think that that’s the reason for the difference. Fewer memory accesses for unique seeds should make things slightly faster.

All-in-all, a nice combination of reduced runtime and memory in this pull request. I am happy to merge this pull request.

Nice, then let’s do it.

marcelm force-pushed the readlen-sti branch from 8808e3d to 1dc9503 Compare January 20, 2023 10:01

marcelm mentioned this pull request Jan 20, 2023

Store file format version in the .sti file and check it when reading #197

Merged

marcelm added 4 commits January 25, 2023 19:06

Introduce a canonical read length and use it in index file name

57a0336

There are six canonical read lengths (50, 100, 150, 225, 325 and 400) which represent the six pre-defined index parameter settings. The canonical read length is now used as part of index filename as in "ref.fasta.r100.sti"

Use a TemporaryDirectory in test

7120e0e

Print a proper error message when sti file is missing

e9eb961

Remove the now optional c parameter even in test

91f3f53

marcelm force-pushed the readlen-sti branch from b50f1ff to fa38a5e Compare January 25, 2023 20:17

marcelm added 3 commits January 25, 2023 21:41

Base max_dist on the canonical read length, not the actual one

e140059

This allows to re-use an on-disk index even if the read length differs slightly.

Fix -l/-u even for dumpstrobes

7b978ab

Add new pre-defined index parameters (profile) for read length 125

2b68110

marcelm force-pushed the readlen-sti branch from fa38a5e to 2b68110 Compare January 25, 2023 20:44

Bump baseline commit

8e53e41

marcelm mentioned this pull request Jan 27, 2023

Require a path for -i/--use-index #134

Closed

marcelm merged commit 8a0e047 into main Jan 30, 2023

marcelm deleted the readlen-sti branch January 30, 2023 12:12

ksahlin mentioned this pull request Feb 19, 2023

Some benchmark remarks bioinf-benchmarking/mapping-benchmarking#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include read length in the index filename #195

Include read length in the index filename #195

marcelm commented Jan 19, 2023 •

edited

Loading

marcelm commented Jan 19, 2023

ksahlin commented Jan 20, 2023

marcelm commented Jan 20, 2023

ksahlin commented Jan 20, 2023

marcelm commented Jan 20, 2023 •

edited

Loading

ksahlin commented Jan 20, 2023

marcelm commented Jan 23, 2023

marcelm commented Jan 23, 2023

ksahlin commented Jan 25, 2023

marcelm commented Jan 25, 2023

ksahlin commented Jan 25, 2023

marcelm commented Jan 25, 2023

ksahlin commented Jan 25, 2023

ksahlin commented Jan 26, 2023

ksahlin commented Jan 27, 2023 •

edited

Loading

ksahlin commented Jan 27, 2023

ksahlin commented Jan 27, 2023

marcelm commented Jan 27, 2023 •

edited

Loading

ksahlin commented Jan 27, 2023 •

edited

Loading

ksahlin commented Jan 27, 2023

marcelm commented Jan 27, 2023

marcelm commented Jan 27, 2023 •

edited

Loading

ksahlin commented Jan 27, 2023

marcelm commented Jan 27, 2023 •

edited

Loading

marcelm commented Jan 27, 2023 •

edited

Loading

ksahlin commented Jan 28, 2023

ksahlin commented Jan 28, 2023

marcelm commented Jan 28, 2023

marcelm commented Jan 28, 2023

ksahlin commented Jan 28, 2023

marcelm commented Jan 28, 2023

ksahlin commented Jan 28, 2023

ksahlin commented Jan 28, 2023

marcelm commented Jan 30, 2023

Include read length in the index filename #195

Include read length in the index filename #195

Conversation

marcelm commented Jan 19, 2023 • edited Loading

marcelm commented Jan 19, 2023

CHM13

rye

ksahlin commented Jan 20, 2023

marcelm commented Jan 20, 2023

ksahlin commented Jan 20, 2023

marcelm commented Jan 20, 2023 • edited Loading

ksahlin commented Jan 20, 2023

marcelm commented Jan 23, 2023

marcelm commented Jan 23, 2023

ksahlin commented Jan 25, 2023

marcelm commented Jan 25, 2023

Current settings

Possible settings

ksahlin commented Jan 25, 2023

marcelm commented Jan 25, 2023

ksahlin commented Jan 25, 2023

ksahlin commented Jan 26, 2023

ksahlin commented Jan 27, 2023 • edited Loading

ksahlin commented Jan 27, 2023

ksahlin commented Jan 27, 2023

marcelm commented Jan 27, 2023 • edited Loading

ksahlin commented Jan 27, 2023 • edited Loading

ksahlin commented Jan 27, 2023

marcelm commented Jan 27, 2023

marcelm commented Jan 27, 2023 • edited Loading

ksahlin commented Jan 27, 2023

marcelm commented Jan 27, 2023 • edited Loading

marcelm commented Jan 27, 2023 • edited Loading

ksahlin commented Jan 28, 2023

Accuracy

Memory

Runtime

ksahlin commented Jan 28, 2023

marcelm commented Jan 28, 2023

marcelm commented Jan 28, 2023

ksahlin commented Jan 28, 2023

marcelm commented Jan 28, 2023

ksahlin commented Jan 28, 2023

ksahlin commented Jan 28, 2023

marcelm commented Jan 30, 2023

marcelm commented Jan 19, 2023 •

edited

Loading

marcelm commented Jan 20, 2023 •

edited

Loading

ksahlin commented Jan 27, 2023 •

edited

Loading

marcelm commented Jan 27, 2023 •

edited

Loading

ksahlin commented Jan 27, 2023 •

edited

Loading

marcelm commented Jan 27, 2023 •

edited

Loading

marcelm commented Jan 27, 2023 •

edited

Loading

marcelm commented Jan 27, 2023 •

edited

Loading