Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include read length in the index filename #195

Merged
merged 8 commits into from
Jan 30, 2023
Merged

Include read length in the index filename #195

merged 8 commits into from
Jan 30, 2023

Conversation

marcelm
Copy link
Collaborator

@marcelm marcelm commented Jan 19, 2023

Closes #133.

We currently have the concept of an (actual) read length, which is either estimated from the input FASTQs or provided using the -r command-line option. This PR introduces the concept of a "canonical read length". The canonical read length is one of 50, 100, 150, 225, 325 and 400 (we choose the one that is closest to the actual read length) and represents one of the six pre-defined index parameter settings.

With this PR, all parameters are entirely based on the canonical read length only, which means that it can be re-used for datasets that have similar read lengths. This was almost the case before except that the max_dist parameter was computed from the actual read length. Commit 8ba1f93 changes this so that max_dist is computed from the canonical read length. (This changes mapping results when actual and canonical read length differ.)

To make it possible to store multiple indices with different canonical read lengths alongside each other, the canonical read length is also included in the index filename, as in ref.fasta.r100.sti. I chose to include the r here mainly because I think it may be clearer that way what the number represents. Another minor reason is that we could extend the naming scheme at a later point to include other parameters. For example, .r100k24.sti could mean that one needs to use -r 100 -k 24 to reproduce the settings the index was created with.

At the moment, if the user overrides any indexing parameters, the file name extension is going to be just .sti as before. The user then again has the problem that they cannot store multiple of these indices in the same directory, but I think solving this has much lower priority. (One workaround is to symlink the reference into separate directories for each parameter setting and then create the index there.) If the index doesn’t match the command-line settings, then there’s an error message in any case, so accidentally using the incorrect index should not be possible.


  • Update documentation
  • Update changelog
  • Ensure we get sensible error messages if index settings don’t match or an index couldn’t be found
  • Measure how accuracy changes due to max_dist now being based on canonical, not actual read length
  • Update baseline commit

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 19, 2023

CHM13

2 x 100 bp reads

-r mapping rate accuracy
76 99.79551 93.263595
100 99.804485 93.27491
125 99.803985 93.27183

rye

2 x 100 bp reads

-r mapping rate accuracy
76 99.692785 85.54961
100 99.797135 85.56528
125 99.80467 85.52795

Hm, that drop in mapping rate for -r 76 is a bit bigger than the other differences.

@ksahlin
Copy link
Owner

ksahlin commented Jan 20, 2023

So if I interpret your analysis correctly, we are 'reversing the order of the analysis' as you described earlier. That is, you test different settings for -r (or maxdist really) for fixed a read length. From your analysis, this suggests that for the shortest read lengths (76) that get assigned to the canonical read length 100, the effect of indexing with a max dist based on 100 could have a larger effect.

Some brainstroming:

  1. Setting the maxdist based on the lowest read length in the interval.
  2. Tighter canonical intervals in the lower range. One could imaging splitting the 75-125 range into 75-100,100-125 where only max dist change (set to 87 or 112 I guess)

I like suggestion 2 better, because I think this difference is only noticeable in the shorter read range(?). One analysis that would shed some light on this question would be to index a genome with different values on -r and see how many times maxdist is 'active' when getting our window of candidate syncmers.

Suggestion 1 is an easy fix but it has the possible drawback of slower relative runtime by being conservative (not leveraging the larger spans that we could use for the reads).

For suggestion 2, in https://github.com/ksahlin/strobealign/blob/main/src/index.cpp:

        settings {75, 20, -4, -4, 2},
        settings {100, 20, -4, -3, 2},
        settings {125, 20, -4, -2, 2},

Here I just made a guess on settings {100, 20, -4, -3, 2}, but i think the (-3,2) interval makes sense.

What do you think?

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 20, 2023

More fine-grained read-length intervals is an option. This would make it a bit more difficult to generate all possible indices in advance. I think some users/admins would want to do that "just to be prepared". And then strobealign could gain a reputation for requiring massive amounts of storage for its indices ...

Shouldn’t the ranges have the most common read lengths as their middle points to ensure that the index works best with them? (Or at least have the max_dist setting optimized for it.) 76-125 was good because it had 100 in the middle, so perhaps it needs to be split into three parts. Or change the other intervals as well. Have you by any chance done a survey to find out which read lengths are most commonly used in practice? I would guess many people use the maximum Illumina read lengths, which are 100, 125, 150, 250, 300 (interesting, didn’t know about the 125).

One analysis that would shed some light on this question would be to index a genome with different values on -r and see how many times maxdist is 'active' when getting our window of candidate syncmers.

I’ll try to do something in that regard.

@ksahlin
Copy link
Owner

ksahlin commented Jan 20, 2023

More fine-grained read-length intervals is an option. This would make it a bit more difficult to generate all possible indices in advance. I think some users/admins would want to do that "just to be prepared". And then strobealign could gain a reputation for requiring massive amounts of storage for its indices ...

Ok agree.

Shouldn’t the ranges have the most common read lengths as their middle points to ensure that the index works best with them?

That makes sense. Would then be interesting to investigate parameters for 100, 125, 150, 250, 300. In my study I benchmarked only 100, 150, and 300 out of those numbers. (The other lengths I investigated were 50, 75, 200, 500 which are less relevant with Illumina, for now.)

I can with little effort add 125 and 250 to my benchmarks. Interestingly 125 is right on the border which probably means the current parameters (also the syncmer span) aren't ideal for them.

How do you want to proceed with this? I am happy to run benchmarks whenever we have something to try.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 20, 2023

I added a "dumprandstrobes" tool in #198 to be able to look at the randstrobes for any FASTA file and any parameters.

I have only started using it, but it was already helpful because it reminded me that 1) max_dist is computed as read length minus 70 and 2) max_dist is restricted to be at least $k$. That means that it is always 20 within the range 76-90. (That is, the index is identical for -r 76 and -r 90.)

I’ll run the tool a bit to get an impression of how much the max_dist setting changes the index.

@ksahlin
Copy link
Owner

ksahlin commented Jan 20, 2023

great!

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 23, 2023

I ran dumprandstrobes on the fruit fly genome, varying the read length from 70 to 287 (at which point my disk was full) and then counted how many randstrobes change when comparing read length $n$ to read length $n+1$.

As expected, small max_dist values have the strongest effect. I’ll post another table later looking at the max_dist values directly.

See the table here The total number of randstrobes was always about 26.5 million.
len1 → len2 no. of changes
70 → 71 0
71 → 72 0
72 → 73 0
73 → 74 0
74 → 75 0
75 → 76 11067172
76 → 77 0
77 → 78 0
78 → 79 0
79 → 80 0
80 → 81 0
81 → 82 0
82 → 83 0
83 → 84 0
84 → 85 0
85 → 86 0
86 → 87 0
87 → 88 0
88 → 89 0
89 → 90 0
90 → 91 1299302
91 → 92 1177129
92 → 93 1097366
93 → 94 1048296
94 → 95 938567
95 → 96 873557
96 → 97 828832
97 → 98 742453
98 → 99 673770
99 → 100 623985
100 → 101 545579
101 → 102 482820
102 → 103 424230
103 → 104 357368
104 → 105 303858
105 → 106 254989
106 → 107 204637
107 → 108 165840
108 → 109 134761
109 → 110 105163
110 → 111 82478
111 → 112 65060
112 → 113 49456
113 → 114 38211
114 → 115 29376
115 → 116 22025
116 → 117 16366
117 → 118 12462
118 → 119 8984
119 → 120 6692
120 → 121 4947
121 → 122 3844
122 → 123 2637
123 → 124 2013
124 → 125 1437
125 → 126 20664155
126 → 127 374043
127 → 128 334959
128 → 129 301984
129 → 130 276357
130 → 131 242701
131 → 132 215097
132 → 133 191469
133 → 134 164461
134 → 135 141917
135 → 136 123035
136 → 137 104145
137 → 138 89577
138 → 139 74392
139 → 140 62325
140 → 141 50928
141 → 142 42005
142 → 143 34895
143 → 144 28495
144 → 145 23336
145 → 146 18428
146 → 147 15286
147 → 148 12245
148 → 149 10053
149 → 150 7786
150 → 151 6236
151 → 152 4753
152 → 153 3854
153 → 154 2976
154 → 155 2277
155 → 156 1911
156 → 157 1627
157 → 158 1107
158 → 159 969
159 → 160 700
160 → 161 549
161 → 162 425
162 → 163 314
163 → 164 239
164 → 165 210
165 → 166 155
166 → 167 117
167 → 168 118
168 → 169 79
169 → 170 63
170 → 171 51
171 → 172 47
172 → 173 39
173 → 174 33
174 → 175 33
175 → 176 17849451
176 → 177 28711
177 → 178 24443
178 → 179 20888
179 → 180 17691
180 → 181 14755
181 → 182 12228
182 → 183 10303
183 → 184 8806
184 → 185 6798
185 → 186 5977
186 → 187 4915
187 → 188 4138
188 → 189 3334
189 → 190 2651
190 → 191 2214
191 → 192 1848
192 → 193 1444
193 → 194 1200
194 → 195 1000
195 → 196 832
196 → 197 638
197 → 198 557
198 → 199 477
199 → 200 352
200 → 201 277
201 → 202 211
202 → 203 197
203 → 204 152
204 → 205 125
205 → 206 118
206 → 207 86
207 → 208 56
208 → 209 61
209 → 210 55
210 → 211 44
211 → 212 45
212 → 213 28
213 → 214 44
214 → 215 24
215 → 216 26
216 → 217 23
217 → 218 20
218 → 219 28
219 → 220 30
220 → 221 26
221 → 222 57
222 → 223 65
223 → 224 80
224 → 225 101
225 → 226 106
226 → 227 107
227 → 228 138
228 → 229 164
229 → 230 156
230 → 231 173
231 → 232 165
232 → 233 176
233 → 234 174
234 → 235 158
235 → 236 173
236 → 237 148
237 → 238 142
238 → 239 152
239 → 240 147
240 → 241 138
241 → 242 116
242 → 243 126
243 → 244 99
244 → 245 124
245 → 246 110
246 → 247 103
247 → 248 104
248 → 249 90
249 → 250 89
250 → 251 105
251 → 252 80
252 → 253 76
253 → 254 77
254 → 255 75
255 → 256 64
256 → 257 63
257 → 258 68
258 → 259 61
259 → 260 69
260 → 261 67
261 → 262 74
262 → 263 62
263 → 264 55
264 → 265 54
265 → 266 63
266 → 267 48
267 → 268 52
268 → 269 56
269 → 270 40
270 → 271 40
271 → 272 50
272 → 273 43
273 → 274 43
274 → 275 42
275 → 276 0 (bug?)
276 → 277 29
277 → 278 24
278 → 279 20
279 → 280 23
280 → 281 20
281 → 282 23
282 → 283 19
283 → 284 11
284 → 285 16
285 → 286 12
286 → 287 8

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 23, 2023

Here’s a histogram for the randstrobes in CHM13, generated using -r 125 and -r 76. This shows the frequencies of offsets, which are computed as strobe2.start - strobe1.start. The read lengths result in max. offsets of 75 and 40, respectively (max_dist is $k=20$ lower because the "distance" would be strobe2.start - strobe1.end).

CHM13
There are 577 million randstrobes in CHM13. If running with -r 125, 235 million of them have an offset greater than 40. They hit max_dist when running with -r 90 or less. 70 million have an offset greater than 50. They hit max_dist for -r 100.

BTW, why did you decide to introduce max_dist instead of adjusting w_max?

I had this idea of using the highest randstrobe setting to create the on-disk index and then dynamically updating it to use a lower max_dist setting after reading it in from disk, based on the actual read length, but that sounds complicated and may not be much quicker than just recomputing the index ...

I’m reluctant because we need to do another round of benchmarks, but it seems the simplest option is your option 2 (tighter intervals).

@ksahlin
Copy link
Owner

ksahlin commented Jan 25, 2023

The read lengths result in max. offsets of 75 and 40,

To establish terminology, do you mean they result in maximum seed lengths 75 and 40? Because the 'offset' between first and second syncmer would be min(r-70, k) which would always be 20 for r 76, which together with a syncmer size of 20 gives maximum seed size of 40 (the offset of the second syncmer to the first plus the syncmer size).

BTW, why did you decide to introduce max_dist instead of adjusting w_max?

The w_max is the limit on the number of syncmers in the window. Some regions have sparser syncmer distribution (and there are syncmers in regions of N's). Max dist assures a seed size limit in nucleotide space. For example, without maxdist we would get seeds that were millions of bases long (spanning over teleomer regions with N's in hg38. Max dist was a practical way to limit this, also in order to limit to 8 bits storing the offset size.

I had this idea of using the highest randstrobe setting to create the on-disk index and then dynamically updating it to use a lower max_dist setting after reading it in from disk, based on the actual read length, but that sounds complicated and may not be much quicker than just recomputing the index ...

Sounds interesting, but I am worried there will be many practical complications:
i) The k and s varies in the higher limit (k-s=6), which will not be compatible with lower (k-s=4) settings
ii) I used a different k for two of the highest options. (I will see if someone can perform a parameter optimisation analysis.)
iii) If max_dist should be adjusted, then better to just store the syncmers and their positions, as the window [w_min, w_max] (in syncmers) will be determined by the read length. (I don't think you can modify the randstrobes after they are created)

or perhaps I am thinking about it the wrong way?

I’m reluctant because we need to do another round of benchmarks, but it seems the simplest option is your option 2 (tighter intervals).

You mean your modified version of my option 2, right? Which also has "the ranges have the most common read lengths as their middle points to ensure that the index works best with them".

I don't have a strong preference here, basically because I can't come up with a this-is-the-way-to-go solution.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 25, 2023

The read lengths result in max. offsets of 75 and 40,

To establish terminology, do you mean they result in maximum seed lengths 75 and 40? Because the 'offset' between first and second syncmer would be min(r-70, k) which would always be 20 for r 76, which together with a syncmer size of 20 gives maximum seed size of 40 (the offset of the second syncmer to the first plus the syncmer size).

Yes, you’re right. What’s shown in the figure is the seed length. Sorry about the confusion – I computed the length incorrectly for the statistics, which made me think offset and distance are somehow different things. But of course offset and distance are both the difference in strobe start positions, and adding $k$ gives us the seed length.

BTW, why did you decide to introduce max_dist instead of adjusting w_max?

The w_max is the limit on the number of syncmers in the window. Some regions have sparser syncmer distribution (and there are syncmers in regions of N's). Max dist assures a seed size limit in nucleotide space. For example, without maxdist we would get seeds that were millions of bases long (spanning over teleomer regions with N's in hg38. Max dist was a practical way to limit this, also in order to limit to 8 bits storing the offset size.

That makes sense. Not that it matters here, but I just wonder whether it would be nicer (conceptually) to only have one criterion. So one could only use max_dist and abandon w_max.

I had this idea of using the highest randstrobe setting to create the on-disk index and then dynamically updating it to use a lower max_dist setting after reading it in from disk, based on the actual read length, but that sounds complicated and may not be much quicker than just recomputing the index ...

Sounds interesting, but I am worried there will be many practical complications:
i) The k and s varies in the higher limit (k-s=6), which will not be compatible with lower (k-s=4) settings

Yes, I thought one would still have multiple indices as before and only adjust those randstrobes that differ due to the different max_dist settings.

ii) I used a different k for two of the highest options. (I will see if someone can perform a parameter optimisation analysis.)
iii) If max_dist should be adjusted, then better to just store the syncmers and their positions, as the window [w_min, w_max] (in syncmers) will be determined by the read length. (I don't think you can modify the randstrobes after they are created)

Good point, only storing syncmers would be an option, good point. On the other hand, the savings would be lower because generating randstrobes also takes time ... But I don’t think it’s worth going into this direction at the moment, it was just a brainstorming idea.

I’m reluctant because we need to do another round of benchmarks, but it seems the simplest option is your option 2 (tighter intervals).

You mean your modified version of my option 2, right? Which also has "the ranges have the most common read lengths as their middle points to ensure that the index works best with them".

Yes, but I described it unclearly: The essential point is that we choose max_dist so that it works best with the read lengths we’re most likely to encounter. Instead of computing max_dist from the canonical read length, we can hardcode it as part of the settings, and then we can choose whatever we want for each canonical read length, and it doesn’t need to be centered. (The canonical read length would then just become the name of one of the pre-defined index parameters.)

Because you designed these settings initially, I don’t have a good intuition for how they would need to be changed, but let me make a suggestion as a starting point anyway. I think we want this:

  • One extra setting centered at read length 125 to cater for that read length
  • Pick the max_dist so that it works best with the most likely read length in that interval (read length minus 70)

Current settings

canonical read length read len. interval k s w_min w_max max_dist
50 <=75 20 16 1 6 20
100 <=125 20 16 2 6 20..55
150 <=175 20 16 5 11 56..105
225 <=275 20 16 8 17 106..205
325 <=375 22 18 6 16 206..255
400 >375 23 17 5 15 255

Possible settings

For the added read length 125, I interpolated w_min and w_max from the adjacent settings, but one might as well pick 4 and 9.

Bold numbers designate changed or new values.

canonical read length read len. interval k s w_min w_max max_dist
50 <=75 20 16 1 6 20
100 <=110 20 16 2 6 30
125 <=135 20 16 3 8 55
150 <=175 20 16 5 11 80
250 <=275 20 16 8 17 180
300 <=375 22 18 6 16 230
400 >375 23 17 5 15 255

@ksahlin
Copy link
Owner

ksahlin commented Jan 25, 2023

I think most of those settings look good. The only one I am hesitant about is the <=75 interval. Reads of length 76 will now be bumped up to max_dist 30 while being capped at 20 before. I believe we should be conservative with this interval and do:

canonical read length read len. interval k s w_min w_max max_dist
50 <=90 20 16 1 6 20
100 <=110 20 16 2 6 30
125 <=135 20 16 3 8 55
150 <=175 20 16 5 11 80
250 <=275 20 16 8 17 180
300 <=375 22 18 6 16 230
400 >375 23 17 5 15 255

(I have edited only the read len. interval column in the first row) This basically states that max_dist have precedence over w_min,w_max boundaries for very short reads.

Before this request is merged, I could start a benchmark including some new read lengths and compare to v0.7.1. As the results should be similar for the actual canonical read lengths (125, 250, 300). I should probably test read lengths 76, 111, 136, and 176 for 'worst case scenarios', right?

There are six canonical read lengths (50, 100, 150, 225, 325 and 400) which
represent the six pre-defined index parameter settings.

The canonical read length is now used as part of index filename as in
"ref.fasta.r100.sti"
@marcelm
Copy link
Collaborator Author

marcelm commented Jan 25, 2023

I added a new setting for read length 125 using your version of the table. Makes absolute sense to end that interval at 90. I also bumped the baseline commit so that we finally get a green checkmark.

Before this request is merged, I could start a benchmark including some new read lengths and compare to v0.7.1. As the results should be similar for the actual canonical read lengths (125, 250, 300). I should probably test read lengths 76, 111, 136, and 176 for 'worst case scenarios', right?

Yes, this PR is ready now, would be good with some benchmarks now that also test some of these worst case lengths.

@ksahlin
Copy link
Owner

ksahlin commented Jan 25, 2023

Great! I will set up and start experiments for read lengths 91, 111, 136, and 176 tomorrow (in addition to previously compared read lengths) on Drosophila, maize, CHM13 and rye. I will also add 125 and 250 as these are standard lengths I did not benchmark before.

@ksahlin
Copy link
Owner

ksahlin commented Jan 26, 2023

The evaluation is up and running but will take some time to complete due to job scheduling waiting times. I am also benchmarking the mapping modes of the two versions (-x) to more directly check the effect of seed parameters.

It can happen that alignment mode is "masking" a bit of the true difference if we, e.g., enter alignment rescue mode more often (which negatively affects runtime). Accuracy differences at seed level will be more apparent in map mode.

@ksahlin
Copy link
Owner

ksahlin commented Jan 27, 2023

Attached are the plots from the benchmark.

Overall: Decent cut in both memory and runtime. This may be a good time to say that I am very impressed with the work you have done!

Minor 'issue' in the results: The accuracy is slightly lower for read lengths 100 and 150 (and 75?) on maize and rye. To me, this is surprising. I would have expected the read length matching canonical read lengths to be identical-ish, as we use the same max_dist as v0.7.1. Any ideas?

I was also curious about why strobealign-master have many more reads aligned for the lowest read lengths.

accuracy_plot.pdf
time_plot.pdf
memory_plot.pdf
percentage_aligned_plot.pdf

@ksahlin
Copy link
Owner

ksahlin commented Jan 27, 2023

Also the evaluation used strobealign with 16 cores for all runs.

Scrutinising the results a bit more:

  1. It looks like the decrease in accuracy come from some alignment specific feature. Because the map-mode accuracy is identical (and map mode also maps much more reads now!). Perhaps the fix of the buggy if-statement (long ago), or is there any regression lately? The benchmark also highlights we should be careful to extrapolate from drosophila.
  2. Observe the peak memory difference in drosophila for reads 300 and 500. Master uses like 5-9 times less peak memory of v0.7.1. Could there be a memory leak in v0.7.1 or something?
  3. Related to 2. The runtime saving is mostly in the 'align-mode', not so much in 'map-mode'. As they both share generating the seeds and the find_nams functionality, it looks line something in the "alignment part" of the code is actually faster now. Or a more efficient parallelization (cpu usage)?
  4. As we predicted, the memory saving is not as substantial for highly repetitive genomes (maize and rye)

@ksahlin
Copy link
Owner

ksahlin commented Jan 27, 2023

Final note for now: I am restarting all strobealign v0.7.1 experiments, as some of the datapoints (50,75,100,150,200,250,300,500) were 'old runs' from the benchmark we did related to the accuracy regressions that we observed for some master commits a while ago. Just to make sure all runs are up to date.

Will get back confirming identity or update with new plots when they finish.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 27, 2023

Final note for now: I am restarting all strobealign v0.7.1 experiments, as some of the datapoints (50,75,100,150,200,250,300,500) were 'old runs' from the benchmark we did related to the accuracy regressions that we observed for some master commits a while ago. Just to make sure all runs are up to date.

I’m writing a longer reply, but wanted to comment briefly regarding this: You should make sure that you book an entire node for the benchmark runs. The nodes have 20 cores now and if you book 16 (-n 16), you could get interference from other processes running on the same node. And you could also bump the number of threads to 20, but that won’t change much.

@ksahlin
Copy link
Owner

ksahlin commented Jan 27, 2023

Yes I always book entire nodes for strobealign runtime benchmarks, even when I benchmark using a single core. Already got an email about potential over-allocation from uppmax ;)

@ksahlin
Copy link
Owner

ksahlin commented Jan 27, 2023

Another possibility explaining runtime improvement is compilation flags. I am now using the cmake compiled version of v0.7.1.

When I compare the runtime to what I reported in Fig S13 (suppl data) in the paper, they differ. In fact, the paper runtimes look very similar to runtimes of the current master.

So I am wondering if my compilation of v0.7.1 was not optimal? I used the version I compiled a while ago for the accuracy regressions benchmark. In the paper I used my old compilation instructions.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 27, 2023

Nice plots! I’m suprised (but delighted) to see that mapping runtimes have decreased as well, and also that the mapping rate went up for the short reads.

I don’t know where the mapping runtime improvement actually comes from. I wonder if something else is going on. One thought was that there could be a bias depending on which node the job runs on, but the rackham nodes all have identical CPUs, so I think it’s fine to compare runtimes obtained on different nodes. Did you enable AVX2?

Minor 'issue' in the results: The accuracy is slightly lower for read lengths 100 and 150 (and 75?) on maize and rye. To me, this is surprising. I would have expected the read length matching canonical read lengths to be identical-ish, as we use the same max_dist as v0.7.1. Any ideas?

This must have something to do with the other changes made since v0.7.1. I’m guessing it’s once again from the changed sort order for the randstrobes.

BTW, just to confirm: Your plots says strobealign-master, but the measurements are actually for the readlen-sti branch, right?

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 27, 2023

I ran a test over lunch, and I’m now pretty sure that most of the increase in mapping rate at the short read lengths is explained by commit 87fb45e, where I made a change to generate one more randstrobe at the end of each sequence. For 2x50 bp Drosophila reads, mapping rate and (!) accuracy increase by 0.1 percentage points with that commit.

The benchmark also highlights we should be careful to extrapolate from drosophila.

Yes, you’re right. I’d like to update the automated tests to use another reference that is perhaps a bit big bigger and more repetitive, and perhaps also test other read lengths. I’ll open an issue about this.

  1. Observe the peak memory difference in drosophila for reads 300 and 500. Master uses like 5-9 times less peak memory of v0.7.1. Could there be a memory leak in v0.7.1 or something?

That is weird. The only memory leak I remember fixing was in #181, but I had introduced that myself.

@ksahlin
Copy link
Owner

ksahlin commented Jan 27, 2023

New runs are still pending. Some answers:

BTW, just to confirm: Your plots says strobealign-master, but the measurements are actually for the readlen-sti branch, right?

Yes. "strobealign" is strobealign v0.7.1. "strobealign-master" is commit 8e53e41 (the readlen-sti branch).

Did you enable AVX2?

I am becoming unsure about which compiled version of 0.7.1 i am benchmarking against. I will let the new experiments run. Then if we still observe the same results I will run the conda installation of 0.7.1.

What I can say for sure is that:

  • For this benchmark I installed commit 8e53e41 (named strobealign-master that I realize should be renamed to strobealign-main) according to your instructions with cmake.
  • My benchmark of 0.7.1 in the paper I used my old installation instructions (with -mavx2), pasted below
git clone https://github.com/ksahlin/StrobeAlign
cd StrobeAlign
# Needs a newer g++ version. Tested with version 8 and upwards.
g++ -std=c++14 main.cpp source/index.cpp source/xxhash.c source/ksw2_extz2_sse.c source/ssw_cpp.cpp source/ssw.c -lz -fopenmp -o strobealign -O3 -mavx2

That is weird. The only memory leak I remember fixing was in #181, but I had introduced that myself.

Some more brainstroming:

  • What about us now having a smaller batch size? (If this is the case). That would explain that the difference gets larger with increased read length. Although I cannot see that this causes several gigabytes overhead..
  • Also, there is a drastic difference between align and map mode. I know SW alignment had a bug where it tried to align o reference segments larger than 2000nt. Maybe this bug was unintentionally fixed? Again I don't see how it can explain several gigabytes.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 27, 2023

Did you enable AVX2?

I am becoming unsure about which compiled version of 0.7.1 i am benchmarking against. I will let the new experiments run. Then if we still observe the same results I will run the conda installation of 0.7.1.

Can you point me to the two binaries? I can try to check whether they have AVX2 instructions in them.

What I can say for sure is that:

  • For this benchmark I installed commit 8e53e41 (named strobealign-master that I realize should be renamed to strobealign-main) according to your instructions with cmake.

Hm, if you did as it said, you didn’t get AVX2 because we disabled that by default #38. I’ll need to add to the instructions how to re-enable it (add -DENABLE_AVX=ON to the cmake options).

git clone https://github.com/ksahlin/StrobeAlign
cd StrobeAlign
# Needs a newer g++ version. Tested with version 8 and upwards.
g++ -std=c++14 main.cpp source/index.cpp source/xxhash.c source/ksw2_extz2_sse.c source/ssw_cpp.cpp source/ssw.c -lz -fopenmp -o strobealign -O3 -mavx2

That looks quite similar to what we do now (except for -fopenmp) and the fact that -mavx2 is now optional.

That is weird. The only memory leak I remember fixing was in #181, but I had introduced that myself.

Some more brainstroming:

  • What about us now having a smaller batch size? (If this is the case). That would explain that the difference gets larger with increased read length. Although I cannot see that this causes several gigabytes overhead..

Oh, but maybe that’s the explanation, at least for Drosophila? For simplicity, let’s say a 500 bp read uses 1 kiB (sequence + quality + name). Then a pair uses 2 kiB. We changed chunk size from 100'000 to 10'000. Then memory usage per chunk is reduced by 90'000 * 2 kiB = 172 MiB. There’s one chunk for each thread, so with 16 thread, we arrive at 2.7 GiB less memory used (which is actually larger than the index).

  • Also, there is a drastic difference between align and map mode. I know SW alignment had a bug where it tried to align o reference segments larger than 2000nt. Maybe this bug was unintentionally fixed? Again I don't see how it can explain several gigabytes.

I don’t know. If you want, I can adjust my "test all commits" scripts to run in map-only mode, which would allow us to pinpoint the commit where this changed.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 27, 2023

If you want, I can adjust my "test all commits" scripts to run in map-only mode, which would allow us to pinpoint the commit where this changed.

I ran the adjusted script and while watching it run realized that there’s a second buffer for each thread that contains the processed output. If it’s PAF, then it’s small, but when it’s SAM, it’s strictly larger than the input FASTQ. So that roughly doubles the memory usage just for the buffers to ~6 GiB.

I think this explains the deviation between map-only and align mode.

@ksahlin
Copy link
Owner

ksahlin commented Jan 28, 2023

A fail by me. I now have the results after rerunning strobealign v0.7.1 (attached)

Accuracy

previous analysis compared against old data for strobealign run with -M40 (remember that previous evaluation mishap). Somehow I convinced myself that I had rerun v0.7.1 in default mode after the mishap, but what I did was to simply look up the 'old' csv files with the accuracies for strobealign for default params. So the statistics on the server were a mix of default parameters (from the new read lengths) and -M40 (old read lengths).

For the new runs with default parameters, accuracy is near identical to before besides very short reads where our new version does a bit better. Likely due to the extra seed.

Memory

I consider the explanation of reduced batch size together and the SAM/PAF buffer difference to explain the memory increase that we observe. Mystery solved.

Runtime

Runtime looks a tad better for readlen-sti branch, but overall quite similar. Also, the runtimes match what I reported in paper, so mystery solved here as well.

accuracy_plot.pdf
time_plot.pdf
memory_plot.pdf
percentage_aligned_plot.pdf

@ksahlin
Copy link
Owner

ksahlin commented Jan 28, 2023

Can you point me to the two binaries? I can try to check whether they have AVX2 instructions in them.

I have put the two versions I ran in the folder /proj/snic2022-6-31/nobackup/kris/strobealign_compilations/

Would be interesting to know.

Hm, if you did as it said, you didn’t get AVX2 because we disabled that by default #38. I’ll need to add to the instructions how to re-enable it (add -DENABLE_AVX=ON to the cmake options).

Is Automatic CPU detection easy to add? I feel like this would be a nice addition because I certainly didn't think about it, just blindly followed the installation instructions :)

If the version(s) were not run with AVX2 I am happy to recompile and run them. Although I claim that v0.7.1 should have those instructions per my previous installation.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 28, 2023

I have put the two versions I ran in the folder /proj/snic2022-6-31/nobackup/kris/strobealign_compilations/

Poking around a bit in the disassembly, I’m quite sure that strobealign-8e53e41 was compiled without AVX2 and strobealign-v0.7.1 with AVX2.

Is Automatic CPU detection easy to add? I feel like this would be a nice addition because I certainly didn't think about it, just blindly followed the installation instructions :)

I’ll open a separate issue about this. There are various ways of achieving this.

For now, we can just change the default compilation instructions to include -DENABLE_AVX=ON and then tell people to omit that option if their CPU doesn’t support it.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 28, 2023

I now have the results after rerunning strobealign v0.7.1 (attached)

Great, so is this an accurate high-level summary for the changelog?

  • Reduced memory usage (because of 1) different way the index is stored and 2) smaller chunk sizes)
  • Essentially same accuracy
  • Higher mapping rate for the shorter read lengths
  • Slightly faster mapping

And coming back to the PR, does anything else need to be done? (I’ll push changes for the changelog and documentation update later)

@ksahlin
Copy link
Owner

ksahlin commented Jan 28, 2023

Poking around a bit in the disassembly, I’m quite sure that strobealign-8e53e41 was compiled without AVX2 and strobealign-v0.7.1 with AVX2.

Okay, then I should rerun strobealign-8e53e41 with AVX2 instructions enabled so we can compare actual difference in runtime. Do you mind giving me the instructions for how to do this within the cmake installation?

Great, so is this an accurate high-level summary for the changelog?

I would wait a bit with "Slightly faster mapping" until we also try the AVX2, although it may not change much.

Also, I would add a point stating something like "Several minor bug fixes (including SAM formatting)".

And coming back to the PR, does anything else need to be done? (I’ll push changes for the changelog and documentation update later)

Besides documentation for any interface changes - not that I can think of. We have verified that this version performs as well as previous version on accuracy. And better in terms of runtime and mem. All good to me.

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 28, 2023

Poking around a bit in the disassembly, I’m quite sure that strobealign-8e53e41 was compiled without AVX2 and strobealign-v0.7.1 with AVX2.

Okay, then I should rerun strobealign-8e53e41 with AVX2 instructions enabled so we can compare actual difference in runtime. Do you mind giving me the instructions for how to do this within the cmake installation?

I opened #211 to update the README. In short:

cmake -B build -DENABLE_AVX=ON
make -j -C build

(With -C build, make changes into the directory build/ before starting.)

Also, I would add a point stating something like "Several minor bug fixes (including SAM formatting)".

This is already in the changelog under the "bug fixes" heading, but because the changelog is going to be somewhat longish, I wanted to add a short summary at the top.

And coming back to the PR, does anything else need to be done? (I’ll push changes for the changelog and documentation update later)

Besides documentation for any interface changes - not that I can think of. We have verified that this version performs as well as previous version on accuracy. And better in terms of runtime and mem. All good to me.

Good, then I’ll do those last things and merge.

@ksahlin
Copy link
Owner

ksahlin commented Jan 28, 2023

cmake -B build -DENABLE_AVX=ON
make -j -C build

I did this on the same commit I evaluated here (8e53e41) and restarted all the runs for this version. Experiments currently running.

This is already in the changelog under the "bug fixes" heading, but because the changelog is going to be somewhat longish, I wanted to add a short summary at the top.

Okay!

@ksahlin
Copy link
Owner

ksahlin commented Jan 28, 2023

Runs are completed. I verified that accuracy, percent aligned, and memory are identical (just because of previous mishaps).

Now, for the runtime (new time-plot attached). While the results are for one run per data point (so expect some variation), it looks like the AVX2 version is slightly faster. The gain seems to be the largest on drosophila(?). Maybe because our new indexing approach in this branch is the most active in smaller genomes with higher fraction of unique seeds? (see the difference in map-only mode on drosophila)

All-in-all, a nice combination of reduced runtime and memory in this pull request. I am happy to merge this pull request.

time_plot.pdf

@marcelm
Copy link
Collaborator Author

marcelm commented Jan 30, 2023

Runs are completed. I verified that accuracy, percent aligned, and memory are identical (just because of previous mishaps).

Great!

Now, for the runtime (new time-plot attached). While the results are for one run per data point (so expect some variation), it looks like the AVX2 version is slightly faster. The gain seems to be the largest on drosophila(?). Maybe because our new indexing approach in this branch is the most active in smaller genomes with higher fraction of unique seeds? (see the difference in map-only mode on drosophila)

I also think that that’s the reason for the difference. Fewer memory accesses for unique seeds should make things slightly faster.

All-in-all, a nice combination of reduced runtime and memory in this pull request. I am happy to merge this pull request.

Nice, then let’s do it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The separate indexing and alignment interface.
2 participants