-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Include read length in the index filename #195
Conversation
CHM132 x 100 bp reads
rye2 x 100 bp reads
Hm, that drop in mapping rate for |
So if I interpret your analysis correctly, we are 'reversing the order of the analysis' as you described earlier. That is, you test different settings for Some brainstroming:
I like suggestion 2 better, because I think this difference is only noticeable in the shorter read range(?). One analysis that would shed some light on this question would be to index a genome with different values on -r and see how many times maxdist is 'active' when getting our window of candidate syncmers. Suggestion 1 is an easy fix but it has the possible drawback of slower relative runtime by being conservative (not leveraging the larger spans that we could use for the reads). For suggestion 2, in https://github.com/ksahlin/strobealign/blob/main/src/index.cpp:
Here I just made a guess on What do you think? |
More fine-grained read-length intervals is an option. This would make it a bit more difficult to generate all possible indices in advance. I think some users/admins would want to do that "just to be prepared". And then strobealign could gain a reputation for requiring massive amounts of storage for its indices ... Shouldn’t the ranges have the most common read lengths as their middle points to ensure that the index works best with them? (Or at least have the max_dist setting optimized for it.) 76-125 was good because it had 100 in the middle, so perhaps it needs to be split into three parts. Or change the other intervals as well. Have you by any chance done a survey to find out which read lengths are most commonly used in practice? I would guess many people use the maximum Illumina read lengths, which are 100, 125, 150, 250, 300 (interesting, didn’t know about the 125).
I’ll try to do something in that regard. |
Ok agree.
That makes sense. Would then be interesting to investigate parameters for 100, 125, 150, 250, 300. In my study I benchmarked only 100, 150, and 300 out of those numbers. (The other lengths I investigated were 50, 75, 200, 500 which are less relevant with Illumina, for now.) I can with little effort add 125 and 250 to my benchmarks. Interestingly 125 is right on the border which probably means the current parameters (also the syncmer span) aren't ideal for them. How do you want to proceed with this? I am happy to run benchmarks whenever we have something to try. |
I added a "dumprandstrobes" tool in #198 to be able to look at the randstrobes for any FASTA file and any parameters. I have only started using it, but it was already helpful because it reminded me that 1) I’ll run the tool a bit to get an impression of how much the |
great! |
I ran As expected, small See the table hereThe total number of randstrobes was always about 26.5 million.
|
To establish terminology, do you mean they result in maximum seed lengths 75 and 40? Because the 'offset' between first and second syncmer would be
The
Sounds interesting, but I am worried there will be many practical complications: or perhaps I am thinking about it the wrong way?
You mean your modified version of my option 2, right? Which also has "the ranges have the most common read lengths as their middle points to ensure that the index works best with them". I don't have a strong preference here, basically because I can't come up with a this-is-the-way-to-go solution. |
Yes, you’re right. What’s shown in the figure is the seed length. Sorry about the confusion – I computed the length incorrectly for the statistics, which made me think offset and distance are somehow different things. But of course offset and distance are both the difference in strobe start positions, and adding
That makes sense. Not that it matters here, but I just wonder whether it would be nicer (conceptually) to only have one criterion. So one could only use max_dist and abandon w_max.
Yes, I thought one would still have multiple indices as before and only adjust those randstrobes that differ due to the different max_dist settings.
Good point, only storing syncmers would be an option, good point. On the other hand, the savings would be lower because generating randstrobes also takes time ... But I don’t think it’s worth going into this direction at the moment, it was just a brainstorming idea.
Yes, but I described it unclearly: The essential point is that we choose max_dist so that it works best with the read lengths we’re most likely to encounter. Instead of computing max_dist from the canonical read length, we can hardcode it as part of the settings, and then we can choose whatever we want for each canonical read length, and it doesn’t need to be centered. (The canonical read length would then just become the name of one of the pre-defined index parameters.) Because you designed these settings initially, I don’t have a good intuition for how they would need to be changed, but let me make a suggestion as a starting point anyway. I think we want this:
Current settings
Possible settingsFor the added read length 125, I interpolated w_min and w_max from the adjacent settings, but one might as well pick 4 and 9. Bold numbers designate changed or new values.
|
I think most of those settings look good. The only one I am hesitant about is the <=75 interval. Reads of length 76 will now be bumped up to max_dist 30 while being capped at 20 before. I believe we should be conservative with this interval and do:
(I have edited only the Before this request is merged, I could start a benchmark including some new read lengths and compare to v0.7.1. As the results should be similar for the actual canonical read lengths (125, 250, 300). I should probably test read lengths 76, 111, 136, and 176 for 'worst case scenarios', right? |
There are six canonical read lengths (50, 100, 150, 225, 325 and 400) which represent the six pre-defined index parameter settings. The canonical read length is now used as part of index filename as in "ref.fasta.r100.sti"
This allows to re-use an on-disk index even if the read length differs slightly.
I added a new setting for read length 125 using your version of the table. Makes absolute sense to end that interval at 90. I also bumped the baseline commit so that we finally get a green checkmark.
Yes, this PR is ready now, would be good with some benchmarks now that also test some of these worst case lengths. |
Great! I will set up and start experiments for read lengths 91, 111, 136, and 176 tomorrow (in addition to previously compared read lengths) on Drosophila, maize, CHM13 and rye. I will also add 125 and 250 as these are standard lengths I did not benchmark before. |
The evaluation is up and running but will take some time to complete due to job scheduling waiting times. I am also benchmarking the mapping modes of the two versions ( It can happen that alignment mode is "masking" a bit of the true difference if we, e.g., enter alignment rescue mode more often (which negatively affects runtime). Accuracy differences at seed level will be more apparent in map mode. |
Attached are the plots from the benchmark. Overall: Decent cut in both memory and runtime. This may be a good time to say that I am very impressed with the work you have done! Minor 'issue' in the results: The accuracy is slightly lower for read lengths 100 and 150 (and 75?) on maize and rye. To me, this is surprising. I would have expected the read length matching canonical read lengths to be identical-ish, as we use the same max_dist as v0.7.1. Any ideas? I was also curious about why strobealign-master have many more reads aligned for the lowest read lengths. accuracy_plot.pdf |
Also the evaluation used strobealign with 16 cores for all runs. Scrutinising the results a bit more:
|
Final note for now: I am restarting all strobealign v0.7.1 experiments, as some of the datapoints (50,75,100,150,200,250,300,500) were 'old runs' from the benchmark we did related to the accuracy regressions that we observed for some master commits a while ago. Just to make sure all runs are up to date. Will get back confirming identity or update with new plots when they finish. |
I’m writing a longer reply, but wanted to comment briefly regarding this: You should make sure that you book an entire node for the benchmark runs. The nodes have 20 cores now and if you book 16 ( |
Yes I always book entire nodes for strobealign runtime benchmarks, even when I benchmark using a single core. Already got an email about potential over-allocation from uppmax ;) |
Another possibility explaining runtime improvement is compilation flags. I am now using the cmake compiled version of v0.7.1. When I compare the runtime to what I reported in Fig S13 (suppl data) in the paper, they differ. In fact, the paper runtimes look very similar to runtimes of the current master. So I am wondering if my compilation of v0.7.1 was not optimal? I used the version I compiled a while ago for the accuracy regressions benchmark. In the paper I used my old compilation instructions. |
Nice plots! I’m suprised (but delighted) to see that mapping runtimes have decreased as well, and also that the mapping rate went up for the short reads. I don’t know where the mapping runtime improvement actually comes from. I wonder if something else is going on. One thought was that there could be a bias depending on which node the job runs on, but the rackham nodes all have identical CPUs, so I think it’s fine to compare runtimes obtained on different nodes. Did you enable AVX2?
This must have something to do with the other changes made since v0.7.1. I’m guessing it’s once again from the changed sort order for the randstrobes. BTW, just to confirm: Your plots says |
I ran a test over lunch, and I’m now pretty sure that most of the increase in mapping rate at the short read lengths is explained by commit 87fb45e, where I made a change to generate one more randstrobe at the end of each sequence. For 2x50 bp Drosophila reads, mapping rate and (!) accuracy increase by 0.1 percentage points with that commit.
Yes, you’re right. I’d like to update the automated tests to use another reference that is perhaps a bit big bigger and more repetitive, and perhaps also test other read lengths. I’ll open an issue about this.
That is weird. The only memory leak I remember fixing was in #181, but I had introduced that myself. |
New runs are still pending. Some answers:
Yes. "strobealign" is strobealign v0.7.1. "strobealign-master" is commit 8e53e41 (the readlen-sti branch).
I am becoming unsure about which compiled version of 0.7.1 i am benchmarking against. I will let the new experiments run. Then if we still observe the same results I will run the conda installation of 0.7.1. What I can say for sure is that:
Some more brainstroming:
|
Can you point me to the two binaries? I can try to check whether they have AVX2 instructions in them.
Hm, if you did as it said, you didn’t get AVX2 because we disabled that by default #38. I’ll need to add to the instructions how to re-enable it (add
That looks quite similar to what we do now (except for
Oh, but maybe that’s the explanation, at least for Drosophila? For simplicity, let’s say a 500 bp read uses 1 kiB (sequence + quality + name). Then a pair uses 2 kiB. We changed chunk size from 100'000 to 10'000. Then memory usage per chunk is reduced by 90'000 * 2 kiB = 172 MiB. There’s one chunk for each thread, so with 16 thread, we arrive at 2.7 GiB less memory used (which is actually larger than the index).
I don’t know. If you want, I can adjust my "test all commits" scripts to run in map-only mode, which would allow us to pinpoint the commit where this changed. |
I ran the adjusted script and while watching it run realized that there’s a second buffer for each thread that contains the processed output. If it’s PAF, then it’s small, but when it’s SAM, it’s strictly larger than the input FASTQ. So that roughly doubles the memory usage just for the buffers to ~6 GiB. I think this explains the deviation between map-only and align mode. |
A fail by me. I now have the results after rerunning strobealign v0.7.1 (attached) Accuracyprevious analysis compared against old data for strobealign run with For the new runs with default parameters, accuracy is near identical to before besides very short reads where our new version does a bit better. Likely due to the extra seed. MemoryI consider the explanation of reduced batch size together and the SAM/PAF buffer difference to explain the memory increase that we observe. Mystery solved. RuntimeRuntime looks a tad better for readlen-sti branch, but overall quite similar. Also, the runtimes match what I reported in paper, so mystery solved here as well. accuracy_plot.pdf |
I have put the two versions I ran in the folder Would be interesting to know.
Is Automatic CPU detection easy to add? I feel like this would be a nice addition because I certainly didn't think about it, just blindly followed the installation instructions :) If the version(s) were not run with AVX2 I am happy to recompile and run them. Although I claim that v0.7.1 should have those instructions per my previous installation. |
Poking around a bit in the disassembly, I’m quite sure that
I’ll open a separate issue about this. There are various ways of achieving this. For now, we can just change the default compilation instructions to include |
Great, so is this an accurate high-level summary for the changelog?
And coming back to the PR, does anything else need to be done? (I’ll push changes for the changelog and documentation update later) |
Okay, then I should rerun
I would wait a bit with "Slightly faster mapping" until we also try the AVX2, although it may not change much. Also, I would add a point stating something like "Several minor bug fixes (including SAM formatting)".
Besides documentation for any interface changes - not that I can think of. We have verified that this version performs as well as previous version on accuracy. And better in terms of runtime and mem. All good to me. |
I opened #211 to update the README. In short:
(With
This is already in the changelog under the "bug fixes" heading, but because the changelog is going to be somewhat longish, I wanted to add a short summary at the top.
Good, then I’ll do those last things and merge. |
I did this on the same commit I evaluated here (8e53e41) and restarted all the runs for this version. Experiments currently running.
Okay! |
Runs are completed. I verified that accuracy, percent aligned, and memory are identical (just because of previous mishaps). Now, for the runtime (new time-plot attached). While the results are for one run per data point (so expect some variation), it looks like the AVX2 version is slightly faster. The gain seems to be the largest on drosophila(?). Maybe because our new indexing approach in this branch is the most active in smaller genomes with higher fraction of unique seeds? (see the difference in map-only mode on drosophila) All-in-all, a nice combination of reduced runtime and memory in this pull request. I am happy to merge this pull request. |
Great!
I also think that that’s the reason for the difference. Fewer memory accesses for unique seeds should make things slightly faster.
Nice, then let’s do it. |
Closes #133.
We currently have the concept of an (actual) read length, which is either estimated from the input FASTQs or provided using the
-r
command-line option. This PR introduces the concept of a "canonical read length". The canonical read length is one of 50, 100, 150, 225, 325 and 400 (we choose the one that is closest to the actual read length) and represents one of the six pre-defined index parameter settings.With this PR, all parameters are entirely based on the canonical read length only, which means that it can be re-used for datasets that have similar read lengths. This was almost the case before except that the
max_dist
parameter was computed from the actual read length. Commit 8ba1f93 changes this so thatmax_dist
is computed from the canonical read length. (This changes mapping results when actual and canonical read length differ.)To make it possible to store multiple indices with different canonical read lengths alongside each other, the canonical read length is also included in the index filename, as in
ref.fasta.r100.sti
. I chose to include ther
here mainly because I think it may be clearer that way what the number represents. Another minor reason is that we could extend the naming scheme at a later point to include other parameters. For example,.r100k24.sti
could mean that one needs to use-r 100 -k 24
to reproduce the settings the index was created with.At the moment, if the user overrides any indexing parameters, the file name extension is going to be just
.sti
as before. The user then again has the problem that they cannot store multiple of these indices in the same directory, but I think solving this has much lower priority. (One workaround is to symlink the reference into separate directories for each parameter setting and then create the index there.) If the index doesn’t match the command-line settings, then there’s an error message in any case, so accidentally using the incorrect index should not be possible.max_dist
now being based on canonical, not actual read length