Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does using a multi part index affect the accuracy? #141

Closed
hasindu2008 opened this issue Apr 2, 2018 · 26 comments
Closed

How does using a multi part index affect the accuracy? #141

hasindu2008 opened this issue Apr 2, 2018 · 26 comments

Comments

@hasindu2008
Copy link
Contributor

A nanopore read when mapped using minimap2 to the human genome (the whole index is in memory) gives an output as follows.
image

When using a multi-part index (I manually split the human genome into 2 pieces and ran minimap2 on each part separately - same as using -I option in minimap I presume), the combined output is as follows.

image

I made the following observations

  1. In case of the multipart index, additional mappings are there (alignment score is comparatively low for those and those mappings can be ignored)
  2. There are two primary alignments in the multi-part index. By looking at the alignment scores the actual primary mapping can be deduced (the one with AS 64771 opposed to 18246). For the other mapping (ones with alignment score 18246), it should be possible to deduce whether it is a secondary mapping or a supplementary mapping by looking at the CIGAR strings.

I have the following two questions.

  1. Can the output from a multi-part index be merged such that it is exactly the same as the output for the single whole index? If not is there any reduction in accuracy when the multi-part index is used?
  2. Are there any existing scripts/methods to perform this merging?

How is the mapping accuracy affected when a multi-part index is used rather than a single whole index?

@hasindu2008 hasindu2008 changed the title How does using a multi part index affects the accuracy? How does using a multi part index affect the accuracy? Apr 2, 2018
@lh3
Copy link
Owner

lh3 commented Apr 3, 2018

Given a multi-part index, minimap2 aligns all query sequences against each part in turn. It doesn't know sequences in other parts and assigns mapping quality based on sequences in one part only. This is why your highlighted read got mapQ=60 twice, though clearly the chr17 hit is the optimal one.

Can the output from a multi-part index be merged such that it is exactly the same as the output for the single whole index?

Unfortunately, no. As it is often impractical to load all query into memory, doing this would require minimap2 to save the exact internal states of each query to disk. This will involve a lot of efforts and code refactoring.

A possibly easier solution is to save sam/paf mapped against to each part to a separate file and then merge them back at the end. However, mapping quality calculated this way will sometimes differ from mapQ computed from a uni-part index. In addition, this strategy is still tricky to implement.

If not is there any reduction in accuracy when the multi-part index is used?

When you use a multi-part index, I believe accuracy is not reduced as long as you merge alignments to each part at the end. However, the results will be different in corner cases.

Are there any existing scripts/methods to perform this merging?

Unfortunately, no. I know your intended applications. However, minimap2 is not designed for those. Multi-part index is mostly used for read overlapping or searching hits against gigantic databases (e.g. nt) when you care about all local hits.

@hasindu2008
Copy link
Contributor Author

Thank you. I will try with the approach you suggested to write to separate files and then merge them back at the end.

@Confurious
Copy link

Hi, I actually want to do something alongs the lines of mapping query reads/assemblies against a gigantic database, more gigantic than nt/nr. People have recommended minimap2 but I suppose a multi-index strategy is a must. A few questions if I may ask here:

  1. What is the memory usage for minimap2 when compared to bwa-mem, i know for bwa-mem, the rough rule is 2x of the database size?
  2. Are there any other scores (not mapQ) that are calculated in a way that does not take the whole database into consideration, so they are comparable among multi-index? The bit score of blast comes to my mind
  3. If minimap2 is not the ideal tool for the task I am describing, what would you recommend me using? I have so far been using brutal force blastn strategy which gives good results, but just too computationally expensive.
    Thanks!

@lh3
Copy link
Owner

lh3 commented Apr 3, 2018

What is the memory usage for minimap2 when compared to bwa-mem, i know for bwa-mem, the rough rule is 2x of the database size?

The minimap2 index size varies greatly with -H, -k and -w. For typical use cases, it is about 2.5X-4X of the database size when you use a uni-part index.

Are there any other scores (not mapQ) that are calculated in a way that does not take the whole database into consideration, so they are comparable among multi-index? The bit score of blast comes to my mind

mapQ is a global score, considering all hits to the database. Blast bit score is proportional to Smith-Waterman score. It is a local score in that it is computed from a single hit. The two scores are used for distinct purposes. With base-alignment, minimap2 computes local SW score.

If minimap2 is not the ideal tool for the task I am describing, what would you recommend me using? I have so far been using brutal force blastn strategy which gives good results, but just too computationally expensive.

When you use a multi-part index, minimap2 is broadly similar to blastn in functionality. It will be faster than blastn but less sensitive in their typical settings. What are your query sequences? What level of sensitivity do you want to achieve?

@Confurious
Copy link

Thanks! So it does have an option to give local SW score for each query, that's great! I understand mapQ is probably more similar to the e value of blastn which takes the queried database into consideration but is not necessarily comparable between different databases.

The query sequences are environmental metagenomic samples so the database is very large to hopefully classify as many reads/contigs as possible. I am basically using all of the blastn hits for each query (limited to 5 per reference for each query) and do LCA to get a most conservative estimate of the taxonomy. I have been using contigs assembled by megahit because blasting every read will just takes years to finish for the stuff I am working with. I have been using the dc-megablast so I suppose a sensitivity level comparable to that algorithm would be most ideal. Based on experiences, it appears that in this setup my pipeline was able to pick up sequences that are 70% or more similar (also the reason to use contig, as 70% for reads are too shaky).
Thanks!

@lh3
Copy link
Owner

lh3 commented Apr 3, 2018

@hasindu2008 I had a look at the minimap2 code and realized that it is not that difficult to dump the internal states of each query against each part of the index to disk. For a multi-part index, the ideal pipeline in my head would be:

  1. Add a new command line option e.g. --multi-prefix, which specifies the prefix of files storing internal states.

  2. For the first part of the index, write mm_reg1_t of each query, in the binary form, to file 'prefix.000.dat'. This file keeps aligned regions of all queries mapped against the first part of the index. Similarly, file 'prefix.001.dat' is for all queries mapped against the second part. Etc.

  3. At the end, open all 'prefix.*.dat' files along with the query sequence file. Combine them together, re-calculate mapQ, output and then delete intermediate 'prefix.*.dat'.

The output of this procedure will be very similar, though not identical, to the alignment against a uni-part index. The procedure will be definitely cleaner and perhaps even simpler than manipulating SAM/PAF from the current multi-part minimap2 output.

Unfortunately, I am busy with another project. I may not be able to implement this feature soon.

@lh3
Copy link
Owner

lh3 commented Apr 3, 2018

@Confurious in your case, I think you can give minimap2 a try. The command line could be something like:

minimap2 -c -k15 -w10 -t 8 db.fa contigs.fa

You can reduce -w to 5 (for example) for higher sensitivity. The peak memory will be around 20-30GB, depending on -w. In the output, the AS tag gives the SW score. Col10/col11 is sequence identity, defined in the blast way.

@lh3
Copy link
Owner

lh3 commented Apr 3, 2018

mapQ is probably more similar to the e-value of blastn

mapQ largely evaluates if there is a hit of similar SW score. E-value measures how likely the hit is random. They are distinct, too. Blast e-value is a local score in effect. It is different from the initial definition of e-value which is computed by fitting the tail of the extreme-value distribution of all hits. The initial e-value definition needs to see all hits across the entire database.

@Confurious
Copy link

"mapQ largely evaluates if there is a hit of similar SW score. E-value measures how likely the hit is random. They are distinct, too. Blast e-value is a local score in effect. It is different from the initial definition of e-value which is computed by fitting the tail of the extreme-value distribution of all hits. The initial e-value definition needs to see all hits across the entire database."

So it looks like to make it truly comparable in multi-index situation, SW score is the absolute objective measurement of how "good" the alignment for each query. So instead of list of hits in blastn, I will be dealing with a list of hits with the top SW scores after pooling the results of multi-index. mapQ, blastn e-value, or even blastn bit score all take the size of search space/database into account directly or indirectly (based on reading https://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html). Unless of course you implement the internal state merging steps.

Btw, thanks for your tips on using minimap2! I am giving it a try, I assume the peak memory of 20-30 GB is based on a defined database in your post right? You do not imply that minimap2 can achieve mapping, maybe slower, with a constant memory cap?

Sorry for hijiking the thread, thanks!

@lh3
Copy link
Owner

lh3 commented Apr 3, 2018

I assume the peak memory of 20-30 GB is based on a defined database in your post right?

When you use a multi-part index, the peak memory is capped by the total number of bases loaded into memory in per batch, regardless of the total size of the database. The default batch size is 4Gb (option -I).

You do not imply that minimap2 can achieve mapping, maybe slower, with a constant memory cap?

Mapping with a multi-part index will be slower. You essentially trade speed for lower peak memory. It is hard to say about accuracy. Probably similar.

@Confurious
Copy link

That makes sense. I assume it would my best interest to construct the index first for my database, which is around 500 GB. I assume I will need more than 1 TB memory for that? I can manage to find a powerful enough computer to do this one-time deal, but considering the index will essentially be searched against in a multi-part way, i am wondering if there is a better way of doing this?

Or do do suggest to simply use the 500 GB fasta file directly with the -I option? I assume in this case the program basically takes 4GB (Default) in chunk, build an index, and do the alignment? This would seem redoing the index work for every set of query sequences.

Thanks!

@lh3
Copy link
Owner

lh3 commented Apr 3, 2018

Minimap2 indexing is very fast. If your network file system is slow, indexing can sometimes be faster than loading index from disk. For a first try, I would just use the command line above:

minimap2 -c -k15 -w10 -t 16 db.fa contigs.fa

This will take 20-30GB. You can consider to use a large -I if you have lots of memory. Try the simple thing first and then optimize if it really works.

@hasindu2008
Copy link
Contributor Author

Thank you. Meanwhile, I also will attempt to do some implementation at least for primary mappings for non-paired reads. If secondary mappings and paired reads are considered it would be required to dump more information, isn't it? I think merging would be further complicated in such a case.

@Confurious
Copy link

Confurious commented Apr 4, 2018

Just want to update here that your suggestion worked beautifully for a 500 GB database!! Now I just need to dig through all the tags to come up with a objective "score". What would you recommend to calculate an objective "Score" to evaluate what the best alignment is over splitted indexes? I noticed in my case the AS is almost always the same as ms, I do not know what that means.

Posting some output here for reference. Thanks!

k99_1335274     898     22      732     +       GL456022.2      1934410 1207650 1208360 556     715     0       NM:i:159        ms:i:460        AS:i:460        nn:i:0  tp:A:P  cm:i:7  s1:i:40 s2:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I155M1I3M1D414M2D6M1I2M1I49M
k99_1335274     898     22      732     +       JH584266.1      4910976 1277142 1277852 556     715     0       NM:i:159        ms:i:460        AS:i:460        nn:i:0  tp:A:S  cm:i:7  s1:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I155M1I3M1D414M2D6M1I2M1I49M
k99_1335274     898     22      732     +       JH584328.1      4256209 1276903 1277613 556     715     0       NM:i:159        ms:i:460        AS:i:460        nn:i:0  tp:A:S  cm:i:7  s1:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I155M1I3M1D414M2D6M1I2M1I49M
k99_1335274     898     22      732     +       GL456022.2      1934410 1220564 1221274 557     716     0       NM:i:159        ms:i:458        AS:i:458        nn:i:0  tp:A:S  cm:i:7  s1:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I65M1I3M1D86M1I3M1D414M2D6M1I2M1I49M
k99_1335274     898     22      732     +       JH584266.1      4910976 1289953 1290663 557     716     0       NM:i:159        ms:i:458        AS:i:458        nn:i:0  tp:A:S  cm:i:7  s1:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I65M1I3M1D86M1I3M1D414M2D6M1I2M1I49M
k99_1335274     898     22      732     +       CM001010.2      94987271        34970860        34971570        557     716     0       NM:i:159        ms:i:458        AS:i:458        nn:i:0  tp:A:S  cm:i:7  s1:i:40 dv:f:0.1695     cg:Z:12M1I3M1D58M1D3M1I65M1I3M1D86M1I3M1D414M2D6M1I2M1I49M

@lh3
Copy link
Owner

lh3 commented Apr 4, 2018

@hasindu2008 For long reads, minimap2 reports secondary mappings, which should often be adequate for merging. Minimap2 doesn't report secondary mappings in the short read mode (as there will be too many). You can still estimate mapping quality from primary alignments, but occasionally such an estimate can be inaccurate.

@Confurious Thanks for the confirmation. AS gives the conventional Smith-Waterman score. You can just use that. I forgot to add you would probably want to increase option -N to 50 or higher (the default is 5). This option specifies the number of secondary alignments minimap2 will perform. In your example, there are six hits. Probably you will see a lot more when increasing -N. By the way, if you have further questions, please open an new issue. There are also a few other parameters that trade between speed, memory and sensitivity if you like to achieve better alignment. I am happy to discuss there. Thank you!

@hasindu2008
Copy link
Contributor Author

@lh3
I have implemented the following so far.

  1. --multi-prefix option to specify a temporary file prefix
  2. Dumping the sequence IDs, sequence names and the lengths of the sequences in the index mi
  3. Dumping mm_reg1_t for each mapping along with the content in mm_extra_t *p (flattened it)
  4. At the end of dumping, collect the sequence names and the lengths of all dumps to generate a better SAM header (SQ lines)
  5. Reading through each line of the fastq while reading through the dumped mm_reg1_t for each read
  6. Fixing the reference sequence name by fixing the rid in mm_reg1_t(step 2 helps in this)
  7. Writing all the regs through mm_write_sam2

By doing so I have verified that the dumping and reading back work perfectly to give the same output. Now I am attempting to generate the MAPQ. I see that the function mm_set_mapq(b->km, n_regs0, regs0, opt->min_chain_score, opt->a, rep_len, is_sr) is used for this. Can you briefly tell me about this replen as it is the only parameter that is not stored inside the mm_reg1_t? Can it be estimated using the parameters inside the mm_reg1_t or should I dump all the replen values for all mappings and then add them back?

@lh3
Copy link
Owner

lh3 commented Apr 9, 2018

That's great! Thanks a lot.

Can you briefly tell me about this replen as it is the only parameter that is not stored inside the mm_reg1_t?

replen is calculated by collect_matches() in map.c. It is the sum of length of regions covered by highly repetitive k-mers. You can't infer replen from a list of mm_reg1_t, unfortunately.

or should I dump all the replen values for all mappings and then add them back?

You have to save replen for each read in each batch. You don't need to save it for each mm_reg1_t (that's why replen is not a member of mm_reg1_t). When you generate the final MAPQ, you read all replen and choose the largest (or average; not sure) replen as the value fed to mm_set_mapq().

Note that it is not possible to generate identical alignment for each read because minimap2 is throwing away k-mers with high occurrences and this set of k-mers differs between a uni-part index and a multi-part index. Nonetheless, this only affects a tiny fraction of reads and these reads tend to have low MAPQ. The impact to downstream processing should be little.

@hasindu2008
Copy link
Contributor Author

Seems like the maximum replen is a better approximation than the average replen as observed for a set of reads. See below.
image

The merging is implemented as follows.

  • the maximum replen is selected
    For all regs (from different index parts) for a particular read
  • call mm_hit_sort_by_dp
  • call mm_set_parent
  • call mm_select_sub
  • call mm_set_sam_pri
  • call mm_set_mapq

I tested it with a long nanopore read and the comparison for a uni-part index and multi-part index looks identical except for a repeat region in chr11.
image

Have you got any "standard" test cases that check the impact of a new feature on the existing features (such as the introduction of bugs)?
Also are there any "gold standard" datasets that can be used with paftools.js to check the mapping accuracy?

@lh3
Copy link
Owner

lh3 commented Apr 12, 2018

I do have a few files that I run minimap2 on before every release. I am out of the town these two days. I will make them available tomorrow.

@hasindu2008
Copy link
Contributor Author

@lh3
Did you make those files available?

@lh3
Copy link
Owner

lh3 commented Apr 20, 2018

Sorry for the late response. I was on another trip. You can find files here. More specifically:

@hasindu2008
Copy link
Contributor Author

Thank you very much for the data set.

For the basic merging implementation that I implemented, I used pb-1.fa.gz for testing.

The results for the SAM output is as follows.
image

The results for the PAF output is as follows.
image

There seems to be about 1.5% difference in the fraction of mapped reads. I pulled out a read that is correctly mapped in the single part index, but not in the multi-part index. Some parameters in mm_reg1_t , some intermediate values during mapq calculation and the computed mapq for the same read in each case are as follows.
image

The difference in the value score0 seem to be the one that highly impact the mapq.
Is this difference in the score0 is due to the difference in indexing that you mentioned earlier (throwing out high occurrence k-mers during each case).?

@lh3
Copy link
Owner

lh3 commented May 3, 2018

I actually forgot this score0. It is the score from initial chaining. Initial chains may be broken into smaller pieces or joined together. "score" is computed after these operations. "score0" was added to improve accuracy in a corner case.

How did you choose "score0" when combining multiple mm_reg1_t?

@hasindu2008
Copy link
Contributor Author

hasindu2008 commented May 4, 2018

Under mm_reg1_t there is a score0.

`int32_t score0;         // initial chaining score (before chain merging/spliting)`

I directly used this value in each entry when computing the MAPQ after merging. However, the value that is stored in score0 for the same hit, seems to be the difference for a uni-part vs a multi-part. Is this the_score0_ what you were asking about?

@hasindu2008
Copy link
Contributor Author

Hi Heng,

I recently performed a test using a much bigger dataset (4,000,000 simulated reads from pbsim on hg38 genome).
The result for the SAM output is as follows.
image

The result for the PAF output is as follows.
image

For the sake of comparison, I evaluated the accuracy without any merging as well - by sorting the alignment outputs based on the read ID order and then calling paftools mapeval.

I would like to hear your comments/suggestions on this results, such as is this an acceptable level of accuracy.

Thank you!

@lh3
Copy link
Owner

lh3 commented Jul 26, 2018

With #189 effectively pulled, I am closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants