Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the alignment of MashMap #12

Open
YiweiNiu opened this issue Jul 13, 2018 · 6 comments
Open

Questions about the alignment of MashMap #12

YiweiNiu opened this issue Jul 13, 2018 · 6 comments

Comments

@YiweiNiu
Copy link

Hello,

Following #6, I checked the outputs of MashMap using blastn and got some quesions with the alignment. I think I shoud open a new issue to understand the ouput of MashMap.

I wanted to remove posssible contaminants within the PacBio data, which was from several insects (whole organisms were used to extract DNA, so there maybe some bacteria DNA). I downloaed all the archaea, bacteria, fungi, protozoa, and viral sequences and merged them together as a contamination library. I also included mitochondrion sequences of the insect into the library.

Then I used MashMap with different parameters and got several outputs:

# run1, with default parameters
$path2mashmap -r $contaminants -q third_all.fasta -o mashmap.out

# run2, with -s 2500 --pi 80
$path2mashmap -t 8 -r $contaminants -q third_all.fasta -s 2500 --pi 80 -o mashmap2.out

# run3, with -s 500 --pi 85
$path2mashmap -t 8 -r $contaminants -q third_all.fasta -s 500 --pi 85 -o mashmap3.out

And the outputs from three runs varied:

# there are 6633142 sequences of input
$ grep -c '>' third_all.fasta 
6633142

# run1
cut -f 1 -d ' ' mashmap.out |sort|uniq|wc -l
463569

# run2
cut -f 1 -d ' ' mashmap2.out |sort|uniq|wc -l
2821004

# run3
cut -f 1 -d ' ' mashmap3.out |sort|uniq|wc -l
6189307

As can be seen, nearly all the sequences were aligned to contaminant library. That really shocked me.

Then I checked the top 10 sequences with highest identity and top 10 ones with loweset identity from the first run using blastn.

$ sort -r -t ' ' -k10 mashmap.out|head
m161111_191517_42256_c101055312550000001823247601061715_s1_p0/135507/26123_34791 8668 0 8667 + AV_I75 15629 300 8661 94.3979
m161111_191517_42256_c101055312550000001823247601061715_s1_p0/118488/0_10466 10466 0 4999 + AV_I338 16706 10618 15617 93.7729
m161111_062618_42256_c101055312550000001823247601061713_s1_p0/84256/10406_16318 5912 0 5911 + AV_I336 15276 533 6284 93.5688
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/108916/11216_16430 5214 0 5213 - kraken:taxid|163164|NC_002978.6 1267782 837214 842416 93.4375
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/29550/13832_24025 10193 0 10192 - AV_I75 15629 5633 15425 93.3745
m161123_002016_42256_c101049952550000001823247601061782_s1_p0/107348/0_13397 13397 0 4999 + kraken:taxid|1633785|NZ_CP011148.1 1267840 408697 413696 93.326
m161118_115530_42256_c101055932550000001823247601061713_s1_p0/96135/27165_36074 8909 0 8908 - kraken:taxid|615|NZ_CP021984.1 5241555 4731262 4740479 93.2709
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/37699/0_7490 7490 0 7489 + AV_I337 13969 4110 11232 93.2274
m161118_115530_42256_c101055932550000001823247601061713_s1_p0/148198/23006_28097 5091 0 5090 - AV_I336 15276 2279 7323 93.1338
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/86966/0_10197 10197 5197 10196 - kraken:taxid|1633785|NZ_CP011148.1 1267840 642883 647882 93.0843

$ sort -t ' ' -k10 mashmap.out|head > mashmap.lowest
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/100026/37064_49118 12054 7054 12053 + kraken:taxid|336810|NZ_CP021172.1 218034 75901 80900 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/100308/48537_58156 9619 4619 9618 - kraken:taxid|880070|NC_015914.1 6221273 4110216 4115215 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/103161/48125_53357 5232 232 5231 + kraken:taxid|336810|NZ_CP021172.1 218034 75901 80900 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104812/38481_50150 11669 0 4999 - kraken:taxid|877455|NC_015216.1 2583753 1814585 1819584 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104851/0_12783 12783 7783 12782 - kraken:taxid|76857|NZ_CP022123.1 2521394 1534616 1539615 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/104909/21773_32004 10231 0 4999 - kraken:taxid|134821|NZ_CP021988.1 722452 74550 79549 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/105012/0_20923 20923 0 4999 + kraken:taxid|552509|NC_033778.1 111453 67843 72842 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/105790/29860_45799 15939 0 4999 - kraken:taxid|1360|NZ_CP025500.1 2346663 699615 704614 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/105963/29235_38484 9249 4249 9248 - kraken:taxid|29430|NZ_CP018260.1 3267348 1015045 1020044 80.8247
m161109_080520_42256_c101052872550000001823247601061737_s1_p0/106112/0_20765 20765 0 4999 + kraken:taxid|2017483|NZ_CP022315.1 4071214 1765868 1770867 80.8247

The highest ones were fine. There were some differences between hits reported by blastn and MashMap, but maybe it's because they used different databases. But the loweset ones were problematic. Most of them were 'No significant similarity found' when default parameters of blastn were used. And when I unselected 'Low complexity regions', the alignments were unreliable. There maybe something with 'low complexity regions' or 'repeat' somthething.

So my question are:

  1. If blastn cannot find any significant alignment, why MashMap could?
  2. In my case, what percentage identity threshold do you think is proper for removing contaminants?

Thank you! Sorry if I missed something.

Bests,
Yiwei Niu

@cjain7
Copy link
Contributor

cjain7 commented Jul 14, 2018

  • For q1, Mashmap identity is an estimate based on Jaccard similarity- not the precise identity; unfortunately the Jaccard-similarity based metric delivers poor specificity in cases when the source of reads is absent from database. See if you can include an insect reference genome (s) in the reference list to avoid this.

  • For q2, I don't have a precise answer to this. My guess is if you run with -s X --pi Y, you can ignore the mappings of identity below Y, and length < 2X. We soon plan to add fast alignment support in Mashmap so users don't have to do any post-processing.

@YiweiNiu
Copy link
Author

YiweiNiu commented Aug 2, 2018

Thanks for your reply! I'm sorry for the delayed response.

As you recommended, I included several genomes of the same order of my target insect. Then I ran MashMap like this:

$path2mashmap -t 20 -r contaminants.and.neighbors.fa -q third_all.fasta -s 500 --pi 80 -o mashmap4.out

Here are the logs:

Start time is 2018/07/17--13:53
>>>>>>>>>>>>>>>>>>
Reference = [contaminants.and.neighbors.fa]
Query = [third_all.fasta]
Kmer size = 16
Window size = 5
Segment length = 500 (read split allowed)
Alphabet = DNA
Percentage identity threshold = 80%
Mapping output file = mashmap4.out
Filter mode = 1 (1 = map, 2 = one-to-one, 3 = none)
Execution threads  = 20
>>>>>>>>>>>>>>>>>>
INFO, skch::Sketch::build, minimizers picked from reference = 16054318849
INFO, skch::Sketch::index, unique minimizers = 770476138
INFO, skch::Sketch::computeFreqHist, Frequency histogram of minimizers = (1, 142744242) ... (255123, 1)
INFO, skch::Sketch::computeFreqHist, With threshold 0.001%, ignore minimizers occurring >= 4307 times during lookup.
INFO, skch::main, Time spent computing the reference index: 149116 sec
INFO, skch::Map::mapQuery, [count of mapped reads, reads qualified for mapping, total input reads] = [6214500, 6214500, 6633142]
INFO, skch::main, Time spent mapping the query : 825763 sec
INFO, skch::main, mapping results saved in : mashmap4.out
Finish time is 2018/07/28--20:54

Then I counted the number of mapped reads (good - mapped to insects, bad - mapped to contaminants, ambivalent - mapped to both insects and contaminants) and unmapped ones. This is the numbers:

No. of total reads: 6633142
  No. of reads in the mashmap.output: 6214500
    No. of good: 853515
    No. of bad: 138799
    No. of ambivalent: 5222186
  No. of reads not in the mashmap.output: 418642

So most reads had been mapped to the reference but the majority of them had been mapped ambivalently, which are troublesome to extract good ones. Here is an example of this kind of reads:

m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 3500 4015 + NW_017852934.1 2683736 1681820 1682319 79.4204
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 1000 1999 + NW_019280650.1 1003565 813077 813577 78.0766
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 2500 2999 - LJIG01019880.1 38067 34865 35364 79.3626
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 500 999 + NC_007418.3 31381287 24981081 24981580 79.5573
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 3000 3499 - kraken:taxid|76857|NZ_CP022123.1 2521394 1537365 1537864 81.3159
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 0 499 - kraken:taxid|1202539|NC_018417.1 157543 40864 41363 79.4397
m161123_064622_42256_c101049952550000001823247601061783_s1_p0/52396/4594_8610 4016 1500 2499 + kraken:taxid|1936081|NZ_CP019389.1 3752836 1813582 1814081 76.7726

The first 4 lines are sequences from insects, and the last 3 lines are sequences from contaminants. As can be seen, in this case I don't know which alignment is more reliable and shoud be kept for downstream analysis.

Thank you again for your quick reply!

@cjain7
Copy link
Contributor

cjain7 commented Aug 7, 2018

I still think many of the reads don't have a good (closely-related) reference in the DB which is causing false hits. I've few ideas to avoid this, and plan to implement them soon. Meanwhile, did you try my second suggestion in my previous response?

@YiweiNiu
Copy link
Author

YiweiNiu commented Aug 7, 2018

Thank you for your reply!

Sorry, which suggestion do you mean? include an insect genomes in the reference list? I already done that. I include 12 other insects of the same order of my target insects.

By the way, I also mapped the same FASTA file to the same library using minimap2. And I found the majority of reads were unmapped.

No. of reads in the SAM: 6633142
    No. of mapped: 661646
    No. of good: 490150
    No. of bad: 125390
    No. of ambivalent: 46106
  No. of Unmapped: 5971496

I'm not an expert on "mapping" things, but I guess that maybe the reason why MashMap cannot distinguish them (when using Jacard similiarity to assign the reads to the reference).

@cjain7
Copy link
Contributor

cjain7 commented Aug 8, 2018

These results are good to know; it suggests that majority of "ambivalent" mashmap mappings are false-positives. This is a good feedback; will soon put additional checks to avoid this.

For your downstream analysis, I guess it's clear that you've roughly 100,000 contaminant reads.

Re: which suggestion?
I was referring to my earlier comment: if you run mashmap with parameters -s X --pi Y, you can try discarding the weak matches i.e. identity < Y, and length < 2X.

@YiweiNiu
Copy link
Author

YiweiNiu commented Aug 9, 2018

I counted the number of reads in different categories again. Since I 've used -s 500 --pi 80, I used identity < 80 and length < 1000 as the thereshold.

Here are the results:

No. of total reads: 6633142
  No. of reads in the mashmap.output: 6221577
    No. of above threshold: 5909169
      No. of good: 637490
      No. of bad: 62995
      No. of ambivalent: 5208684
    No. of below threshold: 312408
      No. of good: 222457
      No. of bad: 79505
      No. of ambivalent: 10446
  No. of reads not in the mashmap.output: 418642

Most of reads had alignments above the threshold, so using that as the threshold maybe not appropriate in my case.

Thank you for your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants