invertebrate support #34

lurebgi · 2019-01-04T09:26:18Z

Hi,

I was wondering if LTR_retriever supports invertebrate genomes. We have an amphioxus genome derived from 60X Pacbio sequencing, however, it shows the LAI score is only 7.07. Moreover, all of the 206 LTRs in LTRlib.fa were classified as 'Unknown'. Does this look normal to you?

Thank you!

Luohao

oushujun · 2019-01-04T22:43:56Z

Hi Luohao,

I tried LTR_retriever on fruitfly, mouse, micro- and mega- bats, and human, and it worked similarly as in plants, although most of these species have much less LTR content in their genomes. LAI requires a minimum of 5% total LTR and 0.1% intact LTR sequences present in the genome for the purpose of accurate evaluation, so you may need to check these two values.

For classification of LTR superfamilies, LTR_retriever uses models trained from rice LTR classifications, so the same model may not be applicable to invertebrate genomes. However, the classification information is not the major factor to identify LTR elements. You may need to do the classification yourself based on the identified LTR elements.

Best,
Shujun

lurebgi · 2019-01-09T08:49:36Z

Hi Shujun,

Thanks for your email. In amphioxus it seems LTR content is less than 1%, that's might be the reason.

On another note, LTR_retirever annotated 25.27% LTRs (according to the .tbl file) in a tilapia genome while the actual portion should be about 4%. I wonder if it has a lot false positives? The LAI score is also unexpectedly low for a Pacbio assembly: 3.01. Below is the script I used, would you have any suggestions for reducing false positives?

`/apps/genometools/1.5.9/bin/gt suffixerator -db $genome -indexname gt_index/$g -suf -lcp -des -ssp -sds -dna
/apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 > $g.harvest.scn
/apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 -motif TGCA -motifmis 1 > $g.harvest.motif.scn

/scratch/luohao/software/LTR_Finder/source/ltr_finder -D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.9 $genome > $g.finder.scn

perl /scratch/luohao/software/mgescan-1.1/mgescan/ltr/find_ltr.pl -seq=$genome -min-ltr=100 -max-ltr=7000 -min_iden=90

/scratch/luohao/software/LTR_retriever-2.0/LTR_retriever -genome $g -nonTGCA $g.harvest.scn -inharvest $g.harvest.motif.scn -infinder $g.finder.scn -threads=20`

Thanks!

wangzhennan14 · 2019-01-16T01:17:02Z

Hi Shujun,

Thanks for your email. In amphioxus it seems LTR content is less than 1%, that's might be the reason.

On another note, LTR_retirever annotated 25.27% LTRs (according to the .tbl file) in a tilapia genome while the actual portion should be about 4%. I wonder if it has a lot false positives? The LAI score is also unexpectedly low for a Pacbio assembly: 3.01. Below is the script I used, would you have any suggestions for reducing false positives?

`/apps/genometools/1.5.9/bin/gt suffixerator -db $genome -indexname gt_index/$g -suf -lcp -des -ssp -sds -dna
/apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 > $g.harvest.scn
/apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 -motif TGCA -motifmis 1 > $g.harvest.motif.scn

/scratch/luohao/software/LTR_Finder/source/ltr_finder -D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.9 $genome > $g.finder.scn

perl /scratch/luohao/software/mgescan-1.1/mgescan/ltr/find_ltr.pl -seq=$genome -min-ltr=100 -max-ltr=7000 -min_iden=90

/scratch/luohao/software/LTR_retriever-2.0/LTR_retriever -genome $g -nonTGCA $g.harvest.scn -inharvest $g.harvest.motif.scn -infinder $g.finder.scn -threads=20`

Thanks!

Hi Luohao,
Where did you download the mgescan-1.1? Can you give me the url? I have download three mgescan packages, but all of them did not work.

Thank you very Much!
Zhennan

lurebgi · 2019-01-16T05:54:20Z

I did not actually use mgescan-1.1, as shujun suggested in some of the threads.

…

On Wed, 16 Jan 2019, 02:17 wangzhennan14 ***@***.*** wrote: Hi Shujun, Thanks for your email. In amphioxus it seems LTR content is less than 1%, that's might be the reason. On another note, LTR_retirever annotated 25.27% LTRs (according to the .tbl file) in a tilapia genome while the actual portion should be about 4%. I wonder if it has a lot false positives? The LAI score is also unexpectedly low for a Pacbio assembly: 3.01. Below is the script I used, would you have any suggestions for reducing false positives? `/apps/genometools/1.5.9/bin/gt suffixerator -db $genome -indexname gt_index/$g -suf -lcp -des -ssp -sds -dna /apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 > $g.harvest.scn /apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 -motif TGCA -motifmis 1 > $g.harvest.motif.scn /scratch/luohao/software/LTR_Finder/source/ltr_finder -D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.9 $genome > $g.finder.scn perl /scratch/luohao/software/mgescan-1.1/mgescan/ltr/find_ltr.pl -seq=$genome -min-ltr=100 -max-ltr=7000 -min_iden=90 /scratch/luohao/software/LTR_retriever-2.0/LTR_retriever -genome $g -nonTGCA $g.harvest.scn -inharvest $g.harvest.motif.scn -infinder $g.finder.scn -threads=20` Thanks! Hi Luohao, Where did you download the mgescan-1.1? Can you give me the url? I have download three mgescan packages, but all of them did not work. Thank you very Much! Zhennan — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#34 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE478S6-a58-Ii67-UGacsvMH1lh30pQks5vDn2OgaJpZM4ZpX_Y> .

oushujun · 2019-01-16T06:46:34Z

@wangzhennan14

For MGEScan_LTR please refer to #8 and #19. Let me know if you need further help, thanks!

Shujun

oushujun · 2019-01-16T06:56:38Z

@lurebgi

Sorry for delay response (somehow I thought I did).

Your commands look good, but I have no idea about the total LTR content of amphioxus. If you suspect high proportions of false positives, you may manually curate a couple of them to verify (try NCBI blast and see what are they). If you do find some, please post example sequences here with 100bp extended on up- and downstreams, which would help to debug.

If LTR content is too low, then LAI is not accurate. You may plot out regional LAI values in the *.LAI file to see if there is any uneven distribution. Using long reads is not a guarantee of assembly quality, which is also depended on a lot of things.

Shujun

lurebgi · 2019-01-18T14:38:45Z

Hi, thanks for getting back to me. Yes you replied before on the amphioxus issue. However, I am no longer interested in amphioxus LTR since there are not many anyway. My second question (sorry for mixing up questions) was about a cichlid fish (tilapia) which should have about 4% LTR. If you are interested in the false positives, maybe you can download the genome from https://www.ncbi.nlm.nih.gov/assembly/GCF_001858045.2 and test your program? Sorry but at least for now I am not going to further analyze LTR_retriever results at least for tilapias. L

…

On Wed, Jan 16, 2019 at 7:56 AM Shujun Ou ***@***.***> wrote: @lurebgi <https://github.com/lurebgi> Sorry for delay response (somehow I thought I did). Your commands look good, but I have no idea about the total LTR content of amphioxus. If you suspect high proportions of false positives, you may manually curate a couple of them to verify (try NCBI blast and see what are they). If you do find some, please post example sequences here with 100bp extended on up- and downstreams, which would help to debug. If LTR content is too low, then LAI is not accurate. You may plot out regional LAI values in the *.LAI file to see if there is any uneven distribution. Using long reads is not a guarantee of assembly quality, which is also depended on a lot of things. Shujun — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE478W8Z1m6nLLyCPwYHcwHQRT4z1f9yks5vDs0mgaJpZM4ZpX_Y> .

oushujun · 2019-01-29T00:54:29Z

@lurebgi I am curious how the 4% LTR in tilapia is estimated?

lurebgi · 2019-01-29T08:06:15Z

by repeatmasker using a library from Repbase plus repeatModeler library. This paper shows a similar result: https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-017-3723-5

…

On Tue, Jan 29, 2019 at 1:54 AM Shujun Ou ***@***.***> wrote: @lurebgi <https://github.com/lurebgi> I am curious how the 4% LTR in tilapia is estimated? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE478WrPTl4ROxcODRtXng57V4FHuuzOks5vH5vFgaJpZM4ZpX_Y> .

oushujun · 2019-01-29T14:42:05Z

@lurebgi Repbase is a database for known TEs. The sequence of LTR elements varies wildly between species, so using other species's LTR sequence to identify the tilapia LTR sequence should be an underestimate. RepeatModeler is a general method for TE identification. It has some attempts to classify TEs but also not accurate in our experience. RepeatModeler can work as a supplement after some good identifications, but Repbase is not a good approach for LTR finding.

lurebgi · 2019-01-29T17:58:35Z

Thanks for the explanation. However, according to https://www.nature.com/articles/nature13726, it is likely true that cichlid fish (including tilapia) have a relatively low content of LTRs. That said, it would be very interesting to note that LTR_retriever actually identified many unannotated LTRs in cichlids.

…

On Tue, Jan 29, 2019 at 3:42 PM Shujun Ou ***@***.***> wrote: @lurebgi <https://github.com/lurebgi> Repbase is a database for known TEs. The sequence of LTR elements varies wildly between species, so using other species's LTR sequence to identify the tilapia LTR sequence should be an underestimate. RepeatModeler is a general method for TE identification. It has some attempts to classify TEs but also not accurate in our experience. RepeatModeler can work as a supplement after some good identifications, but Repbase is not a good approach for LTR finding. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#34 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE478WCeJeqEJSBXjUqA5D8F2hfw_pFzks5vIF2-gaJpZM4ZpX_Y> .

oushujun · 2019-01-29T18:23:49Z

@lurebgi Thanks for sharing the paper. I read the method section. TE annotations were based on RepeatModeler or RepeatScout, so this is kind of a loop. Since both methods are copy-number based, low copy number TEs will be missed out. You may try to figure what new elements are annotated by LTR_retriever. I'll be happy to see how it works/fails.

oushujun closed this as completed Feb 28, 2019

oushujun added the question label Feb 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

invertebrate support #34

invertebrate support #34

lurebgi commented Jan 4, 2019

oushujun commented Jan 4, 2019

lurebgi commented Jan 9, 2019

wangzhennan14 commented Jan 16, 2019

lurebgi commented Jan 16, 2019 via email

oushujun commented Jan 16, 2019

oushujun commented Jan 16, 2019

lurebgi commented Jan 18, 2019 via email

oushujun commented Jan 29, 2019

lurebgi commented Jan 29, 2019 via email

oushujun commented Jan 29, 2019

lurebgi commented Jan 29, 2019 via email

oushujun commented Jan 29, 2019

invertebrate support #34

invertebrate support #34

Comments

lurebgi commented Jan 4, 2019

oushujun commented Jan 4, 2019

lurebgi commented Jan 9, 2019

wangzhennan14 commented Jan 16, 2019

lurebgi commented Jan 16, 2019 via email

oushujun commented Jan 16, 2019

oushujun commented Jan 16, 2019

lurebgi commented Jan 18, 2019 via email

oushujun commented Jan 29, 2019

lurebgi commented Jan 29, 2019 via email

oushujun commented Jan 29, 2019

lurebgi commented Jan 29, 2019 via email

oushujun commented Jan 29, 2019