Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invertebrate support #34

Closed
lurebgi opened this issue Jan 4, 2019 · 12 comments
Closed

invertebrate support #34

lurebgi opened this issue Jan 4, 2019 · 12 comments
Labels

Comments

@lurebgi
Copy link

lurebgi commented Jan 4, 2019

Hi,

I was wondering if LTR_retriever supports invertebrate genomes. We have an amphioxus genome derived from 60X Pacbio sequencing, however, it shows the LAI score is only 7.07. Moreover, all of the 206 LTRs in LTRlib.fa were classified as 'Unknown'. Does this look normal to you?

Thank you!

Luohao

@oushujun
Copy link
Owner

oushujun commented Jan 4, 2019

Hi Luohao,

I tried LTR_retriever on fruitfly, mouse, micro- and mega- bats, and human, and it worked similarly as in plants, although most of these species have much less LTR content in their genomes. LAI requires a minimum of 5% total LTR and 0.1% intact LTR sequences present in the genome for the purpose of accurate evaluation, so you may need to check these two values.

For classification of LTR superfamilies, LTR_retriever uses models trained from rice LTR classifications, so the same model may not be applicable to invertebrate genomes. However, the classification information is not the major factor to identify LTR elements. You may need to do the classification yourself based on the identified LTR elements.

Best,
Shujun

@lurebgi
Copy link
Author

lurebgi commented Jan 9, 2019

Hi Shujun,

Thanks for your email. In amphioxus it seems LTR content is less than 1%, that's might be the reason.

On another note, LTR_retirever annotated 25.27% LTRs (according to the .tbl file) in a tilapia genome while the actual portion should be about 4%. I wonder if it has a lot false positives? The LAI score is also unexpectedly low for a Pacbio assembly: 3.01. Below is the script I used, would you have any suggestions for reducing false positives?

`/apps/genometools/1.5.9/bin/gt suffixerator -db $genome -indexname gt_index/$g -suf -lcp -des -ssp -sds -dna
/apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 > $g.harvest.scn
/apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 -motif TGCA -motifmis 1 > $g.harvest.motif.scn

/scratch/luohao/software/LTR_Finder/source/ltr_finder -D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.9 $genome > $g.finder.scn

perl /scratch/luohao/software/mgescan-1.1/mgescan/ltr/find_ltr.pl -seq=$genome -min-ltr=100 -max-ltr=7000 -min_iden=90

/scratch/luohao/software/LTR_retriever-2.0/LTR_retriever -genome $g -nonTGCA $g.harvest.scn -inharvest $g.harvest.motif.scn -infinder $g.finder.scn -threads=20`

Thanks!

@wangzhennan14
Copy link

Hi Shujun,

Thanks for your email. In amphioxus it seems LTR content is less than 1%, that's might be the reason.

On another note, LTR_retirever annotated 25.27% LTRs (according to the .tbl file) in a tilapia genome while the actual portion should be about 4%. I wonder if it has a lot false positives? The LAI score is also unexpectedly low for a Pacbio assembly: 3.01. Below is the script I used, would you have any suggestions for reducing false positives?

`/apps/genometools/1.5.9/bin/gt suffixerator -db $genome -indexname gt_index/$g -suf -lcp -des -ssp -sds -dna
/apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 > $g.harvest.scn
/apps/genometools/1.5.9/bin/gt ltrharvest -index gt_index/$g -maxlenltr 7000 -maxtsd 6 -mintsd 4 -seqids yes -vic 10 -similar 90 -seed 20 -motif TGCA -motifmis 1 > $g.harvest.motif.scn

/scratch/luohao/software/LTR_Finder/source/ltr_finder -D 15000 -d 1000 -L 7000 -l 100 -p 20 -C -M 0.9 $genome > $g.finder.scn

perl /scratch/luohao/software/mgescan-1.1/mgescan/ltr/find_ltr.pl -seq=$genome -min-ltr=100 -max-ltr=7000 -min_iden=90

/scratch/luohao/software/LTR_retriever-2.0/LTR_retriever -genome $g -nonTGCA $g.harvest.scn -inharvest $g.harvest.motif.scn -infinder $g.finder.scn -threads=20`

Thanks!

Hi Luohao,
Where did you download the mgescan-1.1? Can you give me the url? I have download three mgescan packages, but all of them did not work.

Thank you very Much!
Zhennan

@lurebgi
Copy link
Author

lurebgi commented Jan 16, 2019 via email

@oushujun
Copy link
Owner

@wangzhennan14

For MGEScan_LTR please refer to #8 and #19. Let me know if you need further help, thanks!

Shujun

@oushujun
Copy link
Owner

@lurebgi

Sorry for delay response (somehow I thought I did).

Your commands look good, but I have no idea about the total LTR content of amphioxus. If you suspect high proportions of false positives, you may manually curate a couple of them to verify (try NCBI blast and see what are they). If you do find some, please post example sequences here with 100bp extended on up- and downstreams, which would help to debug.

If LTR content is too low, then LAI is not accurate. You may plot out regional LAI values in the *.LAI file to see if there is any uneven distribution. Using long reads is not a guarantee of assembly quality, which is also depended on a lot of things.

Shujun

@lurebgi
Copy link
Author

lurebgi commented Jan 18, 2019 via email

@oushujun
Copy link
Owner

@lurebgi I am curious how the 4% LTR in tilapia is estimated?

@lurebgi
Copy link
Author

lurebgi commented Jan 29, 2019 via email

@oushujun
Copy link
Owner

@lurebgi Repbase is a database for known TEs. The sequence of LTR elements varies wildly between species, so using other species's LTR sequence to identify the tilapia LTR sequence should be an underestimate. RepeatModeler is a general method for TE identification. It has some attempts to classify TEs but also not accurate in our experience. RepeatModeler can work as a supplement after some good identifications, but Repbase is not a good approach for LTR finding.

@lurebgi
Copy link
Author

lurebgi commented Jan 29, 2019 via email

@oushujun
Copy link
Owner

@lurebgi Thanks for sharing the paper. I read the method section. TE annotations were based on RepeatModeler or RepeatScout, so this is kind of a loop. Since both methods are copy-number based, low copy number TEs will be missed out. You may try to figure what new elements are annotated by LTR_retriever. I'll be happy to see how it works/fails.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants