Be notified of new releases
Create your free GitHub account today to subscribe to this repository for new releases and build software alongside 40 million developers.Sign up
I recently identified a bug for dropping intact LTR elements, which have an imbalance LTR length > 15bp due to InDels. After manual checks, I determined these are still high-quality intact elements and thus salvage them in the output. This will marginally improve the sensitivity especially for genomes with limited LTR sequences (e.g. Arabidopsis, ~7%) and the margin decreases for those with decent amounts of LTRs, such as rice (~25%) and maize (~75%), because the abundance of intact elements has been sufficient to construct a comprehensive library. However, the number of intact LTR elements could increase for 10-20% comparing to the last version (v2.7), which has some positive effects on the calculation of LAI. Some benchmarking results:
- Allow for mirrored candidates produced by LTRharvest
- Improve the convert_ltrdetector.pl for the published version (v1.0) of LtrDetector (contributed by @baozg)
- Add a convertor convert_ltr_finder2.pl to convert LTR_FINDER -w 2 table format into LTRharvest screen output format
- For LAI, allow the -all file to contain other TEs (i.e., whole-genome TE annotation)
I am excited to release this much faster version of LTR_retriever. Its multithreading module has been slowing down the program and I finally get the chance to improve it. This part of the update will not change the program outcome since this is just a more efficient implementation of parallel computation.
With the test on the 14.5 Gb bread wheat genome, a total of 941,338 LTR raw candidates were processed and a non-redundant library was generated. This process only took 8 days 3 hours and 31 minutes for the current version (v2.7) with 10 threads (
-threads 10), which would have required 3 weeks for the last version (v2.6).
- Classification of Copia elements was improved to be more sensitive (#51)
- Print out the program version number on screen.
- Improved genome and sequence reading.
Support three more LTR programs!
Three more programs are supported by
Users need to convert candidates identified by these programs into the
LTRharvest format with scripts located in the
Then feed them to
-inharvest. You may concatenate multiple
LTRharvest format input files together.
Note: You won't find a lot of intact LTR elements from
LtrDetector outputs due to the fuzzy sequence boundaries these programs provided. So please use these two as supplements to other inputs.
Minor bug fix
- Fix the bug that would count 1 extra bit for sequence names
- Maintain codes for solo LTR identification and solo-intact ratio calculations:
- Format sequence names in the redundant library output
- Add detailed notes to fix the conda RepeatMasker issue. Notes can be found here: #43
Checkpointing is implemented!
Users can recover interrupted runs from a number of major checkpoints. This is particularly useful when running LTR_retriever on huge genomes (i.e., common wheat) and got interrupted (for example, the job is killed due to walltime limit). Use
LTR_retriever -h for further information.
Remove nesting of entire LTR elements in library
Previous versions would remove nested insertion of solo LTRs. However, when a full element is nested in a library sequence, the internal region of the nesting element won't be removed, causing sequence mosaics and library redundancy. In this update, a new module is developed to clean up composite sequences caused by full-element nesting. This update was inspired by Mr. Robert Hubley's report.
The current version has a slight decrease of accuracy with a marginal gain of sensitivity. This is likely due to the removal of nesting sequences that may have slightly shifted the annotation dynamic of RepeatMasker. Nevertheless, there is no extra sequence added in this process, but removes up to 60% of library sequences (i.e., in common wheat) that are redundant due to nested full-element insertions.
- Update README, no longer supports MGEScan_LTR due to the inability to run it on modern Linux platforms.
- Add an easy way (conda) to install dependencies.
- Fix a bug occurred when chromosome names are pure numbers.
- Improve the estimation of LTR age. Previous versions included InDels for divergence estimation, which would result in overestimation of LTR age. This version will only use SNPs, no indels, to compute LTR divergence and age.
- Fix the bug 'substr sequences out of range' when the candidate locates at the boundary of a contig.
- Fix the bug for sometimes producing slightly different results when using both LTRharvest and LTR_FINDER inputs.
- Fix the bug for bias to identify TGCA motifs over non-TGCA motifs and improved TSD identification.
- Improve detection/filtering sensitivity for LINE/DNA transposases and plant proteins.
- Remove short sequences (<100bp) in the final library.
- Update README and citations.
The v2.0 LTR_retriever has similar high performance comparing to v1.x versions.
- Add LTR_digest support
- The *pass.list.gff3 becomes readable to LTR_digest.
- You can also use /LTR_retriever/database/TEfam.hmm to feed LTR_digest.
- Improved gff3 output for intact LTR-RTs
- Add strand info for each elements.
- The ones with '?' (unknown direction) in the *pass.list will remain '?'s in the *pass.list.gff3 file.
- Improve multi-threading efficiency
- Use the Thread::Queue module to replace the Thread::Semaphore module
- At least 100% more efficient
- Add Mac OsX support (High Sierra v10.13.3 tested)
- Add a script to summarize the genome % of each TE families using RepeatMasker .out files
perl ./LTR_retriever/bin/fam_coverage.pl TE_lib RM_output genome_size_bp > TE_fam.size.list
- Not only works with LTRs but also other TEs in the RM.out file.
- Add a script to summarize the genome % of each TE superfamilies (TE summary table for genome publications)
perl ./LTR_retriever/bin/fam_summary.pl TE_fam.size.list genome_size_bp > TE_fam.sum.txt
- Summary tables for LTR families and superfamilies are added to the output of LTR_retriever
- Add a script to calculate LTR distribution (Copia, Gyspy, and unknow) on chromosomes.
perl ./LTR_retriever/bin/LTR_sum.pl -genome genome.fa -all genome.fa.RM.out [options]
-window [int] bp size of the sliding window, default 3,000,000
-step [int] bp size of the moving step, defalut 300,000
-intact indicate the -all file is an LTR_retriever .pass.list instead of a RepeatMasker .out file
- The .out.LTR.distribution.txt file is generated by default.
- Add a script for whole-genome forward simulation (randomly add mutations on the genome)
perl ./LTR_retriever/bin/simulate_mutation.pl -g genome.fasta -u [0-1] > genome.mutated.fasta
- -u specifis the mutation rate. i.e., -u 0.01 will randomly mutate 1% of the entire genome.
- Replace annotate_gff.pl with make_gff3_with_RMout.pl for better whole-genome LTR-RT annotation
perl ./LTR_retriever/bin/make_gff3.pl genome.fa.RepeatMasker.out > genome.fa.RepeatMasker.gff
- Applied basic hit filtering: SW_score>=300, alignment length >= 80 bp
- Add more usage information to -h
- Update README
- Program halt when nothing is masked in truncated candidates.
- Program halt when multiple LTR_retriever tasks simutainously check RepeatMasker in the same directory
- substr sequences out of range when self-corrected reads are used as input
LAI Version b2
- Rewrite LTR_calc.pl with more accurate and efficient algorithms.
- Add the -step parameter for overlapping-sliding window scheme to estimate LAI
- Output the size of the genome for genome LAI
- Memory consumption of this scrip is approx. 2X the size of the input genome
- To control the boom and bust dynamic of LTR-RTs, adjust the raw LAI based on LTR identity.
- Estimate mean identity of LTR sequences in a genome using all-versus-all blastn search
- Add a quick estimation (-q) of genomic LTR identity based on a log-linear model with the slope estimated from three small subsets of LTRs
- To avoid abnormal adjustment, if estimated LTR identity <= 92% or >= 96%, then corrected it to 92% or 96%, respectively
- Use the -unlock parameter to release the restriction of LTR identity ([92, 96]) for good genomes with extreme LTR activities
- Set LAI_adj=0 if raw LAI==0
- The alignment identity cutoff (-iden) can excludes hits higher than this value for LTR identity calculation. Default: 100 (%)
- Change the output naming of LAI to raw_LAI and LAI_adj to LAI for easier description.
- Add polyploid support.
- If the input genome is a polypoid (diploidized ancient polypoid does not count), then only a set of chromosomes (1x, a monoploid) should be used to estimate LAI, otherwise the LTR identity will be erroneously estimated to a very high value and substantially decrease the LAI.
- Use the -mono parameter to provide a list of chromosome names of a monoploid, LAI will be calculated only on these sequences.
- Users can run LAI multiple times with different monoploids specified to obtain the whole genome LAI estimation.
- Set prerequisites of LAI estimation
- set intact LTR-RT limit >= 0.01%;
- set total LTR limit >= 5%
- Add the -totLTR parameter for customized total LTR content;
- Add the -window parameter to control window size
- Add the rush mode (-qq) to quickly estimate raw LAI for version comparison. Raw LAI should not be used to compare between different species because the LTR dynamic is not controlled.
- Add status output of the LAI program. LAI is a default output of LTR_retriever. You should rerun LAI with the -mono parameter if the target genome is a polyploid.
- Add Mac OsX support (High Sierra v10.13.3 tested).
- Add the citation for LTR_retriever. Please cite our program: S. Ou and N. Jiang (2017) LTR_retriever: a highly accurate and sensitive program for identification of long terminal-repeat retrotransposons. Plant Physiology, pp.01310.2017; DOI: 10.1104/pp.17.01310 http://www.plantphysiol.org/content/early/2017/12/12/pp.17.01310
- Retain the unreduced library (*.LTRlib.redundant.fa). Please use the non-redundant library if you don't have a specific reason. Note that using unreduced library may not improve the annotation sensitivity, if any, it's marginal but will take significantly more time.
- Remove the entire candidate if plant protein sequence is found dominant (70%) in either the LTR region or the internal region.
- Add module # in the status output.
- Remove space(s) at the end of each seq ID to avoid error;
- Check if seq names are duplicated;
- Fix bugs in reading the window size parameter for LAI;
- Fix a bug in program halt when nothing is masked in truncated candidates.
New feature: The LTR-RT Assembly Index (LAI) for evaluation of genome assembly continuity
Description: LTR retrotransposon is very difficult to assemble due to their repetitive nature (up to 75% of a genome, i.e., maize) and long length (up to 20 Kb long). A very simple idea that more intact LTR-RT could be found in the more continuous genome provides the theoretical support of LAI. This module is using the list of intact LTR-RT and the whole-genome annotation of LTR-RT produced by LTR_retriever (*.pass.list and *.out, respectively) for calculation of LAI. A window-based calculation is implemented for estimation of regional continuity. A manuscript describing this feature is in preparation.
- improved purging criteria. Introduce the identity cutoff for alignment hits (>=30%), change the alignment length criteria to the identity-length criteria: identity-length = alignment length - mismatch >=90 for a real hit.
- add scripts to identify solo LTR and complete LTR, and to estimate solo-complete ratio for each family, and count family size in the genome. These codes were initially developed for this study: https://www.nature.com/articles/s41467-017-02546-5
- Control the length of internal regions (>=100 bp) on LTR candidates.
- Updated the manual
- Introduce fingerprints for databases to avoid accidentally deleting these files (especially when running multiple LTR_retrievers in the same folder. e.g., #2 ). In other words, you can run multiple LTR_retriever in the same folder now.
- Add warnings if specified file(s) not exist.
- Update license to GNU-GPL v3, aka., LTR_retriever is an open source software.
Provide a workaround for the blast bug (described in v1.3 and #4 #3 ) occurred under high CPU usage or resource over-allocation. Each blast attempt will be checked and will be redone for up to 100 times if encounter error status. This is not a total fix but at least there should be no more such errors.
Several bugs have been reported since the last release. Most of them were fixed in this release.
- Sorting a list of sequence coordinate failed when special characters occurred.
- RepeatMasker ran incorrectly if it was installed using HMMER as the default search engine. Added a checking procedure to make sure the blast+ engine is available. Reinstall RepeatMasker is needed if user receives similar errors.
- Copy the database files to the working directory instead of working in the installed directory to avoid write error.
- Update the Manual. I didn't realize it was a commented version.
- Further steps were halted when there was no coding sequence contamination needed to be cleaned.
- Tested lowest dependency versions:
- CDHIT/4.5.6 or up
- BLAST+/2.2.25 or up
- RepeatMasker/3.3.0 or up
- HMMER/3.1b1 or up
BLAST engine error: Warning: Sequence contains no data
Warning: [blastn] Subject_1 chr:from..to|chr:from..to: Subject sequence contians no data
Some candidate sequences were appeared to be empty when analyzing sequence structures. This could happen when more CPUs were allocated to the program than what the system could provide. Some users said it also occurred even plenty of CPUs were available. So far I could not reproduce the second situation, so I don't know how to fix it.
Good news is that this kind of LTR candidates are usually problematic, which means that they are usually false LTRs and will be screened out anyways. So if you only have this kind of errors you should be fine. The results are still reliable.