TE-GER is a high-performance bioinformatics tool based on Deep Learning for the automatic detection and annotation of Transposable Elements (TEs) in raw genomic sequences (FASTA).
The system uses a state-of-the-art hybrid architecture that combines the representational power of DNABERT-2 (a language model pre-trained on DNA) with bidirectional recurrent neural networks (BiLSTM) to capture the sequential and structural context of TEs.
Model weights are hosted on Hugging Face and download automatically on first run β no manual setup required.
Model HuggingFace Task Binary Jspinad/te-ger-binary TE vs Background Order Jspinad/te-ger-order LTR, LINE, SINE, TIR, β¦ Superfamilies Jspinad/te-ger-superfamilies Gypsy, Copia, HAT, β¦ (21 classes)
- Advanced Hybrid Architecture: Integrates DNABERT-2 (for rich k-mer embeddings) + BiLSTM (for sequential memory) + Linear Classifier.
- Configurable Multi-GPU Support: Automatically detects and uses all available GPUs by default (DataParallel). Optionally, you can select specific GPU IDs (
--gpu-ids) or limit the number of GPUs (--num-gpus) to fine-tune resource usage. - Vectorized Inference: Uses matrix operations (NumPy/PyTorch) and mixed precision (FP16) for post-processing, eliminating CPU bottlenecks.
- 3 Levels of Classification:
- Binary: Presence/absence detection (TE vs. Background).
- Order: General taxonomic classification (e.g., LTR, LINE, SINE, DNA).
- Superfamily: Detailed taxonomic classification (e.g., Gypsy, Copia, Mutator, etc.).
- "Mega-Chunks" Strategy: Processes the genome in massive configurable fragments (e.g., 1,000,000 - 5,000,000 bp) to saturate VRAM and minimize communication overhead.
- Standard Output: Generates GFF3 files compatible with IGV, JBrowse, and other genomic viewers.
- Python 3.9 or higher.
- (Recommended) NVIDIA GPU with CUDA drivers installed for fast inference.
- Git.
# 1. Clone the repository
git clone https://github.com/johanpina/TE-GER.git
cd TE-GER
# 2. Create and activate the virtual environment
python -m venv venv
source venv/bin/activate # Linux / Mac
# .\venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txt
# 4. Run
python Te_annotator.py genome.fasta output.gff3 --level binary# 1. Clone the repository
git clone https://github.com/johanpina/TE-GER.git
cd TE-GER
# 2. Create and activate the conda environment
conda create -n teger python=3.10 -y
conda activate teger
# 3. Install PyTorch with CUDA support (adjust cudatoolkit version to match your drivers)
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
# 4. Install the remaining dependencies
pip install -r requirements.txt
# 5. Run
python Te_annotator.py genome.fasta output.gff3 --level binaryTip: Check your CUDA version with
nvidia-smiand pick the matchingpytorch-cudaversion from pytorch.org.
On first run, TE-GER detects that the model weights are missing and downloads them directly from HuggingFace (~460 MB per level). Subsequent runs use the cached local copy.
π₯ Model weights not found locally. Downloading from HuggingFace: Jspinad/te-ger-binary
Destination: ./models/binary/
β
Weights downloaded successfully.
π§ Loading Hybrid Model: binary...
The trained weights are hosted publicly on Hugging Face and do not need to be downloaded manually. TE-GER handles this automatically.
| Level | Repo | Size | Labels |
|---|---|---|---|
binary |
Jspinad/te-ger-binary | ~460 MB | Background, TE |
order |
Jspinad/te-ger-order | ~460 MB | Background, DIRS, HELITRON, LINE, LTR, PLE, SINE, TIR |
superfamilies |
Jspinad/te-ger-superfamilies | ~460 MB | 21 superfamilies (Gypsy, Copia, HAT, Mutator, β¦) |
Once downloaded, the weights are stored in ./models/{level}/ and reused on every subsequent run.
The program is run from the command line (CLI). The basic syntax is:
python Te_annotator.py [ARGUMENTS] [OPTIONS]| Argument | Description | Required |
|---|---|---|
fasta_file |
Path to the input file (.fasta, .fa, .fna). |
β |
output_gff |
Path where the annotation file will be saved (.gff3). |
β |
| Option | Command | Description | Default |
|---|---|---|---|
| Level | --level |
Classification level: binary, order, superfamilies. |
binary |
| Create Library | --create-library |
Generate a FASTA library of candidate TE sequences. Use --no-create-library to disable. |
True |
| Chunk Size | --chunk-size |
Size of the genome fragment to process in memory (base pairs). Increase for higher speed, decrease if there is a memory error. | 2,000,000 |
| Workers | --num-workers |
CPU threads for data loading. It is recommended to keep it low (2) as GPU inference is very fast. | 4 |
| Device | --device |
Execution device: cuda (GPU) or cpu. |
cuda |
| GPU IDs | --gpu-ids |
Comma-separated list of specific GPU IDs to use (e.g., 0,2,3). Useful when sharing a server or when certain GPUs are busy. |
None (all) |
| Num GPUs | --num-gpus |
Maximum number of GPUs to use, selected sequentially starting from GPU 0. 0 means use all available. |
0 (all) |
Scans the genome using a large chunk (2,000,000 bp) for maximum speed on GPUs with good VRAM (e.g., 24GB+).
python Te_annotator.py \
./test/corn_genome.fasta \
./results/binary_detection.gff3 \
--level binary \
--chunk-size 2000000 \
--num-workers 2Standard configuration for mid-range GPUs (12GB - 16GB VRAM). 1,000,000 bp chunk.
python Te_annotator.py \
./test/rice_genome.fasta \
./results/order_classification.gff3 \
--level order \
--chunk-size 1000000 \
--device cudaThe most detailed analysis. If you have low free VRAM, use the default chunk size (200,000 bp).
python Te_annotator.py \
./test/unknown_genome.fasta \
./results/full_annotation.gff3 \
--level superfamilies \
--chunk-size 200000By default, TE-GER detects and uses all available GPUs automatically. The following options allow you to control which and how many GPUs are used.
Use only specific GPUs by their IDs (e.g., GPUs 0 and 2 on a 4-GPU server):
python Te_annotator.py \
./test/corn_genome.fasta \
./results/binary_detection.gff3 \
--level binary \
--gpu-ids "0,2"Limit to a fixed number of GPUs (e.g., use only the first 2 GPUs on an 8-GPU node):
python Te_annotator.py \
./test/corn_genome.fasta \
./results/order_classification.gff3 \
--level order \
--num-gpus 2Run on a single GPU (useful for debugging or when sharing resources):
python Te_annotator.py \
./test/corn_genome.fasta \
./results/binary_detection.gff3 \
--level binary \
--gpu-ids "0"Force CPU execution (no GPU required):
python Te_annotator.py \
./test/corn_genome.fasta \
./results/binary_detection.gff3 \
--level binary \
--device cpuNote: If neither
--gpu-idsnor--num-gpusis specified, TE-GER will automatically use all available GPUs. The--gpu-idsoption takes precedence over--num-gpusif both are provided.
The main output file follows the GFF3 (Generic Feature Format version 3) standard. Example:
##gff-version 3
chr1 TE-GER LTR 10500 12400 . + . ID=LTR_10500_12400;Name=LTR_prediction
chr1 TE-GER LINE 15000 15800 . + . ID=LINE_15000_15800;Name=LINE_prediction
- Column 1 (SeqID): Sequence ID (chromosome/contig).
- Column 2 (Source): Source (
TE-GER). - Column 3 (Type): TE type (Model prediction, e.g.,
LTR). - Column 4-5 (Start-End): 1-based coordinates.
- Column 9 (Attributes): Unique ID and metadata for visualization.
By default (or with --create-library), the tool also generates a FASTA file containing the DNA sequences of all predicted TEs. The output path will be the same as the GFF file, but with a .fasta extension (e.g., full_annotation.gff3.fasta).
The FASTA headers are formatted similarly to RepeatModeler 2 to be easily parsable and informative:
>TE_1#LTR
AGCT...
>TE_2#LINE
TTCA...
- The ID is a unique sequential number for each candidate (
TE_1,TE_2, etc.). - The classification is appended after a
#symbol, taken directly from the model's prediction.
This library is useful for downstream analyses like building consensus sequences, BLASTing against other databases, or manual inspection.
TE-GER includes a second script, library_builder.py, that takes the candidate FASTA library and generates a consensus library through clustering and multiple sequence alignment.
Candidate FASTA β MMseqs2 (clustering) β MAFFT (MSA per cluster) β CIAlign (consensus per MSA)
These tools must be installed in your conda environment before using library_builder.py:
conda install -c bioconda -c conda-forge mmseqs2 mafft
pip install cialignpython library_builder.py [FASTA_INPUT] [OUTPUT_DIR] [OPTIONS]| Argument | Description | Required |
|---|---|---|
fasta_input |
FASTA file of candidate TEs (output from Te_annotator.py). |
β |
output_dir |
Directory where all results will be saved. | β |
| Option | Command | Description | Default |
|---|---|---|---|
| Min Seq ID | --min-seq-id |
Minimum sequence identity for MMseqs2 clustering (0-1). | 0.8 |
| Coverage | --coverage |
Minimum alignment coverage for MMseqs2 clustering (0-1). | 0.8 |
| Threads | --threads |
CPU threads for MMseqs2 and MAFFT. | 4 |
| Workers | --workers |
Parallel processes for MAFFT + CIAlign (multiprocessing). | 4 |
| Min Cluster Size | --min-cluster-size |
Minimum sequences in a cluster to generate MSA. Smaller clusters are skipped. | 2 |
Basic run using default parameters:
python library_builder.py \
./results/binary_detection.gff3.fasta \
./results/library_output/Fine-tune clustering stringency and parallelism:
python library_builder.py \
./results/order_classification.gff3.fasta \
./results/library_output/ \
--min-seq-id 0.6 \
--coverage 0.7 \
--threads 8 \
--workers 6 \
--min-cluster-size 3Include singleton clusters (clusters with 1 sequence):
python library_builder.py \
./results/binary_detection.gff3.fasta \
./results/library_output/ \
--min-cluster-size 1output_dir/
βββ clusterRes_cluster.tsv β MMseqs2 cluster assignments
βββ clusterRes_rep_seq.fasta β Representative sequences
βββ clusterRes_all_seqs.fasta β All sequences with clusters
βββ tmp/ β MMseqs2 temp files
βββ clusters/
β βββ cluster_0.fasta β Sequences per cluster
β βββ cluster_0_msa.fasta β MAFFT alignment per cluster
β βββ ...
βββ consensus/
β βββ cluster_0_consensus.fasta β CIAlign consensus per cluster
β βββ ...
βββ consensus_library.fasta β FINAL CONSENSUS LIBRARY
The final file consensus_library.fasta contains one consensus sequence per cluster and can be used directly with tools like RepeatMasker (-lib consensus_library.fasta).
TE-GER solves the problem of the limited input length of BERT-like models through an optimized "Divide and Conquer" strategy:
- Mega-Chunking: The genome is divided into large fragments (e.g., 1,000,000 - 2,000,000 bp) that are loaded into VRAM at once.
- Parallel Sliding Window: Each Mega-Chunk contains thousands of 512bp windows. These are automatically distributed among the selected GPUs (all by default, or a user-defined subset via
--gpu-ids/--num-gpus). - Hybrid Inference (FP16):
- DNABERT-2: Extracts deep features from the DNA sequence.
- BiLSTM: Analyzes the sequential context.
- Vectorized Reconstruction: Predictions are decoded using NumPy boolean masks, avoiding slow Python loops and allowing millions of bases to be processed per second.
- Error
CUDA Out of memory: The fragment is too large for your GPU VRAM. Reduce--chunk-size(e.g., from1000000to200000). - Slow or failed download: The first run downloads ~460 MB from HuggingFace. Make sure you have an active internet connection. If the download is interrupted, delete the incomplete
./models/{level}/folder and run again. No HuggingFace repo configurederror: Check that--levelis one ofbinary,order, orsuperfamilies.Triton / Flash Attentionwarnings: Normal on GPUs older than Ampere/Hopper. TE-GER switches to a compatible attention implementation automatically.
This project is licensed under the MIT License.
Developed by Johan S. PiΓ±a - 2025