DENSE is a pipeline that detects genes that have emerged de novo (from non-coding DNA regions), based on phylostratigraphy and synteny.
from Roginski et al 2024 (https://doi.org/10.1093/gbe/evae159)
DENSE uses a genome of interest (focal) and its phylogenetic neighbors (genomes FASTA and GFF3 annotation files).
It has two main parts :
- 1 : search for the taxonomically restricted genes (TRGs) among the annotated genes of a the focal genome (A and B),
- 2 : identifies through a cascade of filters, TRGs with homology traces in the neighbor genomes (C and D)
More precisely, the pipeline includes the following steps :
- A : extracts the coding sequences of all protein coding genes in the focal genome and search for homologs among the Refseq Non-redundant protein database (NR), and the neighbor genomes.
- B : based on the previous step, selects genes that are taxonomically restricted.
- C : selects TRGs with homology in non-coding regions of neighbor genomes. If a phylogenetic tree was provided, DENSE can require from these genomes to be 'outgroup', meaning that they are more distant from the focal genome that any neighbor actually sharing the gene.
- D : DENSE finally determines whether the homologous non-coding regions are in synteny with their TRGs (the step can be switch off).
It generates a file containing all the genes that have emerged de novo.
Flexibility : DENSE allows to use different "strategies" (combinations of filters) to detect de novo genes :
- 1 : any TRG with a outgroup (see below) non-coding hit
- 2 : any TRG with a non-coding hit
- 3 : any orphan gene with a non-coding hit
- Introduction
- Table of contents
- Key concepts
- Set-up
- Input files
- Usage
- Options
- Pipeline output
- Credits
- Citations
A genome labeled as "outgroup" is a genome where a given gene is absent and which branches in the tree after the last genome where the gene is present.
from Roginski et al 2024 (https://doi.org/10.1093/gbe/evae159)
Before anything, you need to have an recent Nextflow installed.
Tip
If you do not have Nextflow yet, you can find simple instructions here : this page.
In order to use the latest Nextflow version, you should use:
nextflow self-update
Important
DENSE requires Nextflow >=23.04.3. A previous version could lead to errors.
To test your Nextflow installation you can use :
nextflow run hello
In order to use DENSE in an fully-ready and reproducible environment, you need to have a container manager installed on your machine.
You can use any of the following :
You can now test DENSE on the example data with the following command :
nextflow run i2bc/dense -profile <DOCKER|APPTAINER|SINGULARITY>,test
For example, if you have Docker installed on your machine, your command could be :
nextflow run i2bc/dense -profile docker,test
Note
The very first time you run DENSE, Nextflow will download the repository along with the appropriate container images from DockerHub. It takes about a minute and do not need do be repeated.
Warning
Docker users may encounter the following error :
docker: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?.
See 'docker run --help'.
In that case, restart Docker desktop (if appropriate) or follow this fix.
In order to detect taxonomically restricted genes (TRGs), DENSE uses GenEra to search the Refseq Non-redundant protein database (NR).
To download and properly install the NR along with taxonomic data, you can follow these instructions.
Warning
Do not install the nr database in the root directory of your device (i.e. "/nr.dmnd").
Note
The downloading step can take a couple of hours, but is necessary to assess the absence of homology of your genes candidate to any other known protein coding gene.
You can ignore this step if you want to use you own user-defined TRG list instead (see Usage).
To run DENSE you always need a directory that contains a genomic FASTA file ('.fna','.fasta') and a GFF3 annotation file ('.gff','.gff3') for each genome (focal and neighbors, e.g. : the mouse and some close rodents). --gendir
GFF3 files must have a classical CDS < mRNA < gene parent relationship between features.
If you want to use DENSE the most complete way, you also need :
- The phylogenetic tree that shows relations between the genomes (Newick format)
--tree
- A '.tsv' file with two columns : col1 = genome name, col2 = taxid. Must include all genomes (focal and neighbors)
--taxids
Tip
Get the taxid of your species:
- GFF3 files from NCBI have a header line. ##species https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=\<TAXID>
- Find your organism on the NCBI Taxonomy browser.
Here is a example of a --taxids
TSV file :
Droso_melanogaster 7227
Droso_virilis 7244
Droso_simulans 7240
...
Here is a example of a --trgs
file :
rna-NM_181684.3
rna-XM_047421177.1
rna-XM_047438033.1
rna-XM_047438039.1
rna-NM_001012416.1
...
The GFF3 annotation files must be compliant with the gff3 specifications.
In addition, they must have 'CDS' features with 'mRNA' parents, and these 'mRNA' features must have 'gene' features as 'Parent'.
Warning
The phylostratigraphy is performed with genEra. It requires several dozens of Gb of RAM (and as many CPUs as possible).
nextflow run i2bc/dense -profile <DOCKER|APPTAINER|SINGULARITY> -c yourparams.config
params {
gendir = "/PATH/" // a directory that contains the genome FASTA and GFF3
focal = "name_of_the_focal_genome"
tree = "/PATH/tree.nwk" // a tree with the same names as the genome files
genera_db = "/PATH/nr.dmnd" // the diamond database
taxids = "/PATH/taxids.tsv" // see the Input files section
}
nextflow run i2bc/dense -profile <DOCKER|APPTAINER|SINGULARITY> -c yourparams.config
params {
gendir = "/PATH/" // a directory that contains the genome FASTA and GFF3
focal = "name_of_the_focal_genome"
tree = "/PATH/tree.nwk" // a tree with the same names as the genome files
trgs = "/PATH/trgs.txt" // a single column file with CDS IDs (from GFF3). Their parent genes are assumed to be taxonomically restricted. See the Input files section
}
Note
DENSE runs a tBLASTn of all TRG translated CDS against the neighbor genomes. For certain big genomes, a few queries can have several millions of hits (e.g. repeated elements) which can slow down the analysis.
e.g. with 16 cpus per taks (neighbor), the human (GRCh38.p14) tBLASTn finishes in about 10 days, whereas the mouse genome takes about 30 hours and the yeast (S. cer) only a few minutes.
Lucy has a favorite species. She wants to collect genes from that species with the best indications of de novo emergence.
Therefore, she runs a complete DENSE analysis.
Since her HPC's admin does not like Docker (they all do), she uses Apptainer.
nextflow run i2bc/dense -profile apptainer -c Lucy.config
Lucy.config content :
params {
gendir = "../GENOMES/" // a directory that contains favorite.fna, favorite.gff3, cousin1.fna, cousin1.gff3, cousin2.fna, cousin2.gff3, ...
focal = "favorite" // the name of the focal genome
tree = "family_tree.nwk" // a tree with the same names as the genome files
genera_db = "/PATH/nr.dmnd" // the diamond database
taxids = "taxids.tsv" // see the Input files section
}
Luca has dozen of annotated strains from its most cherished Yeast.
He wants to know if its first strain has genes that seem to have emerged de novo by comparison with the eleven other strains.
He already has a list of orphan genes for this yeast, and so he provides it to DENSE (basically skip step A and B) (trgs
).
He does not know the evolutionary relationship between the genomes (no tree
and strategy = 2
).
He does not care about checking the synteny (synteny = false
).
He changed his mind about this options in the middle of a first analysis, so this time he use -resume
to reuse pre-computed steps.
nextflow run i2bc/dense -profile docker -c Luca.config -resume
Luca.config content :
params {
gendir = "input_file/" // a directory that contains strain1.fna, strain1.gff3, strain2.fna, strain2.gff3, etc...
focal = "strain1" // the name of the focal genome
trgs = "list_of_orphan_genes.txt" // see the Input files section
strategy = 2 // select any TRG with a non-coding homolog region (and no coding homolog) of a neighbor.
synteny = false // turn off synteny checking
}
[!TIP] Find out more ways to use options in Nextflow : configs
see PARAMETERS.md
- denovogenes.tsv : this is the main output. A two columns TSV file (col1: gene, col2:CDS).
- TRG_match_matrix.tsv : synthetically shows the present/absence (homolog) of every TRG coding sequence among the provided genome.
- TRG_table.tsv : details all homolog names/coordinates
- directories with some useful precomputed intermediate files :
- genera_results
- diamondblast_out
- orthologs
- blast_out
- synteny
We thank the following people for their extensive assistance in the development of this pipeline:
Ambre Baumann
Simon Herman
If you use DENSE for your analysis, please cite it using the following doi: https://doi.org/10.1093/gbe/evae159
An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md
file.