GenEra output

Jump to bottom

Josué Barrera Redondo edited this page Sep 11, 2023 · 3 revisions

Output files

The main output files of `GenEra` are the following:

[TAXID]_gene_ages.tsv Tab-delimited table that contains the age assignment for every gene in the query species, the taxonomic rank that ranges from 1 in the oldest taxonomic level (i.e., conserved genes throughout all cellular organisms) to the Nth youngest taxonomic level (i.e., putative orphans at species-level), and the taxonomic representativeness score for each gene age assignment. This table can be used as the input to perform evolutionary transcriptomics through myTAI.
[TAXID]_gene_age_summary.tsv Summary file with the number of genes in the query species that could be assigned to each taxonomic level.
[TAXID]_founder_events.tsv Tab-delimited table that contains the oldest age assignment for each gene family (as defined by MCL clustering), with its respective taxonomic rank and with the number of genes that are contained within each gene family. These could be regarded as the putative gene-family founder events (i.e., the point in time where gene families are expected to have originated).
[TAXID]_founder_summary.tsv Summary file with the number of putative gene-family founder events per taxonomic level.
[TAXID]_HDF_gene_ages.tsv (Optional) Tab-delimited table that contains the genes whose age assignment cannot be explained by homology detection failure (HDF). This file is created by GenEra whenever the user specifies a table with pairwise evolutionary distances using -s. The genes are selected based on the detection failure probabilities (calculated with abSENSE) of the closest outgroup for each given taxonomic level in the analysis. All the genes whose detection failure probabilities are lower than 0.05 in the closest outgroup are deemed as gene age assignments that passed the HDF test.
[TAXID]_HDF_gene_age_summary.tsv (Optional) Summary file with the number of gene-age assignments that passed the HDF test for each taxonomic level. This file is created by GenEra whenever the user specifies a table with pairwise evolutionary distances using -s. GenEra will assign an NA in the gene count of all the taxonomic levels that lacked an outgroup in the file with evolutionary distances. Species-specific genes are also treated as NA, since detection failure probabilities cannot be calculated for single data points. Taxonomic levels with a gene count of 0 mean that an appropriate outgroup was available in the analysis, but GenEra could not detect any high-confidence gene for that taxonomic level (sometimes due to the lack of enough data points to calculate detection failure probabilities). The fourth column contains the NCBI taxonomy ID of the outgroup species that was used to determine whether these genes passed the HDF test or not.
[TAXID]_HDF_founder_events.tsv (Optional) Tab-delimited table that contains the oldest age assignment of the gene-families that contain at least one gene that passed the HDF test for that taxonomic level (i.e., gene-family founder events whose age assignment cannot be explained by HDF). This file is created by GenEra whenever the user specifies a table with pairwise evolutionary distances using -s.
[TAXID]_HDF_founder_summary.tsv (Optional) Summary file with the number of gene-family founder events that passed the HDF test per taxonomic level. This file is created by GenEra whenever the user specifies a table with pairwise evolutionary distances using -s. The fourth column contains the NCBI taxonomy ID of the outgroup species that was used to calculate and determine whether these gene families passed the HDF test or not.

Other output files that are relevant:

[TAXID]_ambiguous_phylostrata.tsv Tab-delimited table with genes that ranked low in taxonomic representativeness (by default, below 30%), which were flagged as potential contaminants in the genome or putative horizontal gene transfer events. The table gives a list of possible taxonomic levels to which these genes could be assigned.
[TAXID]_deepest_homolog.tsv (Optional) Additional output file that is generated when -i is established as true. The file contains the best sequence hit (as defined by the bitscore value) responsible for the oldest gene age assignment for each of the query genes. This file is useful to identify erroneous age assignments due to false positive matches, and to manually evaluate genes with a low taxonomic representativeness.
[TAXID]_Diamond_results.bout Homology table generated by DIAMOND (and MMseqs2 when using -f) with all the traceable homologs for each query protein (this file is only generated when using a FASTA file as input). This is an intermediate file generated in the first step of the pipeline, which can be used with the -p argument, in case the user desires to resume GenEra from step 2 onwards (thus, saving a considerable amount of time). This file is usually HUGE, so it is stored by default as a temporary file within a directory made by GenEra (tmp_[TAXID]_[RANDOMNUM]/), but it can also be redirected to any specified location using the -x argument.
[TAXID]_Foldseek_results.bout Homology table generated by Foldseek with all the traceable homologs for each query protein structure (this file is only generated when using PDB files as input). This is an intermediate file generated in the first step of the pipeline, which can be used with the -p argument, in case the user desires to resume GenEra from step 2 onwards (thus, saving a considerable amount of time). This file is usually big, so it is stored by default as a temporary file within a directory made by GenEra (tmp_[TAXID]_[RANDOMNUM]/), but it can also be redirected to any specified location using the -x argument.
[TAXID]_HMMER_results.bout (Optional) Homology table generated by JackHMMER with all the traceable homologs of all the query proteins that were re-analyzed (this file is only generated when -j is changed to true). This file is concatenated on to of the DIAMOND results ([TAXID]_Diamond_results.bout), which is then used by GenEra to re-calculate gene ages. This file is stored by default as a temporary file within a directory made by GenEra (tmp_[TAXID]_[RANDOMNUM]/), but it can also be redirected to any specified location using the -x argument.
[TAXID]_ncbi_lineages.csv A modified version of the lineage table generated by NCBItax2lin, with the phylostrata ordered in accordance to the query species, and without the phylostrata that lack genomic data for a reliable age assignment. This is an intermediate file generated in the second step of the pipeline, which can be used with the -c argument, in case the user desires to resume GenEra while skipping step 2 (thus, saving some time).
[TAXID]_orthofinder_tree.nwk A NEWICK tree containing the phylogenetic relationships of the strains/subspecies that were added to the analysts using the argument -v. This phylogeny is automatically created with OrthoFinder, as it is retained in the output so the user can evaluate the evolutionary relationships that were used to perform the detection of gene ages at the infraspecies level.
[TAXID]_abSENSE_results/ (Optional) Files generated by abSENSE when the user specified a table with pairwise evolutionary distances using -s. These files include a table with all the calculated detection failure probabilities, the bitscore predictions that were used to calculate these probabilities, as well as other parameters and general information about the analysis. Please refer to the abSENSE README for more detailed information.