-
Notifications
You must be signed in to change notification settings - Fork 1
FAQ
The use of gene trees is recommended; however, utilizing a species tree is also permissible, under the condition that the analysis is limited to single-copy genes and orthogroups with lineage-specific duplications are not the focus of the study. It is also important to be aware that discordance in gene trees may adversely affect the outcomes if a species tree is used, although CSUBST is designed to mitigate such artifacts.
In CSUBST, branch lengths undergo recalculation, allowing for the use of any trees as long as their topology is accurate and tip labels match sequence names in the input codon alignment.
Yes, these are indeed actual estimates and not a result of any bug. The occurrence of very large ωC values is common in branch combinations where one or more branches have undergone only a small number of substitutions, resulting in no or a very small number of observed synonymous and nonsynonymous convergence (OCS) and OCN): e.g., 0.001 in posterior probability. Even if ωC is very large, such convergence, like those with only 0.001 nonsynonymous convergence over the entire protein (OCN), should not be considered biologically significant. This is because, in all likelihood, there is a 99.9% chance that there are no actual convergent substitutions. Typically, branch combinations of this nature are excluded by applying an OCN cutoff. We recommend using this cutoff in conjunction with an ωC cutoff, as suggested in Fukushima & Pollock (2023).
The --s
, --bs
, or --cs
option may be useful. They produce site-wise total posteriors of substitutions (--s
), site-wise total posteriors of paired substitutions (--cs
), or site-wise posteriors of substitutions in individual branches (--bs
).
The values displayed in CSUBST outputs like csubst_cb_2.tsv
appear unreasonably high when viewed in Excel. What could be the issue?
CSUBST utilizes a dot (.
) as the decimal separator. However, if your Excel is configured for a language that uses a comma (,
) as the decimal separator, the values may be displayed incorrectly on your screen.
When using the same ID in foreground.txt
with the option --fg_exclude_wg
set to no
, it functions correctly. However, assigning different IDs to individual lineages is also possible, and in this case, the --fg_exclude_wg
option is not necessary. Although both approaches yield similar analyses, there are subtle differences. For instance, in the first scenario, convergence within a lineage comprising multiple species is analyzed, which isn't the case in the second scenario.
Yes, you should list all genes of interest in foreground.txt
. Omitting any gene can complicate the process of identifying relevant foreground branch combinations in the CSUBST output files.
If I have 10 sequences to compare and set --max_arity
to 10
, will csubst_cb_10.tsv
be generated if convergence occurred among all 10 species?
The csubst_cb_10.tsv
file will be generated under certain conditions. The file is generated if each of the 10 sequences is independent (i.e., there are no sister pairs among them) and if there's a 9- or 10-way convergence that meets the thresholds set by --cutoff_stat
for higher-order convergence analysis.
Do all branch combinations in csubst_cb_K.tsv
meet the convergence metrics specified by --cutoff_stat
?
Not all branch combinations listed in csubst_cb_K.tsv
necessarily show convergence. Only a subset of these combinations may meet the thresholds set by --cutoff_stat
, and the file might also include non-convergent combinations. The file provides all relevant convergence metrics, allowing you to assess each branch combination individually.
CSUBST deterministically generates numerical branch IDs. Branch IDs should remain consistent across trees that share the same topology and identical leaf labels, but they may vary when these conditions are not met.