Skip to content
Kenji Fukushima edited this page Sep 18, 2024 · 14 revisions

Can a species tree be used for genome-wide analysis instead of gene trees?

The use of gene trees is recommended; however, utilizing a species tree is also permissible, under the condition that the analysis is limited to single-copy genes and orthogroups with lineage-specific duplications are not the focus of the study. It is also important to be aware that discordance in gene trees may adversely affect the outcomes if a species tree is used, although CSUBST is designed to mitigate such artifacts.

Is the branch length significant in an input tree?

In CSUBST, branch lengths undergo recalculation, allowing for the use of any trees as long as their topology is accurate and tip labels match sequence names in the input codon alignment.

I am encountering "ωC = Inf" in many branch pairs. Are these actual results?

Yes, these are indeed actual estimates and not a result of any bug. The occurrence of very large ωC values is common in branch combinations where one or more branches have undergone only a small number of substitutions, resulting in no or a very small number of observed synonymous and nonsynonymous convergence (OCS) and OCN): e.g., 0.001 in posterior probability. Even if ωC is very large, such convergence, like those with only 0.001 nonsynonymous convergence over the entire protein (OCN), should not be considered biologically significant. This is because, in all likelihood, there is a 99.9% chance that there are no actual convergent substitutions. Typically, branch combinations of this nature are excluded by applying an OCN cutoff. We recommend using this cutoff in conjunction with an ωC cutoff, as suggested in Fukushima & Pollock (2023).

Is there any way to characterize site-wise substitution/convergence?

The --s, --bs, or --cs option may be useful. They produce site-wise total posteriors of substitutions (--s), site-wise total posteriors of paired substitutions (--cs), or site-wise posteriors of substitutions in individual branches (--bs).

The values displayed in CSUBST outputs like csubst_cb_2.tsv appear unreasonably high when viewed in Excel. What could be the issue?

CSUBST utilizes a dot (.) as the decimal separator. However, if your Excel is configured for a language that uses a comma (,) as the decimal separator, the values may be displayed incorrectly on your screen.

Why does foreground.txt have the same ID for all foreground genes in the PEPC test dataset?

When using the same ID in foreground.txt with the option --fg_exclude_wg set to no, it functions correctly. However, assigning different IDs to individual lineages is also possible, and in this case, the --fg_exclude_wg option is not necessary. Although both approaches yield similar analyses, there are subtle differences. For instance, in the first scenario, convergence within a lineage comprising multiple species is analyzed, which isn't the case in the second scenario.

Should I list all the sequences that I want to test for convergence in foreground.txt?

Yes, you should list all genes of interest in foreground.txt. Omitting any gene can complicate the process of identifying relevant foreground branch combinations in the CSUBST output files.

If I have 10 sequences to compare and set --max_arity to 10, will csubst_cb_10.tsv be generated if convergence occurred among all 10 species?

The csubst_cb_10.tsv file will be generated under certain conditions. The file is generated if each of the 10 sequences is independent (i.e., there are no sister pairs among them) and if there's a 9- or 10-way convergence that meets the thresholds set by --cutoff_stat for higher-order convergence analysis.

Do all branch combinations in csubst_cb_K.tsv meet the convergence metrics specified by --cutoff_stat?

Not all branch combinations listed in csubst_cb_K.tsv necessarily show convergence. Only a subset of these combinations may meet the thresholds set by --cutoff_stat, and the file might also include non-convergent combinations. The file provides all relevant convergence metrics, allowing you to assess each branch combination individually.

Are branch IDs consistent among trees?

CSUBST deterministically generates numerical branch IDs. Branch IDs should remain consistent across trees that share the same topology and identical leaf labels, but they may vary when these conditions are not met.