Amino acid substitutions that are consistent with phenotypic variation indicate that the gene product is potentially involved in the genetic determination of the trait. We define these cases as Convergent Amino Acid Substitutions (CAAS).
It is possible to retrieve CAAS by scanning Multiple Sequence Alignments (MSA) of orthologous proteins or translated nucleotides. We will isolate those positions in which we can verify that species with diverging trait values converge to different amino acids. A very simple way to do this is to define two groups of species, isolate those positions in which they won’t share any amino acids, and test the statistical significance of this association.
Although the implementation of this strategy can be easily achieved through simple scripting, its scaling to proteome-size requires some additional effort in terms of code optimization. Also, CAAS analysis is usually validated through bootstrap-based approaches. All these operations together are computationally expensive, especially when brought to proteome-wide scale.
In recent years, our group worked on optimizing our in-house scripts for CAAS discovery and validation. CAAStools is the result of these efforts. A small suite of bioinformatics tools that allow the user to identify and validate CAAS analysis on MSA of orthologous proteins.
CAAStools is a collection of 3 bioinformatics tools written in Python 3.9.4. Figure 1 resumes the functioning of each tool. Globally, CAAStools relies on three main pieces of information that are required in different formats (see the input specification section of those paragraphs discussing each single tool). The discovery tool detects CAAS from a single amino acid MSA (protein or translated nucleotide). The resample tools elaborates virtual phenotype groups through different strategies (brownian motion simulation, random sorting of species or phylogeny-restricted resampling). The result of this tool can be submitted to the bootstrap tool that runs a bootstrap CAAS analysis on a single MSA.
The output of the discovery tool consists in a table reporting a list of CAAS associations. A corresponding p-value is calculated as the probability of randomly finding a CAAS association in that position (see our preprint for a full explanation about the statistical testing of CAAS).
The caastools/examples folder in the repository contains the files to run the examples described in this documentation. The supplementary information in our preprint (Barteri et al., 2023) describe the detection and the statistical testing of the gene BRCA2 from Farré et al., 2021 analysis. The results and raw files about this example are in the folder caastools/examples/longevity.2021.
[Under construction]
CAAStools is available as a GitHub repository
https://github.com/linudz/caastools.
The repository can be cloned through command line
git clone github.com/linudz/caastools
Also, you can clone it through the GitHub GUI client or download the code as a zipped file through your browser.
The file ct in the CAAStools Git folder is an executable python script that will launch the different tools of the suite. To run it system-wide, you can decide to either download the code to the usr/local/bin folder or to add the caastools folder to the $PATH variable.
Please, follow the github.com/linudz/caastools repository for code updates.
CAAStools is written in Pyhton 3.9.4 and is compatible with any Python 3+ environment. Each tool has its own set of dependencies.
Biopython 1.79
Scipy (Biopython dependency)
Numpy (Biopython dependency)
Dendropy 4+
Also, the Brownian Motion mode (--mode bm
) for trait simulation relies on the simpermvec() function from R library RERconverge (https://github.com/nclark-lab/RERconverge). This library will be requested only by the resample tool when run in Brownian Motion mode (--mode bm
).
Warning. Apple M1 and M2 users might experience issues during RERconverge installation (nclark-lab/RERconverge#60).
CAAStools discovery identifies CAAS from an amino acid Multiple Sequence Alignment file (MSA). It is possible to view the inputs and the options through the -h --help
command
ct discovery -h/--help
CAAStools discovery returns those positions in which amino acids differ between two groups of species. We call these groups discovery groups, distinguishing them into foreground (FG) and background (BG). The program will detect a CAAS in those MSA positions that will meet two conditions. First, the foreground and the background won’t share any amino acids. Second, all the species in at least one group need to share (or converge to) the same amino acid. The two conditions define a set of 4 possible mutation patterns. Table 3.1 reports a set of possible mutation patterns, indicating which ones are accepted as CAAS, that are enumerated from 1 to 4.
FG | BG | Difference between groups | Convergence within… | Pattern | Is it CAAS? |
K | W | YES | Both groups | Pattern 1 | YES |
K | WE | YES | FG | Pattern 2 | YES |
KE | W | YES | BG | Pattern 3 | YES |
KE | WF | YES | None | - | NO |
K | KE | NO | FG | - | NO |
KE | K | NO | BG | - | NO |
Table 3.1 - Mutation patterns.
Note that the pattern 4 can be included as CAAS by user specification. Through the --patterns
option, the user will be able to select the number of patterns to include in the output. By default, ct discovery returns the patterns 1,2 and 3.
The user can filter the result based on the maximum number of indels (or gaps, “-”) accepted per position or the maximum species missing in the alignment.
For each CAAS prediction, the program will calculate an empiric p-value that corresponds to the probability of finding the same set of mutational patterns with random species. This probability is calculated through the hypergeometric distribution.
For further details on the CAAStools discovery algorithm, please refer to our preprint on BioRXIV.
-t /--traitfile $config_file
The **configuration file **or config is the file that we’ll use to tell the program which species we are comparing and how they are arranged into the FG and BG. It consists of a simple tab file containing the name of the species and a label indicating the corresponding group (FG = 1, **BG **= 0).
A config file is present in the examples/ folder (examples/conifig.tab)
Aotus_griseimembra 0
Avahi_peyrierasi 0
Callibella_humilis 0
Gorilla_beringei 1
Gorilla_gorilla 1
Macaca_thibetana 1
Mandrillus_leucophaeus 1
This config file will instruct the program to create two groups.
FG | Gorilla_beringei, Gorilla_gorilla, Macaca_thibetana, Mandrillus_leucophaeus |
BG | Aotus_griseimembra, Avahi_peyrierasi, Callibella_humilis |
Note that:
- The values are tab-separated and no further space is admitted
- The order of the species is irrelevant.
The second fundamental input is the amino acid MSA file. CAAStools relies on Biopython 1.7 to import sequence files. Hence, the accepted formats are the ones specified in Biopython docs ( https://biopython.org/wiki/AlignIO ):
clustal, emboss, fasta, fasta-m10, ig, maf, mauve, msf, nexus, phylip, phylip-sequential, phylip-relaxed, stockholm
By default, ct discovery will read the MSA as a clustal file. To specify a different format, we’ll need to specify it through the --fmt
option (e.g. --fmt phylip-relaxed
). In the examples folder, the examples/MSA directory contains an MSA in different formats.
CAAStools will associate the sequence in the MSA to the species by name identity. The program will save the name of each species in FG and BG groups in string variables, and will select those sequences in the MSA whose ID field will coincide with one of the species in the config file. Please, format your config file and MSA in order to match the sequence IDs with the name of the species in the config file.
CAAS results can be filtered for a maximum of gaps or missing species. In this, we define as a “gap” the presence of an indel which is indicated with the “-” character. A missing species will be a species that is mentioned in the config file but it is not found in the MSA. This situation can occur when we iterate the analysis over different MSAs with variable coverage.
By default, CAAStools discovery accepts a maximum of n-1 gaps or missing species per group, where n is the size of the group. This means that the program will need the presence of at least one species per group to verify the conditions for CAAS assignment. The user can decide to limit the number of gaps and missing species per group, or to skip those positions in which gaps represent more than a maximum percentage of total symbols (default=50%). The following tables report the different options for gaps and missing species.
Max background gaps | --max_bg_gaps | Filter by number of gaps in the background. | Default: No filter |
Max foreground gaps | --max_fg_gaps | Filter by number of gaps in the foreground. | Default: No filter |
Max overall gaps | --max_gaps | Filter by total number of gaps | Default: No filter |
Max gaps per position | --max_gaps_per_position | Filter by number of gaps per position. | Default: 0.5 (50%) |
Max background missing species | --max_bg_miss | Filter by number of missing species in the background. | Default: No filter |
Max foreground missing species | --max_fg_miss | Filter by number of missing species in the foreground. | Default: No filter |
Max overall missing species | --max_miss | Filter by total number of missing species | Default: No filter |
CAAStools discovery will output a tab file with all the CAAS found in one single MSA.
Column | Header | Description |
1 | Gene | The name of the gene (from MSA filename) |
2 | Trait | The name of the trait (from binary config file) |
3 | Position | The position in the MSA (0-based) |
4 | Substitution | The substitution FG/BG |
5 | Pvalue | The p-value from hypergeometric distribution |
6 | Scenario | The mutational pattern (see “patterns” table in 3.1 - Algorithm overview) |
7 | FFGN | Species Found in FG: Number. Number of species found in the FG group (it excludes those ones having an indel). |
8 | FBGN | Species Found in BG: Number. Number of species found in the BG group (it excludes those ones having an indel). |
9 | GFG | Number of Gaps in the FG. |
10 | GBG | Number of Gaps in the BG. |
11 | MFG | Number of Missing species in the FG. |
12 | MBG | Number of Missing species in the BG. |
13 | FFG | List of species Found in the FG. Comma-separated. |
14 | FBG | List of species Found in the FG. Comma-separated. |
15 | MS | List of missing species. Comma-separated. |
Run ct discovery with example alignment (phylip-relaxed
format, that has to be specified).
ct discovery -a examples/MSA/primates.msa.pr -t examples/config.tab -o examples/discovery.output.usr.example --fmt phylip-relaxed
CAAStools resample (ct resample) elaborates a set of resampled discovery groups for bootstrap analysis. The global options for the tool can be fetched through the help command:
ct resample -h/--help
The simulation of discovery groups is propaedeutic to bootstrap analysis
can consist in a simple randomization, a randomization that is restricted to some parts of the phylogeny, or be based on brownian-motion trait evolution simulation. The user will indicate one of these three strategies, the size of the resampled FG and BG groups and the number of simulation cycles. The program outputs a tab file in
ct resample --mode random
In this case, the simulation will consist in the bare random sorting of species into a pair of FG/BG discovery groups.
ct resample --mode random --limit_by_group $groupfile
In this case, the simulation is based on the random choice of species, but is limited to the families that are present in a config file provided as a template. A further file, the species file, specifies the composition of the families. The random scooping takes into account the number of groups (or families) present in the template groups and will replicate that composition. For instance, if our template FG group consists of 3 species from group A and 2 species from groupB, the randomisation will follow this pattern. In each cycle, the program scoops 3 random species from group A and 2 random species from group B.
ct resample --mode bm
This strategy resamples neutral evolution by brownian motion simulation. The FG/BG group size is defined by a template config file. The R function
simpermvec()
from R library RERconverge (https://github.com/nclark-lab/RERconverge) will perform a brownian motion simulation of neutral evolution. FG and BG will be defined as the n-th species with higher values and the m-th species with lower values, where n and m are the size of FG and BG respectively.
Each simulation strategy will require a specific set of input files. The following table reports all the inputs that are needed for each strategy.
Input File | Random | Random (Phylogeny restricted) | Brownian Motion |
Phylogenetic tree in newick format.
-p/--phylogeny |
Mandatory | Mandatory | Mandatory |
Config File as a template
--bytemp |
Can be replaced by -f/--fg_size and -b/--bg_size for FG/BG size definition | Mandatory | Mandatory |
Trait values
--traitvalues |
NO | NO | Mandatory. It is used by the program to shuffle the |
The user will need to provide a phylogenetic tree in Newick format (https://evolution.genetics.washington.edu/phylip/newicktree.html) to tell the program on which species base its simulation. The binary configuration file is required for the phylogeny restricted and Brownian motion strategies. The phylogeny restricted strategy limits trait randomisation to some specific families.
The resample tool outputs 1000 resampled traits by default. The user can decide the number of cycles through the --cycles option:
--cycles 10
--cycles 100
--cycles 1000
The output consists in a tab file with three columns.
Column 1: The name of the cycle, indicated in the b_numberofcycle format
Column 2: The FG species (comma-separated)
Column 3: The BG species (comma-separated)
Here’s an example of the first ten lines of a resampled traits output file.
b_1 Cercopithecus_mitis,Alouatta_palliata,Pan_troglodytes,Colobus_polykomos Saimiri_sciureus,Alouatta_puruensis,Trachypithecus_crepusculus,Cercocebus_torquatus
b_2 Macaca_fuscata,Papio_cynocephalus,Cercopithecus_petaurista,Cheracebus_lucifer Trachypithecus_phayrei,Tarsius_wallacei,Lophocebus_aterrimus,Macaca_silenus
b_3 Saimiri_oerstedii,Nomascus_concolor,Lemur_catta,Saguinus_oedipus Pongo_abelii,Indri_indri,Eulemur_albifrons,Eulemur_fulvus
b_4 Saimiri_ustus,Eulemur_rubriventer,Leontocebus_nigricollis,Macaca_mulatta Semnopithecus_hypoleucos,Mus_musculus,Otolemur_garnettii,Eulemur_macaco
b_5 Cacajao_hosomi,Alouatta_puruensis,Saimiri_macrodon,Pithecia_mittermeieri Ateles_belzebuth,Macaca_maura,Prolemur_simus,Trachypithecus_laotum
b_6 Eulemur_coronatus,Trachypithecus_geei,Hapalemur_griseus,Prolemur_simus Eulemur_flavifrons,Macaca_fuscata,Trachypithecus_pileatus,Cercopithecus_mona
b_7 Cheracebus_lugens,Cercocebus_lunulatus,Hapalemur_occidentalis,Ateles_chamek Tarsius_lariang,Cercopithecus_neglectus,Cercopithecus_diana,Macaca_thibetana
b_8 Perodicticus_potto,Pan_paniscus,Cacajao_hosomi,Lepilemur_ankaranensis Cercocebus_chrysogaster,Papio_anubis,Callimico_goeldii,Plecturocebus_miltoni
b_9 Macaca_leonina,Cercopithecus_diana,Propithecus_diadema,Macaca_siberu Galagoides_demidovii,Alouatta_seniculus,Propithecus_coquereli,Plecturocebus_dubius
b_10 Leontopithecus_rosalia,Papio_cynocephalus,Hylobates_agilis,Mus_musculus Cheracebus_regulus,Cercopithecus_mitis,Lophocebus_aterrimus,Ateles_marginatus
Ex.1 Resampling based on random selection of species {#ex-1-resampling-based-on-random-selection-of-species}
With input fg/bg size (-f and -b options)
ct resample -p examples/phylogeny.nw -f 5 -b 4 -m random --cycles 500 -o test/resample/random.resampling.tab
By template (binary config)
ct resample -p examples/phylogeny.nw --bytemp examples/config.tab -m random --cycles 500 -o test/resample/random.resampling.bytemplate.tab
Phylogeny restricted (must go by template)
ct resample -p examples/phylogeny.nw --bytemp examples/config.tab -m random --limit_by_group test/sp2fam.210727.tab --cycles 500 -o test/resample/random.resampling.bytemplate.tab
Template and trait values mandatory
ct resample -p examples/phylogeny.nw --bytemp examples/config.tab -m random --cycles 500 --traitvalues examples/traitvalues.tab -o test/resample/BM.resampling.tab
NOTE: The comparison of the different statistical testings are described in Supplementary Material 3 in CAAStools manuscript.
The bootstrap tool is designed to repeat the CAAS discovery on a large number of discovery groups. In this case, the discovery groups are defined through the output file of the resample tool, in which each line represents a single cycle (see previous paragraph). The program will scan an MSA and will count the number of cycles that return a CAAS in that position.
-s $resampled_trait (output of ct resample)
-a $MSA
The output consists in a tabbed file with three columns
Column 1: Position
Column 2: Number of resamples returning a CAAS in the position
Column 3: Number of cycles
Column 4: Bootstrap value
Column 5: Cycles with positive CAAS
ct bootstrap -s test/resample/random.resampling.tab -t examples/config.tab -a examples/MSA/primates.msa.pr -o examples/random.bootstrap.tab --fmt phylip-relaxed
- License
This software is licensed under GNU General Public License. The kind of license is to be decided with UPF.
Barteri, F., Valenzuela, A., Farré, X., de Juan, D., Muntané, G., Esteve-Altava, B., & Navarro, A. (2023). CAAStools: a toolbox to identify and test Convergent Amino Acid Substitutions. Bioinformatics, 39(10), btad623. Open Access
You can ask your questions through the discussions section of CAAStools github. Also, you can contact Fabio Barteri at Pompeu Fabra University / BBRC fabio.barteri@upf.edu