Protein function has often been transferred from characterized proteins to novel proteins based on sequence similarity, e.g. using the best BLAST hit. Based on the SIFTER phylogenomic tool (1), we use a statistical inference algorithm to propagate e.g. Gene Ontology (GO) terms inside a phylogenetic tree, scoring branch lengths and evidence codes of GO annotations. Here PhyloFun computes the likelihood of a GO term being inherited on a given phylogenetic branch based on probability distributions that have been carefully calibrated for each GO term separately.
In order to generate accurate phylogenetic trees that contain a maximum of functional information at reasonable computational costs, we implemented a reusable workflow that, for a given input protein, searches candidate orthologs with known functions, adds paralogs so that duplications can be detected reliably and builds a maximum likelihood phylogenetic tree from a filtered multiple alignment. This tree is then used as input to the inference algorithm which outputs, for each protein in the tree, a probability for assigning each GO term occurring in the tree.
We call this new phylogenomic workflow for protein function prediction PhyloFun.
2.15.2 or greater.
PhyloFun requires three external tools:
- MAFFT http://mafft.cbrc.jp/alignment/software/
- GBlocks http://molevol.cmima.csic.es/castresana/Gblocks.html
- FastTree[MP] http://www.microbesonline.org/fasttree/
Make sure all three programms are installed and in your search path. That means, that from an interactive
R shell the following must work:
system( "mafft" )
system( "GBlocks" )
system( "FastTree" ) or, if you installed the preferred multi processor (OpenMP) version of FastTree:
system( "FastTreeMP" )
Install required R packages:
From within R execute the followin code to install the R packages PhyloFun requires.
source("http://bioconductor.org/biocLite.R") biocLite( c( "Rcpp", "Biostrings", "RCurl", "RMySQL", "XML", "ape", "biomaRt", "brew", "gRain", "phangorn", "rredis", "stringr", "xtable" ) )
Install the PhyloFun R package itself:
- Download source:
git clone git://github.com/groupschoof/PhyloFun.git ./PhyloFun
R CMD INSTALL PhyloFun
Please note that PhyloFun requires a working internet connection to run properly!
The PhyloFun R package comes with a number of executable Rscripts, all stored in folder
After installation find the path to the installed PhyloFun package. Open an interactive
R shell and type
The returned path can than be used to run the provided Rscript
runPhyloFun.R as follows:
Rscript <path_to_your_PhyloFun_installation>/exec/runPhyloFun.R <arguments>
PhyloFun command line arguments
The arguments are printed whenever the Rscript is executed.
-qpath to Query Proteins’ amino acid sequences in fasta format
-bpath to sequence similarity search results. Provide
-bfor BLAST tables (tabular output
-m 8) or
-pfor PHMMER (HMMER-3) search results in
-eprovide comma separated list of GO evidence codes you want to be accepted as “trusted”, e.g.
-e ISO,RCAor provide
-e ALLif any evidence code is good for you. See http://www.geneontology.org/GO.evidence.shtml for details on GO evidence codes.
-nuse the n best hits for function prediction. If you want to speed up PhyloFun, reduce this number to e.g.
-h trueto print out statistics of each Query Proteins’ found homologs – result file
-m trueto print out statistics of each generated Multiple Sequnce Alignment (MSA) – result file
-r trueto generate an HTML report of PhyloFun’s results
A short test run can be done as follows
Rscript <path_to_your_PhyloFun_installation>/exec/runPhyloFun.R -q <path_to_your_PhyloFun_installation>/protein_1.fasta -b <path_to_your_PhyloFun_installation>/protein_1_blastout.tbl -f FastTree -h true -r true -m true
Note that depending on you installation of FastTree you’ll have to provide either
-f FastTree or the faster multi processor version of if
Results of this test run will be written into the folder
Protein_1/ in your current directory.
Generate PhyloFun’s input
The only thing you have to do, is to run either BLAST or PHMMER for your query proteins against the provided database of UniprotKB proteins with trusted Gene Ontology term annotations (trusted-UniKB).
- First copy and unpack trusted-UniKB from PhyloFun to your directory of choice.
cp inst/ukb_proteins_with_trusted_go_annos.fasta.bz2 path/to/your/directory, and then unpack it
- Now run BLAST or PHMMER. If you choose BLAST you must provide its output as a table, use the command line switch
-m 8. If you choose PHMMER a tabular result file will also be required and can be obtained by usage of the command line argument
- Recommended E-Value cutoffs for BLAST are
-e 1and for PHMMER
Coding style guide
PhyloFun’s code mostly follows the google style guide for R: