Function prediction of uncharacterized proteins using their evolutionary history.
R C++
Latest commit 2da4470 Jul 28, 2015 @asishallab asishallab Made GO term based F1-scoe calculation possible without extension of …
…parent and/or spawned GO terms. Set option 'extend.go.annos.with.parents' to FALSE
Permalink
Failed to load latest commit information.
R
data
exec
inst
man
src
.gitignore
DESCRIPTION
NAMESPACE
README.textile
todos.textile

README.textile

PhyloFun

Abstract

Protein function has often been transferred from characterized proteins to novel proteins based on sequence similarity, e.g. using the best BLAST hit. Based on the SIFTER phylogenomic tool (1), we use a statistical inference algorithm to propagate e.g. Gene Ontology (GO) terms inside a phylogenetic tree, scoring branch lengths and evidence codes of GO annotations. Here PhyloFun computes the likelihood of a GO term being inherited on a given phylogenetic branch based on probability distributions that have been carefully calibrated for each GO term separately.

In order to generate accurate phylogenetic trees that contain a maximum of functional information at reasonable computational costs, we implemented a reusable workflow that, for a given input protein, searches candidate orthologs with known functions, adds paralogs so that duplications can be detected reliably and builds a maximum likelihood phylogenetic tree from a filtered multiple alignment. This tree is then used as input to the inference algorithm which outputs, for each protein in the tree, a probability for assigning each GO term occurring in the tree.

We call this new phylogenomic workflow for protein function prediction PhyloFun.

Install PhyloFun

PhyloFun requires R version 2.15.2 or greater.

PhyloFun requires three external tools:

Make sure all three programms are installed and in your search path. That means, that from an interactive R shell the following must work:
system( "mafft" )
system( "GBlocks" )
system( "FastTree" ) or, if you installed the preferred multi processor (OpenMP) version of FastTree:
system( "FastTreeMP" )

Install required R packages:

From within R execute the followin code to install the R packages PhyloFun requires.

source("http://bioconductor.org/biocLite.R")
biocLite( c(
  "Rcpp", "Biostrings", "RCurl", "RMySQL", "XML", "ape", "biomaRt", "brew",
  "gRain", "phangorn", "rredis", "stringr", "xtable"
) )

Install the PhyloFun R package itself:

  • Download source: git clone git://github.com/groupschoof/PhyloFun.git ./PhyloFun
  • Install: R CMD INSTALL PhyloFun

Run PhyloFun

Please note that PhyloFun requires a working internet connection to run properly!

The PhyloFun R package comes with a number of executable Rscripts, all stored in folder exec/.

After installation find the path to the installed PhyloFun package. Open an interactive R shell and type .path.package("PhyloFun").
The returned path can than be used to run the provided Rscript runPhyloFun.R as follows:

Rscript <path_to_your_PhyloFun_installation>/exec/runPhyloFun.R <arguments>

PhyloFun command line arguments

The arguments are printed whenever the Rscript is executed.

  • -q path to Query Proteins’ amino acid sequences in fasta format
  • -p or -b path to sequence similarity search results. Provide -b for BLAST tables (tabular output -m 8) or -p for PHMMER (HMMER-3) search results in --tblout format.
  • -e provide comma separated list of GO evidence codes you want to be accepted as “trusted”, e.g. -e ISO,RCA or provide -e ALL if any evidence code is good for you. See http://www.geneontology.org/GO.evidence.shtml for details on GO evidence codes.
  • -n use the n best hits for function prediction. If you want to speed up PhyloFun, reduce this number to e.g. -n 250
  • -h true to print out statistics of each Query Proteins’ found homologs – result file homologs_stats.txt
  • -m true to print out statistics of each generated Multiple Sequnce Alignment (MSA) – result file msa_stats.txt
  • -r true to generate an HTML report of PhyloFun’s results

A short test run can be done as follows

Rscript <path_to_your_PhyloFun_installation>/exec/runPhyloFun.R -q <path_to_your_PhyloFun_installation>/protein_1.fasta -b <path_to_your_PhyloFun_installation>/protein_1_blastout.tbl -f FastTree -h true -r true -m true

Note that depending on you installation of FastTree you’ll have to provide either -f FastTree or the faster multi processor version of if -f FastTreeMP.

Results of this test run will be written into the folder Protein_1/ in your current directory.

Generate PhyloFun’s input

The only thing you have to do, is to run either BLAST or PHMMER for your query proteins against the provided database of UniprotKB proteins with trusted Gene Ontology term annotations (trusted-UniKB).

  1. First copy and unpack trusted-UniKB from PhyloFun to your directory of choice. cp inst/ukb_proteins_with_trusted_go_annos.fasta.bz2 path/to/your/directory, and then unpack it bunzip2 ukb_proteins_with_trusted_go_annos.fasta.bz2
  2. Now run BLAST or PHMMER. If you choose BLAST you must provide its output as a table, use the command line switch -m 8. If you choose PHMMER a tabular result file will also be required and can be obtained by usage of the command line argument --tblout.
  3. Recommended E-Value cutoffs for BLAST are -e 1 and for PHMMER -E 0.000001

Coding style guide

PhyloFun’s code mostly follows the google style guide for R:
http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html