Skip to content

Latest commit

 

History

History
333 lines (235 loc) · 19.5 KB

README.textile

File metadata and controls

333 lines (235 loc) · 19.5 KB

Automated Assignment of Human Readable Descriptions (AHRD)

Protein function has often been transferred from characterized proteins to
novel proteins based on sequence similarity, e.g. using the best BLAST hit. To
assign human readable descriptions to predicted proteins we developed a new
program called Automatic assignment of human readable descriptions (AHRD). We
aim to select descriptions that are concise and informative, precise in regard
to function and use standard nomenclature. AHRD scores BLAST hits taken from
searches against different databases on the basis of the trust put into these
databases and the local alignment quality. The BLAST hit descriptions are
tokenized into informative words and a lexical analysis scores these tokens
according to their frequency and the quality of the BLAST hits they occur in.
Shared tokens with Gene-Ontology Annotations increase the description-scoring
in order to use standard nomenclature where possible. Finally the best scoring
description is assigned.

1 Getting started

1.1 Requirements

AHRD is a Java-Program which requires Java 1.7 or higher and ant.

1.2 Installation

1.2.1 Get AHRD

Copy (clone) AHRD to your computer using git via command-line:

git clone git://github.com/groupschoof/AHRD.git
git checkout tags/2.0.2-stable

1.2.2 Build the executable jar

Running

ant dist
will create the executable JAR-File: ./dist/ahrd.jar

An executable jar (Java 1.6 compiled) of the paper release version can also be downloaded here:
https://www3.uni-bonn.de/cropbio/tools-and-databases

2 Usage

All AHRD-Inputs are passed to AHRD in a single YML-File.
See ./ahrd_example_input.yml for details.
(About YAML-Format see Wikipedia)

Basically AHRD needs a FASTA-File of amino acid sequences and different files
containing the results from the respective BLAST searches, in our example we
searched three databases: Uniprot/trEMBL, Uniprot/Swissprot and TAIR10. Note,
that AHRD is generic and can make use of any number of different Blast
databases that do not necessarily have to be the above ones. If e.g. annotating
genes from a fungal genome searching yeast databases might be more
recommendable than using TAIR (Arabidopsis thaliana).

All parameters can be set manually, or the default ones can be used as given in
the example input file ./test/resources/ahrd_input_test_run.yml (see section
Parameters).

In order to parallelize the protein function annotation processes,
AHRD can be run on batches of recommended size between 1,000 to 2,000 proteins.
If you want to annotate large protein sets use the included Batcher to split your
input-data into Batches of appropriate size (see section Batcher).
As of Java 7 or higher AHRD is quite fast and batching might no longer be necessary.

2.1 AHRD Example run

java -Xmx2g -jar ./dist/ahrd.jar ahrd_example_input.yml 

or just execute

ant test.run 

2.2 Input

Example files for all input files can be found under ./test/resources/

2.2.1 Required input data

  1. Protein sequences in fasta format
  2. BLAST seach results in pairwise format for protein sequences

(If you run AHRD in batches the blast search results need to be batched in the same way as the fasta files.)

Recommended BLAST-Search:

For your query proteins you should start independent BLAST searches e.g.
in the three different databases mentioned above:

blastall -p blastp -i proteins.fasta -o swissprot_blastout.pairwise -d swissprot.fasta -e 0.0001 -v 200 -b 200 -m 0

Note We are currently working on extending AHRD to parse Blast+ output. As of now, legacy blast is still a requirement.

2.2.2 Optional input data

AHRD optionally considers other protein function annotations. Formerly this was requested in genome projects 1. We do not recommend their use any more, due to AHRD’s excellent performance without this protein function information.

  1. Domain search results from InterProScan in raw format
  2. Gene ontology annotations in csv format

In case InterProScan results are given as input, AHRD needs the Interpro database XML- and dtd-file.

2.2.3 Required config files

  1. Input yml with all pathes and parameters according to your needs (see ahrd_example_input.yml and section Parameters)
  2. Blacklists and filters (they can either be used as provided or can be adapted to your needs and databases). Each of these files contains a list of valid Java regular expressions, one per line. For details on Java regular expressions please refer to http://docs.oracle.com/javase/tutorial/essential/regex/.
    1. Description blacklist (Argument blacklist: ./test/resources/blacklist_descline.txt) – Any Blast-Hit’s description matching one of the regular expressions in this file will be ignored.
    2. Description filter for each single blast database (Argument filter: ./test/resources/filter_descline_sprot.txt) – Any part of a Blast-Hit description that matches any one of the regular expressions in this file will be deleted from the description.
    3. Token blacklist (Argument token_blacklist: ./test/resources/blacklist_token.txt) – Blast-Hit’s descriptions are composed of words. Any such word matching any one of the regular expressions in this file will be ignored by AHRD’s scoring, albeit it will not be deleted from the description and thus might still be seen in the final output.
2.2.3.1 Test custom blacklists and filters

As explained in 2.2.3 AHRD makes use of blacklists and filters provided as Java regular expressions. You can test your own custom blacklists and filters:

  1. Put the strings representing Blast-Hit descriptions or words contained in them in the file ./test/resources/regex_list.txt. Note, that each line is interpreted as a single entry.
  2. Put the Java regular expressions you want to test in file ./test/resources/match_list.txt, using one regular expression per line.
  3. Execute ant test.regexs and study the output.

Example Output for test string “activity”, and regular expressions “(?i)interacting”, and “(?i)activity” applied in serial:

[junit] activity 
[junit] (?i)interacting -> activity
[junit] (?i)activity -> 

The above example demonstrates how the first regular expression does not match anything in the test string “activity”, but after matching it against the second regular expression nothing remains, because the matched substring has been filtered out. As you can see, this test applies all provided regular expression in order of appearance and shows what remains of the provided test string after filtering with the provided regular expressions.

2.3 Batcher

AHRD provides a function to generate several input.yml files from large datasets, consisting of batches of query proteins. For each of these batches the user is expected to provide the batch’s query proteins in FASTA format, and one Blast result file for each database searched. The AHRD batcher will then generate a unique input.yml file and entry in a batch shell script to execute AHRD on the respective batches in parallel for example on a compute cluster using a batch-system like LSF. We recommend this for genome scale datasets.

To generate the mentioned input.yml files and batcher shell script that can subsequently be used to start AHRD in parallel use the batcher function as follows:

java -cp ./dist/ahrd.jar ahrd.controller.Batcher ./batcher_input_example.yml

You will have to edit ./batcher_input_example.yml and provide the following arguments. Note, that in the mentioned directories each file will be interpreted as belonging to one unique Batch, if and only if they have identical file names.

  1. shell_script: Path to the shell-script file which later contains all the statements to invoke AHRD on each supplied batch in parallel.
  2. ahrd_call: "java -Xmx2048m -jar ./dist/ahrd.jar #batch#" This line will be the basis of executing AHRD on each supplied batch. The line must contain #batch# wich will be replaced with the input.yml files. To use a batch system like LSF this line has to be modified to something like bsub -q normal 'java -Xmx2048m -jar ./dist/ahrd.jar #batch#'
  3. proteins_dir: The path to the directory the query protein batches are stored in.
  4. batch_ymls_dir: The path to the directory the AHRD-Batcher should store the generated input.yml files in.
  5. dir: Each database entry requires this argument, the path to the directory each batch’s blast result file from searches in the corresponding Blast-database is located.
  6. interpro_results_dir: The directory in which each Batch’s interproscan result file is located.
  7. gene_ontology_results_dir: The directory in which each Batch’s Gene Ontology annotation file is stored.
  8. output_dir: The directory each AHRD run should create a subdirectory with the output for the processed batch.

Batch-Name requirement: All above explained files belonging to the same Batch must have the same name. This name must start with alpha-numeric characters and may finish with digits indicating the Batch’s number. File extensions are allowed to be varying.

2.4 Output

AHRD supports two different formats. The default one is a tab-delimited table.
The other is FASTA-Format.

2.4.1 Tab-Delimited Table

AHRD writes out a CSV table with the following columns:

  1. Protein-Accesion — The Query Protein’s Accession
  2. Blast-Hit-Accession — The Accession of the Protein the assigned description was taken from.
  3. AHRD-Quality-Code — explained below
  4. Human-Readable-Description — The assigned HRD
  5. Interpro-ID (Description) — If AHRD was started with InterProScan-Results, they are appended here.
  6. Gene-Ontology-ID (Name) — If AHRD was started with Gene-Ontology-Annotations, they are appended here.

AHRD’s quality-code consists of a four character string, where each character
is either ‘*’ if the respective criteria is met or ‘-’
otherwise. Their meaning is explained in the following table:

Position Criteria
1 Bit score of the blast result is >50 and e-value is <e-10
2 Overlap of the blast result is >60%
3 Top token score of assigned HRD is >0.5
4 Gene ontology terms found in description line

2.4.2 Fasta-Format

To set AHRD to write its output in FASTA-Format set the following switch in the input.yml:

output_fasta: true

AHRD will write a valid FASTA-File of your query-proteins where the Header will
be composed of the same parts as above, but here separated by whitespaces.

2.5 AHRD run using BLASTX results

In order to run AHRD on BLASTX results instead of BLASTP results you have to
modify the following parameters in the ahrd_example_input.yml:

token_score_bit_score_weight: 0.5
token_score_database_score_weight: 0.3
token_score_overlap_score_weight: 0.2

Since the algorithm is based on protein sequences and the BLASTX searches are
based on nucleotide sequence there will be a problem calculating the overlap
score of the blast result. To overcome this problem the
token_score_overlap_score_weight has to be set to 0.0. Therefore the other two
scores have to be raised. These three parameters have to sum up to 1. The
resulting parameter configuration could look like this:

token_score_bit_score_weight: 0.6
token_score_database_score_weight: 0.4
token_score_overlap_score_weight: 0.0

3 Algorithm

Based on e-values the 200 best scoring blast results are chosen from each
database-search (e.g. Swissprot, TAIR, trEMBL). For all 600 resulting candidate
description lines a score is calculated using a lexical approach. First each
description line is passed through two regular expression filters. The first
filter discards any matching description line in order to ignore descriptions
like e.g. ‘Whole genome shotgun sequence’, while the second filter tailors the
description lines deleting matching parts, in order to discard e.g. the
trailing Species-Descriptions ’OS=Arabidopsis thaliana […]". In the second
step the scoring each description line is split into single tokens, which are
passed through a blacklist filter, ignoring all matching tokens in terms of
score. Tokens are sequences of characters with a collective meaning. For each
token a score is calculated from three single scores with different weights,
the bit score, the database score and the overlap score. The bit score is
provided within the blast result. The database score is a fixed score for each
blast database, based on the description quality of the database. The overlap
score reflects the overlap of the query and subject sequence. In the second
step the sum of all token scores from a description line is divided by a
correction factor that avoids the scoring system from being biased towards
longer or shorter description lines. In the third step the predicted gene
ontology terms, if available, are used to evaluate the description lines and to
get a better ranking. Therefore only gene ontology terms with a probability
greater than 0.4 are used. From this ranking now the best scoring description
line can be chosen. In the last step a domain name provided by InterProScan
results, if available, is extracted and appended to the best scoring
description line for each uncharacterized protein.

In the end for each uncharacterized protein a description line is selected that
comes from a high-scoring BLAST match, that contains words occurring frequently
in the descriptions of highest scoring BLAST matches and that does not contain
meaningless “fill words”. If available an assigned Interpro domain is appended
to the description line and each line will contain an evaluation section that
reflects the significance of the assigned human readable description.

3.1 Pseudo-Code

  1. Choose 600 best scoring blast results
  2. Filter description lines of above blast-results using regular expressions:
    1. Reject those matched by any regex given in e.g. ./test/resources/blacklist_descline.txt,
    2. Delete those parts of each description line, matching any regex in e.g. ./test/resources/filter_descline_sprot.txt.
  3. Divide each description line into tokens (characters of collective meaning)
    1. In terms of score ignore any tokens matching regexs given e.g. in ./test/resources/blacklist_token.txt.
  4. Token score (calculated from: bitscore, database weight, overlap score)
  5. Lexical score (calculated from: Token score, High score factor, Pattern factor, Correction factor)
  6. Description score (calculated from: Lexical score, GO score, Blast score)
  7. Choose best scoring description line
  8. Append InterProScan description to chosen description line if available

3.2 Used Formulae and Parameters

3.3 Parameters

Above formulae use the following parameters as given in ./ahrd_example_input.yml.
These parameters can either be used as provided or can be adapted to your needs.

3.3.1 The weights in formula Token-Score are

token_score_bit_score_weight: 0.5
token_score_database_score_weight: 0.3
token_score_overlap_score_weight: 0.2 

and Blast-Database specific:

weight: 100 

3.3.2 The weight in formula Description-Score also is Blast-database specific

description_score_bit_score_weight: 0.2 

3.3.3 Parameters controlling the parsing of tabular sequence similarity search result tables (legacy BLAST, BLAST+, and BLAT)

AHRD has been designed to work with tabular sequence similarity search results in tabular format. By default it parses the tabular “Blast 8” output:

Sequence Similarity Search Tool recommended output format (command line switch)
Blast+ -outfmt 6
legacy Blast -m 8
Blat -out=blast8

The following paramters can be set optionally, if your tabular sequence similarity search result deviates from the above “Blast 8” format. See example file ./test/resources/ahrd_input_seq_sim_table.yml for more details.

Optional Parameter example meaning of parameter
seq_sim_search_table_comment_line_regex "#" single character that starts a to be ignored comment line
seq_sim_search_table_sep "\t" single character that separates columns
seq_sim_search_table_query_col 10 number of column holding the query protein’s accession
seq_sim_search_table_subject_col 11 number of column holding the Hit protein’s accession
seq_sim_search_table_query_start_col 16 number of column holding the query’s start amino acid position in the local alignment
seq_sim_search_table_query_end_col 17 number of column holding the query’s end amino acid position in the local alignment
seq_sim_search_table_subject_start_col 18 number of column holding the Hit’s start amino acid position in the local alignment
seq_sim_search_table_subject_end_col 19 number of column holding the Hit’s end amino acid position in the local alignment
seq_sim_search_table_e_value_col 20 number of column holding the Hit’s E-Value
seq_sim_search_table_bit_score_col 21 number of column holding the Hit’s Bit-Score

NOTE: All above column numbers start counting with zero, i.e. the first column has number 0.

3.3.4 Parameters controlling Gene Ontology term annotations

AHRD is capable of annotating the Query proteins with Gene Ontology (GO) terms. It does so, by transferring the reference GO terms found in the Blast Hit AHRD selects as source of the resulting HRD. To be able to pass these reference GO terms AHRD needs a reference GO annotation file (GOA). By default AHRD expects this GOA file to be in the standard Uniprot format. You can download the latest GOA file from the Uniprot server. To obtain GO annotations for all UniprotKB proteins download file gene_association.goa_uniprot.gz (last visit March 24th 2015)

To have AHRD annotate your proteins with GO terms, you just need to provide the optional parameter gene_ontology_result. See example file ./test/resources/ahrd_input_seq_sim_table_go_prediction.yml for more details.

3.4.4.1 Custom reference Gene Ontology annotations (non UniprotKB GOA)

Unfortunately UniprotKB GOA files use short protein accessions like W9QFR0, while the UniprotKB FASTA databases use the long protein accessions with pipes like this tr|W9QFR0|W9QFR0_9ROSA. In order to enable AHRD to match the correct short and long accessions, and thus the reference GO annotations it uses regular expressions to parse both the long accessions and the GOA files. By default AHRD is setup to hanlde Uniprot formats, but you can provide custom regular expressions for your own custom GOA files:

  1. Set the regular expression to extract short protein accessions mapped to GO terms from the reference GOA using reference_go_regex, e.g. reference_go_regex: ^UniProtKB\\s+(?<shortAccession>\\S+)\\s+\\S+\\s+(?<goTerm>GO:\\d{7})
  2. Set the Blast database specific regular expression, used to extract the short accessions from logn ones with short_accession_regex: For example for TAIR10, use short_accession_regex: "^(?<shortAccession>.+)$".

Note: You must provide the above named match groups shortAccession and goTerm, respectively.

4 Testing

If you want to run the complete JUnit Test-Suite execute:

ant

5 License

See attached file LICENSE.txt for details.

6 Authors

Asis Hallab, Kathrin Klee, Sri Girish, and Prof. Dr. Heiko Schoof

INRES Crop Bioinformatics
University of Bonn
Katzenburgweg 2
53115 Bonn
Germany

References

1 Young, Nevin D., Frédéric Debellé, Giles E. D. Oldroyd, Rene Geurts, Steven B. Cannon, Michael K. Udvardi, Vagner A. Benedito, et al. “The Medicago Genome Provides Insight into the Evolution of Rhizobial Symbioses.” Nature 480, no. 7378 (December 22, 2011): 520–24. doi:10.1038/nature10625.