AHRD)Automated Assignment of Human Readable Descriptions (
Short descriptions in sequence databases are useful to quickly gain insight into important information about a sequence, for example in search results. We developed a new program called “Automatic assignment of Human Readable Descriptions” (AHRD) with the aim to select descriptions and Gene Ontology terms that are concise, informative and precise. AHRD outperforms competing methods and can overcome problems caused by wrong annotations, lack of similar sequences and partial alignments.
Table of contents
- Getting started
- Used Formulae and Parameters
1 Getting started
AHRD is a Java-Program which requires
Java 1.7 or higher and
Copy (clone) AHRD to your computer using git via command-line, then change into AHRD’s directory, and finally use the latest stable version:
git clone https://github.com/groupschoof/AHRD.git
git checkout tags/v3.3.3
1.2.2 Build the executable jar
will create the executable JAR-File:
All AHRD-Inputs are passed to AHRD in a single YML-File. See
./ahrd_example_input.yml for details. (About YAML-Format see Wikipedia/YAML)
Basically AHRD needs a FASTA-File of amino acid sequences and different files containing the results from the respective BLAST searches, in our example we searched three databases: Uniprot/trEMBL, Uniprot/Swissprot and TAIR10. Note, that AHRD is generic and can make use of any number of different Blast databases that do not necessarily have to be the above ones. If e.g. annotating genes from a fungal genome searching yeast databases might be more recommendable than using TAIR (Arabidopsis thaliana).
In order to parallelize the protein function annotation processes, AHRD can be run on batches of recommended size between 1,000 to 2,000 proteins. If you want to annotate very large protein sets or have low memory capacities use the included Batcher to split your input-data into Batches of appropriate size (see section 2.3). Note: As of Java 7 or higher AHRD is quite fast and batching might no longer be necessary.
AHRD example usages2.1
There are two template AHRD input files provided that you should use according to your use case. All example input files are stored in
./test/resources and are named
ahrd_example_input*.yml. You can run AHRD on any of these use cases with
java -Xmx2g -jar ./dist/ahrd.jar your_use_case_file.yml
|Use Case||Template File|
|Annotate your Query proteins with Human Readable Descriptions (HRD)||
|Annotate your Query proteins with HRD and Gene Ontology (GO) terms||
Example files for all input files can be found under
./test/resources/. NOTE: Only files containing
example in their filename should be used as template input files. Other YAML files are used for testing purposes.
2.2.1 Required input data
- Protein sequences in fasta format
- Sequence Similarity Search (
blat) results in tabular format
(If you run AHRD in batches the blast search results need to be batched in the same way as the fasta files.)
Recommended Sequence Similarity Search:
For your query proteins you should start independent BLAST searches e.g. in the three different databases mentioned above:
blastp -outfmt 6 -query query_sequences_AA.fasta -db uniprot_swissprot.fasta -out query_vs_swissprot.txt
2.2.2 Optional input data
If you want AHRD to predict your query protein’s functions with Gene Ontology (GO) terms, you need to provide the GO annotations of reference proteins. See section 3.3.2 for more details.
2.2.3 Required config files
- Input yml with all pathes and parameters according to your needs (see ahrd_example_input.yml and section Parameters)
- Blacklists and filters (they can either be used as provided or can be adapted to your needs and databases). Each of these files contains a list of valid Java regular expressions, one per line. For details on Java regular expressions please refer to http://docs.oracle.com/javase/tutorial/essential/regex/.
- Description blacklist (Argument
blacklist: ./test/resources/blacklist_descline.txt) – Any Blast-Hit’s description matching one of the regular expressions in this file will be ignored.
- Description filter for each single blast database (Argument
filter: ./test/resources/filter_descline_sprot.txt) – Any part of a Blast-Hit description that matches any one of the regular expressions in this file will be deleted from the description.
- Token blacklist (Argument
token_blacklist: ./test/resources/blacklist_token.txt) – Blast-Hit’s descriptions are composed of words. Any such word matching any one of the regular expressions in this file will be ignored by AHRD’s scoring, albeit it will not be deleted from the description and thus might still be seen in the final output.
- Description blacklist (Argument
22.214.171.124 Test custom blacklists and filters
As explained in 2.2.3 AHRD makes use of blacklists and filters provided as Java regular expressions. You can test your own custom blacklists and filters:
- Put the strings representing Blast-Hit descriptions or words contained in them in the file
./test/resources/regex_list.txt. Note, that each line is interpreted as a single entry.
- Put the Java regular expressions you want to test in file
./test/resources/match_list.txt, using one regular expression per line.
ant test.regexsand study the output.
Example Output for test string “activity”, and regular expressions “(?i)interacting”, and “(?i)activity” applied in serial:
[junit] activity [junit] (?i)interacting -> activity [junit] (?i)activity ->
The above example demonstrates how the first regular expression does not match anything in the test string “activity”, but after matching it against the second regular expression nothing remains, because the matched substring has been filtered out. As you can see, this test applies all provided regular expression in order of appearance and shows what remains of the provided test string after filtering with the provided regular expressions.
AHRD provides a function to generate several input.yml files from large datasets, consisting of batches of query proteins. For each of these batches the user is expected to provide the batch’s query proteins in FASTA format, and one Blast result file for each database searched. The AHRD batcher will then generate a unique input.yml file and entry in a batch shell script to execute AHRD on the respective batches in parallel for example on a compute cluster using a batch-system like LSF. We recommend this for very large datasets (more than a genome) or computers with low RAM.
To generate the mentioned input.yml files and batcher shell script that can subsequently be used to start AHRD in parallel use the batcher function as follows:
java -cp ./dist/ahrd.jar ahrd.controller.Batcher ./batcher_input_example.yml
You will have to edit
./batcher_input_example.ymland provide the following arguments. Note, that in the mentioned directories each file will be interpreted as belonging to one unique Batch, if and only if they have identical file names.
shell_script:Path to the shell-script file which later contains all the statements to invoke AHRD on each supplied batch in parallel.
ahrd_call: "java -Xmx2048m -jar ./dist/ahrd.jar #batch#"This line will be the basis of executing AHRD on each supplied batch. The line must contain
#batch#wich will be replaced with the input.yml files. To use a batch system like LSF this line has to be modified to something like
bsub -q normal 'java -Xmx2048m -jar ./dist/ahrd.jar #batch#'
proteins_dir:The path to the directory the query protein batches are stored in.
batch_ymls_dir:The path to the directory the AHRD-Batcher should store the generated input.yml files in.
dir:Each database entry requires this argument, the path to the directory each batch’s blast result file from searches in the corresponding Blast-database is located.
output_dir:The directory each AHRD run should create a subdirectory with the output for the processed batch.
Batch-Name requirement: All above explained files belonging to the same Batch must have the same name. This name must start with alpha-numeric characters and may finish with digits indicating the Batch’s number. File extensions are allowed to be varying.
AHRD supports two different formats. The default one is a tab-delimited table.
The other is FASTA-Format.
2.4.1 Tab-Delimited Table
AHRD writes out a CSV table with the following columns:
- Protein-Accesion — The Query Protein’s Accession
- Blast-Hit-Accession — The Accession of the Protein the assigned description was taken from.
- AHRD-Quality-Code — explained below
- Human-Readable-Description — The assigned HRD
- Gene-Ontology-ID — If AHRD was requested to generate Gene-Ontology-Annotations, they are appended here.
AHRD’s quality-code consists of a three character string, where each character is either ‘*’ if the respective criteria is met or ‘-’ otherwise. Their meaning is explained in the following table:
|1||Bit score of the blast result is >50 and e-value is <e-10|
|2||Overlap of the blast result is >60%|
|3||Top token score of assigned HRD is >0.5|
To set AHRD to write its output in FASTA-Format set the following switch in the input.yml:
AHRD will write a valid FASTA-File of your query-proteins where the Header will be composed of the same parts as above, but here separated by whitespaces.
AHRD run using BLASTX results2.5
In order to run AHRD on BLASTX results instead of BLASTP results you have to modify the following parameters in the ahrd_example_input.yml:
token_score_bit_score_weight: 0.5 token_score_database_score_weight: 0.3 token_score_overlap_score_weight: 0.2
Since the algorithm is based on protein sequences and the BLASTX searches are based on nucleotide sequence there will be a problem calculating the overlap score of the blast result. To overcome this problem the token_score_overlap_score_weight has to be set to 0.0. Therefore the other two scores have to be raised. These three parameters have to sum up to 1. The resulting parameter configuration could look like this:
2.6 Parameter Optimization
If you have well annotated reference proteins (at least 1000) from an organism closely related to the proteins you want to annotate, you can optimize AHRD’s parameters to reproduce protein function predictions more alike to your references. In order to do so you need to start a simulated annealing approach:
java -cp dist/ahrd.jar ahrd.controller.Trainer trainer_example_input.yml
./test/resources/trainer_example_input.yml for details.
Basically the input file is almost identical to the standard AHRD run. Only the following additional parameters are available; all numeric values given show the default used for the respective parameter:
references_fasta: path_to_your_references.fastaRequired parameter pointing to the fasta formatted file containing the reference proteins including the reference Human Readable Descriptions. (See
path_log: your_log_file.tsvStores the parameters evaluated in any optimization iteration, their F1-Score, and the difference in score to the currently accepted parameter set.
temperature: 75000Temperature used in simulated annealing
cool_down_by: 1Dimish temperature by this value after each simulated annealing iteration
f_measure_beta_parameter: 1.0The scaling factor used to compute the F-Score of an assigned HRD when comparing it with the reference. See Wikipedia/F1_score for details.
optimization_acceptance_probability_scaling_factor: 2500000000.0Used to scale the probability of accepting a currently evaluated parameter set.
P(Accept) := exp(-delta_scores*scaling_factor/current-temperature)
mutator_mean: 0.25Mutate a randomly selected parameter by value gaussian normal distributed with this mean
mutator_deviation: 0.15Mutate a randomly selected parameter by value gaussian normal distributed with this standard deviation
remember_simulated_annealing_path: falseSet to
true, if you want the optimization to remember already visited parameter sets and their performance. This increases memory usage but improves final results.
2.6.1 Optimization in parallel (Trainer-Batcher)
Simulated Annealing is a time and resource consuming process. Because AHRD uses quite a number of parameters, the search space is accordingly large. Hence it is recommendable to start independend simulated annealing runs in parallel. Do this using
java -cp ./dist/ahrd.jar ahrd.controller.TrainerBatcher ./test/resources/trainer_batcher_example.yml
The following parameters are specific to the Trainer-Batcher:
shell_script: ./start_ahrd_batched.shThe path to the shell script you will use to start the simulated annealing jobs in parallel.
ahrd_call: "java -Xmx2048m -cp ./dist/ahrd.jar ahrd.controller.Trainer #batch#"Defines a line in the resulting shell script, requires the presence of
#batch#which will be replaced with the respective generated Trainer inputs. This line can be used to submit the parallel optimizations into a job queue.
no_start_positions_in_parameter_space: 1024Defines how many parallel optimization runs you want to start.
batch_ymls_dir: ./trainer_batch_ymlsPath to the directory in which to store the generated Trainer input files.
output_dir: ./trainer_resultsPath to the directory in which to store each Trainer run’s output(s).
Note, that the Trainer-Batcher works very much like the AHRD-Batcher (section 2.3). Particularly you need to respect the batch-name requirements explained there.
The above Trainer-Batcher example input shows how to automatically generated a desired number of input files with different starting points in parameter space. These input files can then directly be used with the above documented AHRD Trainer (section 2.6).
AHRD-Evaluator)2.7 Computing F-Scores for selected parameter sets (
Having different parameter sets AHRD enables you to compute their performance in terms of F-Scores for each reference protein. Optionally you can also revise the theorectically maximum attainable F-Score and see how well the best Hits from each sequence similarity search perform. In order to do so, use the Evaluator function:
java -cp ./dist/ahrd.jar ahrd.controller.Evaluator ./evaluator_example.yml
./test/resources/evaluator_example.yml for more details.
Parameters specific to the Evaluator function are:
find_highest_possible_evaluation_score: falseSet to
trueif you want to see how the best fitting candidate description among the sequence similarity search hits performs. This gives you the theoretical best possible performance.
write_scores_to_output: falseSet to
trueif you want to see all internal intermediate scores: Token-Scores, Lexical-Scores, and Description-Scores. Use with extreme caution, because this option is meant for developers only.
write_best_blast_hits_to_output: falseSet to
trueif you want to see the respective best sequence similarity search Hits and their performances.
f_measure_beta_parameter: 1.0This parameter is also used by AHRD-Trainer. See section 2.6 for details.
Optionally you can apply AHRD’s balcklisting and filtering on the Reference Descriptions, too. This results in performance only being assessed on non-blacklistes and filtered descriptions, of which only meaningful words, i.e. non blacklisted tokens will be scored. In order to do so, you need to set all of the three following parameters. Note, that you can point any of them to empty dummy files, if you wish.
references_description_blacklist: ./test/resources/blacklist_descline.txtRegular expressions that identify reference descriptions not to score.
references_description_filter: ./test/resources/filter_descline_sprot.txtRegular expressions that filter out parts from the reference descriptions to be ignored for performance scoring.
references_token_blacklist: ./test/resources/blacklist_token.txtRegular expressions that identify non-meaningful words (tokens) not to be considered when computing the performance scores.
Based on e-values the 200 best scoring blast results are chosen from each database-search (e.g. Swissprot, TAIR, trEMBL). For all resulting candidate description lines a score is calculated using a lexical approach. First each description line is passed through two regular expression filters. The first filter discards any matching description line in order to ignore descriptions like e.g. ‘Whole genome shotgun sequence’, while the second filter tailors the description lines deleting matching parts, in order to discard e.g. the trailing Species-Descriptions ’OS=Arabidopsis thaliana […]". In the second step of the scoring each description line is split into single tokens, which are passed through a blacklist filter, ignoring all matching tokens in terms of score. Tokens are sequences of characters with a collective meaning. For each token a score is calculated from three single scores with different weights, the bit score, the database score and the overlap score. The bit score is provided within the blast result. The database score is a fixed score for each blast database, based on the description quality of the database. The overlap score reflects the overlap of the query and subject sequence. In the second step the sum of all token scores from a description line is divided by a correction factor that avoids the scoring system from being biased towards longer or shorter description lines. From this ranking now the best scoring description line can be chosen. In the last step a domain name provided by InterProScan results, if available, is extracted and appended to the best scoring description line for each uncharacterized protein. In the end for each uncharacterized protein a description line is selected that comes from a high-scoring BLAST match, that contains words occurring frequently in the descriptions of highest scoring BLAST matches and that does not contain meaningless “fill words”. Each HRD line will contain an evaluation section that reflects the significance of the assigned human readable description.
- Choose best scoring blast results, 200 from each database searched
- Filter description lines of above blast-results using regular expressions:
- Reject those matched by any regex given in e.g. ./test/resources/blacklist_descline.txt,
- Delete those parts of each description line, matching any regex in e.g. ./test/resources/filter_descline_sprot.txt.
- Divide each description line into tokens (characters of collective meaning)
- In terms of score ignore any tokens matching regexs given e.g. in ./test/resources/blacklist_token.txt.
- Token score (calculated from: bitscore, database weight, overlap score)
- Lexical score (calculated from: Token score, High score factor, Pattern factor, Correction factor)
- Description score (calculated from: Lexical score and Blast score)
- Choose best scoring description line
3.2 Used Formulae and Parameters
Above formulae use the following parameters as given in ./ahrd_example_input.yml. These parameters can either be used as provided or can be adapted to your needs.
The weights in the above formulae are
token_score_bit_score_weight: 0.468 token_score_database_score_weight: 0.2098 token_score_overlap_score_weight: 0.3221
and Blast-Database specific, for example for UniprotKB/Swissprot:
weight: 653 description_score_bit_score_weight: 2.717061
BLAST, BLAST+, and BLAT)3.3.1 Parameters controlling the parsing of tabular sequence similarity search result tables (legacy
AHRD has been designed to work with tabular sequence similarity search results in tabular format. By default it parses the tabular “Blast 8” output:
|Sequence Similarity Search Tool|| recommended output format (
The following paramters can be set optionally, if your tabular sequence similarity search result deviates from the above “Blast 8” format. See example file
./test/resources/ahrd_input_seq_sim_table.yml for more details.
|Optional Parameter||example||meaning of parameter|
||single character that starts a to be ignored comment line|
||single character that separates columns|
||number of column holding the query protein’s accession|
||number of column holding the Hit protein’s accession|
||number of column holding the query’s start amino acid position in the local alignment|
||number of column holding the query’s end amino acid position in the local alignment|
||number of column holding the Hit’s start amino acid position in the local alignment|
||number of column holding the Hit’s end amino acid position in the local alignment|
||number of column holding the Hit’s E-Value|
||number of column holding the Hit’s Bit-Score|
NOTE: All above column numbers start counting with zero, i.e. the first column has number 0.
3.3.2 Parameters controlling Gene Ontology term annotations
AHRD is capable of annotating the Query proteins with Gene Ontology (GO) terms. It does so, by transferring the reference GO terms found in the Blast Hit AHRD selects as source of the resulting HRD. To be able to pass these reference GO terms AHRD needs a reference GO annotation file (GOA). By default AHRD expects this GOA file to be in the standard Uniprot format. You can download the latest GOA file from the Uniprot server. To obtain GO annotations for all UniprotKB proteins download file
goa_uniprot_all.gaf.gz (last visit Feb 16th 2017)
To have AHRD annotate your proteins with GO terms, you just need to provide the optional parameter
gene_ontology_result. See example file
./test/resources/ahrd_input_seq_sim_table_go_prediction.yml for more details.
126.96.36.199 Prefer reference proteins as candidates that have GO Term annotations
prefer_reference_with_go_annos: true is highly recommended when aiming to annotate your query proteins with GO Terms. If this parameter is set to true only those candidate references are considered that also have GO Term annotations. However, if you put more emphasis on Human Readable Descriptions and are prepared to accept a couple of your queries to not get any GO Term predictions you can switch this off with
prefer_reference_with_go_annos: false or just omit the parameter as by default it is set to
GOA)188.8.131.52 Custom reference Gene Ontology annotations (non UniprotKB
Unfortunately UniprotKB GOA files use short protein accessions like
W9QFR0, while the UniprotKB FASTA databases use the long protein accessions with pipes like this
tr|W9QFR0|W9QFR0_9ROSA. In order to enable AHRD to match the correct short and long accessions, and thus the reference GO annotations it uses regular expressions to parse both the long accessions and the GOA files. By default AHRD is setup to handle Uniprot formats, but you can provide custom regular expressions for your own custom GOA files:
- Set the regular expression to extract short protein accessions mapped to GO terms from the reference GOA using
- Set the Blast database specific regular expression, used to extract the short accessions from long ones with
short_accession_regex:For example for TAIR10, use
Note: You must provide the above named match groups
If you want to run the complete JUnit Test-Suite execute:
If you want to execute a test run on example proteins use:
See attached file LICENSE.txt for details.
Dr. Asis Hallab, Kathrin Klee, Florian Boecker, Dr. Girish Srinivas, and Prof. Dr. Heiko Schoof
INRES Crop Bioinformatics
University of Bonn
High quality genome projects that used AHRD:
1 Young, Nevin D., Frédéric Debellé, Giles E. D. Oldroyd, Rene Geurts, Steven B. Cannon, Michael K. Udvardi, Vagner A. Benedito, et al. “The Medicago Genome Provides Insight into the Evolution of Rhizobial Symbioses.” Nature 480, no. 7378 (December 22, 2011): 520–24. doi:10.1038/nature10625.
2 The Tomato Genome Consortium. “The Tomato Genome Sequence Provides Insights into Fleshy Fruit Evolution.” Nature 485, no. 7400 (May 31, 2012): 635–41. doi:10.1038/nature11119.
3 International Wheat Genome Sequencing Consortium (IWGSC). “A Chromosome-Based Draft Sequence of the Hexaploid Bread Wheat (Triticum Aestivum) Genome.” Science (New York, N.Y.) 345, no. 6194 (July 18, 2014): 1251788. doi:10.1126/science.1251788.
4 International Barley Genome Sequencing Consortium, Klaus F. X. Mayer, Robbie Waugh, John W. S. Brown, Alan Schulman, Peter Langridge, Matthias Platzer, et al. “A Physical, Genetic and Functional Sequence Assembly of the Barley Genome.” Nature 491, no. 7426 (November 29, 2012): 711–16. doi:10.1038/nature11543.
5 Wang, W., G. Haberer, H. Gundlach, C. Gläßer, T. Nussbaumer, M. C. Luo, A. Lomsadze, et al. “The Spirodela Polyrhiza Genome Reveals Insights into Its Neotenous Reduction Fast Growth and Aquatic Lifestyle.” Nature Communications 5 (2014): 3311. doi:10.1038/ncomms4311.
6 Guo, Shaogui, Jianguo Zhang, Honghe Sun, Jerome Salse, William J. Lucas, Haiying Zhang, Yi Zheng, et al. “The Draft Genome of Watermelon (Citrullus Lanatus) and Resequencing of 20 Diverse Accessions.” Nature Genetics 45, no. 1 (January 2013): 51–58. doi:10.1038/ng.2470.
AHRD was applied on all plant genomes present in the public database PlantsDB:
7 Spannagl, Manuel, Thomas Nussbaumer, Kai C. Bader, Mihaela M. Martis, Michael Seidel, Karl G. Kugler, Heidrun Gundlach, and Klaus F. X. Mayer. “PGSB PlantsDB: Updates to the Database Framework for Comparative Plant Genome Research.” Nucleic Acids Research 44, no. D1 (January 4, 2016): D1141–47. doi:10.1093/nar/gkv1130.