Skip to content

groupschoof/AHRD

Repository files navigation

Automated Assignment of Human Readable Descriptions (AHRD)

Short descriptions in sequence databases are useful to quickly gain insight into important information about a sequence, for example in search results. We developed a new program called “Automatic assignment of Human Readable Descriptions” (AHRD) with the aim to select descriptions and Gene Ontology terms that are concise, informative and precise. AHRD outperforms competing methods and can overcome problems caused by wrong annotations, lack of similar sequences and partial alignments.

Table of contents

  1. Getting started
    1. Requirements
    2. Installation
      1. Get AHRD
      2. Build the executable jar
  2. Usage
    1. AHRD example usages
    2. Input
      1. Required input data
      2. Optional input data
      3. Required config files
        1. Test custom blacklists and filters
    3. Batcher
    4. Output
      1. Tab-Delimited Table
      2. Fasta-Format
    5. AHRD run using BLASTX results
    6. Parameter Optimization
      1. Optimization in parallel
    7. Computing F-Scores for selected parameter sets
  3. Algorithm
    1. Pseudo-Code
    2. Used Formulae and Parameters
    3. Parameters
      1. Parameters controlling the parsing of tabular sequence similarity search result tables
      2. Parameters controlling Gene Ontology term annotations
        1. Prefer reference proteins as candidates that have GO Term annotations
        2. Custom reference Gene Ontology annotations
          1. Custom Gene Ontology Database
  4. Testing
  5. License
  6. Authors
  7. References

1 Getting started

1.1 Requirements

AHRD is a Java-Program which requires Java 1.7 or higher and ant.

1.2 Installation

1.2.1 Get AHRD

Copy (clone) AHRD to your computer using git via command-line, then change into AHRD’s directory, and finally use the latest stable version:

git clone https://github.com/groupschoof/AHRD.git
cd AHRD
git checkout tags/v3.3.3

Alternativelly without using git, you can download AHRD version v3.3.3 (zip or tar.gz) and extract it.

1.2.2 Build the executable jar

Running

ant dist

will create the executable JAR-File: ./dist/ahrd.jar

2 Usage

All AHRD-Inputs are passed to AHRD in a single YML-File. See ./ahrd_example_input.yml for details. (About YAML-Format see Wikipedia/YAML)

Basically AHRD needs a FASTA-File of amino acid sequences and different files containing the results from the respective BLAST searches, in our example we searched three databases: Uniprot/trEMBL, Uniprot/Swissprot and TAIR10. Note, that AHRD is generic and can make use of any number of different Blast databases that do not necessarily have to be the above ones. If e.g. annotating genes from a fungal genome searching yeast databases might be more recommendable than using TAIR (Arabidopsis thaliana).

All parameters can be set manually, or the default ones can be used as given in the example input file ahrd_example_input.yml (see sections 2.1 and 3.2 for more details).

In order to parallelize the protein function annotation processes, AHRD can be run on batches of recommended size between 1,000 to 2,000 proteins. If you want to annotate very large protein sets or have low memory capacities use the included Batcher to split your input-data into Batches of appropriate size (see section 2.3). Note: As of Java 7 or higher AHRD is quite fast and batching might no longer be necessary.

2.1 AHRD example usages

There are two template AHRD input files provided that you should use according to your use case. All example input files are stored in ./test/resources and are named ahrd_example_input*.yml. You can run AHRD on any of these use cases with

java -Xmx2g -jar ./dist/ahrd.jar your_use_case_file.yml

Use Case Template File
Annotate your Query proteins with Human Readable Descriptions (HRD) ./test/resources/ahrd_example_input.yml
Annotate your Query proteins with HRD and Gene Ontology (GO) terms ./test/resources/ahrd_example_input_go_prediction.yml

2.2 Input

Example files for all input files can be found under ./test/resources/. NOTE: Only files containing example in their filename should be used as template input files. Other YAML files are used for testing purposes.

2.2.1 Required input data

  1. Protein sequences in fasta format
  2. Sequence Similarity Search (blastp or blat) results in tabular format

(If you run AHRD in batches the blast search results need to be batched in the same way as the fasta files.)

Recommended Sequence Similarity Search:

For your query proteins you should start independent BLAST searches e.g. in the three different databases mentioned above:

blastp -outfmt 6 -query query_sequences_AA.fasta -db uniprot_swissprot.fasta -out query_vs_swissprot.txt

2.2.2 Optional input data

If you want AHRD to predict your query protein’s functions with Gene Ontology (GO) terms, you need to provide the GO annotations of reference proteins. See section 3.3.2 for more details.

2.2.3 Required config files

  1. Input yml with all pathes and parameters according to your needs (see ahrd_example_input.yml and section Parameters)
  2. Blacklists and filters (they can either be used as provided or can be adapted to your needs and databases). Each of these files contains a list of valid Java regular expressions, one per line. For details on Java regular expressions please refer to http://docs.oracle.com/javase/tutorial/essential/regex/.
    1. Description blacklist (Argument blacklist: ./test/resources/blacklist_descline.txt) – Any Blast-Hit’s description matching one of the regular expressions in this file will be ignored.
    2. Description filter for each single blast database (Argument filter: ./test/resources/filter_descline_sprot.txt) – Any part of a Blast-Hit description that matches any one of the regular expressions in this file will be deleted from the description.
    3. Token blacklist (Argument token_blacklist: ./test/resources/blacklist_token.txt) – Blast-Hit’s descriptions are composed of words. Any such word matching any one of the regular expressions in this file will be ignored by AHRD’s scoring, albeit it will not be deleted from the description and thus might still be seen in the final output.
2.2.3.1 Test custom blacklists and filters

As explained in 2.2.3 AHRD makes use of blacklists and filters provided as Java regular expressions. You can test your own custom blacklists and filters:

  1. Put the strings representing Blast-Hit descriptions or words contained in them in the file ./test/resources/regex_list.txt. Note, that each line is interpreted as a single entry.
  2. Put the Java regular expressions you want to test in file ./test/resources/match_list.txt, using one regular expression per line.
  3. Execute ant test.regexs and study the output.

Example Output for test string “activity”, and regular expressions “(?i)interacting”, and “(?i)activity” applied in serial:

[junit] activity 
[junit] (?i)interacting -> activity
[junit] (?i)activity -> 

The above example demonstrates how the first regular expression does not match anything in the test string “activity”, but after matching it against the second regular expression nothing remains, because the matched substring has been filtered out. As you can see, this test applies all provided regular expression in order of appearance and shows what remains of the provided test string after filtering with the provided regular expressions.

2.3 Batcher

AHRD provides a function to generate several input.yml files from large datasets, consisting of batches of query proteins. For each of these batches the user is expected to provide the batch’s query proteins in FASTA format, and one Blast result file for each database searched. The AHRD batcher will then generate a unique input.yml file and entry in a batch shell script to execute AHRD on the respective batches in parallel for example on a compute cluster using a batch-system like LSF. We recommend this for very large datasets (more than a genome) or computers with low RAM.

To generate the mentioned input.yml files and batcher shell script that can subsequently be used to start AHRD in parallel use the batcher function as follows:

java -cp ./dist/ahrd.jar ahrd.controller.Batcher ./batcher_input_example.yml

You will have to edit ./batcher_input_example.yml and provide the following arguments. Note, that in the mentioned directories each file will be interpreted as belonging to one unique Batch, if and only if they have identical file names.

  1. shell_script: Path to the shell-script file which later contains all the statements to invoke AHRD on each supplied batch in parallel.
  2. ahrd_call: "java -Xmx2048m -jar ./dist/ahrd.jar #batch#" This line will be the basis of executing AHRD on each supplied batch. The line must contain #batch# wich will be replaced with the input.yml files. To use a batch system like LSF this line has to be modified to something like bsub -q normal 'java -Xmx2048m -jar ./dist/ahrd.jar #batch#'
  3. proteins_dir: The path to the directory the query protein batches are stored in.
  4. batch_ymls_dir: The path to the directory the AHRD-Batcher should store the generated input.yml files in.
  5. dir: Each database entry requires this argument, the path to the directory each batch’s blast result file from searches in the corresponding Blast-database is located.
  6. output_dir: The directory each AHRD run should create a subdirectory with the output for the processed batch.

Batch-Name requirement: All above explained files belonging to the same Batch must have the same name. This name must start with alpha-numeric characters and may finish with digits indicating the Batch’s number. File extensions are allowed to be varying.

2.4 Output

AHRD supports two different formats. The default one is a tab-delimited table.
The other is FASTA-Format.

2.4.1 Tab-Delimited Table

AHRD writes out a CSV table with the following columns:

  1. Protein-Accesion — The Query Protein’s Accession
  2. Blast-Hit-Accession — The Accession of the Protein the assigned description was taken from.
  3. AHRD-Quality-Code — explained below
  4. Human-Readable-Description — The assigned HRD
  5. Gene-Ontology-ID — If AHRD was requested to generate Gene-Ontology-Annotations, they are appended here.

AHRD’s quality-code consists of a three character string, where each character is either ‘*’ if the respective criteria is met or ‘-’ otherwise. Their meaning is explained in the following table:

Position Criteria
1 Bit score of the blast result is >50 and e-value is <e-10
2 Overlap of the blast result is >60%
3 Top token score of assigned HRD is >0.5

2.4.2 Fasta-Format

To set AHRD to write its output in FASTA-Format set the following switch in the input.yml:

output_fasta: true

AHRD will write a valid FASTA-File of your query-proteins where the Header will be composed of the same parts as above, but here separated by whitespaces.

2.5 AHRD run using BLASTX results

In order to run AHRD on BLASTX results instead of BLASTP results you have to modify the following parameters in the ahrd_example_input.yml:

token_score_bit_score_weight: 0.5
token_score_database_score_weight: 0.3
token_score_overlap_score_weight: 0.2

Since the algorithm is based on protein sequences and the BLASTX searches are based on nucleotide sequence there will be a problem calculating the overlap score of the blast result. To overcome this problem the token_score_overlap_score_weight has to be set to 0.0. Therefore the other two scores have to be raised. These three parameters have to sum up to 1. The resulting parameter configuration could look like this:

token_score_bit_score_weight: 0.6
token_score_database_score_weight: 0.4
token_score_overlap_score_weight: 0.0

2.6 Parameter Optimization

If you have well annotated reference proteins (at least 1000) from an organism closely related to the proteins you want to annotate, you can optimize AHRD’s parameters to reproduce protein function predictions more alike to your references. In order to do so you need to start a simulated annealing approach:

java -cp dist/ahrd.jar ahrd.controller.Trainer trainer_example_input.yml

Note: See ./test/resources/trainer_example_input.yml for details.

Basically the input file is almost identical to the standard AHRD run. Only the following additional parameters are available; all numeric values given show the default used for the respective parameter:

  1. references_fasta: path_to_your_references.fasta Required parameter pointing to the fasta formatted file containing the reference proteins including the reference Human Readable Descriptions. (See ./test/resources/references.fasta for details.)
  2. path_log: your_log_file.tsv Stores the parameters evaluated in any optimization iteration, their F1-Score, and the difference in score to the currently accepted parameter set.
  3. temperature: 75000 Temperature used in simulated annealing
  4. cool_down_by: 1 Dimish temperature by this value after each simulated annealing iteration
  5. f_measure_beta_parameter: 1.0 The scaling factor used to compute the F-Score of an assigned HRD when comparing it with the reference. See Wikipedia/F1_score for details.
  6. optimization_acceptance_probability_scaling_factor: 2500000000.0 Used to scale the probability of accepting a currently evaluated parameter set. P(Accept) := exp(-delta_scores*scaling_factor/current-temperature)
  7. mutator_mean: 0.25 Mutate a randomly selected parameter by value gaussian normal distributed with this mean
  8. mutator_deviation: 0.15 Mutate a randomly selected parameter by value gaussian normal distributed with this standard deviation
  9. remember_simulated_annealing_path: false Set to true, if you want the optimization to remember already visited parameter sets and their performance. This increases memory usage but improves final results.

2.6.1 Optimization in parallel (Trainer-Batcher)

Simulated Annealing is a time and resource consuming process. Because AHRD uses quite a number of parameters, the search space is accordingly large. Hence it is recommendable to start independend simulated annealing runs in parallel. Do this using

java -cp ./dist/ahrd.jar ahrd.controller.TrainerBatcher ./test/resources/trainer_batcher_example.yml

The following parameters are specific to the Trainer-Batcher:

  1. shell_script: ./start_ahrd_batched.sh The path to the shell script you will use to start the simulated annealing jobs in parallel.
  2. ahrd_call: "java -Xmx2048m -cp ./dist/ahrd.jar ahrd.controller.Trainer #batch#" Defines a line in the resulting shell script, requires the presence of #batch# which will be replaced with the respective generated Trainer inputs. This line can be used to submit the parallel optimizations into a job queue.
  3. no_start_positions_in_parameter_space: 1024 Defines how many parallel optimization runs you want to start.
  4. batch_ymls_dir: ./trainer_batch_ymls Path to the directory in which to store the generated Trainer input files.
  5. output_dir: ./trainer_results Path to the directory in which to store each Trainer run’s output(s).

Note, that the Trainer-Batcher works very much like the AHRD-Batcher (section 2.3). Particularly you need to respect the batch-name requirements explained there.

The above Trainer-Batcher example input shows how to automatically generated a desired number of input files with different starting points in parameter space. These input files can then directly be used with the above documented AHRD Trainer (section 2.6).

2.7 Computing F-Scores for selected parameter sets (AHRD-Evaluator)

Having different parameter sets AHRD enables you to compute their performance in terms of F-Scores for each reference protein. Optionally you can also revise the theorectically maximum attainable F-Score and see how well the best Hits from each sequence similarity search perform. In order to do so, use the Evaluator function:

java -cp ./dist/ahrd.jar ahrd.controller.Evaluator ./evaluator_example.yml

See ./test/resources/evaluator_example.yml for more details.

Parameters specific to the Evaluator function are:

  1. find_highest_possible_evaluation_score: false Set to true if you want to see how the best fitting candidate description among the sequence similarity search hits performs. This gives you the theoretical best possible performance.
  2. write_scores_to_output: false Set to true if you want to see all internal intermediate scores: Token-Scores, Lexical-Scores, and Description-Scores. Use with extreme caution, because this option is meant for developers only.
  3. write_best_blast_hits_to_output: false Set to true if you want to see the respective best sequence similarity search Hits and their performances.
  4. f_measure_beta_parameter: 1.0 This parameter is also used by AHRD-Trainer. See section 2.6 for details.

Optionally you can apply AHRD’s balcklisting and filtering on the Reference Descriptions, too. This results in performance only being assessed on non-blacklistes and filtered descriptions, of which only meaningful words, i.e. non blacklisted tokens will be scored. In order to do so, you need to set all of the three following parameters. Note, that you can point any of them to empty dummy files, if you wish.

  1. references_description_blacklist: ./test/resources/blacklist_descline.txt Regular expressions that identify reference descriptions not to score.
  2. references_description_filter: ./test/resources/filter_descline_sprot.txt Regular expressions that filter out parts from the reference descriptions to be ignored for performance scoring.
  3. references_token_blacklist: ./test/resources/blacklist_token.txt Regular expressions that identify non-meaningful words (tokens) not to be considered when computing the performance scores.

3 Algorithm

Based on e-values the 200 best scoring blast results are chosen from each database-search (e.g. Swissprot, TAIR, trEMBL). For all resulting candidate description lines a score is calculated using a lexical approach. First each description line is passed through two regular expression filters. The first filter discards any matching description line in order to ignore descriptions like e.g. ‘Whole genome shotgun sequence’, while the second filter tailors the description lines deleting matching parts, in order to discard e.g. the trailing Species-Descriptions ’OS=Arabidopsis thaliana […]". In the second step of the scoring each description line is split into single tokens, which are passed through a blacklist filter, ignoring all matching tokens in terms of score. Tokens are sequences of characters with a collective meaning. For each token a score is calculated from three single scores with different weights, the bit score, the database score and the overlap score. The bit score is provided within the blast result. The database score is a fixed score for each blast database, based on the description quality of the database. The overlap score reflects the overlap of the query and subject sequence. In the second step the sum of all token scores from a description line is divided by a correction factor that avoids the scoring system from being biased towards longer or shorter description lines. From this ranking now the best scoring description line can be chosen. In the last step a domain name provided by InterProScan results, if available, is extracted and appended to the best scoring description line for each uncharacterized protein. In the end for each uncharacterized protein a description line is selected that comes from a high-scoring BLAST match, that contains words occurring frequently in the descriptions of highest scoring BLAST matches and that does not contain meaningless “fill words”. Each HRD line will contain an evaluation section that reflects the significance of the assigned human readable description.

3.1 Pseudo-Code

  1. Choose best scoring blast results, 200 from each database searched
  2. Filter description lines of above blast-results using regular expressions:
    1. Reject those matched by any regex given in e.g. ./test/resources/blacklist_descline.txt,
    2. Delete those parts of each description line, matching any regex in e.g. ./test/resources/filter_descline_sprot.txt.
  3. Divide each description line into tokens (characters of collective meaning)
    1. In terms of score ignore any tokens matching regexs given e.g. in ./test/resources/blacklist_token.txt.
  4. Token score (calculated from: bitscore, database weight, overlap score)
  5. Lexical score (calculated from: Token score, High score factor, Pattern factor, Correction factor)
  6. Description score (calculated from: Lexical score and Blast score)
  7. Choose best scoring description line

3.2 Used Formulae and Parameters

3.3 Parameters

Above formulae use the following parameters as given in ./ahrd_example_input.yml. These parameters can either be used as provided or can be adapted to your needs.

The weights in the above formulae are

token_score_bit_score_weight: 0.468
token_score_database_score_weight: 0.2098
token_score_overlap_score_weight: 0.3221 

and Blast-Database specific, for example for UniprotKB/Swissprot:

weight: 653
description_score_bit_score_weight: 2.717061 

3.3.1 Parameters controlling the parsing of tabular sequence similarity search result tables (legacy BLAST, BLAST+, and BLAT)

AHRD has been designed to work with tabular sequence similarity search results in tabular format. By default it parses the tabular “Blast 8” output:

Sequence Similarity Search Tool recommended output format (command line switch)
Blast+ -outfmt 6
legacy Blast -m 8
Blat -out=blast8

The following paramters can be set optionally, if your tabular sequence similarity search result deviates from the above “Blast 8” format. See example file ./test/resources/ahrd_input_seq_sim_table.yml for more details.

Optional Parameter example meaning of parameter
seq_sim_search_table_comment_line_regex "#" single character that starts a to be ignored comment line
seq_sim_search_table_sep "\t" single character that separates columns
seq_sim_search_table_query_col 10 number of column holding the query protein’s accession
seq_sim_search_table_subject_col 11 number of column holding the Hit protein’s accession
seq_sim_search_table_query_start_col 16 number of column holding the query’s start amino acid position in the local alignment
seq_sim_search_table_query_end_col 17 number of column holding the query’s end amino acid position in the local alignment
seq_sim_search_table_subject_start_col 18 number of column holding the Hit’s start amino acid position in the local alignment
seq_sim_search_table_subject_end_col 19 number of column holding the Hit’s end amino acid position in the local alignment
seq_sim_search_table_e_value_col 20 number of column holding the Hit’s E-Value
seq_sim_search_table_bit_score_col 21 number of column holding the Hit’s Bit-Score

NOTE: All above column numbers start counting with zero, i.e. the first column has number 0.

3.3.2 Parameters controlling Gene Ontology term annotations

AHRD is capable of annotating the Query proteins with Gene Ontology (GO) terms. It does so, by transferring the reference GO terms found in the Blast Hit AHRD selects as source of the resulting HRD. To be able to pass these reference GO terms AHRD needs a reference GO annotation file (GOA). By default AHRD expects this GOA file to be in the standard Uniprot format. You can download the latest GOA file from the Uniprot server. To obtain GO annotations for all UniprotKB proteins download file goa_uniprot_all.gaf.gz (last visit Feb 16th 2017)

To have AHRD annotate your proteins with GO terms, you just need to provide the optional parameter gene_ontology_result. See example file ./test/resources/ahrd_input_seq_sim_table_go_prediction.yml for more details.

3.3.2.0 Prefer reference proteins as candidates that have GO Term annotations

The parameter prefer_reference_with_go_annos: true is highly recommended when aiming to annotate your query proteins with GO Terms. If this parameter is set to true only those candidate references are considered that also have GO Term annotations. However, if you put more emphasis on Human Readable Descriptions and are prepared to accept a couple of your queries to not get any GO Term predictions you can switch this off with prefer_reference_with_go_annos: false or just omit the parameter as by default it is set to false.

3.3.2.1 Custom reference Gene Ontology annotations (non UniprotKB GOA)

Unfortunately UniprotKB GOA files use short protein accessions like W9QFR0, while the UniprotKB FASTA databases use the long protein accessions with pipes like this tr|W9QFR0|W9QFR0_9ROSA. In order to enable AHRD to match the correct short and long accessions, and thus the reference GO annotations it uses regular expressions to parse both the long accessions and the GOA files. By default AHRD is setup to handle Uniprot formats, but you can provide custom regular expressions for your own custom GOA files:

  1. Set the regular expression to extract short protein accessions mapped to GO terms from the reference GOA using reference_go_regex, e.g. reference_go_regex: ^UniProtKB\\s+(?<shortAccession>\\S+)\\s+\\S+\\s+(?<goTerm>GO:\\d{7})
  2. Set the Blast database specific regular expression, used to extract the short accessions from long ones with short_accession_regex: For example for TAIR10, use short_accession_regex: "^(?<shortAccession>.+)$".

Note: You must provide the above named match groups shortAccession and goTerm, respectively.

4 Testing

If you want to run the complete JUnit Test-Suite execute:

ant

If you want to execute a test run on example proteins use:

ant test.run

5 License

See attached file LICENSE.txt for details.

6 Authors

Dr. Asis Hallab, Kathrin Klee, Florian Boecker, Dr. Girish Srinivas, and Prof. Dr. Heiko Schoof

INRES Crop Bioinformatics
University of Bonn
Katzenburgweg 2
53115 Bonn
Germany

7 References

High quality genome projects that used AHRD:

1 Young, Nevin D., Frédéric Debellé, Giles E. D. Oldroyd, Rene Geurts, Steven B. Cannon, Michael K. Udvardi, Vagner A. Benedito, et al. “The Medicago Genome Provides Insight into the Evolution of Rhizobial Symbioses.” Nature 480, no. 7378 (December 22, 2011): 520–24. doi:10.1038/nature10625.

2 The Tomato Genome Consortium. “The Tomato Genome Sequence Provides Insights into Fleshy Fruit Evolution.” Nature 485, no. 7400 (May 31, 2012): 635–41. doi:10.1038/nature11119.

3 International Wheat Genome Sequencing Consortium (IWGSC). “A Chromosome-Based Draft Sequence of the Hexaploid Bread Wheat (Triticum Aestivum) Genome.” Science (New York, N.Y.) 345, no. 6194 (July 18, 2014): 1251788. doi:10.1126/science.1251788.

4 International Barley Genome Sequencing Consortium, Klaus F. X. Mayer, Robbie Waugh, John W. S. Brown, Alan Schulman, Peter Langridge, Matthias Platzer, et al. “A Physical, Genetic and Functional Sequence Assembly of the Barley Genome.” Nature 491, no. 7426 (November 29, 2012): 711–16. doi:10.1038/nature11543.

5 Wang, W., G. Haberer, H. Gundlach, C. Gläßer, T. Nussbaumer, M. C. Luo, A. Lomsadze, et al. “The Spirodela Polyrhiza Genome Reveals Insights into Its Neotenous Reduction Fast Growth and Aquatic Lifestyle.” Nature Communications 5 (2014): 3311. doi:10.1038/ncomms4311.

6 Guo, Shaogui, Jianguo Zhang, Honghe Sun, Jerome Salse, William J. Lucas, Haiying Zhang, Yi Zheng, et al. “The Draft Genome of Watermelon (Citrullus Lanatus) and Resequencing of 20 Diverse Accessions.” Nature Genetics 45, no. 1 (January 2013): 51–58. doi:10.1038/ng.2470.

AHRD was applied on all plant genomes present in the public database PlantsDB:

7 Spannagl, Manuel, Thomas Nussbaumer, Kai C. Bader, Mihaela M. Martis, Michael Seidel, Karl G. Kugler, Heidrun Gundlach, and Klaus F. X. Mayer. “PGSB PlantsDB: Updates to the Database Framework for Comparative Plant Genome Research.” Nucleic Acids Research 44, no. D1 (January 4, 2016): D1141–47. doi:10.1093/nar/gkv1130.

About

High throughput protein function annotation with Human Readable Description (HRDs) and Gene Ontology (GO) Terms.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages