Fast implementation of HMMSearch optimized for high-memory systems using PyHmmer. PyHMMSearch can handle fasta in uncompressed or gzip format and databases in either HMM or Python pickle serialized format. No intermediate files are created.
pip install pyhmmsearch
- pyhmmer >=0.10.12
- pandas
- tqdm
| Database | Tool | Single Threaded | 12 Threads |
|---|---|---|---|
| Pfam | PyHMMSearch | 2:24 | 0:20 |
| Pfam | HMMER HMMSearch | 2:53 | 2:27 |
* Time in minutes for 4977 proteins in test/test.faa.gz.
Official benchmarking for hmmsearch algorithm implemented in PyHMMER against HMMER from Larralde et al. 2023:
Recommended usage for PyHMMSearch is on systems with 1) high RAM; 2) large numbers of threads; and/or 3) reading/writing to disk is charged (e.g., AWS EFS). Also useful when querying a large number of proteins.
-
# Download database DATABASE_DIRECTORY=/path/to/database_directory/ mkdir -p ${DATABASE_DIRECTORY}/Annotate/Pfam wget -v -P ${DATABASE_DIRECTORY}/Annotate/Pfam https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz # Run PyHMMSearch pyhmmsearch -i test/test.faa.gz -o output.tsv -b ${DATABASE_DIRECTORY}/Annotate/Pfam/Pfam-A.hmm.gz -p=-1
-
# Provide a database serialize_hmm_models -d path/to/Pfam-A.hmm.gz -b path/to/database.pkl.gz # or a directory of HMMs serialize_hmm_models -d path/to/hmm_directory/ -b path/to/database.pkl.gz # or from a list of filepaths to HMM models serialize_hmm_models -l path/to/hmms.list -b path/to/database.pkl.gz # or form a list through stdin ls path/to/directory/*.hmm | serialize_hmm_models -b path/to/database.pkl.gz
-
Database can be uncompressed pickle or gzipped pickle.
pyhmmsearch -i test/test.faa.gz -o output.tsv -b ~/Databases/Pfam/database.pkl.gz -p=-1 -
pyhmmsearch -i test/test.faa.gz -o output.tsv -d test/bacteria_odb10/bacteria_odb10.hmm.gz -s test/bacteria_odb10/scores_cutoff -f name -p=-1
-
reformat_pyhmmsearch -i pyhmmsearch_output.tsv -o pyhmmsearch_output.reformatted.tsv
$ pyhmmsearch -h
usage: pyhmmsearch -i <proteins.fasta> -o <output.tsv> -d
Running: pyhmmsearch v2024.4.25 via Python v3.10.14 | /Users/jolespin/miniconda3/envs/kofamscan_env/bin/python
options:
-h, --help show this help message and exit
I/O arguments:
-i PROTEINS, --proteins PROTEINS
path/to/proteins.fasta. stdin does not stream and loads everything into memory. [Default: stdin]
-o OUTPUT, --output OUTPUT
path/to/output.tsv [Default: stdout]
--no_header No header
Utility arguments:
-p N_JOBS, --n_jobs N_JOBS
Number of threads to use [Default: 1]
HMMSearch arguments:
-s SCORES_CUTOFF, --scores_cutoff SCORES_CUTOFF
path/to/scores_cutoff.tsv [id_hmm]<tab>[score_threshold], No header.
-f {accession,name}, --hmm_marker_field {accession,name}
HMM reference type (accession, name) [Default: accession]
-t SCORE_TYPE, --score_type SCORE_TYPE
{full, domain} [Default: full]
-m {gathering,noise,e,trusted}, --threshold_method {gathering,noise,e,trusted}
Cutoff threshold method [Default: e]
-e EVALUE, --evalue EVALUE
E-value threshold [Default: 10.0]
Database arguments:
-d HMM_DATABASE, --hmm_database HMM_DATABASE
path/to/database.hmm cannot be used with -b/-serialized_database
-b SERIALIZED_DATABASE, --serialized_database SERIALIZED_DATABASE
path/to/database.pkl cannot be used with -d/--database_directory. Database should be pickled dictionary {name:hmm}
Copyright 2024 Josh L. Espinoza (jolespin@newatlantis.io)
-
From pyhmmsearch:
id_protein id_hmm threshold score bias best_domain-score best_domain-bias e-value SRR13615825__k127_453760_1 PF00389.34 (24.600000381469727, 24.600000381469727) 93.686 6.702 89.856 6.702 1.984e-27 SRR13615825__k127_295655_1 PF00389.34 (24.600000381469727, 24.600000381469727) 83.195 0.005 83.167 0.005 3.456e-24 SRR13615825__k127_218710_3 PF00389.34 (24.600000381469727, 24.600000381469727) 42.235 0.004 42.073 0.004 1.559e-11 SRR13615825__k127_272080_1 PF00389.34 (24.600000381469727, 24.600000381469727) 24.673 0.000 22.067 0.000 4.154e-06 SRR13615825__k127_297426_1 PF02826.23 (25.100000381469727, 25.100000381469727) 170.426 0.003 170.122 0.003 6.392e-51 -
From reformat_pyhmmsearch:
id_protein number_of_hits ids evalues scores SRR13615825__k127_453760_1 3 ['PF00389.34', 'PF02826.23', 'PF03446.19'] [1.984e-27, 2.113e-39, 2.41e-08] [93.686, 132.902, 32.336] SRR13615825__k127_295655_1 2 ['PF00389.34', 'PF02826.23'] [3.456e-24, 7.794e-21] [83.195, 72.421] SRR13615825__k127_218710_3 1 ['PF00389.34'] [1.559e-11] [42.235] SRR13615825__k127_272080_1 2 ['PF00389.34', 'PF02826.23'] [4.154e-06, 2.035e-41] [24.673, 139.471] SRR13615825__k127_297426_1 1 ['PF02826.23'] [6.392e-51] [170.426] -
From reformat_pyhmmsearch with -b/--best_hits_only:
id_protein id evalue score SRR13615825__k127_453760_1 PF02826.23 2.113e-39 132.902 SRR13615825__k127_295655_1 PF00389.34 3.456e-24 83.195 SRR13615825__k127_218710_3 PF00389.34 1.559e-11 42.235 SRR13615825__k127_272080_1 PF02826.23 2.035e-41 139.471 SRR13615825__k127_297426_1 PF02826.23 6.392e-51 170.426
-
Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011 Oct;7(10):e1002195. doi: 10.1371/journal.pcbi.1002195. Epub 2011 Oct 20. PMID: 22039361; PMCID: PMC3197634.
-
Larralde M, Zeller G. PyHMMER: a Python library binding to HMMER for efficient sequence analysis. Bioinformatics. 2023 May 4;39(5):btad214. doi: 10.1093/bioinformatics/btad214. PMID: 37074928; PMCID: PMC10159651.
The code for PyHMMSearch is licensed under an MIT License
Please contact jolespin@newatlantis.io regarding any licensing concerns.
