-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add hmmsearch option that use hmmsearch instead of hmmscan #118
Conversation
Hi @chtsai0105 |
And would you be interested in collaborating on another pull request that disables the problematic tons-of-subprocesses creation feature (#117) as well? |
Hi @cmkobel Actually I just rewrote a light weight version of run_dbcan, dbcanLight, which implement the hmmer and dbcan_sub part using pyhmmer. The output of dbcanLight is compatible to the original run_dbcan. Currently there is a out-of-memory bug described here which is due to the accumulated matches eventually hit the maximum memory. I already got some feedback from the developer and probably can fix this really quick. |
Thanks for your PR. I will test your code on my local machine and merge it if there is no issues. |
If hmmsearch has the same result as hmmer, by the way, could you please do not add a new parameter for hmmsearch? You can directly add hmmsearch in the choices. and revise the default from all to ['hmmsearch', 'diamond', 'dbcansub']. parser.add_argument('--tools', '-t', nargs='+', choices=['hmmer', 'diamond', 'dbcansub', 'all'], default='all', help='Choose a combination of tools to run') |
Hi @chtsai0105, there are no new changes in your new commit just now. Could you submit it again? Thanks! |
Ohhh I just updated my branch... I'm working on changing the parser behavior you mentioned earlier and will push the fix later. |
Got it. Thank you very much. |
Hi - is that possible to create a new branch on the original run_dbcan repo? |
Hi @chtsai0105 , we already tested performance for hmmsearch and hmmscan, and we found that there are only a small partial of mismatch for hmmsearch and hmmscan, which is bearable. Therefore, we decide to replace hmmscan with hmmsearch. I will do this by myself and thanks for your PR. Since it only just changes one command and we have multiple other changes, we won't merge your PR. But we will leave thank you msg in our package. |
Got it! Since the issue is going to be resolved I'm closing this PR. |
I basically add an option that allow user to switch to hmmsearch. Nothing change if the option is not specified.
In general, hmmsearch is doing exactly the same thing as hmmscan. There have been a lot of discussions (#27, #49 and #105) mentioning that hmmsearch is way faster than hmmscan when querying a lot of data (please also refer to this thread) and my recent issue #117 that the current code will cause too much burden on the cluster if several tasks are run parallelly. The differences are the e-value (which will be influenced by the sequence length) and the format. I take a look into the final overview.txt and found that we just need to slightly modify the order of the column extraction to get the identical results (except the e-value) processed from hmmsearch output.
Code changes
run_dbcan.py
--hmmsearch
to theargparse
object to allow user to use hmmsearch instead of hmmscan.runHmmScan
,split_uniInput
andrun
.hmmscan_parser.py
run
.Evaluation
I evaluated the changes in one of my data (prodigal results come from a metagenomic sample) and got 634 hits from the original hmmscan version and 628 from the hmmsearch version. Among these hits, 48 are not identical but many of them are actually matching to the same target. (please refer to the results_comparison.csv)
And there is a lot improvement in terms of the speed.
Since the previous version do not have timer for hmmer. I just compare the dbcan_sub module. To process 25320 sequences, the total time for the hmmscan version is 318683.97225904465 while the hmmsearch version is 3085.800838470459. It's around 100X faster.
Caveat
I didn't go through the other submodule so maybe need further check whether they have any dependency on the function I made change of.