Add hmmsearch option that use hmmsearch instead of hmmscan #118

chtsai0105 · 2023-05-25T18:55:02Z

I basically add an option that allow user to switch to hmmsearch. Nothing change if the option is not specified.
In general, hmmsearch is doing exactly the same thing as hmmscan. There have been a lot of discussions (#27, #49 and #105) mentioning that hmmsearch is way faster than hmmscan when querying a lot of data (please also refer to this thread) and my recent issue #117 that the current code will cause too much burden on the cluster if several tasks are run parallelly. The differences are the e-value (which will be influenced by the sequence length) and the format. I take a look into the final overview.txt and found that we just need to slightly modify the order of the column extraction to get the identical results (except the e-value) processed from hmmsearch output.

Code changes

run_dbcan.py

Add an option --hmmsearch to the argparse object to allow user to use hmmsearch instead of hmmscan.
Add the hmmsearch argument to the function runHmmScan, split_uniInput and run.
Modify the corresponding sections that run hmmscan with a if-else statement that determine which runner to use.
Add timer to hmmer and diamond. Would be great to know the processing time for all these core tools.

hmmscan_parser.py

Add the hmmsearch argument to the function run.
Add an if-else statement to switch the parsing codes to corresponding runner.

Evaluation

I evaluated the changes in one of my data (prodigal results come from a metagenomic sample) and got 634 hits from the original hmmscan version and 628 from the hmmsearch version. Among these hits, 48 are not identical but many of them are actually matching to the same target. (please refer to the results_comparison.csv)

And there is a lot improvement in terms of the speed.
Since the previous version do not have timer for hmmer. I just compare the dbcan_sub module. To process 25320 sequences, the total time for the hmmscan version is 318683.97225904465 while the hmmsearch version is 3085.800838470459. It's around 100X faster.

Caveat

I didn't go through the other submodule so maybe need further check whether they have any dependency on the function I made change of.

cmkobel · 2023-09-07T09:19:31Z

Hi @chtsai0105
Do you have a (bio)conda recipe for the code base that you propose in this pull request? I'm curious to try it out but generally rely on conda to install most software.
Best, Carl.

cmkobel · 2023-09-07T09:21:59Z

And would you be interested in collaborating on another pull request that disables the problematic tons-of-subprocesses creation feature (#117) as well?

chtsai0105 · 2023-09-07T21:34:05Z

Hi @cmkobel

Actually I just rewrote a light weight version of run_dbcan, dbcanLight, which implement the hmmer and dbcan_sub part using pyhmmer. The output of dbcanLight is compatible to the original run_dbcan.

Currently there is a out-of-memory bug described here which is due to the accumulated matches eventually hit the maximum memory. I already got some feedback from the developer and probably can fix this really quick.

linnabrown · 2023-09-07T21:41:09Z

Thanks for your PR. I will test your code on my local machine and merge it if there is no issues.

linnabrown · 2023-09-07T21:44:06Z

If hmmsearch has the same result as hmmer, by the way, could you please do not add a new parameter for hmmsearch? You can directly add hmmsearch in the choices. and revise the default from all to ['hmmsearch', 'diamond', 'dbcansub'].

parser.add_argument('--tools', '-t', nargs='+', choices=['hmmer', 'diamond', 'dbcansub', 'all'], default='all', help='Choose a combination of tools to run')
parser.add_argument('--hmmsearch', default=False, action='store_true', help='Use hmmsearch instead of hmmscan')

linnabrown · 2023-09-07T22:11:35Z

Hi @chtsai0105, there are no new changes in your new commit just now. Could you submit it again? Thanks!

chtsai0105 · 2023-09-07T22:17:02Z

Ohhh I just updated my branch... I'm working on changing the parser behavior you mentioned earlier and will push the fix later.

linnabrown · 2023-09-07T22:22:21Z

Got it. Thank you very much.
BTW, I am also resolving another issue. I will merge your new changes after I finish that.

chtsai0105 · 2023-09-07T22:35:51Z

Hi - is that possible to create a new branch on the original run_dbcan repo?
Like I mentioned before - the results from hmmsearch will be slightly different from hmmscan since the evalue will be varied on the size of the database. So I think we can create a branch for you to test how it works and you can decide later whether you want to totally replace hmmscan by hmmsearch or not. I guess this program is also the backend of your web service so just want to be cautious not to break it.

linnabrown · 2023-09-08T01:34:08Z

Hi @chtsai0105 , we already tested performance for hmmsearch and hmmscan, and we found that there are only a small partial of mismatch for hmmsearch and hmmscan, which is bearable. Therefore, we decide to replace hmmscan with hmmsearch. I will do this by myself and thanks for your PR. Since it only just changes one command and we have multiple other changes, we won't merge your PR. But we will leave thank you msg in our package.

chtsai0105 · 2023-09-08T01:43:00Z

Got it! Since the issue is going to be resolved I'm closing this PR.

Add hmmsearch option that use hmmsearch instead of hmmscan

a9aad1e

Merge branch 'linnabrown:master' into master

7f864ee

chtsai0105 closed this Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add hmmsearch option that use hmmsearch instead of hmmscan #118

Add hmmsearch option that use hmmsearch instead of hmmscan #118

chtsai0105 commented May 25, 2023

cmkobel commented Sep 7, 2023

cmkobel commented Sep 7, 2023 •

edited

Loading

chtsai0105 commented Sep 7, 2023

linnabrown commented Sep 7, 2023

linnabrown commented Sep 7, 2023

linnabrown commented Sep 7, 2023

chtsai0105 commented Sep 7, 2023 •

edited

Loading

linnabrown commented Sep 7, 2023 •

edited

Loading

chtsai0105 commented Sep 7, 2023

linnabrown commented Sep 8, 2023 •

edited

Loading

chtsai0105 commented Sep 8, 2023

Add hmmsearch option that use hmmsearch instead of hmmscan #118

Add hmmsearch option that use hmmsearch instead of hmmscan #118

Conversation

chtsai0105 commented May 25, 2023

Code changes

Evaluation

Caveat

cmkobel commented Sep 7, 2023

cmkobel commented Sep 7, 2023 • edited Loading

chtsai0105 commented Sep 7, 2023

linnabrown commented Sep 7, 2023

linnabrown commented Sep 7, 2023

linnabrown commented Sep 7, 2023

chtsai0105 commented Sep 7, 2023 • edited Loading

linnabrown commented Sep 7, 2023 • edited Loading

chtsai0105 commented Sep 7, 2023

linnabrown commented Sep 8, 2023 • edited Loading

chtsai0105 commented Sep 8, 2023

cmkobel commented Sep 7, 2023 •

edited

Loading

chtsai0105 commented Sep 7, 2023 •

edited

Loading

linnabrown commented Sep 7, 2023 •

edited

Loading

linnabrown commented Sep 8, 2023 •

edited

Loading