Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hmmsearch option that use hmmsearch instead of hmmscan #118

Closed
wants to merge 2 commits into from

Conversation

chtsai0105
Copy link

I basically add an option that allow user to switch to hmmsearch. Nothing change if the option is not specified.
In general, hmmsearch is doing exactly the same thing as hmmscan. There have been a lot of discussions (#27, #49 and #105) mentioning that hmmsearch is way faster than hmmscan when querying a lot of data (please also refer to this thread) and my recent issue #117 that the current code will cause too much burden on the cluster if several tasks are run parallelly. The differences are the e-value (which will be influenced by the sequence length) and the format. I take a look into the final overview.txt and found that we just need to slightly modify the order of the column extraction to get the identical results (except the e-value) processed from hmmsearch output.

Code changes

run_dbcan.py

  1. Add an option --hmmsearch to the argparse object to allow user to use hmmsearch instead of hmmscan.
  2. Add the hmmsearch argument to the function runHmmScan, split_uniInput and run.
  3. Modify the corresponding sections that run hmmscan with a if-else statement that determine which runner to use.
  4. Add timer to hmmer and diamond. Would be great to know the processing time for all these core tools.

hmmscan_parser.py

  1. Add the hmmsearch argument to the function run.
  2. Add an if-else statement to switch the parsing codes to corresponding runner.

Evaluation

I evaluated the changes in one of my data (prodigal results come from a metagenomic sample) and got 634 hits from the original hmmscan version and 628 from the hmmsearch version. Among these hits, 48 are not identical but many of them are actually matching to the same target. (please refer to the results_comparison.csv)

And there is a lot improvement in terms of the speed.
Since the previous version do not have timer for hmmer. I just compare the dbcan_sub module. To process 25320 sequences, the total time for the hmmscan version is 318683.97225904465 while the hmmsearch version is 3085.800838470459. It's around 100X faster.

Caveat

I didn't go through the other submodule so maybe need further check whether they have any dependency on the function I made change of.

@cmkobel
Copy link

cmkobel commented Sep 7, 2023

Hi @chtsai0105
Do you have a (bio)conda recipe for the code base that you propose in this pull request? I'm curious to try it out but generally rely on conda to install most software.
Best, Carl.

@cmkobel
Copy link

cmkobel commented Sep 7, 2023

And would you be interested in collaborating on another pull request that disables the problematic tons-of-subprocesses creation feature (#117) as well?

@chtsai0105
Copy link
Author

Hi @cmkobel

Actually I just rewrote a light weight version of run_dbcan, dbcanLight, which implement the hmmer and dbcan_sub part using pyhmmer. The output of dbcanLight is compatible to the original run_dbcan.

Currently there is a out-of-memory bug described here which is due to the accumulated matches eventually hit the maximum memory. I already got some feedback from the developer and probably can fix this really quick.

@linnabrown
Copy link
Owner

Thanks for your PR. I will test your code on my local machine and merge it if there is no issues.

@linnabrown
Copy link
Owner

If hmmsearch has the same result as hmmer, by the way, could you please do not add a new parameter for hmmsearch? You can directly add hmmsearch in the choices. and revise the default from all to ['hmmsearch', 'diamond', 'dbcansub'].

parser.add_argument('--tools', '-t', nargs='+', choices=['hmmer', 'diamond', 'dbcansub', 'all'], default='all', help='Choose a combination of tools to run')
parser.add_argument('--hmmsearch', default=False, action='store_true', help='Use hmmsearch instead of hmmscan')

@linnabrown
Copy link
Owner

Hi @chtsai0105, there are no new changes in your new commit just now. Could you submit it again? Thanks!

@chtsai0105
Copy link
Author

chtsai0105 commented Sep 7, 2023

Ohhh I just updated my branch... I'm working on changing the parser behavior you mentioned earlier and will push the fix later.

@linnabrown
Copy link
Owner

linnabrown commented Sep 7, 2023

Got it. Thank you very much.
BTW, I am also resolving another issue. I will merge your new changes after I finish that.

@chtsai0105
Copy link
Author

Hi - is that possible to create a new branch on the original run_dbcan repo?
Like I mentioned before - the results from hmmsearch will be slightly different from hmmscan since the evalue will be varied on the size of the database. So I think we can create a branch for you to test how it works and you can decide later whether you want to totally replace hmmscan by hmmsearch or not. I guess this program is also the backend of your web service so just want to be cautious not to break it.

@linnabrown
Copy link
Owner

linnabrown commented Sep 8, 2023

Hi @chtsai0105 , we already tested performance for hmmsearch and hmmscan, and we found that there are only a small partial of mismatch for hmmsearch and hmmscan, which is bearable. Therefore, we decide to replace hmmscan with hmmsearch. I will do this by myself and thanks for your PR. Since it only just changes one command and we have multiple other changes, we won't merge your PR. But we will leave thank you msg in our package.

@chtsai0105
Copy link
Author

Got it! Since the issue is going to be resolved I'm closing this PR.

@chtsai0105 chtsai0105 closed this Sep 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants