-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GTDBtk is being run on only one (of several) samples, even when running with the option --binning_map_mode own #641
Comments
Some more info, in case it's helpful for identifying the issue: the file
|
The GTDB-Tk process only takes bins which pass a given (user-specifiable) quality threshold (a minimum completeness of 50% and maximum contamination of 10%). Could it be that none of the bins from the samples that aren't going into GTDB-Tk meet these thresholds? |
I've also observed the same behavior when running the pipeline on my own data (7 samples), where I only had GTDBtk running on a single sample. I think it's highly unlikely that only that specific sample had bins passing the quality thresholds, and none of the other 6 samples did... |
Certainly sounds like it could be a bug in that case - could you double-check the bin qualities from the busco_summary.tsv file first, just to make sure? If there are bins passing the default threshold there that might help in figuring out where the problem is. |
Had a look now inside I'm also uploading a screenshot with some filters I made in Excel. |
Here is the log file for the single GTDBtk job that was ran for the data posted above, as well as the full log and execution trace for the pipeline run. command.log.txt |
@prototaxites |
I’m dealing with the same problem. I tried several times by modifying the quality thresholds for GTDB, and found that only the first bin meeting the thresholds gets processed by GTDB-tk classifywf. It doesn’t process all the bins that meet the thresholds. |
I've done some brute-force print debugging (sorry manuscript...) - I can't find any obvious fault with the code in the GTDBTK subworkflow that collects the QC metrics, matches them to the input channel, and filters using the provided thresholds - at least with the test data (with a low enough min_completeness threshold), samples from both test samples should get through to the GTDBTK_CLASSIFYWF process. So I'm not really sure what's up here... |
Thanks a lot for looking into this. By any chance, have you tried running the commands from the first post above, and did you manage to reproduce the issue in this way? |
Afraid I didn’t try to reproduce your example above as I was poking about inside a Gitpod instance - which doesn’t have enough juice to actually run GTDB! But I had hoped to catch any obvious problems with the filtering… |
I also provided a directory for --gtdb_db like amizeranschi did, then only the first bin that met the GTDB threshold was processed by GTDBTK_CLASSIFYWF. So, I tried again by providing a tar.gz file for --gtdb_db, and this time it worked. All bins that met the thresholds were processed. |
I can also confirm that running the pipeline with I'm guessing that the quickest "fix" for this issue would be to change the docs to include instructions for only passing the tar.gz archive to the https://github.com/nf-core/mag/blob/dev/nextflow_schema.json#L539 |
Ah - I think I found the problem! Could one of you test the this fix with gtdb directory input, replacing the standard nextflow run command with the below?
edit: updated the fix |
I ran your branch now (took 7 hours to run on the data from the first post above) with |
Description of the bug
When running the pipeline on several samples, it appears that the pipeline only runs GTDBtk on a single sample, which is featured in the output. This happens even when running the pipeline with the option
--binning_map_mode own
.The job list printed by the pipeline also suggests that there was only one GTDBtk job spawned, for one sample:
This issue can be reproduced using data from the
test_all
profile, as shown below.Command used and terminal output
Relevant files
No response
System information
No response
The text was updated successfully, but these errors were encountered: