Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUSCO update #77

Closed
skrakau opened this issue Jul 17, 2020 · 5 comments
Closed

BUSCO update #77

skrakau opened this issue Jul 17, 2020 · 5 comments
Assignees
Labels
bug Something isn't working
Milestone

Comments

@skrakau
Copy link
Member

skrakau commented Jul 17, 2020

BUSCO v3.0.2 is currently failing on some datasets, among others on the test profile when scratch = false. Somehow the results and thrown errors differ between scratch = true and scratch = false (on CFC), which we cannot not explain currently. In the past such errors were ignored, which was changed now (see #68 and #72).
Moreover it seems that if no tblastn hits are found, this causes an error and is not handled properly.

To achieve more control, update BUSCO to v4.0.6 and handle the case of no tblastn hits. Use a parameter to pass over the path to an already downloaded db (test if this works with --offline), the name of the db for automatic download or the auto-lineage parameter

However, currently there is an issue with the download of the BUSCO databases: https://busco.ezlab.org/frames/bact.htm.
See also https://gitlab.com/ezlab/busco/-/issues/293. So I need to wait until this works again, to test this.

Additionally, for offline use, one can also download the whole dataset https://busco-data.ezlab.org/v4/data/ and add the path to the custom config file. I think this should work both with --lineage_dataset bacteria_odb10 and --auto-lineage

@skrakau skrakau self-assigned this Jul 17, 2020
@skrakau skrakau added the bug Something isn't working label Jul 17, 2020
@d4straub
Copy link
Collaborator

If all possibilities would be implemented, I'd think the most requested cases would be --lineage_dataset bacteria_odb10 >> --auto-lineage > --lineage_dataset archea_odb10 > anything else. This is because usually the majority of genomes are bacteria and BUSCO evaluation results are best comparable if all bins are evaluated with the same reference data. Therefore, only making a single db available for now would be sufficient, I think. But the --auto-lineage option looks so tempting ;)

@skrakau skrakau mentioned this issue Jul 30, 2020
8 tasks
@skrakau
Copy link
Member Author

skrakau commented Aug 13, 2020

UPDATE:

With BUSCO version 4.0.6 there are frequent, non-reproducible errors occurring, caused by a replace("faa", "fna") function corrupting nextflow filenames that contain the substring "faa" in their hash id. I prepared a fix for BUSCO (https://gitlab.com/ezlab/busco/-/issues/305). Waiting currently.

We need a new BUSCO release before preparing pipeline release.

@ropolomx
Copy link

Hi @skrakau . Are you referring to this type of error? I am getting a lot of these with BUSCO version 3.0.2when running mag with revision: 8586c49 [dev]

Aug-26 20:41:52.612 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'busco (MEGAHIT-SRR9030455.61.fa)'

Caused by:
  Process `busco (MEGAHIT-SRR9030455.61.fa)` terminated with an error exit status (1)

Command executed:

  run_BUSCO.py             --in MEGAHIT-SRR9030455.61.fa             --lineage_path bacteria_odb9             --cpu "4"             --blast_single_core             --mode genome             --out MEGAHIT-SRR9030455.61.fa             >MEGAHIT-SRR9030455.61.fa_busco_log.txt
  cp run_MEGAHIT-SRR9030455.61.fa/short_summary_MEGAHIT-SRR9030455.61.fa.txt short_summary_MEGAHIT-SRR9030455.61.fa.txt

  for f in run_MEGAHIT-SRR9030455.61.fa/single_copy_busco_sequences/*faa; do
      [ -e "$f" ] && cat run_MEGAHIT-SRR9030455.61.fa/single_copy_busco_sequences/*faa >MEGAHIT-SRR9030455.61.fa_buscos.faa || touch MEGAHIT-SRR9030455.61.fa_buscos.faa
      break
  done
  for f in run_MEGAHIT-SRR9030455.61.fa/single_copy_busco_sequences/*fna; do
      [ -e "$f" ] && cat run_MEGAHIT-SRR9030455.61.fa/single_copy_busco_sequences/*fna >MEGAHIT-SRR9030455.61.fa_buscos.fna || touch MEGAHIT-SRR9030455.61.fa_buscos.fna
      break
  done

Command exit status:
  1

Command output:
  (empty)

Command error:
  cp: cannot stat ‘run_MEGAHIT-SRR9030455.61.fa/short_summary_MEGAHIT-SRR9030455.61.fa.txt’: No such file or directory

@skrakau
Copy link
Member Author

skrakau commented Aug 27, 2020

Hi @ropolomx, to be precise, the problem I described was for BUSCO 4 versions, which prevented us from updating BUSCO to solve some issues. But in BUSCO 3.0.2 there was a related problem, which can cause such errors as you described above (BUSCO itself did not return an error, but since an output file is missing, the downstream cp command failed).

@skrakau skrakau added this to the 1.1.0 milestone Aug 27, 2020
@skrakau skrakau mentioned this issue Sep 23, 2020
8 tasks
@skrakau
Copy link
Member Author

skrakau commented Sep 29, 2020

Solved in #103 :)

@skrakau skrakau closed this as completed Sep 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants