Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal KeyError when processing externally generated gene calls #12

Closed
jdwinkler-lanzatech opened this issue Jun 15, 2021 · 7 comments

Comments

@jdwinkler-lanzatech
Copy link

Hi,

I am using gunc v1.0.2 in a fresh Conda environment to perform chimerism checks on a few test genomes. I have already generated Prodigal calls, so I am providing them as the input fasta after setting the gene_calls flag.

Initial command:

gunc run --db_file /home/annotator/database/gunc_db_gtdb95.dmnd --input_fasta proteins.faa --file_suffix .faa --gene_calls --threads 64 --out_dir /tmp/tmppbxk2bwz

After DIAMOND finishes running, I consistently get the following error:

  Traceback (most recent call last):
  File "/opt/conda/envs/gunc_env/bin/gunc", line 10, in <module>
  sys.exit(main())
  File "/opt/conda/envs/gunc_env/lib/python3.9/site-packages/gunc/gunc.py", line 567, in main
  run(args)
  File "/opt/conda/envs/gunc_env/lib/python3.9/site-packages/gunc/gunc.py", line 475, in run
  gunc_output = run_gunc(diamond_outfiles, genes_called, args.out_dir,
  File "/opt/conda/envs/gunc_env/lib/python3.9/site-packages/gunc/gunc.py", line 389, in run_gunc
  gene_call_count = genes_called[basename]
  KeyError: 'proteins.faa'

Since the basename is generated by this line of code: basename = os.path.basename(diamond_file).split('.diamond.')[0] I am not sure of the exact source of the error.

@jdwinkler-lanzatech
Copy link
Author

Think I figured it out. It looks like the basename extracted from the diamond output file will include the original extension, while the gene counts are stored in the filename without the extension. Running GUNC using gene calls without an extension (e.g. proteins) works fine.

@fullama
Copy link
Contributor

fullama commented Jun 15, 2021

Yeah i just saw that too.. i can only apologise for that. I will release a new version with a fix as soon as i can, at least for now you could run without an extension. Thanks for pointing it out though.. i dont know how that got through testing..

@jdwinkler-lanzatech
Copy link
Author

No worries, happens to all of us.

@fullama
Copy link
Contributor

fullama commented Jun 16, 2021

I just released v1.0.3 which should fix this issue.. it is taking a while to appear on conda but once its there this issue should be solved!

if you notice any other issues do let me know.. thanks again.

@fullama fullama closed this as completed Jun 16, 2021
@cpauvert
Copy link

Hi @fullama,
I am reviving this issue because I could reproduce the same error as @jdwinkler-lanzatech in GUNC v.1.0.5.
For externally generated gene calls to work, one must still indicate the (.faa) FASTA file without the extension. This can be problematic for workflow management system based on the filenames to infer dependencies (e.g. Snakemake).

As far as I have understood, there are two possibilities of fix:
A. keeping the prefix (.faa) in the genes_counts json file (and modify here)
B. Adding a specific condition (if gene_calls) to the extraction of the basename here

The A solution could be complemented by a last read of the json, trimming of the suffixes and rewrite, for compatibility purpose.
Let me know your thoughts and keep up the good work!
Best,

@fullama
Copy link
Contributor

fullama commented Apr 29, 2022

hi, im not sure i understand whats happening here.. did you run with --file_suffix .faa ?

@cpauvert
Copy link

cpauvert commented May 2, 2022

My bad, it worked with --file_suffix .faa
Thanks @fullama for the quick reply.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants