Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing gene-related data failed #232

Closed
dmartimarti opened this issue May 30, 2024 · 3 comments
Closed

Writing gene-related data failed #232

dmartimarti opened this issue May 30, 2024 · 3 comments

Comments

@dmartimarti
Copy link

Hi, first of all, thanks for creating and supporting this amazing software, it's been very helpful so far.

I am doing a pangenome from several E. coli strains we have sequenced in our lab. I got their annotation using bakta with the latest complete db (5.1), and then fed these annotations to the complete workflow:

ppanggolin all --anno genomes.gbff.txt --output ppanggolin_results -c 2 --verbose 2 -f

However, when it comes to writing all gene-data in the h5f file I'm getting an error related to the object class:

Traceback (most recent call last):
  File "tables/tableextension.pyx", line 1676, in tables.tableextension.Row.__setitem__
TypeError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/bin/ppanggolin", line 10, in <module>
    sys.exit(main())
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/main.py", line 219, in main
    ppanggolin.workflow.all.launch(args)
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/workflow/all.py", line 288, in launch
    launch_workflow(args, panrgp=True, panmodule=True)
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/workflow/all.py", line 61, in launch_workflow
    write_pangenome(pangenome, filename, args.force, disable_bar=args.disable_prog_bar)
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/formats/writeBinaries.py", line 711, in write_pangenome
    write_annotations(pangenome, h5f, disable_bar=disable_bar)
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/formats/writeAnnotations.py", line 342, in write_annotations
    write_genedata(pangenome, h5f, annotation, genedata2gene, disable_bar)
  File "/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/ppanggolin/formats/writeAnnotations.py", line 309, in write_genedata
    genedata_row["name"] = genedata.name
  File "tables/tableextension.pyx", line 1681, in tables.tableextension.Row.__setitem__
TypeError: invalid type (<class 'str'>) for column ``name``
/rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/lib/python3.10/site-packages/tables/file.py:113: UnclosedFileWarning:

Closing remaining open file: ppanggolin_results/pangenome.h5

Here is the complete output from the run.

2024-05-30 12:39:04 utils.py:l168 INFO	Command: /rds/general/user/dmarti14/home/anaconda3/envs/ppanggo/bin/ppanggolin all --anno genomes.gbff.txt --output ppanggolin_results -c 2 --verbose 2 -f
2024-05-30 12:39:04 utils.py:l169 INFO	PPanGGOLiN version: 2.0.5
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--anno: genomes.gbff.txt" has been specified in the command line with a non-default value. Its value overwrites the default value (None).
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--force: True" has been specified in the command line with a non-default value. Its value overwrites the default value (False).
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--output: ppanggolin_results" has been specified in the command line with a non-default value. Its value overwrites the default value (ppanggolin_output_DATE2024-05-30_HOUR12.39.04_PID2061566).
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--verbose: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l668 DEBUG	4 all parameters have non-default value: cpu=2, force=True, output=ppanggolin_results, verbose=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing annotate arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 annotate parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing cluster arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 cluster parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing graph arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing partition arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 partition parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing rarefaction arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 rarefaction parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing rgp arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing spot arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing module arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 module parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing draw arguments in config file.
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing write_pangenome arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 write_pangenome parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l679 DEBUG	Parsing write_genomes arguments in config file.
2024-05-30 12:39:04 utils.py:l529 DEBUG	The parameter "--cpu: 2" has been specified in the command line with a non-default value. Its value overwrites the default value (1).
2024-05-30 12:39:04 utils.py:l709 DEBUG	1 write_genomes parameters have a non-default value: cpu=2
2024-05-30 12:39:04 utils.py:l722 INFO	11 parameters have a non-default value.
2024-05-30 12:39:04 annotate.py:l503 INFO	Reading genomes.gbff.txt the list of genome files ...
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:01<00:00,  5.76file/s]
2024-05-30 12:39:06 annotate.py:l535 INFO	gene identifiers used in the provided annotation files were unique, PPanGGOLiN will use them.
2024-05-30 12:39:06 writeBinaries.py:l709 INFO	Writing genome annotations...
2024-05-30 12:39:06 writeAnnotations.py:l71 DEBUG	Writing 8 genomes
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 160547.52genome/s]
2024-05-30 12:39:06 writeAnnotations.py:l105 DEBUG	Writing 1600 contigs
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1600/1600 [00:00<00:00, 652365.74contigs/s]
2024-05-30 12:39:06 writeAnnotations.py:l148 DEBUG	Writing 36656 genes
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 36656/36656 [00:00<00:00, 169713.94gene/s]
2024-05-30 12:39:06 writeAnnotations.py:l297 DEBUG	Writing 36509 gene-related data (can be lower than the number of genes)
 93%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌           | 33897/36509 [00:00<00:00, 687257.45genedata/s]

I can supply a few of the annotation files that I'm using as a test if necessary.

Thanks a lot.

@JeanMainguy
Copy link
Member

Hi @dmartimarti,
A few of the annotation files would be very helpful in deed to check what's going on here.
Thanks !

@JeanMainguy
Copy link
Member

The error seems a bit similar to the one encountered in these issues: #95, #175 and #222. However, here the problem seems to be with the gene name and not the product.

You might try to catch any problematic characters with this grep command on your gbff files:
LC_ALL=C grep -n -P [$'\x80'-$'\xFF'] *.g*ff

@dmartimarti
Copy link
Author

Hi @JeanMainguy

That was it! For the record, it was again one of these double-wing motiff proteins (gene mmcQ) the responsible for the error.
I tried removing the non-ASCII characters from the gff3 files and this time it worked with a test subset I was playing with.

Thanks a lot for your prompt help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants