Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW]: Pyrodigal: Python bindings and interface to Prodigal, an efficient method for gene prediction in prokaryotes. #4296

Closed
editorialbot opened this issue Apr 4, 2022 · 41 comments
Assignees
Labels
accepted Cython published Papers published in JOSS Python recommend-accept Papers recommended for acceptance in JOSS. review Shell TeX

Comments

@editorialbot
Copy link
Collaborator

editorialbot commented Apr 4, 2022

Submitting author: @althonos (Martin Larralde)
Repository: https://github.com/althonos/pyrodigal/
Branch with paper.md (empty if default branch): paper
Version: v1.0.0
Editor: @mikldk
Reviewers: @Gab0, @standage
Archive: 10.5281/zenodo.6472583

Status

status

Status badge code:

HTML: <a href="https://joss.theoj.org/papers/b86b61e7d2f4ac0140ae284cfe910a6e"><img src="https://joss.theoj.org/papers/b86b61e7d2f4ac0140ae284cfe910a6e/status.svg"></a>
Markdown: [![status](https://joss.theoj.org/papers/b86b61e7d2f4ac0140ae284cfe910a6e/status.svg)](https://joss.theoj.org/papers/b86b61e7d2f4ac0140ae284cfe910a6e)

Reviewers and authors:

Please avoid lengthy details of difficulties in the review thread. Instead, please create a new issue in the target repository and link to those issues (especially acceptance-blockers) by leaving comments in the review thread below. (For completists: if the target issue tracker is also on GitHub, linking the review thread in the issue or vice versa will create corresponding breadcrumb trails in the link target.)

Reviewer instructions & questions

@Gab0 & @standage, your review will be checklist based. Each of you will have a separate checklist that you should update when carrying out your review.
First of all you need to run this command in a separate comment to create the checklist:

@editorialbot generate my checklist

The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. Any questions/concerns please let @mikldk know.

Please start on your review when you are able, and be sure to complete your review in the next six weeks, at the very latest

Checklists

📝 Checklist for @standage

📝 Checklist for @Gab0

@editorialbot
Copy link
Collaborator Author

Hello humans, I'm @editorialbot, a robot that can help you with some common editorial tasks.

For a list of things I can do to help you, just type:

@editorialbot commands

For example, to regenerate the paper pdf after making changes in the paper's md or bib files, type:

@editorialbot generate pdf

@editorialbot
Copy link
Collaborator Author

Software report:

github.com/AlDanial/cloc v 1.88  T=0.16 s (479.7 files/s, 188929.2 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
JSON                             3              0              0          11512
SVG                              4              1             83           7787
Cython                          19            706            741           3927
Python                          20            324            231           1700
YAML                             3             22             67            627
Markdown                         4            221              0            543
C                                4             25             39            215
C/C++ Header                     5             38             17            167
TeX                              1              9              0            123
reStructuredText                 9            109            132             93
DOS Batch                        1              8              1             26
CSS                              1              4              7             16
make                             1              4              7              9
-------------------------------------------------------------------------------
SUM:                            75           1471           1325          26745
-------------------------------------------------------------------------------


gitinspector failed to run statistical information for the repository

@editorialbot
Copy link
Collaborator Author

Wordcount for paper.md is 1246

@editorialbot
Copy link
Collaborator Author

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1038/s41592-018-0046-7 is OK
- 10.1145/1250734.1250746 is OK
- 10.1093/nar/gkz1002 is OK
- 10.1186/1471-2105-11-119 is OK
- 10.1101/2021.05.03.442509 is OK
- 10.1093/nargab/lqaa109 is OK
- 10.1101/2021.07.21.453177 is OK
- 10.1093/bioinformatics/btz714 is OK

MISSING DOIs

- None

INVALID DOIs

- None

@editorialbot
Copy link
Collaborator Author

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@mikldk
Copy link

mikldk commented Apr 4, 2022

@Gab0, @standage: Thanks for agreeing to review. Please carry out your review in this issue by first creating a checklist (@editorialbot generate my checklist) and then updating it as the review proceeds. The reviewer guidelines are available here: https://joss.readthedocs.io/en/latest/reviewer_guidelines.html. If possible create issues (and cross-reference) in the submission's repository to avoid too specific discussions in this review thread.

If you have any questions or concerns please let me know.

@standage
Copy link

standage commented Apr 4, 2022

Review checklist for @standage

Conflict of interest

  • I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the https://github.com/althonos/pyrodigal/?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
  • Contribution and authorship: Has the submitting author (@althonos) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
  • A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
  • State of the field: Do the authors describe how this software compares to other commonly-used packages?
  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

@Gab0
Copy link

Gab0 commented Apr 11, 2022

Review checklist for @Gab0

Conflict of interest

  • I confirm that I have read the JOSS conflict of interest (COI) policy and that: I have no COIs with reviewing this work or that any perceived COIs have been waived by JOSS for the purpose of this review.

Code of Conduct

General checks

  • Repository: Is the source code for this software available at the https://github.com/althonos/pyrodigal/?
  • License: Does the repository contain a plain-text LICENSE file with the contents of an OSI approved software license?
  • Contribution and authorship: Has the submitting author (@althonos) made major contributions to the software? Does the full list of paper authors seem appropriate and complete?
  • Substantial scholarly effort: Does this submission meet the scope eligibility described in the JOSS guidelines

Functionality

  • Installation: Does installation proceed as outlined in the documentation?
  • Functionality: Have the functional claims of the software been confirmed?
  • Performance: If there are any performance claims of the software, have they been confirmed? (If there are no claims, please check off this item.)

Documentation

  • A statement of need: Do the authors clearly state what problems the software is designed to solve and who the target audience is?
  • Installation instructions: Is there a clearly-stated list of dependencies? Ideally these should be handled with an automated package management solution.
  • Example usage: Do the authors include examples of how to use the software (ideally to solve real-world analysis problems).
  • Functionality documentation: Is the core functionality of the software documented to a satisfactory level (e.g., API method documentation)?
  • Automated tests: Are there automated tests or manual steps described so that the functionality of the software can be verified?
  • Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support

Software paper

  • Summary: Has a clear description of the high-level functionality and purpose of the software for a diverse, non-specialist audience been provided?
  • A statement of need: Does the paper have a section titled 'Statement of Need' that clearly states what problems the software is designed to solve and who the target audience is?
  • State of the field: Do the authors describe how this software compares to other commonly-used packages?
  • Quality of writing: Is the paper well written (i.e., it does not require editing for structure, language, or writing quality)?
  • References: Is the list of references complete, and is everything cited appropriately that should be cited (e.g., papers, datasets, software)? Do references in the text use the proper citation syntax?

@standage
Copy link

The author describes Pyrodigal, a software package implementing a Python interface to the widely-adopted gene prediction program Prodigal. The Prodigal source code is bundled with Pyrodigal and compiled as a Python extension, which Pyrodigal accesses directly via a Python interface rather than indirect system calls to a separate executable. In addition to Python bindings, the author describes performance optimizations made to Prodigal’s core scoring procedure.

Prodigal is the de facto standard for prokaryotic gene prediction in the genome informatics community, and Python is the community’s de facto general-purpose scripting language. Exposing Prodigal’s internals comprehensively and efficiently to Python programs is a valuable contribution, as underscored by Pyrodigal's usage statistics and its integration into various published workflows. I am happy to endorse this submission for publication at JOSS.

  • I was able to install Pyrodigal painlessly from both PyPI and Bioconda. The PyPI/pip package management system is much less heavyweight than the Conda/Bioconda system, so the author’s point about not requiring a separately handled external executable with pip installation is well taken.
  • Source code is only one component of many in an open source project. The Pyrodigal repository also includes CI configuration, an active issue tracker, descriptions of the software (both extensive and detailed), example usages, and comprehensive API documentation. The quality of these resources is at the level one would expect for a mature open source project with numerous contributors.
  • I was able to write a custom Python script with Pyrodigal in just a few minutes, and it performed as expected on data from a handful of bacterial isolates. Performance versus equivalent Prodigal invocations is consistent with the benchmarking reported in the submitted manuscript. I do not have convenient access to binned metagenomic assemblies for testing of those features.
  • Pyrodigal performed well on a battery of tests designed to stress or break the code. Exception messages were easy to understand, with no verbose stack traces or cryptic error codes/messages.
  • The manuscript is clearly written, and after multiple read-throughs I cannot find any significant errors or mistakes requiring revision.

@standage
Copy link

Hi @mikldk, no concerns from me on this submission so I think this concludes my review!

@Gab0
Copy link

Gab0 commented Apr 17, 2022

Overview

This project consists of Cython bindings to the widely used prokaryotic ORF finder software Prodigal.

The premise is interesting, allowing Prodigal functions to be called directly from Python code as an alternative to binary calls using subprocess.call or similar. This enables a more flexible pipeline design.

The performance claims over CPU time, RAM use and storage use can be useful when analyzing large sets of genomes, a scenario that has become increasingly common over the last years.

The paper includes all the required information. It also includes interesting details about performance optimizations done on top of the Prodigal codebase.

Setup

The README.md describes the following installation methods:

  • Python's standard package repository pypi.
  • Bioconda.

From these I tested the pypi method and it works nicely, but I don't use Bioconda.

There seems to be two undocumented installation methods:

  • The AUR package named python-pyrodigal.
  • Build from the git source.

Both methods works nicely. The local build works after pulling all the submodules to the local repository.

Tests

I wrote a quick comparison test for Pyrodigal against the Prodigal binary. The workflow is a basic run using default parameters. The Pyrodigal binary runs much faster (1.8x here) than the original Prodigal executable, reflecting the performance improvements described in the paper.

[ ] However, in this run, calling Pyrodigal routines using the library took more time than with the binary. Execution times are 15.5s against 2.5s. Maybe I did not understand the sample code shown at the README.md and wrote wrong library calls.

I would like to ask the author a few questions about these benchmark results:

  1. Do both snippets shown below have equivalent functionality?

a) pyrodigal -i <genome.fna> -o genes -a proteins.faa

b)

orf_finder = pyrodigal.OrfFinder(meta=True)

record = SeqIO.read(GENOMEfna, "fasta")
genome_input = record.seq.encode("utf-8")

predictions = orf_finder.find_genes(genome_input)
with open("proteins.faa", 'w', encoding="utf-8") as f:
    for i, gene in enumerate(predictions):
        f.write(f">{record.id}_{i+1}\n")
        f.write(gene.translate() + '\n')
  1. If they are equivalent, is this performance difference expected? Maybe it is related to something in my setup. If applicable, explanations about performance differences between library and binary would be a good addition to the documentation.

Suggestions

  • The package could be improved by including complete example workflows, possibly including sample data. The code snippets provided in the README.md
    are basic demonstrations that require modification to be executed. They also include import statements that won't work with current package versions (e.g. import Bio.SeqIO, import multiprocessing.pool).

  • Automatic linters such as autopep8 could be applied to the source code. At least for the pure python files, since I don't know if they are effective for Cython code. That should improve code readability by enforcing line length limits and
    solving other minor formatting issues.

Summary

  1. The documentation and the paper are well written and contain all the relevant information.

  2. Setup works nicely.

  3. Expected performance gains are observable when comparing Pyrodigal executable against the original Prodigal.

  4. I have a few questions about library performance in the Tests section. These correspond to the remaining checkboxes in my review.

The project is solid overall. My review status is minor revisions.

@althonos
Copy link

althonos commented Apr 18, 2022

Hi @Gab0 , thanks for the review! A few answers / updates:

There seems to be two undocumented installation methods:

The AUR package named python-pyrodigal.
Build from the git source.

I added the AUR installation method recently, that's why it's not documented, but after your review updated the documentation to mention it, so it will be rendered on the website with the next release. Installing from source is not covered in the README.md (because it's really less convenient than PyPI / Bioconda), but I already have a section in the [Install page}(https://pyrodigal.readthedocs.io/en/stable/install.html) of the documentation.

I wrote a quick comparison test for Pyrodigal against the Prodigal binary. The workflow is a basic run using default parameters. The Pyrodigal binary runs much faster (1.8x here) than the original Prodigal executable, reflecting the performance improvements described in the paper.

However, in this run, calling Pyrodigal routines using the library took more time than with the binary. Execution times are 15.5s against 2.5s. Maybe I did not understand the sample code shown at the README.md and wrote wrong library calls.

I would like to ask the author a few questions about these benchmark results:
Do both snippets shown below have equivalent functionality?

If they are equivalent, is this performance difference expected? Maybe it is related to something in my setup. If applicable, explanations about performance differences between library and binary would be a good addition to the documentation.

No, the two snippets are not equivalent, because in the command line invocation you called Pyrodigal/Prodigal in single mode (which trains and predicts on the same sequence) but in the script you used the meta mode (as shown by the first line, orf_finder = pyrodigal.OrfFinder(meta=True)), which predicts genes using the best-scoring pre-trained data. Meta mode is much slower (it can compare up to 50 different parameters to find the best ones, although with a filter on the GC% of the input it will often only compare around 10 different bins; still a significant increase in runtime).

The package could be improved by including complete example workflows, possibly including sample data. The code snippets provided in the README.md are basic demonstrations that require modification to be executed. They also include import statements that won't work with current package versions (e.g. import Bio.SeqIO, import multiprocessing.pool).

Pyrodigal is not thought to perform I/O, which is why I included examples with Biopython and Scikit-Bio, which are two general purpose libraries for bioinformatics, and can be used to load data from FASTA or GenBank files. multiprocessing.pool is a standard library module so it will work in any Python setup. I can however update the examples with paths to real data so that they are immediately testable.

Automatic linters such as autopep8 could be applied to the source code. At least for the pure python files, since I don't know if they are effective for Cython code. That should improve code readability by enforcing line length limits and
solving other minor formatting issues.

I have linted the Python sources with black, but to my knowledge there is no source code autoformatting tools for Cython. In addition, I'd rather not reformat the Cython code, because some code blocks have been ported from the C code of Prodigal, and although they don't look very nice, having a 1-to-1 match between the Prodigal code and the Pyrodigal code for this sections improves readability when comparing the two, which also makes the code more future-proof if the Prodigal code changes in the future (because diffing will be easier).

@Gab0
Copy link

Gab0 commented Apr 18, 2022

@althonos Cool, I'll address each remaining topic:

  • Extra Setup Methods: Setup using AUR and source builds do work and that was my main concern here.

  • Library performance issues: Oh meta mode is another operation mode, got that. I'm now able to 'recreate' the standard prodigal operation workflow by training pyrodigal.OrfFinder on the same sequence it was going to predict. This workflow is actually included in the README.md, but wouldn't pointing out which example corresponds to the standard prodigal operation make things clearer for starter users? The performance with the library is as good as the pyrodigal binary, so there is no problem here.

  • Example import issues: Let's forget this part, I messed up with the imports.

  • Code linter: black fixed the issues with Python sources. Good point on leaving the Cython code as close as possible to prodigal.

Ok, so all my concerns have been addressed. My suggestions about working examples and documentation updates can be implemented in future releases if the author feels they will improve the package.

My review is completed, and I endorse this submission for publication at JOSS. (@mikldk)

@althonos
Copy link

Thanks for your reviews @Gab0 and @standage !

@mikldk
Copy link

mikldk commented Apr 19, 2022

@Gab0, @standage: Thank you very much for your effort with the very constructive reviews.

@althonos:

  • Please have a final read though of the paper, checking language etc.
  • Have a final check of the proofs with @editorialbot generate pdf
  • In general, be sure that all versions are correct and that the repo is in the state that should be published.
  • Please make a tagged release and archive (e.g. with Zenodo) as described here, and report the version number and archive DOI in this thread. Please verify that the archive deposit has the correct metadata (title and author list), or edit these if that is not the case.

@althonos
Copy link

@editorialbot generate pdf

@editorialbot
Copy link
Collaborator Author

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@althonos
Copy link

Hi @mikldk ,

I have now made a new stable release (v1.0.0) and finalized the paper. The DOI for the v1.0.0 version on Zenodo is 10.5281/zenodo.6472583.

@althonos
Copy link

althonos commented Apr 20, 2022

Please verify that the archive deposit has the correct metadata (title and author list), or edit these if that is not the case.

Does this means the archive needs to be named the same as the JOSS paper title?

EDIT: Guess I do need to, updated the Zenodo record.

@mikldk
Copy link

mikldk commented Apr 25, 2022

@editorialbot set 10.5281/zenodo.6472583 as archive

@editorialbot
Copy link
Collaborator Author

Done! Archive is now 10.5281/zenodo.6472583

@mikldk
Copy link

mikldk commented Apr 25, 2022

@editorialbot set v1.0.0 as version

@editorialbot
Copy link
Collaborator Author

Done! version is now v1.0.0

@mikldk
Copy link

mikldk commented Apr 25, 2022

@editorialbot check references

@editorialbot
Copy link
Collaborator Author

Reference check summary (note 'MISSING' DOIs are suggestions that need verification):

OK DOIs

- 10.1038/s41592-018-0046-7 is OK
- 10.1145/1250734.1250746 is OK
- 10.1093/nar/gkz1002 is OK
- 10.1186/1471-2105-11-119 is OK
- 10.1101/2021.05.03.442509 is OK
- 10.1093/nargab/lqaa109 is OK
- 10.1101/2021.07.21.453177 is OK
- 10.1093/bioinformatics/btz714 is OK

MISSING DOIs

- None

INVALID DOIs

- None

@mikldk
Copy link

mikldk commented Apr 25, 2022

@editorialbot generate pdf

@editorialbot
Copy link
Collaborator Author

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@mikldk
Copy link

mikldk commented Apr 25, 2022

@althonos Thanks. A few comments:

  • There is a period in the end of the title (both paper and Zenodo) - please remove
  • Summary: "One of the key step for analysing" -> steps

Please change and recompile (no need to make a new release, just update Zenodo meta data and the paper).

@althonos
Copy link

@mikldk

Thanks, done.

@althonos
Copy link

@editorialbot generate pdf

@editorialbot
Copy link
Collaborator Author

👉📄 Download article proof 📄 View article proof on GitHub 📄 👈

@mikldk
Copy link

mikldk commented Apr 25, 2022

@editorialbot recommend-accept

@editorialbot
Copy link
Collaborator Author

Attempting dry run of processing paper acceptance...

@editorialbot
Copy link
Collaborator Author

👋 @openjournals/joss-eics, this paper is ready to be accepted and published.

Check final proof 👉 openjournals/joss-papers#3171

If the paper PDF and the deposit XML files look good in openjournals/joss-papers#3171, then you can now move forward with accepting the submission by compiling again with the command @editorialbot accept

@editorialbot editorialbot added the recommend-accept Papers recommended for acceptance in JOSS. label Apr 25, 2022
@danielskatz
Copy link

@editorialbot accept

looks good!

@editorialbot
Copy link
Collaborator Author

Doing it live! Attempting automated processing of paper acceptance...

@editorialbot
Copy link
Collaborator Author

🐦🐦🐦 👉 Tweet for this paper 👈 🐦🐦🐦

@editorialbot
Copy link
Collaborator Author

🚨🚨🚨 THIS IS NOT A DRILL, YOU HAVE JUST ACCEPTED A PAPER INTO JOSS! 🚨🚨🚨

Here's what you must now do:

  1. Check final PDF and Crossref metadata that was deposited 👉 Creating pull request for 10.21105.joss.04296 joss-papers#3172
  2. Wait a couple of minutes, then verify that the paper DOI resolves https://doi.org/10.21105/joss.04296
  3. If everything looks good, then close this review issue.
  4. Party like you just published a paper! 🎉🌈🦄💃👻🤘

Any issues? Notify your editorial technical team...

@editorialbot editorialbot added accepted published Papers published in JOSS labels Apr 25, 2022
@althonos
Copy link

althonos commented Apr 25, 2022

Thank you all! This was really a pleasant process and I'll definitely consider JOSS again when I have something else suitable for publication.

@danielskatz
Copy link

Congratulations to @althonos (Martin Larralde)!!

And thanks to @Gab0 and @standage for reviewing, and to @mikldk for editing!
We couldn't do this without you

@editorialbot
Copy link
Collaborator Author

🎉🎉🎉 Congratulations on your paper acceptance! 🎉🎉🎉

If you would like to include a link to your paper from your README use the following code snippets:

Markdown:
[![DOI](https://joss.theoj.org/papers/10.21105/joss.04296/status.svg)](https://doi.org/10.21105/joss.04296)

HTML:
<a style="border-width:0" href="https://doi.org/10.21105/joss.04296">
  <img src="https://joss.theoj.org/papers/10.21105/joss.04296/status.svg" alt="DOI badge" >
</a>

reStructuredText:
.. image:: https://joss.theoj.org/papers/10.21105/joss.04296/status.svg
   :target: https://doi.org/10.21105/joss.04296

This is how it will look in your documentation:

DOI

We need your help!

The Journal of Open Source Software is a community-run journal and relies upon volunteer effort. If you'd like to support us please consider doing either one (or both) of the the following:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Cython published Papers published in JOSS Python recommend-accept Papers recommended for acceptance in JOSS. review Shell TeX
Projects
None yet
Development

No branches or pull requests

6 participants