New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JOSS review #3
Comments
Hi @gavinmdouglas , Major comments:
I added a section about the choice of taxonomic classifiers on the usage page of the documentation ec6a7da
The default database was removed from the conda package itself, but remain in the github repository. 25c3079 I added a page in the documentation on how to build custom sources using Kraken (simple pipeline available here github.com/maxibor/kraken-nf) ec6a7da
I added an example analysis with sourcepredict in the Documentation and a comment on how to interpret the results. 44e7f81
The example analysis now contains a comparison of Sourcepredict with Sourcetracker 281083d README comments:
Done a56f524
Update README with links to doc and example files, as well as rough overview of sourcepredict e3fdac4 4115f40
The example file is now simplified so the logs are shorter. I prefer to keep them so that the user knows what to expect when running sourcepredict
The t-SNE embedding is available to the user if they use the
The integration and unit tests are run on each push to github by Travis (also on forks)
Added issue templates 108fb0b and contributing guidelines 5b8f718 Manuscript edits / commentsGeneral
Added an example in the field of microbial archaelogy. 57a14e1
The tree is based on the NCBI taxonomy, I added a sentence to mention it "A weighted Unifrac (default) pairwise distance matrix is then computed on the merged and normalized training dataset Regarding the default reference sources, I prefer to keep them out of the paper because they're not part of the method itself. Therefore, based on your comment, I removed the data from the conda package itself (they will remain on Github though), but I added the origin of the data in the README a56f524
I'm not sure neither.
Reformatted the mentionned paragraph and a few others. But kept the methods unchanged, I find it easier to read. 9e35118
Unfortunately no Summary section
Changed to suggested formulation. 23d516a
Changed to suggestion. ee20cb9
Changed to: "One aspect of metagenomics is investigating the community composition of organisms within a sequencing sample with tools known as taxonomic classifiers, such as Kraken" 85977cf
Changed to suggested formulation. 4c97a30 Method section
Done
Done
Changed to "derivative samples". These samples are not really simulated, but are derived from the sink samples. Commit: d3cdf5a |
My major comments have been addressed, but I do have a few remaining minor points which I think should be addressed:
edit: I also think commenting specifically on the use case of identifying the degree of contamination in samples would be useful. I.e. do you think your tool is appropriate for this purpose? |
Changed in README 9847a5c and Doc 78a5858 by count and/or abundance table.
Fixed with d7123ec. I had some double trailing spaces lying around. I've removed them
Fixed with 7eba279 and 0fb3873
Fixed with 1c9fde9
I made an example of this comparing Sourcepredict and Sourcetracker on data generated from mixed sources: sourcepredict.readthedocs.io/en/latest/mixed_prop.html |
Great, all of my comments have been addressed! |
Thanks a lot @gavinmdouglas for all the comments and suggestions ! |
My pleasure! |
Related to openjournals/joss-reviews#1540
I think sourcepredict is really interesting and could potentially be very useful for metagenomics researchers. However, I think several areas of the documentation need to be clarified.
Major comments:
I think the key question users will have is which taxonomic classifier should be used. Different classifiers can result in starkly different results and so from the user's perspective it's very unclear how the use of different classifiers will affect the result. If you have only tested Kraken1 with sourcepredict I think this should be clarified. Either way I think it should be emphasized that exactly the same taxonomic classifier should be used for the reference sources as for the sink samples (if you disagree than this should definitely be clarified).
Would you recommend users use the default database or do you envision users plugging in custom reference files? The latter makes more sense to me since the combination of human, dog, and stool doesn't seem very generalizable. If you agree than more documentation on how to construct custom reference files is needed and adding a sentence in the manuscript related to this point would be useful.
A more real-world usage example is needed since I think having the "ERR1915662_copy" sample is confusing. Ideally an example of how to identify an outlier contaminated sample or how to discriminate 1 dog and 1 human sample would be more informative. A couple of sentences recommending users how to interpret the output and what next steps to take would also be useful (e.g. do you have any recommendations for hard cut-offs for saying that a sample is sufficiently of a single source? If not then this would be good to clarify since I'm sure users will be trying to figure that out).
No performance claims were made in the manuscript so I think it's beyond the scope of my review to request performance validations. However, I think some sort of validation of the accuracy of the tool or comparison with a tool like SourceTracker would be really useful and probably increase usage of the tool.
README comments:
Remove extra
wget
typo in README dog exampleThe README example would benefit from a very brief description of what the input file is, what is being performed, and what the output file means in addition to the rough expected runtime.
I suggest that the intermediate log output be removed from the README for ease of reading
Outputting the tSNE embedding and showing an example plot would also be really informative for users looking at the test example.
Possible to add brief description of how unittests can be run or are these run automatically by conda? (checkbox:
Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?
)Community guidelines needed (checkbox:
Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support
)Manuscript edits / comments
General
Adding a sentence that gives a better motivation for why source detection is useful is needed. Can you give an example of when the source of a sample is unknown? Also, I think many users would be interested in identifying outlier contaminated samples using this approach even if the sampling source is known. If you think the method would be appropriate for these cases I think it would help motivate the need for sourcepredict to bring up that example.
Also, adding a sentence or two mentioning what the reference sources are and what organisms they're based on is needed. Also need to mention what the reference tree is for the UniFrac calculation.
Spacing between paragraphs is inconsistent - I'm not sure if this is due to the PDF creation or in the original document.
I think many of the paragraphs thoughout that are single sentences could be combined. For instance, the paragraph starting with "One aspect of metagenomics" should be combined with preceding paragraph.
Missing DOIs for 2 citations - are these available?
Summary section
"a Python Conda package" sounds odd to me; is it more accurate to say "a Python package distributed through Conda"?
"will compute the organism taxonomic..." --> "compute the organism taxonomic..."
The sentence starting with "These taxonomic classifiers" is essentially repeating the preceding sentence (e.g. "community composition of organisms" vs "organism taxonomic composition") and giving Kraken as an example. I think this should be re-worded to make it less redundant.
I suggest the last sentence of the summary be re-worded to make it more clear, something like:
"However, the Sourcepredict results are easier interpreted since the samples are embedded in a human observable low-dimensional space. This embedding is performed by a dimension reduction algorithm followed by K-Nearest-Neighbours (KNN) classification."
Method section
Write out what GMPR stands for on first usage
I suggest using the term "source" throughout in place of "class" to avoid confusion. They are currently used interchangeably.
Would it be more accurate to call the "estimated samples" "simulated samples" instead?
The text was updated successfully, but these errors were encountered: