JOSS review #3

gavinmdouglas · 2019-07-09T14:27:03Z

Related to openjournals/joss-reviews#1540

I think sourcepredict is really interesting and could potentially be very useful for metagenomics researchers. However, I think several areas of the documentation need to be clarified.

Major comments:

I think the key question users will have is which taxonomic classifier should be used. Different classifiers can result in starkly different results and so from the user's perspective it's very unclear how the use of different classifiers will affect the result. If you have only tested Kraken1 with sourcepredict I think this should be clarified. Either way I think it should be emphasized that exactly the same taxonomic classifier should be used for the reference sources as for the sink samples (if you disagree than this should definitely be clarified).
Would you recommend users use the default database or do you envision users plugging in custom reference files? The latter makes more sense to me since the combination of human, dog, and stool doesn't seem very generalizable. If you agree than more documentation on how to construct custom reference files is needed and adding a sentence in the manuscript related to this point would be useful.
A more real-world usage example is needed since I think having the "ERR1915662_copy" sample is confusing. Ideally an example of how to identify an outlier contaminated sample or how to discriminate 1 dog and 1 human sample would be more informative. A couple of sentences recommending users how to interpret the output and what next steps to take would also be useful (e.g. do you have any recommendations for hard cut-offs for saying that a sample is sufficiently of a single source? If not then this would be good to clarify since I'm sure users will be trying to figure that out).
No performance claims were made in the manuscript so I think it's beyond the scope of my review to request performance validations. However, I think some sort of validation of the accuracy of the tool or comparison with a tool like SourceTracker would be really useful and probably increase usage of the tool.

README comments:

Remove extra wget typo in README dog example
The README example would benefit from a very brief description of what the input file is, what is being performed, and what the output file means in addition to the rough expected runtime.
I suggest that the intermediate log output be removed from the README for ease of reading
Outputting the tSNE embedding and showing an example plot would also be really informative for users looking at the test example.
Possible to add brief description of how unittests can be run or are these run automatically by conda? (checkbox: Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?)
Community guidelines needed (checkbox: Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support)

Manuscript edits / comments

General

Adding a sentence that gives a better motivation for why source detection is useful is needed. Can you give an example of when the source of a sample is unknown? Also, I think many users would be interested in identifying outlier contaminated samples using this approach even if the sampling source is known. If you think the method would be appropriate for these cases I think it would help motivate the need for sourcepredict to bring up that example.
Also, adding a sentence or two mentioning what the reference sources are and what organisms they're based on is needed. Also need to mention what the reference tree is for the UniFrac calculation.
Spacing between paragraphs is inconsistent - I'm not sure if this is due to the PDF creation or in the original document.
I think many of the paragraphs thoughout that are single sentences could be combined. For instance, the paragraph starting with "One aspect of metagenomics" should be combined with preceding paragraph.
Missing DOIs for 2 citations - are these available?

Summary section

"a Python Conda package" sounds odd to me; is it more accurate to say "a Python package distributed through Conda"?
"will compute the organism taxonomic..." --> "compute the organism taxonomic..."
The sentence starting with "These taxonomic classifiers" is essentially repeating the preceding sentence (e.g. "community composition of organisms" vs "organism taxonomic composition") and giving Kraken as an example. I think this should be re-worded to make it less redundant.
I suggest the last sentence of the summary be re-worded to make it more clear, something like:
"However, the Sourcepredict results are easier interpreted since the samples are embedded in a human observable low-dimensional space. This embedding is performed by a dimension reduction algorithm followed by K-Nearest-Neighbours (KNN) classification."

Method section

Write out what GMPR stands for on first usage
I suggest using the term "source" throughout in place of "class" to avoid confusion. They are currently used interchangeably.
Would it be more accurate to call the "estimated samples" "simulated samples" instead?

The text was updated successfully, but these errors were encountered:

maxibor · 2019-07-18T20:25:03Z

Hi @gavinmdouglas ,
Thanks a lot for your time spent on doing this review, your comments were very valuable !
Please find below the modifications and answers I've made to your comments.

Major comments:

I think the key question users will have is which taxonomic classifier should be used. Different classifiers can result in starkly different results and so from the user's perspective it's very unclear how the use of different classifiers will affect the result. If you have only tested Kraken1 with sourcepredict I think this should be clarified. Either way I think it should be emphasized that exactly the same taxonomic classifier should be used for the reference sources as for the sink samples (if you disagree than this should definitely be clarified).

I added a section about the choice of taxonomic classifiers on the usage page of the documentation ec6a7da

Would you recommend users use the default database or do you envision users plugging in custom reference files? The latter makes more sense to me since the combination of human, dog, and stool doesn't seem very generalizable. If you agree than more documentation on how to construct custom reference files is needed and adding a sentence in the manuscript related to this point would be useful.

The default database was removed from the conda package itself, but remain in the github repository. 25c3079

I added a page in the documentation on how to build custom sources using Kraken (simple pipeline available here github.com/maxibor/kraken-nf) ec6a7da

A more real-world usage example is needed since I think having the "ERR1915662_copy" sample is confusing. Ideally an example of how to identify an outlier contaminated sample or how to discriminate 1 dog and 1 human sample would be more informative. A couple of sentences recommending users how to interpret the output and what next steps to take would also be useful (e.g. do you have any recommendations for hard cut-offs for saying that a sample is sufficiently of a single source? If not then this would be good to clarify since I'm sure users will be trying to figure that out).

I added an example analysis with sourcepredict in the Documentation and a comment on how to interpret the results. 44e7f81

No performance claims were made in the manuscript so I think it's beyond the scope of my review to request performance validations. However, I think some sort of validation of the accuracy of the tool or comparison with a tool like SourceTracker would be really useful and probably increase usage of the tool.

The example analysis now contains a comparison of Sourcepredict with Sourcetracker 281083d

README comments:

Remove extra wget typo in README dog example

Done a56f524

The README example would benefit from a very brief description of what the input file is, what is being performed, and what the output file means in addition to the rough expected runtime.

Update README with links to doc and example files, as well as rough overview of sourcepredict e3fdac4 4115f40

I suggest that the intermediate log output be removed from the README for ease of reading

The example file is now simplified so the logs are shorter. I prefer to keep them so that the user knows what to expect when running sourcepredict

Outputting the tSNE embedding and showing an example plot would also be really informative for users looking at the test example.

The t-SNE embedding is available to the user if they use the -e flag.

Possible to add brief description of how unittests can be run or are these run automatically by conda? (checkbox: Automated tests: Are there automated tests or manual steps described so that the function of the software can be verified?)

The integration and unit tests are run on each push to github by Travis (also on forks)

Community guidelines needed (checkbox: Community guidelines: Are there clear guidelines for third parties wishing to 1) Contribute to the software 2) Report issues or problems with the software 3) Seek support)

Added issue templates 108fb0b and contributing guidelines 5b8f718

Manuscript edits / comments

General

Adding a sentence that gives a better motivation for why source detection is useful is needed. Can you give an example of when the source of a sample is unknown?

Added an example in the field of microbial archaelogy. 57a14e1

Also, I think many users would be interested in identifying outlier contaminated samples using this approach even if the sampling source is known. If you think the method would be appropriate for these cases I think it would help motivate the need for sourcepredict to bring up that example.

Also, adding a sentence or two mentioning what the reference sources are and what organisms they're based on is needed. Also need to mention what the reference tree is for the UniFrac calculation.

The tree is based on the NCBI taxonomy, I added a sentence to mention it "A weighted Unifrac (default) pairwise distance matrix is then computed on the merged and normalized training dataset $D_{ref}$ and test dataset $D_{sink}$ with scikit-bio, using the NCBI taxonomy as a reference tree."

Regarding the default reference sources, I prefer to keep them out of the paper because they're not part of the method itself. Therefore, based on your comment, I removed the data from the conda package itself (they will remain on Github though), but I added the origin of the data in the README a56f524

Spacing between paragraphs is inconsistent - I'm not sure if this is due to the PDF creation or in the original document.

I'm not sure neither.

I think many of the paragraphs thoughout that are single sentences could be combined. For instance, the paragraph starting with "One aspect of metagenomics" should be combined with preceding paragraph.

Reformatted the mentionned paragraph and a few others. But kept the methods unchanged, I find it easier to read. 9e35118

Missing DOIs for 2 citations - are these available?

Unfortunately no

Summary section

"a Python Conda package" sounds odd to me; is it more accurate to say "a Python package distributed through Conda"?

Changed to suggested formulation. 23d516a

"will compute the organism taxonomic..." --> "compute the organism taxonomic..."

Changed to suggestion. ee20cb9

The sentence starting with "These taxonomic classifiers" is essentially repeating the preceding sentence (e.g. "community composition of organisms" vs "organism taxonomic composition") and giving Kraken as an example. I think this should be re-worded to make it less redundant.

Changed to: "One aspect of metagenomics is investigating the community composition of organisms within a sequencing sample with tools known as taxonomic classifiers, such as Kraken" 85977cf

I suggest the last sentence of the summary be re-worded to make it more clear, something like:
"However, the Sourcepredict results are easier interpreted since the samples are embedded in a human observable low-dimensional space. This embedding is performed by a dimension reduction algorithm followed by K-Nearest-Neighbours (KNN) classification."

Changed to suggested formulation. 4c97a30

Method section

Write out what GMPR stands for on first usage

Done

I suggest using the term "source" throughout in place of "class" to avoid confusion. They are currently used interchangeably.

Done

Would it be more accurate to call the "estimated samples" "simulated samples" instead?

Changed to "derivative samples". These samples are not really simulated, but are derived from the sink samples.

Commit: d3cdf5a

gavinmdouglas · 2019-07-19T14:57:18Z

My major comments have been addressed, but I do have a few remaining minor points which I think should be addressed:

The example file is referred to as an OTU table, but this is actually the output of kraken right? OTU typically just refers to clustered amplicon reads (e.g. based on 16S) so this should be re-phrased or at the very least clarified what you mean by "OTU table" in the example (i.e. a taxonomic abundance table).
I think it's fine to write up the paper as a series of very short paragraphs, but the paragraphs should be denoted consistently. It's currently a little hard to read since the spacing is inconsistent - I think there should either be an indentation at the start of each paragraph or a blank line separating them.
Typo in documentation: "the taxonomic classifier used to produce the source OTU count table must be the same as the one used to produced the sink OTU count table."
Manuscript typo: "Sourcepredict results are more easily interpreted..."

edit: I also think commenting specifically on the use case of identifying the degree of contamination in samples would be useful. I.e. do you think your tool is appropriate for this purpose?

maxibor · 2019-07-26T15:12:22Z

The example file is referred to as an OTU table, but this is actually the output of kraken right? OTU typically just refers to clustered amplicon reads (e.g. based on 16S) so this should be re-phrased or at the very least clarified what you mean by "OTU table" in the example (i.e. a taxonomic abundance table).

Changed in README 9847a5c and Doc 78a5858 by count and/or abundance table.

I think it's fine to write up the paper as a series of very short paragraphs, but the paragraphs should be denoted consistently. It's currently a little hard to read since the spacing is inconsistent - I think there should either be an indentation at the start of each paragraph or a blank line separating them.

Fixed with d7123ec. I had some double trailing spaces lying around. I've removed them

Typo in documentation: "the taxonomic classifier used to produce the source OTU count table must be the same as the one used to produced the sink OTU count table."

Fixed with 7eba279 and 0fb3873

Manuscript typo: "Sourcepredict results are more easily interpreted..."

Fixed with 1c9fde9

I also think commenting specifically on the use case of identifying the degree of contamination in samples would be useful. I.e. do you think your tool is appropriate for this purpose?

I made an example of this comparing Sourcepredict and Sourcetracker on data generated from mixed sources: sourcepredict.readthedocs.io/en/latest/mixed_prop.html

gavinmdouglas · 2019-07-26T17:39:44Z

Great, all of my comments have been addressed!

maxibor · 2019-07-26T17:42:23Z

Thanks a lot @gavinmdouglas for all the comments and suggestions !

gavinmdouglas · 2019-07-26T17:45:04Z

My pleasure!

gavinmdouglas mentioned this issue Jul 15, 2019

[REVIEW]: Sourcepredict: Prediction of metagenomic sample sources using dimension reduction followed by machine learning classification openjournals/joss-reviews#1540

Closed

36 tasks

maxibor added this to the ADRSM 0.4 milestone Jul 18, 2019

gavinmdouglas closed this as completed Jul 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JOSS review #3

JOSS review #3

gavinmdouglas commented Jul 9, 2019

maxibor commented Jul 18, 2019

gavinmdouglas commented Jul 19, 2019 •

edited

maxibor commented Jul 26, 2019 •

edited

gavinmdouglas commented Jul 26, 2019

maxibor commented Jul 26, 2019

gavinmdouglas commented Jul 26, 2019

JOSS review #3

JOSS review #3

Comments

gavinmdouglas commented Jul 9, 2019

Major comments:

README comments:

Manuscript edits / comments

General

Summary section

Method section

maxibor commented Jul 18, 2019

Major comments:

README comments:

Manuscript edits / comments

General

Summary section

Method section

gavinmdouglas commented Jul 19, 2019 • edited

maxibor commented Jul 26, 2019 • edited

gavinmdouglas commented Jul 26, 2019

maxibor commented Jul 26, 2019

gavinmdouglas commented Jul 26, 2019

gavinmdouglas commented Jul 19, 2019 •

edited

maxibor commented Jul 26, 2019 •

edited