Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environmental contaminants #1

Open
mw55309 opened this issue Aug 2, 2023 · 8 comments
Open

Environmental contaminants #1

mw55309 opened this issue Aug 2, 2023 · 8 comments

Comments

@mw55309
Copy link

mw55309 commented Aug 2, 2023

Thanks for posting this public rebuttal! Good science is open science.

There's been some suggestion that removing environmental contaminants, as done in the original paper, removes the cancer sub-type signal. See attached and the tweet below.

Whilst that analysis is sadly not open, it would be good to respond.

anon

@mw55309
Copy link
Author

mw55309 commented Aug 2, 2023

@gregpoore
Copy link
Owner

Thanks for the question. I'm happy to respond, clarify a few things, and provide some reassuring data:

  1. When we worked on the original paper, there was no 'gold standard' list of microbes derived from tumors and TCGA lacked experimental contamination controls. This forced us to use tools like decontam (Davis et al. 2018 Microbiome) and 'black lists' of genera to infer putative contaminants. However, these approaches have limitations, both with false negatives and false positives, leading us to state in the original paper: We stress that these in silico decontamination methods are not substitutes for implementing gold-standard microbiology practices on cancer samples, including sterile processing, sterile-certified reagents, negative blanks of reagents processed from start to finish... For reference, these issues have been well described by others (e.g., Austin et al. 2023 Nature Biotech).
  2. Fortunately, a study from the Weizmann Institute of Science (WIS) that implemented those "gold-standard microbiology practices on cancer samples" appeared in Science just a few months after our original paper (Nejman et al. 2020). That list of decontaminated bacteria was expanded during our collaboration with them on fungi (Narunsky-Haziza et al. 2022 Cell), collectively providing a 'gold standard' list of bacteria and fungi found in tumors. I note that between these two studies, >1100 experimental contamination controls were employed in parallel alongside the tumors.
  3. With this background in mind, our re-analyses of TCGA in the bioRxiv rebuttal and Narunsky-Haziza et al. 2022 Cell took the more conservative approach of focusing on WIS-overlapping taxa, followed by repeating all analyses. In other words, we intersected TCGA microbial features with highly-decontaminated taxa from an independent cohort of WIS tumors. Moreover, these taxa have much better supporting data than the contaminant vs. non-contaminant calls listed in Table S6 from our original paper. (Important note: Table S6 of the original paper was satisfactory for March 2020, but there are better approaches now).
  4. Having understood the above, I repeated the same machine learning analyses in this Github repo after subsetting the Gihawi et al. raw data just using WIS-overlapping genera (n=149 genera). This is saved in the new R-script, tcga_gihawhi_rebuttal_WIS_subset_3Aug23.R and the results are approximately the same:

image
(A) After subsetting to WIS-overlapping genera (n=149 genera), we evaluated if multiclass machine learning could discriminate between cancer types using the raw data from all HMS PT samples. Gradient boosting machines were applied with 10-fold cross-validation such that every sample was left out once, and their predictions were used to generate a confusion matrix. The mean balanced accuracy was 93.62% in comparison to the no information rate (NIR) of 54.84% (p<2.2e-16).
(B) After subsetting to WIS-overlapping genera (n=149 genera), 10-fold cross-validation using gradient boosting machines was applied on HMS BDN samples. The balanced accuracy was 88.82% in comparison to the NIR of 80% (p=4.4e-5).

@gregpoore gregpoore reopened this Aug 3, 2023
@travisgibson
Copy link

travisgibson commented Aug 3, 2023

Thanks for making this all open source and posting on GItHub!! Might I suggest running a variable importance analysis after training your ml models.

for the 2 class model in "tcga_gihawhi_rebuttal_31July23.R" top taxa are soil or known to be hospital acquired
image

for the 2 class model in "tcga_gihawhi_rebuttal_WIS_subset_3Aug23.R" top taxa are soil or could be hospital acquired
image

for second analysis Rhizobium averages about 2 reads per sample, would be cool to have some uncertainty quantification with such low number of reads or see what a simple model like DESEQ does to try and discriminate the classes.

@gregpoore
Copy link
Owner

gregpoore commented Aug 6, 2023

@travisgibson Happy to do this and provide some clarifications:

  • The raw data released from Gihawi et al. 2023 bioRxiv did not apply decontamination or suggest how it should be done. The feature importance lists you provided are based on their raw data, which is what we (and others) have access to.
  • There are lots of ways to approach in silico decontamination, but they are not a replacement for experimental contamination controls, as noted in our response above and our original paper. In retrospective studies that lacked those controls, a straightforward and conservative approach is to restrict analyses to microbes found in independent, highly-decontaminated studies that implemented hundreds of experimental contamination controls. The caveat is that it prevents discovery of new taxa associations, but it has the benefit of providing greater confidence in the underlying taxa. The largest cohorts of decontaminated, tumor-derived taxa (bacteria and fungi) come from Ravid Straussman's group at the Weizmann Institute of Science (WIS) based on their Nejman et al. 2020 Science paper and the WIS-cohort portion of the Narunksy-Haziza et al. 2022 Cell paper. The Nejman et al. work employed 811 experimental controls (see Fig 1A of their paper), and the Narunsky-Haziza et al. work employed 295 experimental controls (see Fig 1A of that paper). These included DNA extraction controls, paraffin controls, and PCR controls.
  • For the Gihawi et al. data, restricting the predictive modeling to their reported genera that overlaps with the WIS data (n=149 genera) replicated the same predictive modeling conclusions (see here). I've gone ahead and added the feature importance calculation to the tcga_gihawhi_rebuttal_WIS_subset_3Aug23.R script, and the top 5 features for discriminating among HMS primary tumor types are:
Feature Gain Cover Frequency
Prevotella 0.2572682836 0.1117143829 0.075675676
Staphylococcus 0.1569590199 0.1529085973 0.124324324
Rhizobium 0.1264024071 0.0797954770 0.048648649
Methylobacterium 0.0981973793 0.0358081653 0.027027027
Haemophilus 0.0505654742 0.0307605472 0.021621622
  • Notwithstanding the above, it is important to note that high feature importances does not guarantee or imply a significant over- or under-abundance of a particular taxon. Doing so requires statistical testing. Moreover, statistical comparisons with microbiome data requires compositionally-coherent methods (see Gloor et al. 2017 and Lin & Peddada 2020 for background on this), which DESeq2 does not satisfy. We previously implemented a compositionally-coherent method called ANCOM-BC (Lin & Peddada 2020) on TCGA samples in our fungi-focused paper (e.g., Data S5.4), but ANCOM-BC was not available during the development of our 2020 paper.

@clozupone
Copy link

It is interesting that even using the overlap with the WIS dataset that had all of these experimental controls, that Rhizobium, typically regarded as a soil microbe, is coming up. Certain species within the Rhizobium genus cause plant tumors, and of these Rhizobium radiobacter can also be found in human infections, including case reports of in cancer patients (https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=abf50af543d7357152c1e9c8da9a7097eeef881f). In this study strains of R. radiobacter cultured out of human samples could not cause plant disease and were not found in environmental contaminant controls - suggestive of human adaptation. Is it possible to redo analyses here at the species level to see if the Rhizobium being identified is R. radiobacter? Similarly, a full Bradyrhizobium genome was assembled from the biopsy of a cancer patient who got colitis following a cord blood transplant in this paper (https://www.nejm.org/doi/10.1056/NEJMoa1211115). They found that this organism was highly related to the soil microbe B. japonicum, but was different and named it B. enterica, and again suggested that this isolate might be human adapted. Would it be possible to test if the Bradyrhizobium reads in this analysis are mapping closer to B. enterica than other Bradyrhizobium? This might shed light on whether the "soil bacteria" being identified here are actually these relatives that may be adapted to humans. That both Rhizobium and Bradyrhizobium closely interact with plant hosts to form symbiotic nodules, and that these relationships can "go awry" and form tumors in plants, makes it potentially interesting to explore mechanistically any potential pathway overlap with the mechanisms that these microbes exploit during tumorigenesis in plants and pathways of importance in human tumor formation. May be hard to do but I am just thinking of ways to dig a little more into mechanistic leads using sequence data.

@gregpoore
Copy link
Owner

gregpoore commented Aug 9, 2023

@clozupone I really like your questions and suggestions. However, it's unfortunately not possible to answer them with the Gihawi et al. data, which was fixed at the genus level and did not share the reads. This main goal of this repository, by re-analyzing their data, was to show that alternative bioinformatic pipelines and reduced feature sets still yield the conclusion that microbiomes are cancer type specific, even when limiting the analyses to 9 'well known' genera.

I think there are ways to get to the species/read level and do what you're suggesting/asking about. I'll reach out via email to discuss further.

@mw55309
Copy link
Author

mw55309 commented Aug 17, 2023

It is interesting that even using the overlap with the WIS dataset that had all of these experimental controls, that Rhizobium, typically regarded as a soil microbe, is coming up. Certain species within the Rhizobium genus cause plant tumors, and of these Rhizobium radiobacter can also be found in human infections, including case reports of in cancer patients (https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=abf50af543d7357152c1e9c8da9a7097eeef881f). In this study strains of R. radiobacter cultured out of human samples could not cause plant disease and were not found in environmental contaminant controls - suggestive of human adaptation. Is it possible to redo analyses here at the species level to see if the Rhizobium being identified is R. radiobacter? Similarly, a full Bradyrhizobium genome was assembled from the biopsy of a cancer patient who got colitis following a cord blood transplant in this paper (https://www.nejm.org/doi/10.1056/NEJMoa1211115). They found that this organism was highly related to the soil microbe B. japonicum, but was different and named it B. enterica, and again suggested that this isolate might be human adapted. Would it be possible to test if the Bradyrhizobium reads in this analysis are mapping closer to B. enterica than other Bradyrhizobium? This might shed light on whether the "soil bacteria" being identified here are actually these relatives that may be adapted to humans. That both Rhizobium and Bradyrhizobium closely interact with plant hosts to form symbiotic nodules, and that these relationships can "go awry" and form tumors in plants, makes it potentially interesting to explore mechanistically any potential pathway overlap with the mechanisms that these microbes exploit during tumorigenesis in plants and pathways of importance in human tumor formation. May be hard to do but I am just thinking of ways to dig a little more into mechanistic leads using sequence data.

I appreciate the attempt, but neither of the studies quoted ruled out contamination

@gregpoore
Copy link
Owner

@mw55309 I have no involvement in those papers and suggest that you reach out to the original authors if you have concerns of contamination. However, I kindly note that the following text in Bhatt et al. 2013 NEJM directly addresses this topic:

Paired-end 76-bp or 101-bp massively parallel sequencing was performed at separate sequencing centers for each patient in order to control for possible contamination (see the Supplementary Appendix for a detailed description of the contamination analysis).

In their Supplementary Appendix, they have 2.5 pages (p. 5-7) specifically detailing how they mitigated contamination. It thus seems difficult to conclude that they did not make good faith attempts to rule it out, but I again encourage you to reach out to those authors if it remains a concern for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants