update Results and Discussion text

nickp60 · Apr 30, 2020 · 69cc42e · 69cc42e
1 parent 4f53c86
commit 69cc42e
Showing 1 changed file with 17 additions and 15 deletions.
diff --git a/docs/main/main.Rmd b/docs/main/main.Rmd
@@ -63,21 +63,22 @@ Where both *in silico* phylotyping methods predicted a Clermont phylogroup diffe
 
 EzClermont is an open-source Python package distributed under the MIT License, available *via* PyPI, conda, and GitHub [**INCLUDE URIS HERE**. The package comprises a command-line tool for batch execution, and a Flask-based webapp.The webapp is hosted as a live service at <http://ezclermont.org>, and a Docker container is available [**AT WHAT URI?**] for local deployment.
 
+### Performance {-}
 
-## Results {-}
+[**NEEDS PERFORMANCE METHOD DESCRIPTION HERE: COMPUTING SPECS, NUMBER OF RUNS, MEASURING "WALL TIME"**]
 
-Before addressing the performance of the *in silico* phylotypers, the experimentally-determined phylogroups must be confirmed with the maximum likelihood phylogeny.  Of the 125 strains assessed, seven of the reported phylogroup do not match.  Table 1 summarizes any disagreement between the literature reported phylotypes and that determined by maximum likelihood, ClermonTyping, or EzClermont. This could be for a number of reasons, including a strain being mislabelled, contamination, or a case where the Clermont quadruplex PCR does not accurately reflect the phylogeny.  In all cases, both *in silico* methods typed the strain in a manner matching the phylogeny, supporting the maximum likelihood phylogeny and suggesting that the integrity of the strain collection may have been compromised.  Indeed, comparing the two sequencing efforts revealed that at least in these two collections, the phylogentic identities of 12 of the 72 strains are not certain (Supplementary Figure \@ref(fig:bioproject-comparison)), and research labs may be refering to different strains by the same name. For assessing performance of the *in silico* tools, the phylogroup supported by the phylogeny was taken to be true, as we have no way of knowing the true identity of the strain on which the original Clermont PCR method was performed.
 
-```{r tree-ez, fig.cap="Cladogram of phylogenetic relationships between members of the ECOR collection and select phylogroup G isolates from Clermont et al. 2019. Clades are coloured by phylogroup; the heatmap surrounding the tree shows the reported phylogroup, ClermonTyping phylogroup, and EzClermont phylogroup (inside to outside).  The reported phylogroup was not supported by the phylogeny in seven of the strains. Both EzClermont and ClermonTyping show agreement with the phylogeny in all but two cases: ECOR44 and ECOR49.", out.width="110%" }
-knitr::include_graphics(path =  "../analysis/cladogram.png")
-```
+## Results {-}
 
+Clermont types reported in the literature are not guaranteed always to correspond to phylogenetic lineage for *E. coli*; *in silico* predictions of phylotype may agree with reported type, lineage, both, or neither. We first therefore established the correspondence between lineage and Clermont type for each isolate in the 125-member validation set and visualised this in Figure \@ref(fig:tree-ez). We found that for seven isolates the lineage was not consistent with the recorded Clermont type [**TABLE**]. In these cases we considered that the phylogenetic lineage was more reliable and took precedence over literature-reported Clermont type, for validating the *in silico* methods.
 
-The cladogram in Figure \@ref(fig:tree-ez) highlights two isolates (ECOR44 and ECOR49) were mistyped by both EzClermont and ClermonTyping; the maximum likelihood tree is shown in Supplementary Figure \@ref(fig:tree-ez-phy). Both of these mistyped  isolates  are phylogroup D.  Both tools type ECOR49 as phylogroup G. It appears that the canonical arpA fragment that should be present in phylogroup D isolates could not be identified in the assembly with either tool. Using a reciprocal BLAST approach to identify the region in the assembly also failed. Taken together, this suggests that the assembly may be incomplete, as the fragment must be present in the organism to be detected with PCR, but cannot be found; this was confirmed by running on the alternative assembly accession GCA_002190975.1, which both tools type as phylogroup D.
+Figure \@ref(fig:tree-ez) also summarises the resuls of applying both EzClermont and ClermonTyping to the validation dataset. For 123 of 125 isolates, the *in silico* method predictions were consistent with the dominant Clermont type of the phylogenetic lineage. The two mismatched isolates ECOR44 and ECOR49 are, by lineage and literature report, phylogroup D but were mistyped by both EzClermont and ClermonTyping as phylogroups G or E. We examined the source assembly for the ECOR49 isolate and found by reciprocal BLAST search that the canonical arpA fragment that should be present in phylogroup D could not be identified. This would be sufficient to cause misclassification, and suggested that the assembly used for validation might not be complete. We confirmed this by also analysing the alternative ECOR49 assembly GCA_002190975.1; this assembly contains the arpA fragment and both tools assigned this genome correctly to phylogroup D. [**A FIGURE SHOWING ALIGNMENT OF THIS REGION OF THE TWO ECOR49 ASSEMBLIES WOULD BE USEFUL IN SUPP INFO**]
 
-ClermonTyping types ECOR44  as phylogroup E, suggesting a false positive with the Group E primer set. EzClermont types it as phylogroup G, suggesting a false negative of the arpA primer set. Investigation revealed that the quadruplex's arpA fragment was not detected due to a G->A mutation in base 17 of the reverse primer site. Because this occurs in the final 5 bases of the 3' end, it was not incorporated during the training process; a similar mutation was seen in 9 of the 1395 training isolates. @beghain_clermontyping_2018 notes the difficultly in typing with this arpA fragment, as it has likely been horizontally transfered to some phylogroup D isolates.
+```{r tree-ez, fig.cap="Cladogram of whole-genome phylogenety for members of the ECOR collection and phylogroup G isolates from Clermont et al. 2019. Clades are background-coloured by dominant phylogroup. The heatmap surrounding the tree shows phylogroups determined from: literature (inner ring), ClermonTyping, and EzClermont (outer ring). The literature phylogroup was not supported by *in silico* analysis for seven strains. Both EzClermont and ClermonTyping agree with the phylogenetic lineage in all but two cases: ECOR44 and ECOR49.", out.width="110%" }
+knitr::include_graphics(path =  "../analysis/cladogram.png")
+```
 
-<!-- Table (\#tab:sims)1: Summary of phylogroup by method. EzClermont and ClermonTyping were run on a set of strains with reported phylotypes. PhyML was used to reconstruct the phylogeny based on core SNPs from Parsnp, allowing comparison between the phylotype and the true phylogeny. Both tools type ECOR49 types as phylogroup G due to a contaminated assembly (*); ECOR49 from assembly GCA_002190975.1 is correctly typed by both tool s as phylogroup D. -->
+The ECOR44 isolate was mistyped by ClermonTyping as phylogroup E, and by EzClermont as phylogroup G. This was suggestive of a false negative result *in silico* for the arpA primer set. Closer inspection of the region indicated that the arpA fragment was not correctly identified due to a G->A substitution at base 17 of the reverse primer binding site. This mutation occurs in the final 5 bases of the reverse primer, and so was not incorporated during the training process for the primer regular expressions; a similar [**SIMILAR, OR THE SAME? IF SIMILAR A TABLE OF VARIANTS IN SUPP INFO, MAYBE?**] mutation was seen in a further eight of the 1395 training isolates.
 
 ```{r simstab}
 restab <- structure(list(Strain = c("SMS-3-5", "APEC01", "ECOR07", "ECOR72", 
@@ -91,25 +92,26 @@ restab <- structure(list(Strain = c("SMS-3-5", "APEC01", "ECOR07", "ECOR72",
 "B2"), Note = c("", 
 "", "", "", "", "", "ArpA1_r G17A", 
 "", "")), row.names = c(NA, -9L), class = "data.frame")
-knitr::kable( longtable=FALSE, booktabs=TRUE,
+knitr::kable(longtable=FALSE, booktabs=TRUE,
   #format = "markdown",
   escape = T, col.names = c( "Strain", "Accession",  "Reported",  "Phylogeny", " ClermonTyping", "EzClermont", "Note")  , 
-  caption="Summary of phylogroup by method. EzClermont and ClermonTyping were run on a set of strains with reported phylotypes. PhyML was used to reconstruct the phylogeny based on core SNPs from Parsnp, allowing comparison between the phylotype and the true phylogeny. Both tools type ECOR49 types as phylogroup G due to a contaminated assembly (*); ECOR49 from assembly GCA002190975.1 is correctly typed by both tool s as phylogroup D.",
+  caption="Isolates with inconsistent phylogroup predictions. EzClermont and ClermonTyping were run on a set of strains with reported phylotypes. A core SNP tree was constructed, allowing comparison between predicted and reported phylotypes, and the estimated phylogeny. Both tools mistype ECOR49 types as phylogroup G due to a potentially contaminated assembly (*); ECOR49 from assembly GCA_002190975.1 is correctly typed by both tools as phylogroup D.",
   restab %>% arrange(Strain)
 ) %>% kable_styling(latex_options = "striped")
 ```
 
+We ran our analyses on the 125 member validation set [**X**] times with both EzClermont and ClermonType [**TABLE OF AVERAGE/RANGE OF TIME TAKEN**]. The mean execution time was [**X**] for EzClermont and [**X**] with ClermonTyping, indicating that there is [**NO COMPELLING EFFICIENCY DIFFERENCE BETWEEN THE TWO IMPLEMENTATIONS?**]
 
-## Discussion {-}
-
-EzClermont was built to bridge the gap between traditional methods of classifying *E. coli* and the interpretation of whole-genome sequencing data. Considering the phylogeny of the strains tested, both EzClermont and ClermonTyping correctly classify 124/125 isolates.  Further, a wide application of EzClermont by @zhou_user_2019 to representative *E. coli* strains in EnteroBase was largely in agreement with both higher-resolution sequence typing and with ClermonTyping. 
 
-ClermonTyping has the advantage of classifying which cryptic lineage a strain belongs to, where EzClermont groups them all together. Due to the many methods that provide more accurate classification of these clades, we do not attempt to improve our method to classifiy those species. The execution time for the 125 strains tools 3 minutes 56 seconds for EzClermont; ClermonTyping executed 3 minutes 16 seconds; both tools execute in similar times via their webapp implementations. 
+## Discussion {-}
 
-As a python package, EzClermont is easy for developers to integrate into existing pipelines. Given the ease of use of the web app for simple queries, its incorporation into EnteroBase [@zhou_user_2019], and the standalone speed of execution for larger batches, we hope that EzClermont will be of continued use to the scientific community.
+EzClermont was built to bridge the gap between established laboratory and whole-genome sequencing methods of classifying *E. coli*. Both EzClermont and ClermonTyping correctly classified 123 of the 125 isolates in our validation set, indicating that they each perform with approximately 98% true positive rate (TPR). Furthermore, a much broader application of EzClermont by @zhou_user_2019 to representative *E. coli* strains in EnteroBase was found to be strongly in agreement with both higher-resolution sequence typing and with ClermonTyping. Although there is [**NO COMPELLING EFFICIENCY DIFFERENCE BETWEEN THE METHODS?**], EzClermont identifies only that isolates are classified as "cryptic", where ClermonTyping distinguishes between cryptic lineages.
 
+Both tools mistyped the same pair of isolates from the validation set. Incomplete assemblies and misassembled genomes in particular are always likely to give erroneous results with genome sequence-based methods. Input genome quality is therefore critical for accurate classification. The arpA fragment appears to be particularly problematic, and @beghain_clermontyping_2018  noted the difficulty in typing with this region, which has likely been horizontally transfered to some phylogroup D isolates.
 
+However, the disagreement observed in this study between phylogenetic lineage and literature-reported phylotype for seven isolates reinforces that laboratory assays also share potential for error, and that these errors may be propagated in literature and metadata. Our comparison of sequencing efforts for the same isolates in two BioProjects implies that, at least in these two collections, the phylogenetic identities of 12 of the 72 strains were not certain (Supplementary Figure \@ref(fig:bioproject-comparison)). Such issues may lead to groups referring to distinct strains by the same name. We found that application of the *in silico* tools was able to correct misassigned phylotype for seven isolates.
 
+EzClermont is implemented as an aplication and as a Python package, for developers to integrate into existing and new pipelines. It is also presented as a web appplication with an intuitive interface for simple queries. We hope that the incorporation of EzClermont into EnteroBase [@zhou_user_2019], and the utility of applying the local program to large batches of genomes, mean that it will be of continued use to the scientific community.
 
 
 ## Supplementary Figures {-}