update in silico PCR method section text

This section has a number of request in bold - they should be straightforward restructurings, mostly putting elements into supplementary data.
nickp60 · Apr 30, 2020 · 78eac64 · 78eac64
1 parent a834f8a
commit 78eac64
Showing 1 changed file with 3 additions and 1 deletion.
diff --git a/docs/main/main.Rmd b/docs/main/main.Rmd
@@ -40,8 +40,10 @@ EzClermont was developed to bridge the gap between the traditional quadruplex pr
 
 ## Methods {-}
 ### *In silico* PCR {-}
-EzClermont software uses regular expressions to perform an *in silico* PCR, determining a phylotype according to the presence or absence of the target alleles. As PCR primers do not necessarily need 100\% sequence homology to function, the variability at the priming sites across a large set of *E. coli*  strains  was determined. This set of strains was selected from EnteroBase in April, 2019; after filtering the metadata based on metadata quality and source, one representative of each sequence type was selected. The list of 1395 isolates can be found in the EzClermont repository^[`docs/analysis/training/enterobase_training_subset.tab`; a detailed description and script of this filtering  procedure can be found at <https://github.com/nickp60/soil-persistent-ecoli/blob/master/3602-processing-Enterobase-metadata.Rmd>.]  These sequences provide a broad summary of the genomic diversity in *E. coli*. From each assembly generated by EnteroBase, the eight regions matching the theoretical amplicons of the quadriplex, E-specific, C-specific, G-specific, and E/C control primer sets  from @clermont_clermont_2013 and @clermont_characterization_2019 were identified and extracted using a reciprocal BLAST approach, and aligned the regions with mafft version 7.455 [@katoh_mafft_2013] using default parameters. Any differences between an assembly's sequence and the canonical primer sequence were incorporated into the regular expression in a manner akin to degenerate primer design. Differences occurring in the last 5 nucleotides on the 3’ regions were not incorporated, as those can be used to differentiate alleles [@stadhouders_effect_2010].
 
+To emulate PCR *in silico*, EzClermont uses regular expressions (regexes) that represent the Clermont primer sequences to locate their potential binding sites on a sequenced genome. The sequence regions lying between these sites are taken to be the predicted amplicons, and can be evaluated for sequence composition, or presence/absence to determine Clermont phylotype.
+
+In practice, Clermont primer sequences do not always exactly match their productive binding sites, so primer-binding sequence variability must be captured in the corresponding regexes. To represent this variability, we selected 1395 *E. coli* genomes from EnteroBase[**REF**] (accessed April, 2019), filtering on genome metadata quality and source [**MORE (BUT BRIEF) DETAIL/PRECISION NEEDED - PUT DETAILED METHOD IN SUPPLEMENTARy IF NECESSARY**]. The theoretical amplicons of each of the quadruplex, E-specific, C-specific, G-specific, and E/C control primer sets were identified and aligned [**PUT DETAIL - RECIP. BLAST, MAFFT ETC. IN SUPPLEMENTARY**] to identify variations in the canonical primer regions. These were used to establish the regular expressions used to search for primer binding sites in whole genomes [**INSERT A TABLE WITH THESE REGEX SEQUENCES**]. Sequence differences in the last five 5` bases of the reverse primer in each set were not included, as these variations can be used to differentiate alleles [@stadhouders_effect_2010] [**I AM NOT YET CLEAR WHAT THEY WERE NOT INCLUDED/INCORPORATED IN - DO THE PRIMER REGEXES STOP SHORT, BY 5bp?**].
 
 ### Validation Dataset {-}