This R package has code and data for papers by Jeffrey M. Dick. Plots from the papers are reproduced in the vignettes, which are installed with the package and can be viewed at https://chnosz.net/JMDplots/vignettes/.
Click on the paper titles for a list of files. Published papers are indicated by the year with a DOI link. Links to preprints, if available, are at the end of each list. See the manual page associated with each paper for additional details about scripts, data files, and plotting functions.
genoGOE
: Genomes record the Great Oxidation Event (in-preparation manuscript)
-
inst/extdata/genoGOE: scripts and processed data files
-
methanogen_genomes.csv: genome IDs and taxonomy in GTDB r220 for 19 Class I and 19 Class II methanogen species selected from Fig. 1 of Lyu and Lu (2018)
-
process_GTDB.R: script to obtain DNA and protein sequences for 53 archaeal marker genes in GTDB and amino acid compositions for all proteins in 38 methanogen genomes
-
ar53_msa_marker_info_r220_XHZ+06.csv: list of archaeal marker genes from GTDB augmented with protein abundance information for Methanococcus maripaludis from Xia et al. (2006)
-
methanogen: sequences and amino acid compositions generated using
process_GTDB.R
- marker/fna: Nucleotide sequences of marker genes
- marker/faa: Amino acid sequences of marker genes
- aa: Amino acid compositions of all proteins
-
-
inst/extdata/evdevH2O/LMM16: scripts and processed data files for consensus gene ages from Liebeskind et al. (2016), modified from the files used by Dick (2022)
- mkaa.R: script: sum amino acid compositions of proteins in each gene age category
- reference_proteomes.csv: data: IDs of UniProt reference proteomes for 31 organisms
- modeAges_names.csv: output file: Names of gene age categories for each organism
- modeAges_aa.csv: output file: Summed amino acid composition for proteins in each gene age category
-
../canprot/inst/extdata/fasta/KHAB17.fasta: reconstructed ancestral Rubisco sequences taken from Kaçar et al. (2017)
microhum
: Adaptations of microbial genomes to human body chemistry (submitted manuscript)
-
inst/extdata/microhum: scripts and processed data files
-
ARAST: analysis of metagenomes
- ARAST.R: script: metagenome processing pipeline
- runARAST.R: script: run pipeline for particular metagenomes
- *_aa.csv: output files: amino acid composition
- *_stats.csv: output files: processing statistics
-
KWL22: analysis of metagenome-assembled genomes (MAGs) from Ke et al. (2022)
- mkaa.R: script: metaproteome processing
- KWL22_MAGs_prodigal_aa.csv.xz: output file: amino acid composition
- BioSample_metadata.txt: data: BioSample metadata for MAGs obtained from NCBI BioProjects PRJNA624223 and PRJNA650244.
-
metaproteome: analysis of metaproteomes
-
16S: analysis of 16S rRNA gene sequences
- metadata: data: sample metadata for 16S rRNA datasets
- pipeline.R: script: 16S rRNA processing pipeline
- RDP-GTDB: output files: taxonomic classifications for 16S rRNA datasets made using the RDP Classifier with a training set based on GTDB release 207
-
MR18_Table_S1_modified.csv: data: List of Prokaryotes according to their Aerotolerant or Obligate Anaerobic Metabolism, modified from Million and Raoult (2018)
-
Figure_5_genera.txt: data: List of genera in Figure 5, created from the value invisibly returned by
microhum_5()
.
-
-
R/microhum.R: code for plots
-
man/microhum.Rd: manual page
-
vignettes/microhum.Rmd: vignette including Figures 1–6 and figure supplements.
- microhum.html: compiled HTML version of the vignette (external link)
-
bioRxiv: preprint (external link)
chem16S
: Community-level chemical metrics for exploring genomic adaptation to environments (2023)
-
R/chem16S.R: code for plots
-
man/chem16S.Rd: manual page
-
vignettes/chem16S.Rmd: vignette including Figure 1
- chem16S.html: compiled HTML version of the vignette (external link)
-
../chem16S/inst/extdata: scripts and processed data files (NOTE: these files are in the chem16S package; see chem16S-package.Rd for details)
- RefSeq: processing scripts and output files of amino acid composition of genus- and higher-level taxa derived from the RefSeq database
- GTDB: processing scripts and output files of amino acid composition of genus- and higher-level taxa derived from the Genome Taxonomy Database (GTDB)
- metadata: sample metadata for 16S rRNA datasets: Heart Lake Geyser Basin in Yellowstone National Park (Bowen De León et al., 2012), Baltic Sea (Herlemann et al., 2016), and Bison Pool in Yellowstone National Park (Swingley et al., 2012)
- RDP: output of RDP Classifier for the above 16S rRNA datasets using the default training set
- RDP-GTDB: output of RDP Classifier for the above 16S rRNA datasets using a GTDB-based training set
- DADA2: Analysis of two 16S rRNA datasets with DADA2 using a GTDB-based training set: marine sediment from the Humboldt Sulfuretum (Fonseca et al., 2022) and hot springs in the Qinghai-Tibet Plateau (Zhang et al., 2023)
orp16S
: Community- and genome-based evidence for a shaping influence of redox potential on bacterial protein evolution (2023)
-
inst/extdata/orp16S: scripts and processed data files
- metadata: data: sample metadata for 16S rRNA datasets
- pipeline.R: script: 16S rRNA processing pipeline
- RDP: output files: taxonomic classifications for 16S rRNA datasets made using the RDP Classifier with its default training set
- hydro_p: data: shapefiles for the North American Great Lakes, downloaded from USGS (2010)
- EZdat.csv: output file: sample data and computed values of Eh7 and Zc
- EZlm.csv: output file: linear fits between Eh7 and Zc for each dataset
- BKM60.csv: data: outline of Eh-pH range of natural environments, digitized from Fig. 32 of Baas Becking et al. (1960)
- MR18_Table_S1.csv: data: list of strictly anaerobic and aerotolerant genera from Table S1 of Million and Raoult (2018)
-
metaproteome: analysis of metaproteomes
-
R/orp16S.R: code for plots
-
man/orp16S.Rd: manual page
-
vignettes/orp16S.Rmd: vignette including Figures 1–6, S1–S2, and Table 1
- orp16S.html: compiled HTML version of the vignette (external link)
-
bioRxiv: preprint (external link)
utogig
: Using thermodynamics to obtain geochemical information from genomes (2023)
-
inst/extdata/utogig: scripts and processed data files
-
R/utogig.R: code for plots
-
man/utogig.Rd: manual page
-
vignettes/utogig.Rmd: vignette including Figures 1–4, S1–S4, Table S6, and conversions between redox scales
- utogig.html: compiled HTML version of the vignette (external link)
Amino acid compositions and taxonomic information have been obtained from the Saccharomyces Genome Database (SGD), UniProt, RefSeq, GTDB, and MGnify. See man/JMDplots-package.Rd for further details.
Reference databases
-
inst/extdata/RefDB/organisms: Data for particular organisms, downloaded from SGD or UniProt.
- Sce.csv.xz: Saccharomyces cerevisiae (used in the scsc and aoscp papers)
- yeastgfp.csv.xz: Subcellular localization and abundance of proteins in S. cerevisiae (used in the scsc paper)
- UP000000805_243232.csv.xz: Methanocaldococcus jannaschii (used in the mjenergy paper)
- UP000000625_83333.csv.xz: Escherichia coli K12
- UP000000803_7227.csv.xz: Drosophila melanogaster (used in the evdevH2O paper)
- UP000001570_224308.csv.xz: Bacillus subtilis strain 168 (used in the evdevH2O paper)
-
inst/extdata/RefDB/RefSeq: Data files processed from RefSeq and used in the geo16S and orp16S papers
- genome_AA.csv.xz: Amino acid compositions of species-level archaeal, bacterial, and viral taxa in the RefSeq database
- taxonomy.csv.xz: Taxonomic names for the species
- Scripts to produce these files are in chem16S
-
inst/extdata/RefDB/GTDB: Data files processed from GTDB and used in the microhum manuscript
- genome_AA.csv.xz: Amino acid compositions of predicted proteins
- taxonomy.csv.xz: Taxonomic names
- Scripts to produce these files are in chem16S
-
inst/extdata/RefDB/UHGG: Data files processed from MGnify's UHGG and used in the microhum manuscript
- MGnify_genomes.csv: List of 4744 species-level clusters in the Unified Human Gastrointestinal Genome (UHGG v.2.0.1)
- getMGnify.R: Commands used to download FASTA files for proteins and to scrape the website for taxonomic information
- taxonomy.csv.xz: Taxonomy for 2350 selected genomes with contamination < 2% and completeness > 95%
- genome_AA.R: Calculates amino acid compositions of the selected genomes from FASTA files and writes the output file genome_AA.csv.xz
- taxonomy.R: Combines amino acid compositions of genomes to generate reference proteomes for genera and higher taxonomic levels and writes the output file taxonomy.csv.xz
- fullset: Versions of
taxonomy.csv.xz
,genome_AA.csv.xz
, andtaxon_AA.csv.xz
for the full set of 4744 genomes
First install the remotes packages from CRAN.
install.packages("remotes")
Then install other required packages: canprot and chem16S.
remotes::install_github("jedick/canprot")
remotes::install_github("jedick/chem16S")
Note
Currently (as of 2023-07-31), JMDplots depends on the development versions of canprot and chem16S from GitHub, not the released versions on CRAN.
Finally, install JMDplots. This command will install prebuilt vignettes; they might not be up-to-date with the source code.
remotes::install_github("jedick/JMDplots")
To view the plots, use the R help browser or this command to open the vignettes page:
browseVignettes("JMDplots")
remotes::install_github("jedick/JMDplots", dependencies = TRUE, build_vignettes = TRUE)
Note It might be possible to build the vignettes without pandoc, but having pandoc available will make them look better.
This package except for the file inst/extdata/orp16S/metadata/PCL+18.csv
is licensed under the GNU General Public License v3 (GPLv3).
The ORP (mV), DO (mg/L) and Feature (Stream, Spring, Lake, Terrace, or Geyser) data for New Zealand hot springs (Power et al., 2018) in PCL+18.csv
were obtained from the 1000 Springs Project and are licensed under CC-BY-NC-SA.
This package contains a copy of the dunnTest()
function by Derek H. Ogle from CRAN package FSA, version 0.9.3 (License: GPL (>= 2)), which itself is a wrapper for dunn.test()
from CRAN package dunn.test by Alexis Dinno.