The MIDASim package (v0.1.0) implements the MIcrobiome DAta Simulator functions that simulate realistic microbiome data, maintaining the distributional and taxon-taxon correlation of a template microbiome datase. Users may also change the parameter setup in the model to introduce ``effect'' to facilitate simulation.
Required packages for functions in MIDASim include: psych, MASS, vegan, pracma, scam. These packages are all available on CRAN.
Install from the local file (the .tar.gz file), which you can download from https://github.com/mengyu-he/MIDASim.
devtools::install_local("MIDASim_0.1.0.tar.gz", dependencies = TRUE)
A github installation is also available via
devtools::install_github("mengyu-he/MIDASim", build_vignettes = T, force = T)
browseVignettes("MIDASim")
or
vignette("MIDASim_vignette", package = "MIDASim")
There are three main functions in MIDASim package: MIDASim.setup(), MIDASim.modify(), MIDASim().
We demonstrate the method of MIDASim and functions in this package using a filtered microbiome dataset of patients with IBD(Inflammatory Bowel Disease) in Human Microbiome Project 2 (HMP2) [1]. The data can be directly loaded through
data("count.ibd")
This HMP2 data were modified through package HMP2Data on Bioconductor by filtering out samples with library sizes smaller than 3000 and taxa that appear in less than 2 samples. The filtered data have 146 rows (samples) and 614 columns (taxa).
library(HMP2Data)
IBD16S()
count.ibd = t(IBD16S_mtx[, colSums(IBD16S_mtx)>3000] )
count.ibd = count.ibd[, colSums(count.ibd>0)>1]
dim(count.ibd)
Other datasets come with the package include a vaginal microbiome dataset from HMP2 [1] and a upper-respiratory-tract microbiome dataset [2].
The function MIDASim.setup() fits the underlying MIDASim model and extracts information from the template data and returns the estimated parameters for microbiome simulation. MIDASim offers two modeling options for relative abundances: nonparametric and parametric. Both approaches utilize a two-stage procedure in which the first stage models the presence-absence relationship between taxa and the second stage models the non-zero values in the microbiome community. These two approaches differ by the way how the second stage marginal distributions are modeled. The nonparametric approach generates data that resemble the template data better than the parametric approach. However, the parametric approach offers modifications of the template data in more controlled way.
In nonparametric mode (the default), a rank based approach is used to obtain the relative abundances for taxa that are considered to be non-zero in the simulated data.
To fit MIDASim with nonparametric mode to the count.ibd data, use the following command:
count.ibd.setup = MIDASim.setup(otu.tab, mode = 'nonparametric', n.break.ties = 10)
The argument n.break.ties
specifies the number of replicates used to break ties when ranking relative abundances for each taxon, and the default is n.break.ties = 100
.
In parametric mode, MIDASim fits the generalized gamma model, a three-parameter distribution in the location-scale family that was proposed for analyzing right-censored survival data to the relative abundance data of each taxon separately.
Denote the simulated relative abundance for
We estimate the three parameters
To fit MIDASim in parametric mode to the count.ibd data, use the following command:
count.ibd.setup = MIDASim.setup(otu.tab, mode = 'parametric')
The returned values of MIDASim.setup() include the estimated tetrachoric correlation of observed presence-absence data (tetra.corr
), the estimated Pearson correlation of observed relative abundances (corr.rel.corrected
), the proportion of non-zero cells for each taxon (taxa.1.prop
), the proportion of non-zero cells for each subject (sample.1.prop
), the observed mean relative abundances of each taxon (mean.rel.abund
), and the observed relative abundances among non-zero cells for each taxon (rel.abund.1
). If mode = 'parametric', the function will also return the estimated parameters of the generalized gamma distribution (mu.est
, sigma.est
, and Q.est
).
The function MIDASim.modify() sets up quantities required to simulate data using MIDAS. This function also allows users to modify library sizes, taxa relative abundances, the location parameters in the parametric model, taxa proportion of non-zero cells, number of samples. Note that this is a required step even if no adjustment is made.
The default is no additional adjustment is wanted. The number of samples and their target library sizes for the simulated data are the same as that of the template. The following code is an example.
count.ibd.modified = MIDASim.modify(count.ibd.setup,
lib.size = NULL,
mean.rel.abund = NULL,
gengamma.mu = NULL,
sample.1.prop = NULL,
taxa.1.prop = NULL)
Next, we will explain how to modify the data features to accommodate a wide range of simulation scenarios, separately for the nonparametric and parametric modes.
We provide options for modifying the model under the nonparametric mode. However, it's important to note that such modifications may not be as controlled as in parametric mode. For example, if we wish to change the library sizes of certain observations or the relative abundances of various taxa, it is not clear how the proportion of non-zero taxa should change in the nonparametric mode.
In nonparametric mode, users are able to modify the sample size, library sizes, taxa relative abundances, taxa proportion of non-zero cells.
To change the library sizes, set the lib.size
argument equal to a vector with the target library sizes. The number of samples can be modified implicitly by specifying library sizes with a vector length equal to the desired sample size.
Note that, if we wish to change the library sizes, the proportion of non-zero taxa would also change. Typically, larger library sizes are accompanied by fewer zero cells. However, since the model is nonparametric, there is no simple parametric relationship between the changes in zero cells and relative abundances. MIDASim.modify() requires the user to specify the marginal totals of non-zeros together when a change in lib.size
is made. This can be achieved by providing the marginal proportions using the sample.1.prop
and taxa.1.prop
arguments within the function."
An example way determine the appropriate number of non-zero cells for the new library size is to fit a Shape Constrained Additive model (SCAM) model
$$ log10(Z_{i.}) = f\left(log_{10}(N_{i})\right)+\epsilon_i$$
where sample.1.prop
and the number of taxa should be equal to the product of taxa.1.prop
and the number of samples,
For example, if we want to generate data with library sizes uniformly distributed between 1,000 and 10,000 while changing the number of samples to 700 and allowing the proportion of non-zero cells to adjust according to the SCAM model, we can start by generating the marginal totals of non-zero cells and transforming the totals to proportions using the following:
new.lib.size = sample(1000:10000, size=700, replace=TRUE)
obs.sample.1.ct = rowSums(count.ibd > 0)
xvar = log10(rowSums(count.ibd))
scamfit.non0 = scam::scam( log10(obs.sample.1.ct) ~ s( xvar, bs = "mpi" ))
sample.1.ct = 10^(predict(scamfit.non0,
newdata = data.frame(xvar = log10(new.lib.size) )) )
n.taxa = ncol(count.ibd)
new.sample.1.prop = sample.1.ct/n.taxa
new.taxa.1.prop = fitted$taxa.1.prop * (sum(new.sample.1.prop) * n.taxa / 700) / sum(fitted$taxa.1.prop)
Then we can use MIDASim.modify() by
count.ibd.modified = MIDASim.modify(count.ibd.setup,
lib.size = new.lib.size,
sample.1.prop = new.sample.1.prop,
taxa.1.prop = new.taxa.1.prop)
If the mean relative abundances are modified, adjustments to the values of non-zero relative abundances may be necessary. Let
These three quantities are related by
For example, if
Suppose we want to increase
beta = 0.1
new.mean.rel.abund = count.ibd.setup$mean.rel.abund
new.mean.rel.abund[1:10] = exp(beta) * new.mean.rel.abund[1:10]
new.mean.rel.abund = new.mean.rel.abund / sum(new.mean.rel.abund)
then the target taxa relative abundances can be used in MIDASim.modify() as the following,
count.ibd.modified = MIDASim.modify(count.ibd.setup,
mean.rel.abund = new.mean.rel.abund)
Users also have the option to modify the proportion of non-zero cells for all taxa. However, because this modification affects the overall total count of non-zero cells throughout the entire dataset, users are required to specify the marginal proportions of non-zero cells for all samples.
Suppose we wish to increase the proportion of non-zero cells for first 10 taxa by 0.05. We can achieve this by
new.taxa.1.prop = count.ibd.setup$taxa.1.prop
new.taxa.1.prop[1:10] = new.taxa.1.prop[1:10] + 0.05
Then to ensure that appropriate proportions of non-zero cells are provided for all samples, here we adjust them accordingly by
r = sum(new.taxa.1.prop) / sum(count.ibd.setup$taxa.1.prop)
new.sample.1.prop = count.ibd.setup$sample.1.prop * r
then the target proportion of non-zero cells for taxa can be used in MIDASim.modify() as the following,
count.ibd.modified = MIDASim.modify(count.ibd.setup,
taxa.1.prop = new.taxa.1.prop,
sample.1.prop = new.sample.1.prop)
Users can also simultaneously adjust both the mean relative abundances and the proportion of non-zero cells for taxa using the following code by
count.ibd.modified = MIDASim.modify(count.ibd.setup,
mean.rel.abund = new.mean.rel.abund,
taxa.1.prop = new.taxa.1.prop,
sample.1.prop = new.sample.1.prop)
If the parametric mode is chosen in MIDASim.setup(), then the user are able to modify the sample size, library sizes, taxa relative abundances, the location parameters in the generalized gamma distribution.
Changes in the parameters of the parametric model (including library sizes) imply coordinated changes in all other quantities. After either modification of the parameters, we predict the probability of being non-zero of
\begin{equation} \label{gengamma_pred01}
P(\widetilde{Z}_{ij} = 1) = F_j(N_i ; \widehat{\mu}_j, \widehat{\sigma},j \widehat{Q}j),
\end{equation}
to obtain the number of non-zero cells in each subject $Z{i\cdot }$ and that in each taxon $Z{\cdot j}$.
To change the library sizes, set the lib.size
argument equal to a vector with the target library sizes. The length of this vector becomes the new number of samples. The advantage of the parametric mode is that it handles coordinated changes automatically. As a result, when changing library sizes in parametric mode, there's no need for the user to provide the marginal totals of non-zero cells explicitly.
For example, to generate data having library sizes that are uniformly distributed between 1,000 and 10,000 while changing the number of samples to 700, we execute the code
new.lib.size = sample(1000:10000, size=700, replace=TRUE)
count.ibd.modified = MIDASim.modify(count.ibd.setup,
lib.size = new.lib.size)
Specifying new values for mean relative abundances (mean.rel.abund
) or location parameters (gengamma.mu
) can modify taxa features.
When mean relative abundances are changed, it effectively adjusts the location parameters in the parametric model. In this case, when the user specifies mean relative abundances for taxa, MIDASim automatically estimates new location parameters for the generalized gamma distribution. These changes in location parameters influence the predicted probabilities of non-zero values for each subject and each taxon, subsequently leading to adjustments in the marginal totals of non-zero cells.
For example, if we want to increase
beta = 0.1
new.mean.rel.abund = count.ibd.setup$mean.rel.abund
new.mean.rel.abund[1:10] = exp(beta) * new.mean.rel.abund[1:10]
new.mean.rel.abund = new.mean.rel.abund / sum(new.mean.rel.abund)
then the target taxa relative abundances can be used in MIDASim.modify() as the following,
count.ibd.modified = MIDASim.modify(count.ibd.setup,
mean.rel.abund = new.mean.rel.abund)
Alternatively, the users can also directly change the location parameters
count.ibd.modified = MIDASim.modify(count.ibd.setup,
gengamma.mu = count.ibd.setup$mu.est + beta)
The function MIDASim() takes the quantities obtained from MIDASim.modify() and simulates microbiome data. The output is comprised of the simulated 0/1 (presence-absence) data, sim_rel
, the table of relative abundances, and sim_count
, the table of count data.
simulated.data = MIDASim(count.ibd.modified)
summary(simulated.data)
[1] Lloyd-Price, J., Arze, C., Ananthakrishnan, A.N. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019). https://doi.org/10.1038/s41586-019-1237-9.
[2] Charlson, E.S., Chen, J., Custers-Allen, R. et al. Disordered microbial communities in the upper respiratory tract of cigarette smokers. PloS one, 5(12), e15216 (2010). https://doi.org/10.1371/journal.pone.0015216