Permalink
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
659 lines (571 sloc) 36.2 KB
---
title: '`methyvim`: Targeted, Robust, and Model-free Differential Methylation
Analysis in R'
author:
- name: Nima S. Hejazi
affiliation:
- Group in Biostatistics, University of California, Berkeley
- Center for Computational Biology, University of California, Berkeley
orcid: 0000-0002-7127-2789
email: nhejazi@berkeley.edu
- name: Rachael V. Phillips
affiliation:
- Group in Biostatistics, University of California, Berkeley
orcid: 0000-0002-8474-591X
- name: Alan E. Hubbard
affiliation:
- Group in Biostatistics, University of California, Berkeley
orcid: 0000-0002-3769-0127
- name: Mark J. van der Laan
affiliation:
- Group in Biostatistics, University of California, Berkeley
- Department of Statistics, University of California, Berkeley
orcid: 0000-0003-1432-5511
abstract: |
We present `methyvim`, an R package implementing an algorithm for the
nonparametric estimation of the effects of exposures on DNA methylation at CpG
sites throughout the genome, complete with straightforward statistical
inference for such estimates. The approach leverages variable importance
measures derived from statistical parameters arising in causal inference,
defined in such a manner that they may be used to obtain targeted estimates of
the relative importance of individual CpG sites with respect to a binary
treatment assigned at the phenotype level, thereby providing a new approach to
identifying differentially methylated positions. The procedure implemented is
computationally efficient, incorporating a preliminary screening step to
isolate a subset of sites for which there is cursory evidence of differential
methylation as well as a unique multiple testing correction to control the
False Discovery Rate with the same rigor as would be available if all sites
were subjected to testing. This novel technique for analysis of differentially
methylated positions provides an avenue for incorporating flexible
state-of-the-art data-adaptive regression procedures (i.e., machine learning)
into the estimation of differential methylation effects without the loss of
interpretable statistical inference for the estimated quantity.
keywords: DNA methylation, differential methylation, epigenetics, causal
inference, variable importance, machine learning, targeted loss-based
estimation
bibliography: references.bib
output: BiocWorkflowTools::f1000_article
---
# Introduction
DNA methylation is a fundamental epigenetic process known to play an important
role in the regulation of gene expression. DNA methylation mostly commonly
occurs at CpG sites and involves the addition of a methyl group ($\text{CH}_3$)
to the fifth carbon of the cytosine ring structure to form 5-methylcytosine.
Numerous biological and medical studies have implicated DNA methylation as
playing a role in disease and development [@robertson2005dna]. Perhaps
unsurprisingly then, biotechnologies have been developed to rigorously probe the
molecular mechanisms of this epigenetic process. Modern assays, like the
Illumina _Infinium_ HumanMethylation BeadChip assay, allow for quantitative
interrogation of DNA methylation, at single-nucleotide resolution, across a
comprehensive set of CpG sites scattered across the genome; moreover, the
computational biology community has invested significant effort in the
development of tools for properly removing technological effects that may
contaminate biological signatures measured by such assays
[@fortin2014functional, @dedeurwaerder2013comprehensive]. Despite these
advances in both biological and bioninformatical techniques, most statistical
methods available for differential analysis of data produced by such assays rely
on over-simplified models that do not readily extend to such high-dimensional
data structures without restrictive modeling assumptions and the use of
inferentially costly hypothesis testing corrections. When these standard
assumptions are violated, estimates of the population-level effect of an
exposure or treatment may suffer from large bias. What's more, reliance on
restrictive and misspecified statistical models naturally leads to biased effect
estimates that are not only misleading in assessing effect sizes but also result
in false discoveries as these biased estimates are subject to testing and
inferential procedures. Such predictably unreliable methods serve only to
produce findings that are later invalidated by replication studies and add still
further complexity to discovering biological targets for potential therapeutics.
Data-adaptive estimation procedures that utilize machine learning provide a way
to overcome many of the problems common in classical methods, controlling for
potential confounding even in high-dimensional settings; however, interpretable
statistical inference (i.e., confidence intervals and hypothesis tests) from
such data-adaptive estimates is challenging to obtain [@libbrecht2015machine].
In this paper, we briefly present an alternative to such statistical analysis
approaches in the form of a nonparametric estimation procedure that provides
simple and readily interpretable statistical inference, discussing at length a
recent implementation of the methodology in the `methyvim` R package. Inspired
by recent advances in statistical causal inference and machine learning, we
provide a computationally efficient technique for obtaining targeted estimates
of nonparametric _variable importance measures_ (VIMs) [@vdl2006statistical],
estimated at a set of pre-screened CpG sites, controlling for the False
Discovery Rate (FDR) as if all sites were tested. Under standard assumptions
(e.g., identifiability, strong ignorability) [@pearl2009causality], targeted
minimum loss-based estimators of regular asymptotically linear estimators have
sampling distributions that are asymptotically normal, allowing for reliable
point estimation and the construction of Wald-style confidence intervals
[@vdl2011targeted, @vdl2018targeted]. In the context of DNA methylation studies,
we define the counterfactual outcomes under a binary treatment as the observed
methylation (whether Beta- or M-) values a CpG site would have if all subjects
were administered the treatment and the methylation values a CpG site would have
if treatment were withheld from all subjects. Although these counterfactual
outcomes are, of course, impossible to observe, they do have statistical analogs
that may be reliably estimated (i.e., identified) from observed data under a
small number of untestable assumptions [@pearl2009causality]. We describe an
algorithm that incorporates, in its final step, the use _targeted minimum
loss-based estimators_ (__TMLE__) [@vdl2006targeted] of a given VIM of interest,
though we defer rigorous and detailed descriptions of this aspect of the
statistical methodology to work outside the scope of the present manuscript
[@vdl2006targeted, @vdl2011targeted, @vdl2018targeted]. The proposed methodology
assesses the individual importance of a given CpG site, as a proposed measure of
differential methylation, by utilizing state-of-the-art machine learning
algorithms in deriving targeted estimates and robust inference of a VIM, as
considered more broadly for biomarkers in @bembom2009biomarker and
@tuglus2011targeted. In the present work, we focus on the `methyvim` software
package, available through the Bioconductor project [@gentleman2004bioconductor,
@huber2015orchestrating] for the R language and environment for statistical
computing [@R], which implements a particular realization of this methodology
specifically tailored for the analysis and identification of differentially
methylated positions (DMPs).
For an extended discussion of the general framework of targeted minimum
loss-based estimation and detailed accounts of how this approach may be brought
to bear in developing answers to complex scientific problems through tatistical
and causal inference, the interested reader is invited to consult
@vdl2011targeted and @vdl2018targeted. For a more general introduction to causal
inference, @pearl2009causality and @hernan2018causal may be of interest.
# Methods
## Implementation
The core functionality of this package is made available via the eponymous
`methyvim` function, which implements a statistical algorithm designed to
compute targeted estimates of VIMs, defined in such a way that the VIMs
represent parameters of scientific interest in computational biology
experiments; moreover, these VIMs are defined such that they may be estimated in
a manner that is very nearly assumption-free, that is, within a _fully
nonparametric statistical model_. The statistical algorithm consists in the
several major steps summarized below. Additional methodological details on the
use of targeted minimum loss-based estimation in this problem setting is
provided in a brief appendix (see section __Appendix__).
1. _Pre-screening_ of genomic sites is used to isolate a subset of sites for
which there is cursory evidence of differential methylation. Currently, the
available screening approach adapts core routines from the
[`limma`](http://bioconductor.org/packages/limma) R package. Following the
style of the function for performing screening via `limma`, users may write
their own screening functions and are invited to contribute such functions to
the core software package by opening pull requests at the GitHub repository:
\url{https://github.com/nhejazi/methyvim}.
2. Nonparametric estimates of VIMs, for the specified target parameter, are
computed at each of the CpG sites passing the screening step. The VIMs are
defined in such a way that the estimated effects is of an binary treatment on
the methylation status of a target CpG site, controlling for the observed
methylation status of the neighbors of that site. Currently, routines are
adapted from the [`tmle`](https://CRAN.R-project.org/package=tmle) R package.
3. Since pre-screening is performed prior to estimating VIMs, we apply the
modified marginal Benjamini and Hochberg step-up False Discovery Rate
controlling procedure for multi-stage analyses (FDR-MSA), which is
well-suited for avoiding false positive discoveries when testing is only
performed on a subset of potential targets.
#### Parameters of Interest
For CpG sites that pass the pre-screening step, a user-specified target
parameter of interest is estimated independently at each site. _In all cases,
an estimator of the parameter of interest is constructed via targeted minimum
loss-based estimation_.
Two popular target causal parameters for discrete-valued treatments or
exposures are
* The average treatment effect (ATE): The effect of a binary exposure or
treatment on the observed methylation at a target CpG site is estimated,
controlling for the observed methylation at all other CpG sites in the same
neighborhood as the target site, based on an additive form. Often denoted
$\psi_0 = \psi_0(1) - \psi_0(0)$, the parameter estimate represents the
additive difference in methylation that would have been observed at the
target site had all observations received the treatment versus the
counterfactual under which none received the treatment.
* The relative risk (RR): The effect of a binary exposure or treatment on the
observed methylation at a target CpG site is estimated, controlling for the
observed methylation at all other CpG sites in the same neighborhood as the
target site, based on a geometric form. Often denoted,
$\psi_0 = \frac{\psi_0(1)}{\psi_0(0)}$, the parameter estimate represents the
multiplicative difference in methylation that would have
been observed at the target site had all observations received the treatment
versus the counterfactual under which none received the treatment.
Estimating the VIM corresponding to the parameters above, for discrete-valued
treatments or exposures, requires two separate regression steps: one for the
treatment mechanism (propensity score) and one for the outcome regression.
Technical details on the nature of these regressions are discussed in
@hernan2018causal, and details for estimating these regressions in the
framework of targeted minimum loss-based estimation are discussed in
@vdl2011targeted.
<!--
#### Future Work
Support for continuous-valued treatments or exposures is _planned and ongoing_
but not yet available in stable releases of the `methyvim` software package.
Future releases will allow users to assess continuous-valued treatments by
relying on parameters estimable through implementations available in the
following software packages:
* A nonparametric variable importance measure (NPVI) [@chambaz2012estimation]
(R package `tmle.npvi`). The effect of a continuous-valued exposure or
treatment (the observed methylation at a target CpG site) on an outcome of
interest is estimated, controlling for the observed methylation value at all
other CpG sites in the same neighborhood as the target site. This uses a
parameter that compares values of the treatment against a user-specified
reference value taken to be the null value. In particular, the implementation
to be provided is designed to assess the effect of differential methylation at
the target CpG site on an outcome of interest (e.g., survival), providing a
nonparametric evaluation of the impact of methylation at the target site.
* The causal effect of shifting the value of an observed intervention, defined
as the counterfactual outcome under a posited shift of a continuous-value
treatment of interest using stochastic intervention policies
[@munoz2012population, @diaz2018stochastic, @hejazi2018txshift] (R package
`txshift`). The value of an outcome of interest under an unobserved value of
the (continuous-valued) treatment, specified through a user-provided
_additive shift_ of the observed treatment, may be data-adaptively estimated.
This allows for the effect of changes/shifts in the observed methylation at a
CpG site on an outcome of interest to be evaluated, providing a nonparametric
evalution of the relative importance of a particular target CpG site with
respect to another variable measured in the same study.
-->
#### Class `methytmle`
We have adopted a class `methytmle` to help organize the functionality within
this package. The `methytmle` class builds upon the `GenomicRatioSet` class
provided by the `minfi` package so all of the slots of `GenomicRatioSet` are
contained in a `methytmle` object. The new class introduced in the `methyvim`
package includes several new slots:
* `call` - the form of the original call to the `methyvim` function.
* `screen_ind` - indices identifying CpG sites that pass the screening process.
* `clusters` - non-unique IDs corresponding to the manner in wich sites are
treated as neighbors. These are assigned by genomic distance (bp) and respect
chromosome boundaries (produced via a call to `bumphunter::clusterMaker`).
* `var_int` - the treatment/exposure status for each subject. Currently, these
must be binary, due to the definition of the supported targeted parameters.
* `param` - the name of the target parameter from which the estimated VIMs are
defined.
* `vim` - a table of statistical results obtained from estimating VIMs for
each of the CpG sites that pass the screening procedure.
* `ic` - the measured array values for each of the CpG sites passing the
screening, transformed into influence curve space based on the chosen target
parameter.
The `show` method of the `methytmle` class summarizes a selection of the above
information for the user while masking some of the wealth of information given
when calling the same method for `GenomicRatioSet`. All information contained in
`GenomicRatioSet` objects is preserved in `methytmle` objects, so as to easen
interoperability with other differential methylation software for experienced
users. We refer the reader to the package vignette, "`methyvim`: Targeted
Data-Adaptive Estimation and Inference for Differential Methylation Analysis,"
included in any distribution of the software package, for further details.
## Operation
A standard computer with the latest version of R and Bioconductor 3.6 installed
will handle applications of the `methyvim` package.
# Use Cases
To examine the practical applications and the full set of utilities of the
`methyvim` package, we will use a publicly available example data set produced
by the Illumina 450K array, from the `minfiData` R package.
#### Preliminaries: Setting up the Data
We begin by loading the package and the data set. After loading the data, which
comes in the form of a raw `MethylSet` object, we perform some further
processing by mapping to the genome (with `mapToGenome`) and converting the
values from the methylated and unmethylated channels to Beta-values
(via `ratioConvert`). These two steps together produce an object of class
`GenomicRatioSet`, provided by the `minfi` package.
```{r setup-minfidata, message=FALSE}
suppressMessages(
# numerous messages displayed at time of loading
library(minfiData)
)
data(MsetEx)
mset <- mapToGenome(MsetEx)
grs <- ratioConvert(mset)
grs
```
We can create an object of class `methytmle` from any `GenomicRatioSet` object
simply invoking the S4 class constructor `.methytmle`:
```{r make-methytmle}
library(methyvim)
grs_mtmle <- .methytmle(grs)
grs_mtmle
```
Additionally, a `GenomicRatioSet` can be created from a matrix with the
function `makeGenomicRatioSetFromMatrix` provided by the `minfi` package.
#### Differential Methylation Analysis
For this example analysis, we'll treat the condition of the patients as the
exposure/treatment variable of interest. The `methyvim` function requires that
this variable either be `numeric` or easily coercible to `numeric`. To
facilitate this, we'll simply convert the covariate (currently a `character`):
```{r, minfidata-maketx}
var_int <- (as.numeric(as.factor(colData(grs)$status)) - 1)
```
__n.b.__, the re-coding process results in "normal" patients being assigned a
value of 1 and cancer patients a 0.
Now, we are ready to analyze the effects of cancer status on DNA methylation
using this data set. We proceed as follows with a targeted minimum loss-based
estimate of the Average Treatment Effect.
```{r, minfidata-methyvim, warning=FALSE, cache=TRUE}
methyvim_cancer_ate <- methyvim(data_grs = grs, var_int = var_int,
vim = "ate", type = "Beta", filter = "limma",
filter_cutoff = 0.20, obs_per_covar = 2,
parallel = FALSE, sites_comp = 250,
tmle_type = "glm"
)
```
Note that we set the `obs_per_covar` argument to a relatively low value (just 2,
even though the recommended value, and default, is 20) for the purposes of this
example as the sample size is only 10. We do this only to exemplify the
estimation procedure and it is important to point out that such low values for
`obs_per_covar` will compromise the quality of inference obtained because this
setting directly affects the definition of the target parameter.
Further, note that here we apply the `glm` flavor of the `tmle_type` argument,
which produces faster results by fitting models for the propensity score and
outcome regressions using a limited number of parametric models. By contrast,
the `sl` (for "Super Learning") flavor fits these two regressions using highly
nonparametric and data-adaptive procedures (i.e., via machine learning).
Obtaining the estimates via GLMs results in each of the regression steps
being less robust than if nonparametric regressions were used.
We can view a table of results by examining the `vim` slot of the produced
object, most easily displayed by simply printing the resultant object:
```{r vim-cancer-ate}
methyvim_cancer_ate
```
Finally, we may compute FDR-corrected p-values, by applying a modified procedure
for controlling the False Discovery Rate for multi-stage analyses (FDR-MSA)
[@tuglus2009modified]. We do this by simply applying the `fdr_msa` function.
```{r fdr-msa}
fdr_p <- fdr_msa(pvals = vim(methyvim_cancer_ate)$pval,
total_obs = nrow(methyvim_cancer_ate))
```
Having explored the results of our analysis numerically, we now proceed to use
the visualization tools provided with the `methyvim` R package to further
enhance our understanding of the results.
#### Visualization of Results
While making allowance for users to explore the full set of results produced by
the estimation procedure (by way of exposing these directly to the user), the
`methyvim` package also provides _three_ (3) visualization utilities that
produce plots commonly used in examining the results of differential methylation
analyses.
A simple call to `plot` produces side-by-side histograms of the raw p-values
computed as part of the estimation process and the corrected p-values obtained
from using the FDR-MSA procedure.
```{r methyvim-pvals-raw, fig.height=4, fig.width=6}
plot(methyvim_cancer_ate, type = "raw_pvals")
```
```{r methyvim-pvals-fdr, fig.height=4, fig.width=6}
plot(methyvim_cancer_ate, type = "fdr_pvals")
```
__Remark:__ The plots displayed above may also be generated as side-by-side
histograms in a single plot object. This is the default for the `plot` method
and may easily be invoked by specifying no additional arguments to the `plot`
function, unlike in the above.
While histograms of the p-values may be generally useful in inspecting the
results of the estimation procedure, a more common plot used in examining the
results of differential methylation procedures is the volcano plot, which plots
the parameter estimate along the x-axis and $-\text{log}_{10}(\text{p-value})$
along the y-axis. We implement such a plot in the `methyvolc` function:
```{r methyvim-volcano, fig.height=4, fig.width=6}
methyvolc(methyvim_cancer_ate)
```
The purpose of such a plot is to ensure that very low (possibly statistically
significant) p-values do not arise from cases of low variance. This appears to
be the case in the plot above (notice that most parameter estimates are near
zero, even in cases where the raw p-values are quite low).
Yet another popular plot for visualizing effects in such settings is the
heatmap, which plots estimates of the raw methylation effects (as measured by
the assay) across subjects using a heat gradient. We implement this in the
`methyheat` function:
```{r methyvim-heatmap, fig.height=4, fig.width=6}
methyheat(methyvim_cancer_ate, smooth.heat = TRUE, left.label = "none")
```
__Remark__: Invoking `methyheat` in this manner produces a plot of the top sites
($25$, by default) based on the raw p-value, using the raw methylation measures
in the plot. This uses the exceptional `superheat` R package
[@barter2017superheat], to which we can easily pass additional parameters. In
particulat, we hide the CpG site labels that would appear by default on the left
of the heatmap (by setting `left.label = "none"`) to emphasize that this is only
an example and _not_ a scientific discovery.
# Summary
Here we introduce the R package `methyvim`, an implementation of a general
algorithm for differential methylation analysis that allows for recent advances
in causal inference and machine learning to be leveraged in computational
biology settings. The estimation procedure produces straightforward statistical
inference and takes great care to ensure computationally efficiency of the
technique for obtaining targeted estimates of nonparametric variable importance
measures. The software package includes techniques for pre-screening a set of
CpG sites, controlling for the False Discovery Rate as if all sites were tested,
and for visualzing the results of the analyses in a variety of ways. The anatomy
of the software package is dissected and the design described in detail. The
`methyvim` R package is available via the Bioconductor project.
# Software availability
Latest source code (development version): https://github.com/nhejazi/methyvim
Bioconductor (stable release): https://bioconductor.org/packages/methyvim
Archived source code as at time of publication:
https://github.com/nhejazi/methyvim/releases/tag/f1000
Documentation (development version): https://code.nimahejazi.org/methyvim
Software license: The MIT License, copyright Nima S. Hejazi
# Author contributions
NH designed and implemented the software package, applied the tool to the use
cases presented, and co-drafted the present manuscript. RP helped in designing
the software and co-drafted the present manuscript. AH and ML served as advisors
for the development of this software and the general statistical algorithm it
implements.
# Competing interests
No competing interests were disclosed at the time of publication.
# Grant information
NH was supported in part by the National Library of Medicine of the National
Institutes of Health under Award Number T32-LM012417, by P42-ES004705, and
by R01-ES021369. RP was supported by P42-ES004705. The content of this work is
solely the responsibility of the authors and does not necessarily represent the
official views of the various funding sources and agencies.
# Appendix
## Data Structure
We consider an observed data structure, on a single experimental subject (e.g.,
a patient), $O = (W, A, (Y(j) : j))$, where $(Y(j) : j = 1, \ldots J)$ is the
set of CpG sites measured by the assay in question, $A \in \{0, 1\}$ represents
a binary phenotype-level treatment, and $W$ is a vector of the phenotype-level
baseline covariates that are potential confounders (e.g., age, sex). We consider
having access to measurements on a large number $J$ of CpG sites (e.g.,
$850,000$, as measured by the Illumina _MethylationEPIC_ BeadChip arrays).
Further, let $K: j \to S_j$ be a procedure that assigns to a given CpG site $j$
a set of neighbors $S_j$ -- that is, $S_j$ is a collection of indices of the
neighbors of $j$ just as $j$ indexes the full set of CpG sites; moreover, since
$Y(j)$ is the measured methylation value for a given CpG site $j$, $Y(S_j)$ is
the measured methylation values at the neighbors $S_j$ of $j$. Note that the
definition of a neighborhood $\{j, S_j\}$ is left vague so as to facilitate the
use of user-specified strategies in implementation. We consider the case of
observing $n$ iid copies of $O$ (i.e., $O_1, \ldots, O_n$), where $O \sim P_0
\in \mathcal{M}$, which is to say that the random variable $O$ is governed by an
unknown probability distribution $P_0$, assumed only to reside in a
nonparametric statistical model $\mathcal{M}$ that places no restrictions on the
data-generating process.
## Variable Importance Measure
With the data structure above in hand, we let the estimand of interest be a
$j$-specific variable importance measure (VIM) $\Psi_j(P_0)$, which is defined
through the _true_ (and unknown) probability distribution $P_0$. As a motivating
example, consider the case where $\Psi_j(P_0)$ is
\begin{equation}\label{vim_param}
\Psi_j(P_0) = \mathbb{E}_{P_0}(\mathbb{E}_{P_0}(Y(j) \mid A = 1, W, Y(S_j)) -
\mathbb{E}_{P_0} \mathbb{E}_{P_0}(Y(j) \mid A = 0, W, Y(S_j)),
\end{equation}
where $Y(S_j)$ is the subvector of $Y$ that contains the measured methylation
values of the _neighboring sites_ of $j$ (but not site $j$ itself). This
parameter of interest is a variant of the _average treatment effect_, which has
been the subject of much attention in statistical causal inference
[@holland1986statistics, @hahn1998role, @hirano2003efficient]. As a measure of
variable importance, we seek to estimate this target parameter ($\Psi_j(P_0)$),
which quantifies the effect of changing a treatment $A$ on the methylation
$Y(j)$ of a CpG site $j$, accounting for any potential confounding from
phenotype-level covariates $W$ and observed methylation $Y(S_j)$ at the
neighboring sites $S_j$ of $j$. Importantly, the use of such well-defined
parameters as variable importance measures allows the construction of
inferential procedures (e.g., confidence intervals and hypothesis tests) for the
estimate [@vdl2006statistical]. We propose a procedure that estimates this
target parameter $\Psi_j(P_0)$ across all CpG sites of interest $j: 1, \ldots,
J$. When data-adaptive regression procedures (i.e., machine learning) are
employed in estimating $\Psi_j(P_0)$, this produces a nonparametric variable
importance measure of differential methylation, allowing for the identification
of differentially methylated positions (DMPs) while avoiding many assumptions
common in the use of standard parametric regression procedures. We propose
estimating $\Psi_j(P)$ via targeted minimum loss-based estimation, which allows
for data-adaptive regression procedures to be employed in a straightforward
manner using the Super Learner algorithm for the construction of ensemble models
[@vdl2007super, @wolpert1992stacked, @breiman1996stacked, @gruber2009targeted,
@vdl2011targeted].
## Pre-Screening Procedure
As a matter of practicality, we consider only estimating the target VIM
$\Psi_j(P_0)$ at a subset of CpG sites, so that $j: 1, \ldots J$ does not, in
fact span the full set of assayed CpG sites. In order to determine this subset
of CpG sites, we propose the use of a pre-screening procedure, which need be
nothing more than a method for differential methylation analysis that is
computationally less demanding than the method proposed here. Formally, let the
pre-screening procedure $\theta(j): j \to \{0, 1\}$, which is to say that
$\theta$ takes as input a CpG site $j$ and returns a binary decision rule of
whether to include CpG site $j$ in a subset to be evaluated further or not. As
an example, one might consider employing a classical differential methylation
analysis procedure, such as the linear modeling approach of the
[`limma`](https://bioconductor.org/packages/limma) R package
[@smyth2004linear, @robinson2014statistical], using only CpG sites that pass a
reasonable cutoff based on the linear model (e.g., magnitude of t-statistics,
p-values) for the subsequent analysis steps in the `methyvim` pipeline.
## Reducing Neighbors via Clustering
Due to the data-adaptive nature of the regression procedures employed in
evaluating the target VIM $\Psi_j(P_0)$ via targeted minimum loss-based
estimation, it is possible that a given CpG site $j$ may have _too many_
neighbors $S_j$ to be controlled for in the estimation procedure. Heuristically,
the inclusion of too many neighbors when controlling for potential confounders
may lead to instability in the estimates produced. In such cases, we propose and
implement the use of a clustering technique (e.g., partitioning around medoids)
to select a _representative_ subset of neighbors. Formally, there is likely no
best choice of a specific clustering algorithm, so we leave this aspect of the
proposed algorithm as flexible for the user [@kleinberg2003impossibility]. Note
that the goal of employing a clustering procedure in `methyvim` is to obtain a
smaller but still highly representative set of neighbors $S(j)$, so as to allow
for estimates of $\Psi_j(P_0)$ to account for as much confounding from
neighboring sites as is allowed by the available data.
## Targeted Minimum Loss-Based Estimation and Statistical Inference
Given the choice of target parameter $\Psi_j(P_0)$, we propose the use of
targeted minimum loss-based estimation (TMLE) to construct and evaluate an
estimator $\psi_{n, j}$ of $\Psi_j(P_0)$. In the case of our motivating example,
where the target VIM is based on the average treatment effect, we make use of a
TML estimator of this parameter, which has been implemented for a general case
in the [`tmle`](https://CRAN.R-project.org/package=tmle) R package
[@gruber2011tmle]; users of `methyvim` may wish to consult the documentation of
that software package as a supplement. Generally speaking, a TML estimator is
constructed from a few simple components: (1) an estimator of the propensity
score [@rosenbaum1983central], often denoted $g(A \mid W)$; (2) an estimator of
the outcome regression, often denoted $\bar{Q}(A, W)$; and (3) a targeting step
that revises the initial estimators of the aforementioned components such that
the resultant estimator satisfies a set of score-like equations
[@gruber2009targeted]. While we describe a few of the key properties of TML
estimators in the sequel, extended technical discussion is deferred to more
comprehensive work [@vdl2011targeted, @vdl2018targeted]. In order to construct
an estimate $\psi_{n, j}$ of $\Psi_j(P_0)$, it is necessary to accurately
estimate two nuisance parameters, these being the propensity score
($g(A \mid W)$) and the outcome regression ($\bar{Q}(A, W)$); moreover,
data-adaptive regression procedures may be used to obtain consistent estimates
of these quantities through the creation of a stacked regression model via the
Super Learner algorithm [@wolpert1992stacked, @breiman1996stacked,
@vdl2007super], which ensures that the resulting ensemble model satisfies
important optimality properties with respect to cross-validation
[@vdl2004asymptotic, @dudoit2005asymptotics, @vdl2003unified]. To use our
running example, one would construct a TML estimator of the average treatment
effect (ATE) as a variable importance measure in the following short series of
steps:
1. Construct initial estimates of the propensity score $g(A \mid W)$ and outcome
regression $\bar{Q}(A, W)$. As noted previously, the Super Learner algorithm
may be employed to construct such estimates.
2. Update these initial estimates (e.g., iteratively) so as to solve the
efficient influence function estimation equation of the parameter of interest
relative to a nonparametric statistical model.
3. In the efficient influence function for the ATE, $\text{EIF}(P_0)(O) =
H_n(Y - \bar{Q}(A, W))$, the auxiliary term $H_n$ may be denoted as
$H_n = \frac{\mathbb{I}(A = 1)}{g(1 \mid W)} -
\frac{\mathbb{I}(A = 0)}{g(0 \mid W)}$. Note that the exact form of the
efficient influence function and auxiliary covariate varies with the target
parameter of interest.
4. The TML estimator of the parameter of interest is constructed by a procedure
that updates the initial estimators such that the empirical mean of the
efficient influence function estimating equation(s) is nearly zero (with
respect to an error proportional to the sample size, i.e.,
$\frac{1}{\sqrt{n}}$).
Importantly, TML estimators are well-suited for statistical inference, having an
asymptotically normal limiting distribution [@tsiatis2007semiparametric,
@vdl2006targeted], which allows for a closed-form expression of the variance
to be derived:
\begin{equation}\label{eqn:lim_psi}
\sqrt{n}(\psi_{n, j} - \Psi_j(P_0)) \sim N(0, \sigma_{\text{EIF}}^2),
\end{equation}
where $\sigma_{\text{EIF}}^2$ is the variance of the efficient influence
function, with respect to a nonparametric statistical model, of the target
parameter evaluated at the observed data. Such a convenience allows for
confidence intervals and hypothesis tests to be constructed (i.e.,
$H_{0, j}: \Psi_j(P_0) = 0$) in a straightforward manner. What's more, under
standard regularity conditions, and with consistent estimation of the propensity
score and outcome regression, the TML estimator is asymptotically efficient and
achieve the lowest possible variance among a large class of estimators
[@bickel1993efficient].
## Correction for Multiple Testing
Given that we seek to estimate $\Psi_j(P_0)$ for a possibly large number of CpG
sites $(j: 1, \ldots J)$, the need to perform corrections for multiple testing
is clear. In order to curb the potential for false discoveries, we recommend the
use of the Benjamini and Hochberg procedure for controlling the False Discovery
Rate (FDR) [@benjamini1995controlling]; however, as the proposed procedure
involves a pre-screening step, naive application of the Benjamini and Hochberg
procedure (BH) is invalid -- instead, we rely on a modification of the procedure
to control the FDR, with established theoretical guarantees when pre-screening
is employed [@tuglus2009modified]. In brief, the modified marginal Benjamini and
Hochberg procedure to control the FDR under pre-screening works by applying the
standard BH procedure to a padded vector of p-values -- that is, letting $J$ be
the number of CpG sites tested and $K$ the number of CpG sites filtered out (so
that $J + K = P$, where $P$ is the original dimension of the genomic assay), the
modified marginal BH procedure is the application of the original BH procedure
to a vector of p-values, composed of the $J$ p-values from performing $J$
hypothesis tests (of the form $H_{0, j}: \Psi_j(P_0) = 0$) and $K$ additional
p-values automatically set to $1$ (for the hypothesis tests not performed on
account of pre-screening). This procedure is guaranteed to control the FDR at
the same desired rate when pre-screening is performed whereas naive application
of the BH procedure fails to do so.
# References