2018_susa_berkeley.html

<!DOCTYPE html>
<html>
  <head>
    <title>Targeted Biomarker Discovery</title>
    <meta charset="utf-8">
    <meta name="author" content="Nima Hejazi" />
    <link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
    <link rel="stylesheet" href="custom.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">
class: center, middle, inverse, title-slide

# Targeted Biomarker Discovery
## SUSA Career Exploration
### <a href="https://nimahejazi.org">Nima Hejazi</a>
### 2018 Apr 18 (Wed), 19:00

---


# Accessing these slides

--

### View online:
[goo.gl/HZzosu](https://goo.gl/HZzosu)

--

### Via git:


```bash
git clone -b https://github.com/nhejazi/talk_admitday.git susa-2018
```

???

- This talk will focus on challenges in analyzing high-dimensional biological
  data, with a brief introduction to common problems.

- We will look at how standard approaches from statistical causal inference,
  machine learning, and Targeted Learning can be extended to this class of data.

---
class: inverse, center, middle

# Statistics in High-Dimensional Biology

---

# Biological Sequencing I

## Why?

--

- We can probe biological and health processes at a very fine level.

--

- Learn more about the molecular basis of health and disease.

--

- Querying different genomic processes:
  1. DNA/RNA expression
  2. Protein expression
  3. Epigenetics (e.g., DNA methylation)

???

- We are involved in research with health applications and the creation of
biological domain knowledge.

- Of the list given, the top 2 together comprise the "central dogma of molecular
biology."

- There's a fascinating array of processes to be studied and much room for
statistical innovation.

---

# Biological Sequencing II

## Who?

--

- Experimental scientists in
  1. environmental epidemiology
  2. molecular biology
  3. bioengineering
  4. neuroscience

???

- From the diversity of research areas, you're sure to find something
interesting to work on.

- Just a list I came up with off the top of my head...before my morning coffee.

---

# Biological Sequencing III

## How?

--

- (some) Popular biotechnology:
  1. Microarrays (genes, CpG sites, etc.)
  2. RNA-Seq
  3. Single-Cell RNA-Seq

???

- There are numerous challenges in analyzing such data sets.

- High-dimensional? personally, `\(n \in (4, 125)\)`, `\(g \in (10000, 850000)\)`.

---

# Biological Sequencing Data

Conventions differ in genomics:

`\(\begin{bmatrix} x_{11} &amp; x_{12} &amp; x_{13} &amp; \dots &amp; x_{1g} \\ x_{21} &amp; x_{22} &amp; x_{23} &amp; \dots &amp; x_{2g} \\ \vdots &amp; \vdots &amp; \vdots &amp; \ddots &amp; \vdots \\ x_{n1} &amp; x_{n2} &amp; x_{n3} &amp; \dots &amp; x_{ng} \end{bmatrix} \xrightarrow[]{\text{transpose}} \begin{bmatrix} x_{11} &amp; x_{12} &amp; x_{13} &amp; \dots &amp; x_{1n} \\ x_{21} &amp; x_{22} &amp; x_{23} &amp; \dots &amp; x_{2n} \\ \vdots &amp; \vdots &amp; \vdots &amp; \ddots &amp; \vdots \\ x_{g1} &amp; x_{g2} &amp; x_{g3} &amp; \dots &amp; x_{gn} \end{bmatrix}\)`

__n.b.__, `\(n &lt;&lt; g\)`

--

- standard practice: _subjects in rows, variables in columns_

--

- genomics: _genes in rows, subjects in columns_

--

- Why this awful convention? 

--

- Thanks, Microsoft (Excel)...the root of all evil (in data science).

--

- "Doing data science with spreadsheets is like drunk driving."

???

- This threw me off when first starting to work in this area.

- We'd like to make claims about the genes, so why are they in the rows? Excel.

- Quote from podcast ("Not so standard deviations"), originally P.B. Stark.

---

# Targeted Learning

- A framework for causal inference and variable importance analysis.

--

- Let's represent observed data as `\(O = (W, A, Y)\)`

--

- `\(W\)`: baseline variables (e.g., age, sex, SES)


--

- `\(A\)`: exposure/treatment (e.g., benzene)

--

- `\(Y\)`: outcome of interest (e.g., gene expression)

--

- __goal:__ Estimate the effect of a treatment (A) on an outcome (Y) while
controlling for baseline covariates (W)

--

- Target parameters like the "average treatment effect" (ATE):

`$$\Psi(P_n^*) = E[E[Y \mid A = 1, W] - E[Y \mid A = 0, W]]$$`

???

- All of this is covered in a series of 1st-year courses.

- Significant area of research in our department (M. van der Laan, A. Hubbard,
M. Petersen).

---

# R package: `biotmle`

- What does this package do?

--

_Biomarker identification, combining targeted learning with moderated (empirical
Bayes) statistics to obtain conservative, robust estimates, with inference._

--

- Moderated statistics and targeted learning:

`$$\tilde{t}_b = \frac{\sqrt{n}(\Psi_b(P_n^*) - \Psi_b(P_0))}{\tilde{S}_b}$$`
--


```r
library(devtools)
devtools::install_github("nhejazi/biotmle")
devtools::install_github("nhejazi/biotmleData")
```

--


```r
library(biotmle)
```

```
## biotmle v:1.3.0: Moderated and Targeted Statistical Learning for Biomarker Discovery
```

```r
library(biotmleData)
data(illuminaData)
```


???

- A working implementation of a targeted learning approach to biomarker
discovery using moderated statistics.

- Next, we'll walk through analyzing some data.

---
class: inverse, middle, center

# Data Analysis with `biotmle`

---

# Baseline covariates (W)


```r
# W - age, sex, smoking
W &lt;- illuminaData %&gt;%
  colData() %&gt;%
  data.frame() %&gt;%
  dplyr::select(which(colnames(.) %in% c("age", "sex", "smoking"))) %&gt;%
  dplyr::mutate(
    age = as.numeric((age &gt; quantile(age, 0.25))),
    sex = I(sex),
    smoking = I(smoking)
  )
```

--


```r
head(W)
```

```
##   age sex smoking
## 1   1   1       1
## 2   1   1       1
## 3   0   1       2
## 4   1   2       2
## 5   1   2       2
## 6   1   2       1
```

???

- Our department revolves around using R. You're more than welcome to use
whatever programming language you'd like, but you'll have to learn R if you want
to communicate and collaborate.

- Probably wouldn't hurt to start learning now, if you don't know it already.

---

# Exposure of interest (A)


```r
# A - benzene exposure (discretized)
A &lt;- illuminaData %&gt;%
  colData() %&gt;%
  data.frame() %&gt;%
  dplyr::select(which(colnames(.) %in% c("benzene")))
A &lt;- as.numeric(A[, 1])
```

--


```r
unique(A)
```

```
## [1] 1 3 2
```

--


```r
table(A)
```

```
## A
##  1  2  3 
## 42 59 24
```

???

- We discretize the exposure/treatment to make it fit the form of the parameter
that we saw before (the ATE).

- Decent distribution of observations across the levels of A (though there are
fewer individuals in the highest exposure level).

---

# Outcome of interest (Y)


```r
# Y - genes
Y &lt;- illuminaData %&gt;%
  assay() %&gt;%
  t() %&gt;%
  data.frame()
geneIDs &lt;- colnames(Y)
```

--


```r
dim(Y)
```

```
## [1]   125 22177
```

--


```r
head(Y[, 1:7])
```

```
##      X6960451 X2600731 X2120309 X7510608 X1570494 X6520451 X5960017
## 3101 450.6910 2778.857 119.8120 203.8761 135.3883 222.8536 200.8315
## 3102 339.4663 2856.571 113.6889 228.0108 126.4207 219.0222 185.6719
## 3103 481.7867 4252.924 113.0603 184.6628 165.4673 215.8639 190.5513
## 3108 284.1533 1477.202 101.9724 199.9513 110.4363 172.2799 151.1723
## 3109 334.6466 1316.800 114.9128 204.2139 127.4630 210.0835 194.2244
## 3110 415.4404 3646.593 125.2571 220.9299 145.5556 222.3194 179.5192
```

???

- Woah, look at that dimensionality!

- Expression measures (from microarrays) appear arbitrary.

---

# Identifying biomarkers

We can use the package to identify potential biomarkers:


```r
biomarkerTMLEout &lt;- biomarkertmle(Y = Y, # biomarkers
                                  W = W, # baseline covariates
                                  A = A, # exposure (benzene)
                                  type = "exposure",
                                  parallel = TRUE,
                                  family = "gaussian",
                                  g_lib = c("SL.glmnet", "SL.randomForest",
                                            "SL.polymars", "SL.mean"),
                                  Q_lib = c("SL.glmnet", "SL.randomForest",
                                            "SL.nnet", "SL.mean")
                                 )
```


--


```r
design &lt;- as.data.frame(cbind(rep(1, nrow(Y)),
                              as.numeric(A == max(unique(A)))))
colnames(design) &lt;- c("intercept", "Tx")
limmaTMLEout &lt;- modtest_ic(biotmle = biomarkerTMLEout)
```

???

- The procedure is quite resource-intensive as it evaluates the association of
each potential biomarker (over `\(20,000\)`) with the exposure of interest A

- ...while accounting for potential confounding based on the covariates included
in W

---
class: inverse, middle, center

# Visualizing results (from `biotmle`)

---

# Unadjusted results from tests

&lt;img src="2018_susa_berkeley_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /&gt;

???

- still find a large number of biomarkers

---

# Adjusted results from tests

&lt;img src="2018_susa_berkeley_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" /&gt;

???

- multiple testing correction (FDR control)

- account for simultaneous tests

---

# Visualization I: Heatmap

&lt;img src="2018_susa_berkeley_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" /&gt;

???

- volcano plots are pretty standard

- our goal was to reduce the number of significant findings at a low fold change
in the parameter of interest. It appears that we were successful.

---

# Visualization II: Volcano Plot

&lt;img src="2018_susa_berkeley_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" /&gt;

---
class: center, middle

# Thanks!

Slides created via the R package
[**xaringan**](https://github.com/yihui/xaringan).

Powered by [remark.js](https://remarkjs.com),
[**knitr**](http://yihui.name/knitr), and
[R Markdown](https://rmarkdown.rstudio.com).

---
class: center, middle

# Me

[nimahejazi.org](https://nimahejazi.org)

[statistics.berkeley.edu/~nhejazi](https://statistics.berkeley.edu/~nhejazi)

_email:_ nhejazi -AT- berkeley -DOT- edu

_twitter:_ [@nshejazi](https://twitter.com/nshejazi)

_GitHub:_ [nhejazi](https://github.com/nhejazi)
    </textarea>
<script src="libs/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "zenburn",
"highlightLines": true
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function() {
  var d = document, s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})();</script>

<script type="text/x-mathjax-config">
MathJax.Hub.Config({
  tex2jax: {
    skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
  }
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://cdn.bootcss.com/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>