This repository has been archived by the owner on Feb 6, 2020. It is now read-only.
/
2018_susa_berkeley.html
544 lines (349 loc) · 11.2 KB
/
2018_susa_berkeley.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
<!DOCTYPE html>
<html>
<head>
<title>Targeted Biomarker Discovery</title>
<meta charset="utf-8">
<meta name="author" content="Nima Hejazi" />
<link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
<link rel="stylesheet" href="custom.css" type="text/css" />
</head>
<body>
<textarea id="source">
class: center, middle, inverse, title-slide
# Targeted Biomarker Discovery
## SUSA Career Exploration
### <a href="https://nimahejazi.org">Nima Hejazi</a>
### 2018 Apr 18 (Wed), 19:00
---
# Accessing these slides
--
### View online:
[goo.gl/HZzosu](https://goo.gl/HZzosu)
--
### Via git:
```bash
git clone -b https://github.com/nhejazi/talk_admitday.git susa-2018
```
???
- This talk will focus on challenges in analyzing high-dimensional biological
data, with a brief introduction to common problems.
- We will look at how standard approaches from statistical causal inference,
machine learning, and Targeted Learning can be extended to this class of data.
---
class: inverse, center, middle
# Statistics in High-Dimensional Biology
---
# Biological Sequencing I
## Why?
--
- We can probe biological and health processes at a very fine level.
--
- Learn more about the molecular basis of health and disease.
--
- Querying different genomic processes:
1. DNA/RNA expression
2. Protein expression
3. Epigenetics (e.g., DNA methylation)
???
- We are involved in research with health applications and the creation of
biological domain knowledge.
- Of the list given, the top 2 together comprise the "central dogma of molecular
biology."
- There's a fascinating array of processes to be studied and much room for
statistical innovation.
---
# Biological Sequencing II
## Who?
--
- Experimental scientists in
1. environmental epidemiology
2. molecular biology
3. bioengineering
4. neuroscience
???
- From the diversity of research areas, you're sure to find something
interesting to work on.
- Just a list I came up with off the top of my head...before my morning coffee.
---
# Biological Sequencing III
## How?
--
- (some) Popular biotechnology:
1. Microarrays (genes, CpG sites, etc.)
2. RNA-Seq
3. Single-Cell RNA-Seq
???
- There are numerous challenges in analyzing such data sets.
- High-dimensional? personally, `\(n \in (4, 125)\)`, `\(g \in (10000, 850000)\)`.
---
# Biological Sequencing Data
Conventions differ in genomics:
`\(\begin{bmatrix} x_{11} & x_{12} & x_{13} & \dots & x_{1g} \\ x_{21} & x_{22} & x_{23} & \dots & x_{2g} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & x_{n3} & \dots & x_{ng} \end{bmatrix} \xrightarrow[]{\text{transpose}} \begin{bmatrix} x_{11} & x_{12} & x_{13} & \dots & x_{1n} \\ x_{21} & x_{22} & x_{23} & \dots & x_{2n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ x_{g1} & x_{g2} & x_{g3} & \dots & x_{gn} \end{bmatrix}\)`
__n.b.__, `\(n << g\)`
--
- standard practice: _subjects in rows, variables in columns_
--
- genomics: _genes in rows, subjects in columns_
--
- Why this awful convention?
--
- Thanks, Microsoft (Excel)...the root of all evil (in data science).
--
- "Doing data science with spreadsheets is like drunk driving."
???
- This threw me off when first starting to work in this area.
- We'd like to make claims about the genes, so why are they in the rows? Excel.
- Quote from podcast ("Not so standard deviations"), originally P.B. Stark.
---
# Targeted Learning
- A framework for causal inference and variable importance analysis.
--
- Let's represent observed data as `\(O = (W, A, Y)\)`
--
- `\(W\)`: baseline variables (e.g., age, sex, SES)
--
- `\(A\)`: exposure/treatment (e.g., benzene)
--
- `\(Y\)`: outcome of interest (e.g., gene expression)
--
- __goal:__ Estimate the effect of a treatment (A) on an outcome (Y) while
controlling for baseline covariates (W)
--
- Target parameters like the "average treatment effect" (ATE):
`$$\Psi(P_n^*) = E[E[Y \mid A = 1, W] - E[Y \mid A = 0, W]]$$`
???
- All of this is covered in a series of 1st-year courses.
- Significant area of research in our department (M. van der Laan, A. Hubbard,
M. Petersen).
---
# R package: `biotmle`
- What does this package do?
--
_Biomarker identification, combining targeted learning with moderated (empirical
Bayes) statistics to obtain conservative, robust estimates, with inference._
--
- Moderated statistics and targeted learning:
`$$\tilde{t}_b = \frac{\sqrt{n}(\Psi_b(P_n^*) - \Psi_b(P_0))}{\tilde{S}_b}$$`
--
```r
library(devtools)
devtools::install_github("nhejazi/biotmle")
devtools::install_github("nhejazi/biotmleData")
```
--
```r
library(biotmle)
```
```
## biotmle v:1.3.0: Moderated and Targeted Statistical Learning for Biomarker Discovery
```
```r
library(biotmleData)
data(illuminaData)
```
???
- A working implementation of a targeted learning approach to biomarker
discovery using moderated statistics.
- Next, we'll walk through analyzing some data.
---
class: inverse, middle, center
# Data Analysis with `biotmle`
---
# Baseline covariates (W)
```r
# W - age, sex, smoking
W <- illuminaData %>%
colData() %>%
data.frame() %>%
dplyr::select(which(colnames(.) %in% c("age", "sex", "smoking"))) %>%
dplyr::mutate(
age = as.numeric((age > quantile(age, 0.25))),
sex = I(sex),
smoking = I(smoking)
)
```
--
```r
head(W)
```
```
## age sex smoking
## 1 1 1 1
## 2 1 1 1
## 3 0 1 2
## 4 1 2 2
## 5 1 2 2
## 6 1 2 1
```
???
- Our department revolves around using R. You're more than welcome to use
whatever programming language you'd like, but you'll have to learn R if you want
to communicate and collaborate.
- Probably wouldn't hurt to start learning now, if you don't know it already.
---
# Exposure of interest (A)
```r
# A - benzene exposure (discretized)
A <- illuminaData %>%
colData() %>%
data.frame() %>%
dplyr::select(which(colnames(.) %in% c("benzene")))
A <- as.numeric(A[, 1])
```
--
```r
unique(A)
```
```
## [1] 1 3 2
```
--
```r
table(A)
```
```
## A
## 1 2 3
## 42 59 24
```
???
- We discretize the exposure/treatment to make it fit the form of the parameter
that we saw before (the ATE).
- Decent distribution of observations across the levels of A (though there are
fewer individuals in the highest exposure level).
---
# Outcome of interest (Y)
```r
# Y - genes
Y <- illuminaData %>%
assay() %>%
t() %>%
data.frame()
geneIDs <- colnames(Y)
```
--
```r
dim(Y)
```
```
## [1] 125 22177
```
--
```r
head(Y[, 1:7])
```
```
## X6960451 X2600731 X2120309 X7510608 X1570494 X6520451 X5960017
## 3101 450.6910 2778.857 119.8120 203.8761 135.3883 222.8536 200.8315
## 3102 339.4663 2856.571 113.6889 228.0108 126.4207 219.0222 185.6719
## 3103 481.7867 4252.924 113.0603 184.6628 165.4673 215.8639 190.5513
## 3108 284.1533 1477.202 101.9724 199.9513 110.4363 172.2799 151.1723
## 3109 334.6466 1316.800 114.9128 204.2139 127.4630 210.0835 194.2244
## 3110 415.4404 3646.593 125.2571 220.9299 145.5556 222.3194 179.5192
```
???
- Woah, look at that dimensionality!
- Expression measures (from microarrays) appear arbitrary.
---
# Identifying biomarkers
We can use the package to identify potential biomarkers:
```r
biomarkerTMLEout <- biomarkertmle(Y = Y, # biomarkers
W = W, # baseline covariates
A = A, # exposure (benzene)
type = "exposure",
parallel = TRUE,
family = "gaussian",
g_lib = c("SL.glmnet", "SL.randomForest",
"SL.polymars", "SL.mean"),
Q_lib = c("SL.glmnet", "SL.randomForest",
"SL.nnet", "SL.mean")
)
```
--
```r
design <- as.data.frame(cbind(rep(1, nrow(Y)),
as.numeric(A == max(unique(A)))))
colnames(design) <- c("intercept", "Tx")
limmaTMLEout <- modtest_ic(biotmle = biomarkerTMLEout)
```
???
- The procedure is quite resource-intensive as it evaluates the association of
each potential biomarker (over `\(20,000\)`) with the exposure of interest A
- ...while accounting for potential confounding based on the covariates included
in W
---
class: inverse, middle, center
# Visualizing results (from `biotmle`)
---
# Unadjusted results from tests
<img src="2018_susa_berkeley_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" />
???
- still find a large number of biomarkers
---
# Adjusted results from tests
<img src="2018_susa_berkeley_files/figure-html/unnamed-chunk-17-1.png" style="display: block; margin: auto;" />
???
- multiple testing correction (FDR control)
- account for simultaneous tests
---
# Visualization I: Heatmap
<img src="2018_susa_berkeley_files/figure-html/unnamed-chunk-18-1.png" style="display: block; margin: auto;" />
???
- volcano plots are pretty standard
- our goal was to reduce the number of significant findings at a low fold change
in the parameter of interest. It appears that we were successful.
---
# Visualization II: Volcano Plot
<img src="2018_susa_berkeley_files/figure-html/unnamed-chunk-19-1.png" style="display: block; margin: auto;" />
---
class: center, middle
# Thanks!
Slides created via the R package
[**xaringan**](https://github.com/yihui/xaringan).
Powered by [remark.js](https://remarkjs.com),
[**knitr**](http://yihui.name/knitr), and
[R Markdown](https://rmarkdown.rstudio.com).
---
class: center, middle
# Me
[nimahejazi.org](https://nimahejazi.org)
[statistics.berkeley.edu/~nhejazi](https://statistics.berkeley.edu/~nhejazi)
_email:_ nhejazi -AT- berkeley -DOT- edu
_twitter:_ [@nshejazi](https://twitter.com/nshejazi)
_GitHub:_ [nhejazi](https://github.com/nhejazi)
</textarea>
<script src="libs/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "zenburn",
"highlightLines": true
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
window.dispatchEvent(new Event('resize'));
});
(function() {
var d = document, s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
if (!r) return;
s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
d.head.appendChild(s);
})();</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
tex2jax: {
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
}
});
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement('script');
script.type = 'text/javascript';
script.src = 'https://cdn.bootcss.com/mathjax/2.7.1/MathJax.js?config=TeX-MML-AM_CHTML';
if (location.protocol !== 'file:' && /^https?:/.test(script.src))
script.src = script.src.replace(/^https?:/, '');
document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
</body>
</html>