-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.Rmd
More file actions
572 lines (509 loc) · 19.9 KB
/
index.Rmd
File metadata and controls
572 lines (509 loc) · 19.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
---
author: "VJ Carey"
title: "Cloud-enabled cis-eQTL searches with Bioconductor GGtools 5.x"
date: "3 June 2015"
output:
rmdformats::readthedown:
toc_depth: 3
number_sections: true
---
```{r echo=FALSE,results="hide"}
date()
```
```{r echo=FALSE,results="hide"}
scoresCis = function(...){NULL} # does commented unevald code get checked?
suppressMessages({ # include some warnings on symbol replacements
if (!("GGdata" %in% installed.packages()[,1])) {
source("http://www.bioconductor.org/biocLite.R")
biocLite("GGdata")
}
if (!("SNPlocs.Hsapiens.dbSNP144.GRCh37" %in% installed.packages()[,1])) {
source("http://www.bioconductor.org/biocLite.R")
biocLite("SNPlocs.Hsapiens.dbSNP144.GRCh37")
}
library(knitcitations)
library(bibtex)
allbib = read.bibtex("allbib.bib")
library(GenomeInfoDb)
library(S4Vectors)
library(GGtools)
library(GGdata)
library(yri1kgv)
library(snpStats)
library(scatterplot3d)
library(lumi)
library(parallel)
library(foreach)
library(doParallel)
library(biglm)
library(lumiHumanAll.db)
library(rmeta)
library(SNPlocs.Hsapiens.dbSNP144.GRCh37)
if (!exists("partceu100k_dt")) load("partceu100k_dt.rda")
})
```
# Background
Numerous studies have employed genome-wide measures of mRNA abundance (typically
assayed using DNA microarrays, and more recently RNA-seq) in combination with
high-resolution genotyping (often supplemented with statistical imputation to
loci not directly assayed, leading to genotype calls with quantified uncertainty)
to search for genetic determinants of variation in gene expression.
Key references in human applications include
`r citet(allbib[["Cheung:2005p446"]])`,
`r citet(allbib[["Majewski:2011p3139"]])`, and
`r citet(allbib[["Gaffney:2012p5397"]])`;
`r citet(allbib[["Shabalin:2012p4753"]])` addresses computational concerns.
This document focuses on searches for eQTL <emph>in cis</emph>, so that
DNA variants local to the gene assayed for expression are
tested for association.
A typical report describes tuning of the search (including, for example,
boundaries on minor allele
frequencies of variants to be tested, approach to correction for batch effects
and other forms of confounding, scope of search in terms of distance from gene
coding region), enumerates variants with evidence of association to expression
variation in nearby genes, and then characterizes the biological roles of the
discovered variants.
N.B. The gQTLstats package will supersede GGtools for scalable
eQTL analysis; look for a revised workflow 2016Q1.
# Objectives
Suppose there are $N$ independently sampled individuals
with gene expression measures on $G$ genes or probes.
Each individual is genotyped (or has genotype statistically imputed) at $S$
DNA locations, catalogued by NCBI dbSNP or 1000 genomes.
We are given a $G \times N$
matrix of expression assay results, and $N \times S$ genotyping results
in the form of the number of B alleles (or expected number of B alleles)
for each of the loci. Select the search radius $\rho$ (for example, 100kb) and
for each gene $g$, determine the search neighborhoods $N_g = N_{g,\rho} = [a_g-\rho, b_g+\rho]$,
where $a_g$ denotes the genomic coordinate of the 5' end of the transcript region for
gene $g$, and
$b_g$ is the coordinate at the 3' end. Let $|N_g|$ denote
the number of SNP in that neighborhood. Key objectives are
- For each gene, compute the $|N_g|$ test statistics
measuring association of SNPs in $N_g$ with mean expression of gene $g$;
- Obtain a measure of statistical significance for each test statistic;
- Support adjustment and assessment of sensitivity analysis of statistical
tests (e.g., adjustment for batch effects, effects of filtering on gene expression variation
or SNP minor allele frequency);
- Provide the test results in a format for ready interrogation using
various types of search key;
- Support visualization of associations at various scales.
## Basic execution/reporting structure
The code in the example for the GGtools function All.cis() yields an
example of a sharply restricted search for cis eQTL on chr21, using
data on the HapMap CEU population.
```{r dolo,echo=FALSE}
cc = new("CisConfig")
chrnames(cc) = "21"
radius(cc) = 25000L
lkp = try(library(parallel))
if (!inherits(lkp, "try-error")) {
nc = 2 # max(c(1,min(c(10, detectCores()-1)))) # attempt to trim for workflow builder
options(mc.cores=nc)
geneApply(cc) = mclapply
if (.Platform$OS.type == "windows") geneApply(cc) = lapply
}
estimates(cc) = FALSE
set.seed(1234)
f1 <- All.cis( cc ) # devel: cisScores
```
```{r lkex,eval=FALSE}
cc = new("CisConfig") # take a default configuration
chrnames(cc) = "21" # confine to chr21
estimates(cc) = FALSE # no point estimates neede
f1 <- All.cis( cc ) # compute the tests; can be slow without attendance
# to parallelization
```
The result of the function inherits
from GRanges, and includes metadata concerning its generation.
```{r lookf1,echo=TRUE}
length(f1)
f1[1:3]
metadata(f1)
```
Use of GRanges for the organization of association test statistics allows
easy amalgamation of findings with other forms of genomic annotation. Retention of
the association scores achieved under permutation allows recomputation of
plug-in FDR after combination or filtering.
## Visualization examples
Targeted visualization of association is supported with
the plot_EvG function in GGBase. To obtain the
figure on the right, the expression matrix
has been transformed by removing the principal components
corresponding to the 10 largest eigenvalues. This is a
crude approach to reducing ``expression heterogeneity'', a main
concern of eQTL analyses to date
`r citep(allbib[["Leek:2007p1723"]])`.
```{r demoy,fig=TRUE,fig.width=7,fig.height=4,echo=FALSE,results="hide"}
suppressMessages({
library(yri1kgv)
if (!exists("c20")) c20 = getSS("yri1kgv", "chr20")
par(mfrow=c(1,2))
plot_EvG(probeId("o67h4JQSuEa02CJJIQ"), rsid("rs2259928"), c20,
main="observed expr.")
if (!exists("c20f")) c20f = clipPCs(c20, 1:10)
plot_EvG(probeId("o67h4JQSuEa02CJJIQ"), rsid("rs2259928"), c20f,
main="10 expr. PC removed")
})
```
```{r shoit,eval=FALSE}
plot_EvG(probeId("o67h4JQSuEa02CJJIQ"), rsid("rs2259928"), c20f,
main="10 expr. PC removed")
```
Above we have a single SNP-gene association.
The family of
associations observed cis to ABHD12 can also be
visualized in conjunction with the transcript models.
<img src="abhd12.png">
# Raw materials: structuring expression, genotype, and sample data
## SnpMatrix from snpStats for called and imputed genotypes
As of November 2013, a reasonably efficient representation of expression, sample and
genotype data is defined using an R package containing
- an ExpressionSet instance, and
- a folder inst/parts containing genotype data as SnpMatrix instances, as defined in the snpStats package.
Elements of the sampleNames of the ExpressionSet instance must coincide with elements of the row names
of the SnpMatrix instances. At time of analysis, warnings will be issued and harmonization attempts will be made
when the coincidence is not exact.
The SnpMatrix instances make use of a byte code for (potentially) imputed genotypes.
Each element of the code defines a point on the simplex displayed below, allowing
a discrete but rich set of the key quantities of interest, the expected number of B alleles.
```{r bag,fig=TRUE,fig.width=4,fig.height=4,echo=FALSE}
library(snpStats)
library(scatterplot3d)
tmp = as.raw(1:253)
yy = g2post(tmp)
EB = yy %*% c(0,1,2)
scatterplot3d(yy[,1], yy[,3], EB, xlab="Pr(A/A)", ylab="Pr(B/B)", zlab="mean num. B")
```
Note that the nucleotide codes are not carried in this representation. Typically for
a diallelic SNP, B denotes the alphabetically later nucleotide.
## smlSet for coordinating genotype, expression, and sample-level data
We can illustrate the basic operations with this overall structure, using data collected
on Yoruban (YRI) HapMap cell
lines. Expression data were obtained at ArrayExpression
E-MTAB-264 `r citep(allbib[["Stranger:2012p5427"]])`.
```{r bag2}
library(GGtools)
library(yri1kgv)
library(lumiHumanAll.db)
if (!exists("y22")) y22 = getSS("yri1kgv", "chr22")
y22
dim(exprs(y22))
fn = featureNames(y22)[1:5]
```
The annotation of expression features can be explored in several directions.
First, the probe names themselves encode the 50mers on the chip.
```{r getseq}
library(lumi)
id2seq(fn) # get the 50mer for each probe
# and some annotation
```
Second, the mapping to institutionally curated gene identifiers
is available.
```{r getann}
select( lumiHumanAll.db, keys=fn, keytype="PROBEID", columns=c("SYMBOL", "CHR", "ENTREZID"))
```
Finally, we can look at the genotype information. This can be voluminous
and is managed in an environment to reduce potential copying expenses.
```{r getgen}
gt22 <- smList(y22)[[1]] # access to genotypes
as( gt22[1:5,1:5], "character" )
cs22 = col.summary(gt22) # some information on genotypes
cs22[1:10,]
```
# Cluster management with starcluster
This workflow is based on Amazon EC2 computation managed using the
[MIT starcluster utilities](http://star.mit.edu/cluster/). Configuration
and management of EC2 based machinery is quite simple. The bulk of the partial run described here
used configuration variables
- CLUSTER_SIZE = 4
- NODE_IMAGE_ID = ami-bdaa99d4
- NODE_INSTANCE_TYPE = c3.2xlarge # 8 cores, 15GB RAM on each
- MASTER_INSTANCE_TYPE = c3.2xlarge
In a complete run, for chromosome 1, a rescue run was required with a larger instance
type (m3.2xlarge).
# Programming the parallelized search: various approaches
## High-level, socket-based cluster: ciseqByCluster
We will describe an essentially monolithic approach to using a cluster to
search for eQTL in which evaluation of a single R function
drives the search. The master process will communicate with slaves via
sockets; slaves will write results to disk and ship back to master.
The task is executed across chromosomes that have been split roughly
in thirds to reduce RAM consumption.
The ciseqByCluster function of GGtools is the
workhorse for the search. Arguments to this function
determine how the search will be configured and executed.
The invocation here asks for a search on four chromosomes,
dispatching work from a master R process to a four node cluster,
with multicore concurrency for gene-specific searches
on eight cores per node. Three output files are
generated in the folder identified as targetfolder:
- an RDA file serializing a data.table instance with a record for
each SNP-probe pair satisfying the cis proximity criterion
- a tabix-indexed GFF3 file with the same information as
the data.table
- the tabix .tbi file for the GFF3
The following script is available on the AMI noted above
and will generate the partceu100k_dt data.table instance
used for analysis below.
```{r showscript,eval=FALSE}
library(parallel)
newcl = makePSOCKcluster(c("master", paste0("node00", 1:3)))
library(foreach)
library(doParallel)
registerDoParallel(cores=8) # may want to keep at 5
library(GGtools)
ceuDemoRecov = try(ciseqByCluster( newcl,
chromsToRun=19:22, finaltag="partceu100k",
outprefix="ceurun",
ncoresPerNode=8, targetfolder="/freshdata/CEU_DEMO" ))
save(ceuDemoRecov, file="ceuDemoRecov.rda")
stopCluster(newcl)
stopImplicitCluster()
sessionInfo()
```
The full set of arguments and defaults for ciseqByCluster is
- pack = "yri1kgv",
- outprefix = "yrirun",
- chromsToRun = 1:22, # if length is C will use 3C nodes
- targetfolder = "/freshdata/YRI_3", # for demo, a volume reference
- radius = 100000L,
- nperm = 3L,
- numNodes = 8,
- nodeNames = rep("localhost", numNodes),
- ncoresPerNode = 8,
- numPCtoFilter = 10,
- lowerMAF = .02,
- geneannopk = "lumiHumanAll.db",
- snpannopk = "SNPlocs.Hsapiens.dbSNP144.GRCh37"
- smchrpref = "chr"
The GFF3 file that is generated along
with the data.table instance is useful for targeted queries, potentially from
external applications. The primary difficulty with using this
in R is the need to parse the optional data subfields of field 9.
# Working with the results
## Overview QQ-plot
It is customary to inspect QQ-plots for genome-wide
association studies. For eQTL searches, the number of
test results can range into the billions, so a binned
approach is taken.
```{r lkqq,eval=FALSE}
binnedQQ(partceu100k_dt, ylim=c(0,30), xlim=c(-2,15), end45=12)
```
<img src="lkqq.png">
This gives an indication that the distribution
of the vast majority of observed SNP-gene
pair association statistics is consistent with the null model.
## Sensitivity analysis of tuning the search
Because our data.table output retains information on association scores
achieved after permuting expression against genotype, we can recompute
plug-in FDRs for loci that remain after filtering
the full set of records in various
ways. The eqsens_dt function allows exploration of
sensitivity of eQTL enumerations across various dimensions,
and users can add additional dimension through suitable
programming. The default behavior is illustrated first:
```{r co,echo=FALSE}
update_fdr_filt = function (tab, filt = function(x) x, by = c("pairs", "snps",
"probes")[1]) {
require(GGtools, quietly = TRUE)
tab = filt(tab)
psinds = grep("permScore", colnames(tab), value = TRUE)
nr = nrow(tab)
pscores = vector("numeric", nr * length(psinds))
for (np in 1:length(psinds)) pscores[(((np - 1) * nr) + 1):(np *
nr)] = tab[[psinds[np]]]
if (by == "pairs") {
newfd = pifdr(tab$score, pscores)
}
else {
if (by == "snps")
byvar = "snp"
else if (by == "probes")
byvar = "probeid"
base = tab[, max(score), by = byvar]
maxbysnp = base$V1
ol = list()
pnames = grep("permScore", names(tab))
for (i in 1:length(pnames)) {
tab$score = tab[, pnames[i], with = FALSE]
ol[[i]] = tab[, max(score), by = byvar]$V1
}
newfd = pifdr(maxbysnp, as.numeric(unlist(ol)))
tab = base
}
tab$fdr = newfd
tab
}
#
# defining here so that release version can be used
#
filtgen.maf.dist = function (maf.dist,
validate.tab = function(tab) all(c("mindist",
"MAF") %in% colnames(tab)))
{
stopifnot(is.atomic(maf.dist))
stopifnot(length(maf.dist) == 2)
maf = maf.dist[1]
dist = maf.dist[2]
function(tab) {
stopifnot(isTRUE(validate.tab(tab)))
tab[tab$mindist <= dist & tab$MAF >= maf, ]
}
}
eqsens_dt = function (dtab, filtgen = filtgen.maf.dist, by = c("pairs", "snps",
"probes")[1], targfdrs = c(0.05, 0.01, 0.005), parmslist = list(mafs = c(0.025,
0.05, 0.075, 0.1, 0.125), dists = c(1000, 5000, 10000, 25000,
50000, 1e+05)))
{
parmset = data.matrix(do.call(expand.grid, parmslist))
ntune = nrow(parmset)
ans = foreach(curp = 1:ntune) %dopar% {
tmp = update_fdr_filt(tab = dtab, filt = filtgen(parmset[curp,
]), by = by)
sapply(targfdrs, function(x) sum(tmp$fdr <= x))
}
hold = t(sapply(ans, force))
colnames(hold) = paste0("at_", targfdrs)
cbind(parmset, hold)
}
filtgen.maf.dist = function (maf.dist, validate.tab = function(tab) all(c("mindist",
"MAF") %in% colnames(tab)))
{
stopifnot(is.atomic(maf.dist))
stopifnot(length(maf.dist) == 2)
maf = maf.dist[1]
dist = maf.dist[2]
function(tab) {
stopifnot(isTRUE(validate.tab(tab)))
tab[tab$mindist <= dist & tab$MAF >= maf, ]
}
}
plotsens = function (eqsout, ylab = "count of eQTL at given FDR",
title = "cis radius in bp") {
require(reshape2)
require(ggplot2)
mdf = melt(data.frame(eqsout), id.vars = c("mafs", "dists"))
vind = which(names(mdf) == "variable")
names(mdf)[vind] = "FDR"
mdf[, vind] = gsub("at_", "", mdf[, vind])
ggplot(data = mdf) + geom_point(aes(x = mafs, y = value,
colour = FDR)) + facet_grid(~dists) + theme(axis.text.x = element_text(angle = 90)) +
ylab(ylab) + labs(title = title)
}
```
```{r dosens,fig=TRUE}
load("partceu100k_dt.rda")
library(foreach) # basic function includes a %dopar%
library(doParallel)
library(parallel)
# dcdown = max(c(detectCores()-1,1)) ## use with lots of RAM
dcdown = 1
registerDoSEQ()
if (.Platform$OS.type != "windows") {
#cl = makeForkCluster(dcdown)
registerDoParallel(cores=dcdown) # nesting?
}
```
```{r doeq,eval=FALSE}
eq1 = eqsens_dt(partceu100k_dt)
```
```{r nnn,echo=FALSE}
load("sensSave.rda")
eq1 = sensSave
plotsens(eq1)
```
The arguments to eqsens_dt are
```{r lkar}
args(eqsens_dt)
```
and setting components of the parmslist argument governs
the set of tuning parameter vectors that will be used.
Exploration of additional dimensions would involve passing
a different function for the filtgen parameter; filtgen.maf.dist
assumes a 2-vector of (MAF, dist) will be used to retain rows
with MAF exceeding the input MAF, and dist no greater than the
input mindist from SNP to gene.
## Visualizing results for a gene
As of 1/20/2014 the call to scoresCis() can only be executed with
R-devel and GGtools 4.11.28+.
We will focus on the data.table output. A basic objective
is targeted visualization. The scoresCis function
helps with this. We load the data.table instance first.
```{r coded,eval=FALSE}
library(data.table)
load("partceu100k_dt.rda")
scoresCis("CPNE1", partceu100k_dt)
```
<!-- some code dropped here -->
<img src="demomod.png">
## Statistical characteristics of search results
In this section we consider how
structural and genetic information
can be used to distinguish conditional probabilities
of SNP genotypes being associated with phenotypic variation.
We use some additional data
provided in the GGtools package concerning
a) chromatin state of the lymphoblastoid cell line GM12878,
a line similar to those form which expression data were generated,
and b) identities of SNP that have been found to be hits
or are in LD with hits at $ R^2 > 0.80 $
in the NHGRI GWAS catalog. See man pages for hmm878 in GGtools package
and gwastagger in gwascat package for more information. These
data are automatically propagated to ciseqByCluster data.table output.
Our approach is to use logistic regression on 1.5 million
records. We use the biglm package to keep memory images modest.
We discretize some of the key factors, and form an indicator
variable for the event that the SNP is in
a region of active or poised promoter chromatin state, as
determined by ChromHMM on GM12878.
```{r disc}
distcat = cut(partceu100k_dt$mindist,c(-1, 1, 1000, 5000, 10000, 50000, 100001))
fdrcat = cut(partceu100k_dt$fdr,c(-.01,.005, .05, .1, .2, 1.01))
fdrcat = relevel(fdrcat, "(0.2,1.01]")
mafcat = cut(partceu100k_dt$MAF,c(0,.05, .1, .2, .3, .51))
approm = 1*partceu100k_dt$chromcat878 %in% c("1_Active_Promoter", "3_Poised_Promoter")
```
Now we rebuild the data.table and fit the model to a randomly
selected training set of about half the total number of records.
```{r fit}
partceu100k_dt = cbind(partceu100k_dt, distcat, fdrcat, mafcat, approm)
set.seed(1234)
train = sample(1:nrow(partceu100k_dt),
size=floor(nrow(partceu100k_dt)/2), replace=FALSE)
library(biglm)
b1 = bigglm(isgwashit~distcat+fdrcat+mafcat+approm, fam=binomial(),
data=partceu100k_dt[train,], maxit=30)
```
A source of figures of merit is the calibration of predictions
against actual hit events in the test set.
```{r cali}
pp = predict(b1, newdata=partceu100k_dt[-train,], type="response")
summary(pp)
cpp = cut(pp, c(0,.025, .05, .12, .21))
table(cpp)
sapply(split(partceu100k_dt$isgwashit[-train], cpp), mean)
```
It seems that the model, fit using data on a small number of chromosomes,
has some predictive utility. We can visualize the coefficients:
```{r demomodco,fig=TRUE,fig.width=7,fig.height=4}
tmat = matrix(rownames(summary(b1)$mat),nc=1)
est = summary(b1)$mat[,1]
library(rmeta)
forestplot(tmat, est, est-.01, est+.01, xlog=TRUE,
boxsize=.35, graphwidth=unit(3, "inches"),
xticks=exp(seq(-4,2,2)))
```
Standard errors in the presence of correlations among responses
require further methodological development.
```{r cleanup, echo=FALSE}
rm(list=ls())
gc()
```
# References
```{r results='asis',echo=FALSE}
bibliography() #style="markdown")
```
```{r sess}
sessionInfo()
```