/
simulations.Rmd
598 lines (487 loc) · 28.8 KB
/
simulations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
---
title: "Simulations"
author: "Jean Morrison"
date: "2019-07-10"
output: workflowr::wflow_html
---
# Introduction
This page gives instructions for running the simulations that we ran to evaluate CAUSE and will allow you to reproduce the results [the paper](https://www.biorxiv.org/content/10.1101/682237v3). It will hopefully also serve as a resource for explaining those results in more detail and allow you to modify simulations to run your own experiments. There are a lot of parameters that can be changed. We tried to explore a reasonable range of scenarios but there are many more variations that could be tried. Running the simulations involves multiple set-up steps, mostly software installation and may take a bit of time, especially if you are new to DSC. Additionally, completing the simulations requires several hours of computing and needs to be done on a cluster. If you don't have access to that kind of computing resource you may still find this section useful. All of our results can be downloaded and we will include some discussions of the patterns that we observed as we go along.
# Set Up
## Software Installation
Simulations are run using the `causeSims` R package. The package can be installed with
```
devtools::install_github("jean997/causeSims")
```
This package relies on a few R packages that are not on CRAN, which you will need to install separately.
For detailed installation instructions including installation of supporting R packages for other methods see [here](https://github.com/jean997/causeSims). There you can also find
a walk through of a single simulation and explanations of the functions contained in the package.
We conduct simulations using the [Dynamical Statisical Comparisons](https://stephenslab.github.io/dsc-wiki/overview) software. You will need to [have DSC installed](https://stephenslab.github.io/dsc-wiki/installation.html) and it will be helpful to take a look at [the tutorial](https://stephenslab.github.io/dsc-wiki/first_course/Intro_DSC) if you haven't used DSC before.
## Set Up The Analysis Directory
To set up the analysis create a working directory that you will run the simulations in. Make sure all the software in the previous section is installed.
From inside this working directory use the R command
```{r, eval=FALSE}
library(causeSims)
setup_sims()
```
This function downloads the following files:
+ LD data: This is used for simulating summary statistics and is placed into a sub-directory `data`. The LD data are downloaded from [here](https://zenodo.org/record/3235780/) and contain data for HapMap3 SNPs on chromosome 19 estimated from the CEU 1000 Genomes samples.
+ DSC files: These contain instructions for DSC on how to simulate data and analyze it. Jump to <a href="#details">this subsection</a> for a detailed description of the DSC files and simulation parameters.
If you want to run your own experiments, these files should provide a good starting place for that.
The set up function downloads three DSC files.
One is for evaluating false positives in the presence of different levels of correlated pleiotropy, one is for evaluating power when there is a true causal effect, and one is for evaluating the robustness of CAUSE to different prior distributions on $\gamma$ and $\eta$ (see supplementary note of the paper).
+ A configuration file for running DSC on a cluster. This file is set up for the University of Chicago Research Computing Cluster so you will need to change it to match your cluster set up. Even if you are using the U of C cluster, you will need to change the home directory and account lines of this file.
Read about using DSC with a compute cluster [here](https://stephenslab.github.io/dsc-wiki/advanced_course/Remote_Computations) (We are using the "On Host" option) and modify `config.yml` as needed.
+ Some code for plotting and results summarization
# Run DSC
To run DSC use the following commands at the command line
```
nohup dsc --replicate 100 --host config.yml -c 4 power.dsc > pwr_dsc.out &
nohup dsc --replicate 100 --host config.yml -c 4 false_positives.dsc > fp_dsc.out &
nohup dsc --replicate 100 --host config.yml -c 4 sigma_g.dsc > sg_dsc.out &
```
Depending on your computing resources and how many jobs you allow to run simultaneously, this could take several hours for each command. Errors can occur that stop DSC, usually a result of interfacing with the cluster. For example, DSC may fail to detect that a job has completed. If this happens, simply restart the failed job using the same command. DSC will pick up where it left off since all results are saved along the way.
Running these commands will generate lots of `*.err` and `*.out` files all of which can be safely deleted. Results will be in directories named `fp` (false positives), `pwr` (power) and `sigma_g` (results for testing the prior on $\gamma$ and $\eta$).
# Look at Results and Plot Them
## False positive and power results
The script `extract_results_main.R` will use the `dscrutils` package to extract results from the false positive and power analyses. If you are modifying the analysis and have changed the names of the output directories, you will need to change them in this file also. Depending on the changes you have made, you may need to make other modifications. Read more about [dscrutils here](https://stephenslab.github.io/dsc-wiki/first_course/Intro_DSC.html#query-the-results) and in the `dscrutils` documentation. This script will save R objects containing CAUSE, LCV, and other MR method results for both analyses. It will append the date so it won't over-write the results downloaded by `setup_sims()`. Run it at the command line using
```
Rscript extract_results_main.R
```
or from inside R using `source(extract_results_main.R)`. It will take several minutes to extract all of the results. If you would like to use your own files to plot, change the file names in the code below which uses the downloaded results.
```{r, message=FALSE, packages}
library(tidyverse)
library(gridExtra)
library(pROC)
library(jcolors)
```
The results extraction script makes a data frame for CAUSE results and a data frame for all the other MR methods. There is also a data frame for results from LCV but this is discussed later on. We first read in results and look around a little
```{r, lookat}
causedf <- readRDS("sim_results/main_causedf.RDS")
dim(causedf)
head(causedf)
```
The frist 18 columns of the data frame are all simulation parameters, information about the simulation files, or simple statistics computed about the simulated data. These include
+ `gamma`, `eta`, `q`: True values of $\gamma$, $\eta$ and $q$ used to simulate data. `params` gives a character string listing $(\gamma, \eta, q)$.
+ `tau`, `omega`: Transformations of $\gamma$ and $\eta$ respectively. $\tau$ is the proportion of $Y$ heritability explained by the causal effect of trait $M$, equal to $\tau = \gamma^2 h^2_{M}/h^2_Y$. In our case $h^2_M = h^2_Y$ for all simulations so $\tau = \gamma^2$. $\omega$ is defined as $\omega = \eta^2 h^2_M/h^2_Y = \eta^2$.
+ `h1`, `h2`: Heritabilitites for trait $M$ and $Y$ respectively `h1` $ = h^2_M$ and `h2`$= h^2_Y$. In our case these are equal to 0.25 for all simulations.
+ `n1`, `n2`: Sample size for GWAS of traits $M$ and $Y$ respectively. These can be equal to 12,000 (low power) or 40,000 (high power).
+ `neffect1`, `neffect2`: Expected number of effect SNPS for trait $M$ and $Y$. In our case equal to 1000 for all simulations.
+ `lcv_q1`, `lcv_q2`, `gcp`: True values of the LCV parameters $q_1$, $q_2$ and GCP. These are computed from $\gamma$, $\eta$ and $q$.
+ `m_sig`: The number of genome-wide significant variants for trait $M$ after pruning for LD using an threshold of $r^2 < 0.1$
Values extracted from the CAUSE results all begin with `cause.`. For example, `cause.p` is the CAUSE $p$-value. Posterior medians for $\gamma$, $\eta$ and $q$ are reported for the sharing and causal models (e.g. `cause.q_causal` is the posterior median of $q$ in the causal model).
The `mrdf` data frame gives results for alternative MR methods. The first 18 columns are the same as in the `causedf` data frame. For these methods, we have extracted only the method name (in the `mr` column) and the $p$-value (`mr.p`).
```{r, lookat2}
mrdf <- readRDS("sim_results/main_mrdf.RDS")
dim(mrdf)
head(mrdf)
with(mrdf, table(mr))
```
We first summarize results grouping by sample size and simulation setting. Some of the MR methods will have missing results if there are too few variants. We consider this as not having rejected the null hypothesis or a non-significant result. CAUSE does not have any missing results.
```{r, summ_main}
cause_summ <- causedf %>%
group_by(q, tau, omega, n1, n2) %>%
summarize(n_sig = sum(cause.p < 0.05, na.rm=T),
missing = sum(is.na(cause.p))) %>%
mutate(analysis = "cause")
mr_summ <- mrdf %>%
group_by(q, tau, omega, n1, n2, mr) %>%
summarize(n_sig = sum(mr.p < 0.05, na.rm=T),
missing = sum(is.na(mr.p))) %>%
rename(analysis = "mr")
res <- bind_rows(cause_summ, mr_summ) %>%
ungroup() %>%
mutate(label = fct_recode(analysis, "CAUSE"="cause",
"Egger" = "egger", "GSMR"="gsmr",
"IVW"="ivw",
"MR-PRESSO"="mrp", "Wtd. Median" = "wm"),
n1 = factor(paste0("N[M]==", n1)),
n2 = factor(paste0("N[Y]==", n2)))
filter(res, analysis=="cause") %>% with(., all(missing == 0))
```
### Power
Now we can plot!
First we plot power with $\tau$ on the horizontal axis and the number of times each method rejected the null hypothesis on the vertical axis.
```{r, plot_power}
#colors
cc <- as.character(jcolors("default"))
cols <- c(cc[3], cc[1:2], cc[4:5], "#D590DA")
#plot
pwr <- res %>%
filter(tau != 0 | (tau == 0 & omega == 0 & q == 0)) %>% #This line includes only the power simulations.
ggplot(.) +
geom_line(aes(x=tau, y=n_sig/100, color=label)) +
geom_point(aes(x=tau, y=n_sig/100, color=label, shape=label)) +
scale_color_manual(values=cols) +
scale_shape_manual(values= c(16, 3, 4, 0, 2, 6:8)) +
geom_hline(yintercept=0.05, linetype=2) +
xlab(expression("Percentage of Y heritability explained by M (" ~ tau ~ "=" ~ gamma^2 ~ h[M]^2/h[Y]^2 ~ ")")) +
ylab("Percent Significant") +
theme_bw() +
theme(legend.title = element_blank(),
strip.text = element_text(size=14),
axis.title = element_text(size=16),
legend.text = element_text(size=12),
legend.position = "bottom") +
guides(color=guide_legend(nrow=1),
linetype=guide_legend(nrow=1),
shape=guide_legend(nrow=1) ) +
facet_wrap(~n1*n2, nrow=2, labeller = label_parsed)
pwr
```
CAUSE has lower power than the other methods except when the power for trait $M$ is low and the power for trait $Y$ is high. This is expected since CAUSE uses a more stringent criterion to conclude that the data are consistent with a causal effect. It is able to achieve better power when $M$ is under-powered and $Y$ is well powered because it uses all variants genome-wide rather than only those that are highly significant for trait $M$. This is an important setting because it may be applicable to scenarios when $M$ is a biomarker or molecular trait and $Y$ is a complex triat.
### False Positives
Next we plot false positives. Our simulations include two values of $\eta$ ($\sqrt{0.05}$ and $\sqrt{0.02}$). Although we only had space to present one set of results in the paper, we will make plots for both here. We plot the proportion of $M$ effect variants that display correlated pleiotropy, $q$, on the horizontal axis and the number of discoveries on the vertical axis.
The bottom row of plots are for the larger value of $\eta$ and are the ones shown in the paper.
For convenience, in extracting results, we copy the results for $q=0$.
```{r, copy_zero}
z1 <- filter(res, omega == 0 & tau == 0 & q == 0) %>% mutate(omega = 0.05)
z2 <- filter(res, omega == 0 & tau == 0 & q == 0) %>% mutate(omega = 0.02)
res2 <- bind_rows(res, z1, z2) %>%
filter(!(q == 0 & omega == 0 & tau == 0))
```
```{r, plot_fp}
fp <- res2 %>%
filter(tau == 0) %>%
ggplot(.) +
geom_line(aes(x=q, y=n_sig/100, color=label)) +
geom_point(aes(x=q, y=n_sig/100, color=label, shape=label)) +
geom_hline(yintercept=0.05, linetype=2) +
scale_color_manual(values=cols) +
scale_shape_manual(values= c(16, 3, 4, 0, 2, 6:8)) +
geom_hline(yintercept=0.05, linetype=2) +
xlab("Proportion of variants acting through U (q)") +
ylab("Percent Significant") +
theme_bw() +
theme(legend.title = element_blank(),
strip.text = element_text(size=14),
axis.title = element_text(size=16),
legend.text = element_text(size=12),
legend.position = "bottom") +
guides(color=guide_legend(nrow=1),
linetype=guide_legend(nrow=1),
shape=guide_legend(nrow=1) ) +
facet_wrap(~omega*n1*n2, nrow=2, labeller=label_parsed)
fp
```
In all cases CAUSE has a lower false positive rate than alternative methods. The false positive rate increases for large values of $q$ and is worse when the effect of the shared factor is stronger.
### ROC Curves
We use ROC curves to evaluate how well each method is able to distinguish simulations with a causal effect of size $\gamma = \sqrt{0.05}$ from simulations with a shared factor with an equal effect acconting for 30\% of the trait $M$ vairants. Both of these parameters can be modified. The first step is to combine the CAUSE and alternative MR results into a single data frame
```{r, combine}
x1 <- causedf %>%
mutate(analysis = "cause", stat = -log10(cause.p)) %>%
select(tag, q, omega, tau, n1, n2, analysis, stat)
x2 <- mrdf %>%
mutate(stat = -log10(mr.p)) %>%
rename(analysis = mr) %>%
select(tag, q, omega, tau, n1, n2, analysis, stat)
full_results <- bind_rows(x1, x2)
full_results <- full_results %>%
mutate(label = fct_recode(analysis, "CAUSE"="cause",
"Egger" = "egger", "GSMR"="gsmr",
"IVW"="ivw",
"MR-PRESSO"="mrp", "Wtd. Median" = "wm"))
```
Next we use the `pROC` package.
```{r, message=FALSE, roc_curves}
eff <- 0.05
my_q <- 0.3
n1s <- c(12000, 12000, 40000, 40000)
n2s <- c(12000, 40000, 12000, 40000)
methods <- with(full_results, unique(label)) %>%
as.vector(.)
cc <- as.character(jcolors("default"))
cols <- c(cc[3], cc[1:2], cc[4:5], "#D590DA")
roc_df <-lapply(seq_along(n1s), function(i){
my_n1 <- n1s[i]
my_n2 <- n2s[i]
rocs <- lapply(methods, function(m){
#cat(m, " ")
xx <- filter(full_results, n1 == my_n1 & n2 == my_n2 & label == m &
( tau == eff | (omega == eff & q == my_q & tau == 0)))
xx$stat[is.na(xx$stat)] <- 0
with(xx, roc(predictor = stat, response = tau !=0)) %>%
with(., data.frame(thresh = thresholds, spec = specificities,
sens = sensitivities, method = m,
n1 = my_n1, n2 = my_n2,
stringsAsFactors=F)) %>%
return()
})
do.call(rbind, rocs)
})
roc_df <- do.call(rbind, roc_df) %>%
mutate(n1 = factor(paste0("N[M]==", n1)),
n2 = factor(paste0("N[Y]==", n2)))
head(roc_df)
# A data frame with points to add to the curve
pw <- filter(res, tau == eff )
f <- filter(res, omega == eff & q == my_q & tau ==0)
df <- select(pw, label, n_sig, n1, n2) %>%
rename(power = n_sig) %>%
full_join(., f, c("label", "n1", "n2")) %>%
rename(fp = n_sig) %>%
select(label, power, fp, n1, n2) %>%
filter(label %in% methods) %>%
rename(method=label)
## Plotting!
plt <- roc_df %>%
arrange(sens) %>%
ggplot(.) +
geom_line(aes(x=1-spec, y=sens, col=method, group=method, linetype=method)) +
geom_point(aes(x=fp/100, y = power/100, col=method, shape=method), size =4 , data=df) +
scale_color_manual(values=cols) +
scale_linetype_manual(values = rep(1, 6)) +
scale_shape_manual(values = c(rep(16, 6))) +
xlab("1 - Specificity (False Positive Rate)") + ylab("Sensitivity (Power)") +
theme_bw() +
theme(legend.title = element_blank(),
strip.text = element_text(size=14),
axis.title = element_text(size=16),
legend.text = element_text(size=12),
legend.position = "bottom") +
guides(color=guide_legend(nrow=1),
linetype=guide_legend(nrow=1),
shape=guide_legend(nrow=1)) +
facet_wrap(~n1*n2, nrow=2, labeller= label_parsed)
plt
```
## Prior robustness results
The `sigma_g` anlaysis is intended to evaluate how sensitive CAUSE is to changes in the prior on $\gamma$ and $\eta$. These parameters have the same $N(0, \sigma_{\gamma, \eta})$ prior distribution. By default, $\sigma_{\gamma \eta}$ is chosen guided by the data. These results dmeonstrate that, within a wide range, CAUSE is insensitive to the value of $\sigma_{\gamma \eta}$.
The R script `extract_sigma_g.R` will use `dscrutils` to extract simulation results and store them in a data frame.
Run it at the command line with
```
Rscript extract_sigma_g.R
```
or from inside an R session using `source`.
This code will append the date to the file it saves so it won't save over the results downloaded by `setup_sims()`. If you want to use your own data frame, change the file name in the commands below.
```{r, lookat3}
causedf <- readRDS("sim_results/sg_causedf.RDS")
head(causedf)
with(causedf, table(sigma_g))
with(causedf, table(params))
```
The only columns that we have not seen previously are `quant` and `sigma_g`. `sigma_g` gives the value of $\sigma_{\gamma \eta}$ used in each simulation. These are large, medium and small values which are chosen so that $\sqrt{0.05}$ is the 51st, 65th and 80th quantile (`quant`) of the $N(0, \sigma_{\gamma \eta})$ distribution respectively. In this data frame we have ommited the `cause.` prefix on columns because CAUSE is the only method we are evaluating.
There are three sets of parameters used in simulations. Below we calculate the correlation in $p$-value and posterior median for each parameter in each setting across the three values of $\sigma_{\gamma \eta}$. The following code only reports the minimum correlation for each parameter-simulation setting combination and the correlation between the medium and small values of $\sigma_{\gamma \eta}$.
```{r}
vars <- c("z", "p", "eta_sharing", "q_sharing",
"eta_causal", "q_causal", "gamma_causal")
min_cor <- expand.grid(variable = vars,
params = unique(causedf$params), stringsAsFactors=F)
min_cor$min_cor <- NA
min_cor$cor_med_small <- NA
for(i in seq(nrow(min_cor))){
v <- min_cor$variable[i]
X <- select(causedf, simulate.output.file, params, quant, one_of(v)) %>%
spread(quant, get(v)) %>%
filter(params==min_cor$params[i]) %>%
select(one_of(c("0.51", "0.65", "0.8")))
min_cor$min_cor[i] <- min(cor(X))
min_cor$cor_med_small[i] <- cor(X[,"0.65"], X[,"0.8"])
}
min_cor
```
This code is also contained in `summarize_sigma_g.R`.
# What's in the DSC Files? {#details}
We will now go in detail through one of the DSC files so you can see what is happening and how to make changes to run your own experiments. We will look at `false_positives.dsc` but `power.dsc` and `sigma_g.dsc` are similar.
The DSC statement is at the top:
```
DSC:
define:
mr: ivw, egger, mrp, wm, twas, wm, gsmr
run: simulate*gw_sig,
simulate*LCV,
simulate*mr,
simulate*cause_params*cause
replicate: 100
output: fp
```
This statement tells DSC what to run. In `define` we define a group of modules. These will turn up later in the file. In `run` we tell DSC what to run. For example, `simulate*gw_sig` means that DSC should run the `simulate` module followed by the `gw_sig` module which will use the results of the `simulate` module. `replicate` indicates the number of replicates to run (though this is over-ridden by the `--replicate` flag in the DSC call above) and `output` names the output directory.
The rest of the document defines modules. All of our modules use R code and most of them rely on the `causeSims` R package.
## Simulation Modules
The simulation module:
```
simulate: R(library(causeSims);
snps <- readRDS("data/chr19_snpdata_hm3only.RDS");
evd_list <- readRDS("data/evd_list_chr19_hm3.RDS");
dat <- sum_stats(snps, evd_list,
n_copies = 30,
n1 = n1, n2=n2,
h1=h1, h2=h1,
neffect1 = neffect1, neffect2 = neffect2,
tau = tau, omega = omega, q = q,
cores = 4, ld_prune_pval_thresh = 0.01,
r2_thresh = 0.1))
q: 0.1, 0.2, 0.3, 0.4, 0.5
omega: 0.02, 0.05
tau: 0
n1: 12000, 40000
n2: 12000, 40000
neffect1: 1000
neffect2: 1000
h1: 0.25
h2: 0.25
$sim_params: c(q = q, omega = omega, tau = tau, h1 = h1, h2 = h2, n1 = n1, n2 = n2, neffect1 = neffect1, neffect2 =neffect2)
$dat: dat
```
The R code contained in `R()` is the guts of this module. A data object named `dat` is generated using the `sum_stats` function from the `causeSims` R pacakge. Below this, we specify parameters used to simulate the data. Most important are `q`, `omega`, and `tau`. `q` is the proportion of variants acting through $U$ while `omega` and `tau` are standardized versions of $\gamma$ and $\eta$ in the paper. `tau` is the proportion of $Y$ heritability explained by the causal effect of $M$ on $Y$ times the sign of $\gamma$,
$$
\tau = sign(\gamma)\frac{\gamma^2 h^2_{M}}{h^{2}_{Y}},
$$
and `omega` is the proportion of $Y$ heritability explained by $U$, divided by $q$ and multiplied by the sign of $\eta$.
$$
\omega = sign(\eta)\frac{\eta^2 h^2_{M}}{h^{2}_{Y}}
$$
This parameterization is sometimes more convenient for running simulations that using $\eta$ and $\gamma$. However, the `sum_stats` function will accept either `omega` and `tau` arguments or `eta` and `gamma`. Note that here we have specified that the function can use 4 cores. This should match the `n_cpu` parameter in the `simulate` section of the config file. The `sum_stats` function generates data both with and without LD. In this file we only analyze the "with LD" data but we will describe how to make modifications to analyze both or only the "no LD" data.
The `gw_sig` module is a helper module. This module computes the GCP (a parameter defined by LCV (O'Connor and Price 2019)) and computes the number of genome-wide significant variants. It also stores all the simulation parameters which makes it easier to extract them. Extracting parameters from the `simulate` module will require reading in the entire data set, objects created by the `gw_sig` module are smaller and faster to read.
```
gw_sig: R(library(causeSims);
m_sig_nold <- with($(dat), sum(p_value_nold < thresh ));
y_sig_nold <- with($(dat), sum(2*pnorm(-abs(beta_hat_2_nold/seb2)) < thresh ));
m_sig <- with($(dat), sum(p_value < thresh & ld_prune==TRUE));
params <- $(sim_params);
tau <- as.numeric(params["tau"]);
omega <- as.numeric(params["omega"]);
q <- as.numeric(params["q"]);
h1 <- as.numeric(params["h1"]);
h2 <- as.numeric(params["h2"]);
neffect1 <- as.numeric(params["neffect1"]);
neffect2 <- as.numeric(params["neffect2"]);
gamma <- sqrt(tau*sum(h2)/sum(h1))*sign(tau);
eta <- sqrt(abs(omega)*sum(h2)/sum(h1))*sign(omega);
#LCV parameters
if(q == 0 & tau == 0){
q.1 <- q.2 <- 0;
gcp <- 0;
}else if(q == 0){
q.1 <- 1;
q.2 <- 0;
gcp <- 1;
}else{
q.1 <- sqrt(q);
q.2 <- eta * sqrt(q) * sqrt(h1) / sqrt(h2);
gcp = (log(q.2^2) - log(q.1^2)) / (log(q.2^2) + log(q.1^2));
};
gcp_mom <- gcp_moments($(dat), h1, h2)
)
thresh: 5e-8
$m_sig: m_sig
$m_sig_nold: m_sig_nold
$y_sig_nold: y_sig_nold
$q: q
$omega: omega
$tau: tau
$gamma: gamma
$eta: eta
$h1: h1
$h2: h2
$n1: as.numeric(params["n1"])
$n2: as.numeric(params["n2"])
$neffect1: as.numeric(params["neffect1"])
$neffect2: as.numeric(params["neffect2"])
$lcv_q1: q.1
$lcv_q2: q.2
$lcv_gcp: gcp
$lcv_gcp_mom: gcp_mom$gcp
```
## CAUSE Modules
We have divided CAUSE analysis into parameter estimation and model fitting steps. This makes it easier to experiment with changes only to the parameter estimation step.
```
cause_params: R(library(causeSims);
p1 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=FALSE);
)
null_wt: 10
$cause_params_ld: p1
cause: R(library(causeSims);
library(cause);
cause_res <- cause_sims($(dat), $(cause_params_ld), no_ld = FALSE);
z <- -1*summary(cause_res)$z;
p <- pnorm(-z);
quants <- summary(cause_res)$quants)
$cause_res: cause_res
$sigma_g: cause_res$sigma_g
$eta_med_2: quants[[1]][1,2]
$q_med_2: quants[[1]][1,3]
$eta_med_3: quants[[2]][1,2]
$gamma_med_3: quants[[2]][1,1]
$q_med_3: quants[[2]][1,3]
$z: z
$p: p
```
Both modules use wrapper functions from the `causeSims` package. These are very simple wrappers that make it easy to run CAUSE on the simulated data format. Note that here we use `no_ld = FALSE` in both modules. If you want to use the data that is simulated without LD instead you can change `no_ld = TRUE`. You could get results for both using the following set of modules.
```
cause_params: R(library(causeSims);
p1 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=FALSE);
p2 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=TRUE);
)
null_wt: 10
$cause_params_ld: p1
$cause_params_nold: p2
cause: R(library(causeSims);
library(cause);
if(no_ld==FALSE){
cause_res <- cause_sims($(dat), $(cause_params_ld), no_ld = no_ld);
else{
cause_res <- cause_sims($(dat), $(cause_params_nold), no_ld = no_ld);
}
z <- -1*summary(cause_res)$z;
p <- pnorm(-z);
quants <- summary(cause_res)$quants)
$no_ld: TRUE, FALSE
$cause_res: cause_res
$sigma_g: cause_res$sigma_g
$eta_med_2: quants[[1]][1,2]
$q_med_2: quants[[1]][1,3]
$eta_med_3: quants[[2]][1,2]
$gamma_med_3: quants[[2]][1,1]
$q_med_3: quants[[2]][1,3]
$z: z
$p: p
```
## Alternative MR Modules
There are many modules for alternative MR methods. Modules for IVW regression, Egger regression, MR-PRESSO, and the weighted median method are very similar and rely on wrappers from `causeSims`. We can just look at one of these.
```
ivw: R(library(causeSims);
res <- ivw($(dat), p_val_thresh=thresh, no_ld = no_ld);
)
thresh: 5e-8
no_ld: FALSE
$z: res$z
$p: res$p
```
To run on the data without LD change the `no_ld` parameter to `no_ld: TRUE` or `no_ld: TRUE, FALSE` to run both.
The GSMR module is a little more complicated
```
gsmr: R(library(causeSims);
evd_list <- readRDS("data/evd_list_chr19_hm3.RDS");
res <- gsmr_sims($(dat), evd_list, p_val_thresh = 5e-8, no_ld = FALSE);
if(!is.null(res)){
z <- res$bxy/res$bxy_se;
est <- res$bxy;
p <- res$bxy_pval;
}else{
z <- est <- p <- NA;
}
)
thresh: 5e-8
$z: z
$p: p
$est_gsmr: est
$gsmr: res
```
GSMR requires the LD structure and sometimes returns errors. In this case the `gsmr_sims` wrapper returns NULL and we set results to missing. If `no_ld = TRUE` the LD data in `evd_list` is simply ignored.
## LCV Module
Results from LCV are not compared with CAUSE in the paper because LCV is performing a different test than other MR methods. However, we have implemented wrappers and a module to run LCV.
```
LCV: R(library(causeSims);
res <- try(lcv_sims($(dat),no_ld = no_ld, sig.thresh = thresh));
if(class(res) == "try-error"){
res <- list(pval.gcpzero.2tailed=NA, gcp.pm = NA, gcp.pse = NA);
};
)
thresh: 30
no_ld: FALSE
$p: res$pval.gcpzero.2tailed
$gcp_mean: res$gcp.pm
$gcp_pse: res$gcp.pse
$gcp_obj: res
```
When LCV returns errors, we set results to missing.