analysis/simulations.Rmd

---
title: "Simulations"
author: "Jean Morrison"
date: "2019-07-10"
output: workflowr::wflow_html
---


# Introduction

This page gives instructions for running the simulations that we ran to evaluate CAUSE and will allow you to reproduce the results [the paper](https://www.biorxiv.org/content/10.1101/682237v3). It will hopefully also serve as a resource for explaining those results in more detail and allow you to modify simulations to run your own experiments. There are a lot of parameters that can be changed. We tried to explore a reasonable range of scenarios but there are many more variations that could be tried. Running the simulations involves multiple set-up steps, mostly software installation and may take a bit of time, especially if you are new to DSC. Additionally, completing the simulations requires several hours of computing and needs to be done on a cluster. If you don't have access to that kind of computing resource you may still find this section useful. All of our results can be downloaded and we will include some discussions of the patterns that we observed as we go along. 


# Set Up

## Software Installation

Simulations are run using the `causeSims` R package. The package can be installed with

```
devtools::install_github("jean997/causeSims")
```

This package relies on a few R packages that are not on CRAN, which you will need to install separately. 
For detailed installation instructions including installation of supporting R packages for other methods see [here](https://github.com/jean997/causeSims). There you can also find
a walk through of a single simulation and explanations of the functions contained in the package.


We conduct simulations using the [Dynamical Statisical Comparisons](https://stephenslab.github.io/dsc-wiki/overview) software. You will need to [have DSC installed](https://stephenslab.github.io/dsc-wiki/installation.html) and it will be helpful to take a look at [the tutorial](https://stephenslab.github.io/dsc-wiki/first_course/Intro_DSC) if you haven't used DSC before.

## Set Up The Analysis Directory

To set up the analysis create a working directory that you will run the simulations in. Make sure all the software in the previous section is installed.

From inside this working directory use the R command 
```{r, eval=FALSE}
library(causeSims)
setup_sims()
```

This function downloads the following files:

+ LD data: This is used for simulating summary statistics and is placed into a sub-directory `data`. The LD data are downloaded from [here](https://zenodo.org/record/3235780/) and contain data for HapMap3 SNPs on chromosome 19 estimated from the CEU 1000 Genomes samples.

+ DSC files: These contain instructions for DSC on how to simulate data and analyze it. Jump to <a href="#details">this subsection</a> for a detailed description of the DSC files and simulation parameters. 
If you want to run your own experiments, these files should provide a good starting place for that. 
The set up function downloads three DSC files.
One is for evaluating false positives in the presence of different levels of correlated pleiotropy, one is for evaluating power when there is a true causal effect, and one is for evaluating the robustness of CAUSE to different prior distributions on $\gamma$ and $\eta$ (see supplementary note of the paper). 

+ A configuration file for running DSC on a cluster. This file is set up for the University of Chicago Research Computing Cluster so you will need to change it to match your cluster set up. Even if you are using the U of C cluster, you will need to change the home directory and account lines of this file. 
 Read about using DSC with a compute cluster [here](https://stephenslab.github.io/dsc-wiki/advanced_course/Remote_Computations) (We are using the "On Host" option) and modify `config.yml` as needed.
 
+ Some code for plotting and results summarization


# Run DSC

To run DSC use the following commands at the command line

```
nohup dsc --replicate 100 --host config.yml -c 4 power.dsc > pwr_dsc.out &
nohup dsc --replicate 100 --host config.yml -c 4 false_positives.dsc > fp_dsc.out &
nohup dsc --replicate 100 --host config.yml -c 4 sigma_g.dsc > sg_dsc.out &
```
Depending on your computing resources and how many jobs you allow to run simultaneously, this could take several hours for each command. Errors can occur that stop DSC, usually a result of interfacing with the cluster. For example, DSC may fail to detect that a job has completed. If this happens, simply restart the failed job using the same command. DSC will pick up where it left off since all results are saved along the way. 
Running these commands will generate lots of `*.err` and `*.out` files all of which can be safely deleted. Results will be in directories named `fp` (false positives), `pwr` (power) and `sigma_g` (results for testing the prior on $\gamma$ and $\eta$).


# Look at Results and Plot Them

## False positive and power results

The script `extract_results_main.R` will use the `dscrutils` package to extract results from the false positive and power analyses. If you are modifying the analysis and have changed the names of the output directories, you will need to change them in this file also. Depending on the changes you have made, you may need to make other modifications. Read more about [dscrutils here](https://stephenslab.github.io/dsc-wiki/first_course/Intro_DSC.html#query-the-results) and in the `dscrutils` documentation. This script will save R objects containing CAUSE, LCV, and other MR method results for both analyses. It will append the date so it won't over-write the results downloaded by `setup_sims()`. Run it at the command line using
```
Rscript extract_results_main.R
```
or from inside R using `source(extract_results_main.R)`. It will take several minutes to extract all of the results. If you would like to use your own files to plot, change the file names in the code below which uses the downloaded results.

```{r, message=FALSE, packages}
library(tidyverse)
library(gridExtra)
library(pROC)
library(jcolors)
```

The results extraction script makes a data frame for CAUSE results and a data frame for all the other MR methods. There is also a data frame for results from LCV but this is discussed later on. We first read in results and look around a little

```{r, lookat}
causedf <- readRDS("sim_results/main_causedf.RDS")
dim(causedf)
head(causedf)
```

The frist 18 columns of the data frame are all simulation parameters, information about the simulation files, or simple statistics computed about the simulated data. These include

+ `gamma`, `eta`, `q`: True values of $\gamma$, $\eta$ and $q$ used to simulate data. `params` gives a character string listing $(\gamma, \eta, q)$.
+ `tau`, `omega`: Transformations of $\gamma$ and $\eta$ respectively. $\tau$ is the proportion of $Y$ heritability explained by the causal effect of trait $M$, equal to $\tau = \gamma^2 h^2_{M}/h^2_Y$. In our case $h^2_M  = h^2_Y$ for all simulations so $\tau = \gamma^2$. $\omega$ is defined as $\omega = \eta^2 h^2_M/h^2_Y  = \eta^2$. 
+ `h1`, `h2`: Heritabilitites for trait $M$ and $Y$ respectively `h1` $ = h^2_M$ and `h2`$= h^2_Y$. In our case these are equal to 0.25 for all simulations.
+ `n1`, `n2`: Sample size for GWAS of traits $M$ and $Y$ respectively. These can be equal to 12,000 (low power) or 40,000 (high power).
+ `neffect1`, `neffect2`: Expected number of effect SNPS for trait $M$ and $Y$. In our case equal to 1000 for all simulations.
+ `lcv_q1`, `lcv_q2`, `gcp`: True values of the LCV parameters $q_1$, $q_2$ and GCP. These are computed from $\gamma$, $\eta$ and $q$. 
+ `m_sig`: The number of genome-wide significant variants for trait $M$ after pruning for LD using an threshold of $r^2 < 0.1$

Values extracted from the CAUSE results all begin with `cause.`. For example, `cause.p` is the CAUSE $p$-value. Posterior medians for $\gamma$, $\eta$ and $q$ are reported for the sharing and causal models (e.g. `cause.q_causal` is the posterior median of $q$ in the causal model).

The `mrdf` data frame gives results for alternative MR methods. The first 18 columns are the same as in the `causedf` data frame. For these methods, we have extracted only the method name (in the `mr` column) and the $p$-value (`mr.p`).
```{r, lookat2}
mrdf <- readRDS("sim_results/main_mrdf.RDS")
dim(mrdf)
head(mrdf)
with(mrdf, table(mr))
```


We first summarize results grouping by sample size and simulation setting. Some of the MR methods will have missing results if there are too few variants. We consider this as not having rejected the null hypothesis or a non-significant result. CAUSE does not have any missing results. 

```{r, summ_main}
cause_summ <- causedf %>% 
              group_by(q, tau, omega,  n1, n2) %>%
              summarize(n_sig = sum(cause.p < 0.05, na.rm=T),
                        missing = sum(is.na(cause.p))) %>% 
              mutate(analysis = "cause")

mr_summ <- mrdf %>%
           group_by(q, tau, omega,  n1, n2, mr) %>%
           summarize(n_sig = sum(mr.p < 0.05, na.rm=T),
                     missing = sum(is.na(mr.p))) %>%
           rename(analysis = "mr")


res <- bind_rows(cause_summ, mr_summ) %>%
        ungroup() %>%
        mutate(label = fct_recode(analysis, "CAUSE"="cause",
                                     "Egger" = "egger", "GSMR"="gsmr",
                                     "IVW"="ivw", 
                                     "MR-PRESSO"="mrp", "Wtd. Median" = "wm"),
                 n1 = factor(paste0("N[M]==", n1)),
                 n2 = factor(paste0("N[Y]==", n2)))

filter(res, analysis=="cause") %>% with(., all(missing == 0))
```

### Power 

Now we can plot!

First we plot power with $\tau$ on the horizontal axis and the number of times each method rejected the null hypothesis on the vertical axis. 
```{r, plot_power}
#colors
cc <- as.character(jcolors("default"))
cols <- c(cc[3], cc[1:2], cc[4:5],  "#D590DA")
#plot
pwr <- res %>%
       filter(tau != 0 | (tau == 0 & omega == 0 & q == 0)) %>% #This line includes only the power simulations. 
       ggplot(.) + 
          geom_line(aes(x=tau, y=n_sig/100, color=label)) +  
          geom_point(aes(x=tau, y=n_sig/100, color=label, shape=label)) +
          scale_color_manual(values=cols) + 
          scale_shape_manual(values= c(16, 3, 4, 0, 2, 6:8)) +
          geom_hline(yintercept=0.05, linetype=2) +
          xlab(expression("Percentage of Y heritability explained by M (" ~ tau ~ "=" ~ gamma^2 ~ h[M]^2/h[Y]^2 ~ ")")) +
          ylab("Percent Significant") + 
          theme_bw() + 
          theme(legend.title = element_blank(), 
                strip.text = element_text(size=14),
                axis.title = element_text(size=16),
                legend.text = element_text(size=12),
                legend.position = "bottom") +  
          guides(color=guide_legend(nrow=1),  
                 linetype=guide_legend(nrow=1),
                 shape=guide_legend(nrow=1) )  + 
          facet_wrap(~n1*n2, nrow=2, labeller = label_parsed)

pwr

```

CAUSE has lower power than the other methods except when the power for trait $M$ is low and the power for trait $Y$ is high. This is expected since CAUSE uses a more stringent criterion to conclude that the data are consistent with a causal effect. It is able to achieve better power when $M$ is under-powered and $Y$ is well powered because it uses all variants genome-wide rather than only those that are highly significant for trait $M$. This is an important setting because it may be applicable to scenarios when $M$ is a biomarker or molecular trait and $Y$ is a complex triat. 


### False Positives 

Next we plot false positives. Our simulations include two values of $\eta$ ($\sqrt{0.05}$ and $\sqrt{0.02}$). Although we only had space to present one set of results in the paper, we will make plots for both here. We plot the proportion of $M$ effect variants that display correlated pleiotropy, $q$, on the horizontal axis and the number of discoveries on the vertical axis. 
The bottom row of plots are for the larger value of $\eta$ and are the ones shown in the paper. 
For convenience, in extracting results, we copy the results for $q=0$. 

```{r, copy_zero}
z1 <-  filter(res, omega == 0 & tau == 0 & q == 0) %>% mutate(omega = 0.05)
z2 <-  filter(res, omega == 0 & tau == 0 & q == 0) %>% mutate(omega = 0.02)
res2 <- bind_rows(res, z1, z2) %>%
        filter(!(q == 0 & omega == 0 & tau == 0))
```

```{r, plot_fp}
fp <- res2 %>%
      filter(tau == 0) %>%
      ggplot(.) +
        geom_line(aes(x=q, y=n_sig/100, color=label)) +
        geom_point(aes(x=q, y=n_sig/100, color=label, shape=label)) +
        geom_hline(yintercept=0.05, linetype=2) +
        scale_color_manual(values=cols) +
        scale_shape_manual(values= c(16, 3, 4, 0, 2, 6:8)) +
        geom_hline(yintercept=0.05, linetype=2) +
        xlab("Proportion of variants acting through U (q)") +
        ylab("Percent Significant") +
        theme_bw() +
        theme(legend.title = element_blank(),
            strip.text = element_text(size=14),
            axis.title = element_text(size=16),
            legend.text = element_text(size=12),
            legend.position = "bottom") +
        guides(color=guide_legend(nrow=1),
             linetype=guide_legend(nrow=1),
             shape=guide_legend(nrow=1) )  +
        facet_wrap(~omega*n1*n2, nrow=2, labeller=label_parsed)

fp
```

In all cases CAUSE has a lower false positive rate than alternative methods. The false positive rate increases for large values of $q$ and is worse when the effect of the shared factor is stronger. 

### ROC Curves

We use ROC curves to evaluate how well each method is able to distinguish simulations with a causal effect of size $\gamma = \sqrt{0.05}$ from simulations with a shared factor with an equal effect acconting for 30\% of the trait $M$ vairants. Both of these parameters can be modified. The first step is to combine the CAUSE and alternative MR results into a single data frame

```{r, combine}
x1 <- causedf %>%
      mutate(analysis = "cause", stat = -log10(cause.p)) %>%
      select(tag, q, omega, tau, n1, n2, analysis, stat)

x2 <- mrdf %>%
      mutate(stat = -log10(mr.p)) %>%
      rename(analysis = mr) %>%
      select(tag, q, omega, tau, n1, n2, analysis, stat)

full_results <- bind_rows(x1, x2)

full_results <- full_results %>% 
                mutate(label = fct_recode(analysis, "CAUSE"="cause",
                                     "Egger" = "egger", "GSMR"="gsmr",
                                     "IVW"="ivw",
                                     "MR-PRESSO"="mrp", "Wtd. Median" = "wm"))

```

Next we use the `pROC` package. 

```{r, message=FALSE, roc_curves}
eff <- 0.05
my_q <- 0.3
n1s <- c(12000, 12000, 40000, 40000)
n2s <- c(12000, 40000, 12000, 40000)

methods <- with(full_results, unique(label)) %>%
            as.vector(.)

cc <- as.character(jcolors("default"))
cols <- c(cc[3], cc[1:2], cc[4:5],  "#D590DA")


roc_df <-lapply(seq_along(n1s), function(i){
                    my_n1 <- n1s[i]
                    my_n2 <- n2s[i]
                    rocs <- lapply(methods, function(m){
                       #cat(m, " ")
                       xx <- filter(full_results, n1 == my_n1 & n2 == my_n2 & label == m &
                                 ( tau == eff | (omega == eff & q == my_q & tau == 0)))
                       xx$stat[is.na(xx$stat)] <- 0
                       with(xx, roc(predictor = stat, response = tau !=0)) %>%
                            with(., data.frame(thresh = thresholds, spec = specificities,
                                            sens = sensitivities, method = m,
                                            n1 = my_n1, n2 = my_n2,
                                            stringsAsFactors=F)) %>% 
                            return()

                })
                do.call(rbind, rocs)
            })
roc_df <- do.call(rbind, roc_df) %>%
          mutate(n1 = factor(paste0("N[M]==", n1)),
                 n2 = factor(paste0("N[Y]==", n2)))
head(roc_df)

# A data frame with points to add to the curve
pw <- filter(res, tau == eff ) 
f <- filter(res, omega == eff & q == my_q & tau ==0)
df <- select(pw, label, n_sig, n1, n2) %>%
      rename(power = n_sig) %>%
      full_join(., f, c("label", "n1", "n2")) %>%
      rename(fp = n_sig) %>%
      select(label, power, fp, n1, n2) %>%
      filter(label %in% methods) %>%
      rename(method=label)

## Plotting!
plt <- roc_df %>% 
       arrange(sens) %>%
       ggplot(.) +
          geom_line(aes(x=1-spec, y=sens, col=method, group=method, linetype=method)) +
          geom_point(aes(x=fp/100, y = power/100, col=method, shape=method), size =4 , data=df) +
          scale_color_manual(values=cols) +
          scale_linetype_manual(values = rep(1, 6)) +
          scale_shape_manual(values = c(rep(16, 6))) +
          xlab("1 - Specificity (False Positive Rate)") + ylab("Sensitivity (Power)") +
          theme_bw() +
          theme(legend.title = element_blank(),
                strip.text = element_text(size=14),
                axis.title = element_text(size=16),
                legend.text = element_text(size=12),
                legend.position = "bottom") +
          guides(color=guide_legend(nrow=1),
                 linetype=guide_legend(nrow=1),
                 shape=guide_legend(nrow=1))  +
          facet_wrap(~n1*n2, nrow=2, labeller= label_parsed)
plt
```


## Prior robustness results

The `sigma_g` anlaysis is intended to evaluate how sensitive CAUSE is to changes in the prior on $\gamma$ and $\eta$. These parameters have the same $N(0, \sigma_{\gamma, \eta})$ prior distribution.  By default, $\sigma_{\gamma \eta}$ is chosen guided by the data. These results dmeonstrate that, within a wide range, CAUSE is insensitive to the value of $\sigma_{\gamma \eta}$. 

The R script `extract_sigma_g.R` will use `dscrutils` to extract simulation results and store them in a data frame. 
Run it at the command line with 
```
Rscript extract_sigma_g.R
```
or from inside an R session using `source`. 

This code will append the date to the file it saves so it won't save over the results downloaded by `setup_sims()`. If you want to use your own data frame, change the file name in the commands below.

```{r, lookat3}
causedf <- readRDS("sim_results/sg_causedf.RDS")
head(causedf)
with(causedf, table(sigma_g))
with(causedf, table(params))
```

The only columns that we have not seen previously are `quant` and `sigma_g`. `sigma_g` gives the value of $\sigma_{\gamma \eta}$ used in each simulation. These are large, medium and small values which are chosen so that $\sqrt{0.05}$ is the 51st, 65th and 80th quantile (`quant`) of the $N(0, \sigma_{\gamma \eta})$ distribution respectively. In this data frame we have ommited the `cause.` prefix on columns because CAUSE is the only method we are evaluating.
 
There are three sets of parameters used in simulations. Below we calculate the correlation in $p$-value and posterior median for each parameter in each setting across the three values of $\sigma_{\gamma \eta}$. The following code only reports the minimum correlation for each parameter-simulation setting combination and the correlation between the medium and small values of $\sigma_{\gamma \eta}$.
 
```{r}
vars <- c("z", "p", "eta_sharing", "q_sharing",
          "eta_causal", "q_causal", "gamma_causal")

min_cor <- expand.grid(variable = vars,
                       params = unique(causedf$params), stringsAsFactors=F)
min_cor$min_cor <- NA
min_cor$cor_med_small <- NA
for(i in seq(nrow(min_cor))){
    v <- min_cor$variable[i]
    X <- select(causedf, simulate.output.file, params, quant, one_of(v)) %>%
         spread(quant, get(v))  %>%
         filter(params==min_cor$params[i]) %>%
         select(one_of(c("0.51", "0.65", "0.8")))

    min_cor$min_cor[i] <- min(cor(X))
    min_cor$cor_med_small[i] <- cor(X[,"0.65"], X[,"0.8"])

}
min_cor
```

This code is also contained in `summarize_sigma_g.R`.

# What's in the DSC Files? {#details}

We will now go in detail through one of the DSC files so you can see what is happening and how to make changes to run your own experiments. We will look at `false_positives.dsc` but `power.dsc` and `sigma_g.dsc` are similar.


The DSC statement is at the top:
```
DSC:
    define:
        mr: ivw, egger, mrp, wm, twas, wm, gsmr
    run: simulate*gw_sig,
         simulate*LCV,
         simulate*mr,
         simulate*cause_params*cause
    replicate: 100
    output: fp
```
This statement tells DSC what to run. In `define` we define a group of modules. These will turn up later in the file. In `run` we tell DSC what to run. For example, `simulate*gw_sig` means that DSC should run the `simulate` module followed by the `gw_sig` module which will use the results of the `simulate` module. `replicate` indicates the number of replicates to run (though this is over-ridden by the `--replicate` flag in the DSC call above) and `output` names the output directory.

The rest of the document defines modules. All of our modules use R code and most of them rely on the `causeSims` R package.

## Simulation Modules 

The simulation module:
```
simulate: R(library(causeSims);
            snps <- readRDS("data/chr19_snpdata_hm3only.RDS");
            evd_list <- readRDS("data/evd_list_chr19_hm3.RDS");
            dat <- sum_stats(snps, evd_list,
                  n_copies = 30,
                  n1 = n1, n2=n2,
                  h1=h1, h2=h1,
                  neffect1 = neffect1, neffect2 = neffect2,
                  tau = tau, omega = omega, q = q,
                  cores = 4, ld_prune_pval_thresh = 0.01,
                  r2_thresh = 0.1))

     q: 0.1, 0.2, 0.3, 0.4, 0.5
     omega: 0.02, 0.05
     tau: 0
     n1: 12000, 40000
     n2: 12000, 40000
     neffect1: 1000
     neffect2: 1000
     h1: 0.25
     h2: 0.25
     $sim_params: c(q = q, omega = omega, tau = tau,  h1 = h1, h2 = h2, n1 = n1, n2 = n2, neffect1 = neffect1, neffect2 =neffect2)
     $dat: dat
```

The R code contained in `R()` is the guts of this module. A data object named `dat` is generated using the `sum_stats` function from the `causeSims` R pacakge. Below this, we specify parameters used to simulate the data. Most important are `q`, `omega`, and `tau`. `q` is the proportion of variants acting through $U$ while `omega` and `tau` are standardized versions of $\gamma$ and $\eta$ in the paper. `tau` is the proportion of $Y$ heritability explained by the causal effect of $M$ on $Y$ times the sign of $\gamma$, 
$$
\tau = sign(\gamma)\frac{\gamma^2 h^2_{M}}{h^{2}_{Y}},
$$
and `omega` is the proportion of $Y$ heritability explained by $U$, divided by $q$ and multiplied by the sign of $\eta$.
$$
\omega = sign(\eta)\frac{\eta^2 h^2_{M}}{h^{2}_{Y}}
$$
This parameterization is sometimes more convenient for running simulations that using $\eta$ and $\gamma$. However, the `sum_stats` function will accept either `omega` and `tau` arguments or `eta` and `gamma`. Note that here we have specified that the function can use 4 cores. This should match the `n_cpu` parameter in the `simulate` section of the config file. The `sum_stats` function generates data both with and without LD. In this file we only analyze the "with LD" data but we will describe how to make modifications to analyze both or only the "no LD" data. 

The `gw_sig` module is a helper module. This module computes the GCP (a parameter defined by LCV (O'Connor and Price 2019)) and computes the number of genome-wide significant variants. It also stores all the simulation parameters which makes it easier to extract them. Extracting parameters from the `simulate` module will require reading in the entire data set, objects created by the `gw_sig` module are smaller and faster to read.
```
gw_sig: R(library(causeSims);
           m_sig_nold <- with($(dat), sum(p_value_nold < thresh ));
           y_sig_nold <- with($(dat), sum(2*pnorm(-abs(beta_hat_2_nold/seb2)) < thresh ));
           m_sig <- with($(dat), sum(p_value < thresh & ld_prune==TRUE));

          params <- $(sim_params);
          tau <- as.numeric(params["tau"]);
          omega <- as.numeric(params["omega"]);
          q <- as.numeric(params["q"]);
          h1 <- as.numeric(params["h1"]);
          h2 <- as.numeric(params["h2"]);
          neffect1 <- as.numeric(params["neffect1"]);
          neffect2 <- as.numeric(params["neffect2"]);
          gamma <- sqrt(tau*sum(h2)/sum(h1))*sign(tau);
          eta <- sqrt(abs(omega)*sum(h2)/sum(h1))*sign(omega);

          #LCV parameters
          if(q == 0 & tau == 0){
            q.1 <- q.2 <- 0;
            gcp <- 0;
          }else if(q == 0){
            q.1 <- 1;
            q.2 <- 0;
            gcp <- 1;
          }else{
            q.1 <- sqrt(q);
            q.2 <- eta * sqrt(q) * sqrt(h1) / sqrt(h2);
            gcp = (log(q.2^2) - log(q.1^2)) / (log(q.2^2) + log(q.1^2));
          };
          gcp_mom <- gcp_moments($(dat), h1, h2)
          )
    thresh: 5e-8
    $m_sig: m_sig
    $m_sig_nold: m_sig_nold
    $y_sig_nold: y_sig_nold
    $q: q
    $omega: omega
    $tau: tau
    $gamma: gamma
    $eta: eta
    $h1: h1
    $h2: h2
    $n1: as.numeric(params["n1"])
    $n2: as.numeric(params["n2"])
    $neffect1: as.numeric(params["neffect1"])
    $neffect2: as.numeric(params["neffect2"])
    $lcv_q1: q.1
    $lcv_q2: q.2
    $lcv_gcp: gcp
    $lcv_gcp_mom: gcp_mom$gcp
```

## CAUSE Modules

We have divided CAUSE analysis into parameter estimation and model fitting steps. This makes it easier to experiment with changes only to the parameter estimation step. 

```
cause_params: R(library(causeSims);
                p1 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=FALSE);
                )
    null_wt: 10
    $cause_params_ld: p1

cause: R(library(causeSims);
         library(cause);
         cause_res <- cause_sims($(dat), $(cause_params_ld), no_ld = FALSE);
         z <- -1*summary(cause_res)$z;
         p <- pnorm(-z);
         quants <- summary(cause_res)$quants)
    $cause_res: cause_res
    $sigma_g: cause_res$sigma_g
    $eta_med_2: quants[[1]][1,2]
    $q_med_2: quants[[1]][1,3]
    $eta_med_3: quants[[2]][1,2]
    $gamma_med_3: quants[[2]][1,1]
    $q_med_3: quants[[2]][1,3]
    $z: z
    $p: p
```

Both modules use wrapper functions from the `causeSims` package. These are very simple wrappers that make it easy to run CAUSE on the simulated data format. Note that here we use `no_ld = FALSE` in both modules. If you want to use the data that is simulated without LD instead you can change `no_ld = TRUE`. You could get results for both using the following set of modules.

```
cause_params: R(library(causeSims);
                p1 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=FALSE);
                p2 <- cause_params_sims($(dat), null_wt = null_wt, no_ld=TRUE);
                )
    null_wt: 10
    $cause_params_ld: p1
    $cause_params_nold: p2

cause: R(library(causeSims);
         library(cause);
         if(no_ld==FALSE){
          cause_res <- cause_sims($(dat), $(cause_params_ld), no_ld = no_ld);
         else{
          cause_res <- cause_sims($(dat), $(cause_params_nold), no_ld = no_ld);
         }
         z <- -1*summary(cause_res)$z;
         p <- pnorm(-z);
         quants <- summary(cause_res)$quants)
    $no_ld: TRUE, FALSE
    $cause_res: cause_res
    $sigma_g: cause_res$sigma_g
    $eta_med_2: quants[[1]][1,2]
    $q_med_2: quants[[1]][1,3]
    $eta_med_3: quants[[2]][1,2]
    $gamma_med_3: quants[[2]][1,1]
    $q_med_3: quants[[2]][1,3]
    $z: z
    $p: p
```

## Alternative MR Modules

There are many modules for alternative MR methods. Modules for IVW regression, Egger regression, MR-PRESSO, and the weighted median method are very similar and rely on wrappers from `causeSims`. We can just look at one of these.

```
ivw: R(library(causeSims);
       res <- ivw($(dat), p_val_thresh=thresh, no_ld = no_ld);
       )
    thresh: 5e-8
    no_ld: FALSE
    $z: res$z
    $p: res$p
```
To run on the data without LD change the `no_ld` parameter to `no_ld: TRUE` or `no_ld: TRUE, FALSE` to run both. 

The GSMR module is a little more complicated 
```
gsmr: R(library(causeSims);
        evd_list <- readRDS("data/evd_list_chr19_hm3.RDS");
        res <- gsmr_sims($(dat), evd_list, p_val_thresh  = 5e-8, no_ld = FALSE);
        if(!is.null(res)){
           z <- res$bxy/res$bxy_se;
           est <- res$bxy;
           p <- res$bxy_pval;
        }else{
           z <- est <- p <-  NA;
        }
      )
    thresh: 5e-8
    $z: z
    $p: p
    $est_gsmr: est
    $gsmr: res
```
GSMR requires the LD structure and sometimes returns errors. In this case the `gsmr_sims` wrapper returns NULL and we set results to missing. If `no_ld = TRUE` the LD data in `evd_list` is simply ignored.

## LCV Module

Results from LCV are not compared with CAUSE in the paper because LCV is performing a different test than other MR methods. However, we have implemented wrappers and a module to run LCV.

```
LCV: R(library(causeSims);
       res <- try(lcv_sims($(dat),no_ld = no_ld, sig.thresh = thresh));
       if(class(res) == "try-error"){
            res <- list(pval.gcpzero.2tailed=NA, gcp.pm = NA, gcp.pse = NA);
        };
       )
    thresh: 30
    no_ld: FALSE
    $p: res$pval.gcpzero.2tailed
    $gcp_mean: res$gcp.pm
    $gcp_pse: res$gcp.pse
    $gcp_obj: res
```
When LCV returns errors, we set results to missing.