analysis/gwas_pairs_2.Rmd

---
title: "Analyze Pairs of GWAS Traits Using CAUSE Pipeline"
author: "Jean Morrison"
date: "2019-10-15"
output: workflowr::wflow_html
editor_options:
  chunk_output_type: console
---


## [Click here to explore results interactively](https://jean-loves-stats.shinyapps.io/gwas_pairs_2/)
To reproduce these results or to learn how to use the CAUSE Snakemake pipeline, follow the tutorial below


## Introduction

We have setup a Snakemake pipeline that will make it easy to run CAUSE (and a handful of other methods) on many pairs of traits. We will demonstrate the pipeline with an example that produces a subset of the results above. 

## Set Up

We use [Snakemake](https://snakemake.readthedocs.io/en/stable/) and a conda environment to run this analysis. If you don't have Miniconda or Anaconda installed you can either install one of them (recomended) or just make sure that you have Python3, pandas and Snakemake installed. We have chosen to not include R in the conda environment so you will need to have R installed outside the environment and also have the `cause` and `tidyverse` R packages installed and running. If you were able to work through the [package tutorial](ldl_cad.html) you should be in good shape. For some alternative MR methods you will need the `MendelianRandomization` R package and the `MR-PRESSO` R package which can be installed [here](https://github.com/rondolab/MR-PRESSO).


First create the conda environment

```
conda create -n cause_large python=3.6 snakemake
```

Next create a working directory that you would like to analyze the data in. Change to that directory using `cd` in Mac or Linux. For example

```
mkdir gwas_pairs
cd gwas_pairs
```

Finally, inside the working directory and using R, setup a CAUSE pipeline. The `download_ld=TRUE` argument causes the function to download correctly formatted `ld` data computed using 1k genomes CEU individuals. If you've already downloaded this or you want to use different LD data, use `download_ld=FALSE`. The `download_eur_ld_scores=TRUE` argument will download LD scores to use with LCV. If you have these already or don't want to run LCV, set it to false. 

```
cause::setup_cause_pipeline(download_ld=TRUE, download_eur_ld_scores=TRUE)
```

## How to Use the Pipeline

Set up a pipeline analysis. From the directory you want to use, in R run 

```
cause::setup_cause_pipeline()
```

The `setup_cause_pipeline()` function provides you with everything you need to run an  analysis except for data and a `.csv` file describing that data. You can use this function to start any pipeline analysis, not just this example. The analysis is controlled by the `config.yaml` file which has four sections. You should edit the file to match your analysis desires.

The `input` section gives the location of a `csv` file that describes each set of GWAS summary statistics. More on this file later.

```
# Path to data spreadsheed
# csv format, columns include:
# name, raw_data_path, snp, A1, A2, beta_hat, se, p_value, sample_size, delimeter
input:
    sum_stats: "gwas_pairs.csv"
```

The `analysis` section gives instructions about what analysis to run.

```
# What analysis to do
# If all_pairs do all pairs of triats. Otherwise
# use traits in trait1 as M and traits in trait2 as Y
# methods should be a comma separated list with no spaces.
# Optional methods should be one of
# cause_*_*_*, mrpackage, lcv,  mrpresso
analysis:
    all_pairs: True
    trait1: 1,2
    trait2: 3,4,5
    methods: cause_1_2_1,cause_1_10_1,cause_1_100_1,mrpackage,mrpresso,lcv
    mr_pval: "5e-8"
    cause_seed: 100
```

If `all_pairs` is `True` then the pipeline will run the desired methods for all pairs of traits in the csv. Otherwise it will use the `trait1` and `trait2` fields to determine which pairs to run. Numbers in these fields refer to line numbers (beginnin at 1 after the header) in the csv. The `methods` line lists the methods you would like to run. Options are listed above in the comments. The numbers following `cause` designate the prior on $q$. If the method is given as `cause_a_b_c` then a Beta(a,b) prior truncated at c will but used. The `mrpackage` option runs a set of six different methods using the `MendelianRandomization` R package. These are IVW and Egger regression both with random effects, the weighted median and the weighted mode with with values of phi equal to 1, 0.5, and 0.25. The `mr_pval` field gives the minimum p-value for variants used in methods besides CAUSE and LCV which use all variants. The `cause_seed` provides a random seed to CAUSE and ensures that results can be reproduced exactly. 


The `ld` section tells the pipeline where to find correctly formatted LD data and what files are named. 
```
#Path to directory containing LD data
#ld_score_dir is used only by method lcv
ld:
    dir: "ld/"
    r2_file: "_AF0.05_0.1.RDS"
    info_file: "_AF0.05_snpdata.RDS"
    ld_score_dir: "ld_scores/eur_w_ld_chr/"
```
The pipeline will expect to find files in the directory given in `dir:` with names `{chr}_{r2_file}`  and `{chr}_{info_file}` where `{r2_file}` and `{info_file}` are the file endings given in those respective fields and `{chr}` is a chromosme (e.g. `chr1`, `chr2`) for chromosomes 1-22. `ld_score_dir` is only necessary if you want to run LCV using method `lcv`.

The `out` section tells the pipeline where to store output files.

```
out:
    gwas_data_dir: "cause_standard_format/"
    other_data_dir: "data/"
    output_dir: "results/"
```

The `gwas_data_dir` field is a directory to store formatted summary statistics. It can be helpful to store these in a centralized location if you are running multiple pipelines ot save on work and storage. The `other_data_dir` lists a directory to store other data files. These include some temporary files that are removed at the end of the pipeline and lists of SNPs pruned for LD that are saved. `output_dir` is a directory to store analysis results. 


The `cluster.yaml` file describes resources allocated for each kind of job. The default will work but if it requests more resources than are available on your cluster (e.g. memory) you may need to change it.

The Snakemake command for submitting the pipeline is in the `run-snakemake.sh` file. You will need to modify it to match your cluster.

Once you have downloaded summary statistics and created the csv file you are ready to run the pipeline. You can run with 
```
nohup ./run-snakemake.sh & 
```
You may need to change the permissons of `run-snakemake.sh` to be excecutable with `chmod a+x run-snakemake.sh`. The `nohup` is optional but is nice because the pipeline can run for a long time. I generally prefer to run the pipeline from a compute node rather than the login node but this will depend on your setup.

## Download data and build spreadsheet

The next step is to acquire some GWAS summary statistics and to describe them in a spreadsheet so that the pipeline is able to format them properly. The spreadsheet has the following mandatory column headers in any order:

`name`: a unique string naming the study

`delimeter`: Field delimeter. One of "tab", ",", "space", or any symbol.

`snp`: Column name of SNP ID (generally rs number but anything that matches the other file). This will be the field that studies are joined on.

`A1`: Column name of effect allele

`A2`: Column name of other allele

`beta_hat`: Column name of effect estimate

`se`: Column name of standard error of effect estimate

`p_value`: Column name of $p$-value

`sample_size`: Column name of per-SNP sample size

The `p_value` and `sample_size` fields may contain NAs if some studies don't have them. The rest are required. Most studies can be used exactly as downloaded but you may have to a little bit of formatting before you can use the pipeline. For example, some studies do not contain rs numbers or have atypical variant names. 


## Example 

To run an example, from inside of R run 

```
cause::cause_download_example_gwas_data()
```

This function will set up an example analysis of three traits LDL cholesterol (Willer et al 2013 PMID 24097068
), coronary artery disease (van der Harst et al 2017 PMID 29212778), and asthma (Demenais et al 2018 PMID 29273806). The function will download summary statistics directly from their sources. In these cases they are ready to use without any modifications, but this isn't always the case. For example, some studies do not have rs ids included or report odds ratio rather than log odds ratio (the coefficient estimate from logistic regression). In these cases, you will need to modify the data so they contain the five mandatory colums of snp name, effect allele, other allele, coefficient (effect) estimate, and standard error of effect estimate. The function also downloads a `csv` file called `gwas_pairs.csv`. Take a look at the csv if you are having trouble making your own. This file has some extra (non-required) columns that we find useful for keeping track of studies. 

Now to run an analysis all you need to do is

1. Edit the `run-snakemake.sh` file so that the cluster command is compatible with your cluster setup.

2. Edit the `config.yaml` file. You may not need to make any changes but you might need to change the location of the LD files or change where you would like data stored. Leave `all_pairs: True` or, alternatively to only run LDL -> CAD and LDL -> Asthma change to

```
analysis:
  all_pairs: False
  trait1: 1
  trait2: 2,3
```
The numbers correspond to the index of each trait in `gwas_pairs.csv`. Leave the other fields in the analysis section as they are. 

3. Change the persmissions of `run-snakemake.sh` to make it an executable. Use `chmod a+x run-snakemake.sh`.

4. Run (on a compute node if you like)

```
source activate cause_large
nohup ./run-snakemake.sh & 
```

A good thing to keep in mind is that Snakemake will pickup wherever it left off if a job fails or the analysis is interrupted. For example, suppose you find that you need to give one method more memory using the data you have. You can modify the `cluster.yaml` file and then simply re-run the command above. No work that has been completed will be repeated. You can use `snakemake -n` to do a "dry run" which tells you what commands will be run. 

When the pipeline is done, in the results directory there will be a handful of files named `results/df_{method}.RDS` where `{method}` is one of the methods run above. Source these into R and take a look using

```{r, eval=FALSE}
df <- readRDS("results/df_cause_1_10_1.RDS")
df
```

Additionally, the resuts folder will contain files named `results/{name1}__{name2}_{method}.RDS` containing results for each method. If the method is CAUSE, this will be a CAUSE object that you can look at using utilities in the `cause` package. Try

```{r, eval=FALSE}
library(cause)
res <- readRDS("results/glg_ldl__vanderHarst_cad_cause_1_10_1.RDS")
summary(res)
plot(res, type="posteriors")
plot(res, type="data")
```