analysis/analysis_plan.Rmd

---
title: "QGT-Columbia-analysis-plan"
author: "Hae Kyung Im"
date: "2020-06-03"
output: workflowr::wflow_html
editor_options:
  chunk_output_type: console
---

## Set up 

This information is also on the slides

- [ ] download data and software [from Box](https://uchicago.box.com/s/zhapf2zfxcpj7thvq4sjnqale3emleum).
  This will have copies of all the software repositories and the models

Linux is the operating system of choice to run bioinformatics software. Here are offering two options

- Option 1: full setup, recommended for the linux-savvy with full setup
- OPtion 2: pre-installed RStudio in Google cloud, recommended for people less familiar with linux 

The latest version of the analysis plan markdown document that generated this page is on [github here](https://github.com/hakyimlab/QGT-Columbia-HKI/blob/master/analysis/analysis_plan.Rmd)
rendered [here as an html page](https://hakyimlab.github.io/QGT-Columbia-HKI/analysis_plan.html)

# Option 1

- [ ] install anaconda/miniconda
- [ ] define imlabtools conda environment [how to here](https://github.com/hakyimlab/MetaXcan/blob/master/README.md#example-conda-environment-setup), which will install all the python modules needed for this analysis session
- [x] download software (copies of the repos are already included in the course folder QCT-Columbia-HKI/repos/)
  - download metaxcan repo
  - download torus repo
  - download fastenloc repo
  - download TMWR repo
- [x] download prediction models from predictdb.org (a few models are included in the course folder QCT-Columbia-HKI/repos/)
- [ ] install R/RStudio/tidyverse package
- [ ] (optional) install workflowr package in R
- [ ] git clone https://github.com/hakyimlab/QGT-Columbia-HKI.git
- [ ] start Rstudio (if you installed workflowr, you can just open the QGT-Columbia-HKI.Rproj)

# Option 2
- [ ] claim your Rstudio server IP address ()
- [ ] connect to the Rstudio server using the url you claimed (http://xxx.xxx.xxx.xxx:8787)

# Both options

- [ ] update the analysis document

```{bash eval=FALSE}
PRE="/home/student/"
cd $PRE/../lab/
git pull 
```

- [ ] activate the the imlabtools environment
```{bash, eval=FALSE}
conda activate imlabtools
```

** Notice that the bash chunks need to be copy-pasted to the terminal, not performed within the chunk.

## Summary of analysis plan

- predict whole blood expression
- check how well the prediction works with GEUVADIS expression data
- run association between predicted expression and a simulated phenotype
- calculate association between expression levels and coronary artery disease risk using s-predixcan
- fine-map the coronary artery disease gwas results using torus (need some preformatting)
- calculate colocalization probability using fastenloc
- run transcriptome-wide mendelian randomization in one locus of interest


```{r preliminary definitions}

library(tidyverse)

```

# Transcriptome-wide association methods

```{r preliminaries}

print(getwd())

pre="~/Box/LargeFiles/QGT-Columbia-HKI"
#pre="/home/student/QGT-Columbia-HKI"
model.dir=glue::glue("{pre}/models")
metaxcan.dir=glue::glue("{pre}/repos/MetaXcan-master/software")
fastenloc.dir=glue::glue("{pre}/repos/fastenloc-master")
torus.dir=glue::glue("{pre}/repos/torus-master")
twmr.dir=glue::glue("{pre}/repos/TWMR-master")
results.dir=glue::glue("{pre}/results")


```

## predict expression 

![Visual summary of predixcan runs](https://raw.githubusercontent.com/hakyimlab/QGT-Columbia-HKI/master/extras/figures/PrediXcan-run.png)

```{bash predict genetic component of expression,eval=FALSE}

PRE="/home/student/QGT-Columbia-HKI"
DATA=$PRE/data/predixcan
MODEL=$PRE/models
METAXCAN=$PRE/repos/MetaXcan-master/software
RESULTS=$PRE/results

printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $PRE/models/gtex_v8_en/en_Whole_Blood.db \
--vcf_genotypes $DATA/genotype/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \
--vcf_mode genotyped \
--variant_mapping $DATA/gtex_v8_eur_filtered_maf0.01_monoallelic_variants.txt.gz id rsid \
--on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b38" \
--prediction_output $RESULTS/predixcan/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/predixcan/Whole_Blood__summary.txt \
--verbosity 9 \
--throw

```


## testing
```{bash}
PRE="/Users/haekyungim/Box/LargeFiles/QGT-Columbia-HKI"
DATA=$PRE/data/predixcan
MODEL=$PRE/models
METAXCAN=$PRE/repos/MetaXcan-master/software
RESULTS=$PRE/results

printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path  /Users/haekyungim/Downloads/data/models/gtex_v8_mashr/mashr_Whole_Blood.db \
--vcf_genotypes /Users/haekyungim/Downloads/data/1000G_hg37/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz \
--vcf_mode genotyped \
--variant_mapping $DATA/gtex_v8_eur_filtered_maf0.01_monoallelic_variants.txt.gz id rsid \
--on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b38" \
--prediction_output $RESULTS/predixcan/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/predixcan/Whole_Blood__summary.txt \
--verbosity 9 \
--throw


python3 $METAXCAN/Predict.py \
--model_db_path $PRE/models/gtex_v8_mashr/mashr_Whole_Blood.db \
--vcf_genotypes /Users/haekyungim/Downloads/tempo/filtered.vcf.gz \
--vcf_mode genotyped  \
--variant_mapping $DATA/gtex_v8_eur_filtered_maf0.01_monoallelic_variants.txt.gz id rsid \
--on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b38" \
--prediction_output $RESULTS/predixcan/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/predixcan/Whole_Blood__summary.txt \
--verbosity 9 --throw

```


## assess prediction performance (optional)

```{r, eval=FALSE}

predicted_expression = read_tsv(glue::glue("{results.dir}/predixcan/Whole_Blood__predict.txt"))
dim(predicted_expression)
head(predicted_expression[,1:5])
prediction_summary = read_tsv(glue::glue("{results.dir}/predixcan/Whole_Blood__summary.txt"))
dim(prediction_summary)
head(prediction_summary)

## merge with GEUVADIS expression data

## calculate spearman correlation

## select a few genes and plot predicted vs observed expression

```

## run association with phenotype

```{bash, eval=FALSE}

export PRE="/home/student/QGT-Columbia-HKI"
export DATA=$PRE/data/predixcan
export MODEL=$PRE/models
export METAXCAN=$PRE/repos/MetaXcan-master/software
export RESULTS=$PRE/results

printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--expression_file $RESULTS/predixcan/Whole_Blood__predict.txt \
--input_phenos_file $DATA/phenotype/random_pheno_1000G_hg38.txt \
--input_phenos_column pheno \
--output $RESULTS/predixcan/random_pheno/Whole_Blood__association.txt \
--verbosity 9 \
--throw

```

## read results

```{r, eval=FALSE}

predixcan_association = read_tsv(glue::glue("{results.dir}/predixcan/random_pheno/Whole_Blood__association.txt"))
dim(predixcan_association)
predixcan_association %>% arrange(pvalue) %>% head
predixcan_association %>% arrange(pvalue) %>% ggplot(aes(pvalue)) + geom_histogram(bins=20)


```


## Exercise
-[  ] Run association with another phenotype 
in $PRE/predixcan/data/phenotype/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38_x_en_Whole_Blood.simulated_phenotype.spike_n_slab_0.5_x_pve0.6.txt

```{bash, eval=FALSE}

```

```{r}

```

-------
-------

# Summary PrediXcan

![Visual summary of s-predixcan](https://raw.githubusercontent.com/hakyimlab/QGT-Columbia-HKI/master/extras/figures/gwas-PrediXcan-spredixcan.png)


```{r}

## harmonized and imputed GWAS result for coronary artery disease is available in 
# $PRE/s-predixcan/data/

```

## run s-predixcan 

```{bash, eval=FALSE}

export PRE="/home/student/QGT-Columbia-HKI"
export DATA=$PRE/data/s-predixcan
export MODEL=$PRE/models
export METAXCAN=$PRE/repos/MetaXcan-master/software
export RESULTS=$PRE/results

python $METAXCAN/SPrediXcan.py \
--gwas_file  $DATA/imputed_CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
--snp_column panel_variant_id --effect_allele_column effect_allele --non_effect_allele_column non_effect_allele --zscore_column zscore \
--model_db_path $MODEL/gtex_v8_mashr/mashr_Whole_Blood.db \
--covariance $MODEL/gtex_v8_mashr/mashr_Whole_Blood.txt.gz \
--keep_non_rsid --additional_output --model_db_snp_key varID \
--throw \
--output_file $RESULTS/spredixcan/eqtl/CARDIoGRAM_C4D_CAD_ADDITIVE__PM__Whole_Blood.csv

```

## plot and interpret s-predixcan results

```{r, eval=FALSE}

spredixcan_association = read_csv(glue::glue("{results.dir}/spredixcan/eqtl/CARDIoGRAM_C4D_CAD_ADDITIVE__PM__Whole_Blood.csv"))
dim(spredixcan_association)
spredixcan_association %>% arrange(pvalue) %>% head
spredixcan_association %>% arrange(pvalue) %>% ggplot(aes(pvalue)) + geom_histogram(bins=20)

```

SORT1, considered to be a causal gene for LDL cholesterol and as a consequence of coronary artery disease, is not found here. Why? (tissue)


## run multixcan (optional)

```{bash, eval=FALSE}

export MODEL=$PRE/models
export DATA=$PRE/data/s-predixcan

python $METAXCAN/SMulTiXcan.py \
--models_folder $MODEL/gtex_v8_mashr \
--models_name_pattern "mashr_(.*).db" \
--snp_covariance $MODEL/gtex_v8_expression_mashr_snp_covariance.txt.gz \
--metaxcan_folder $RESULTS/spredixcan/eqtl/ \
--metaxcan_filter "CARDIoGRAM_C4D_CAD_ADDITIVE__PM__(.*).csv" \
--metaxcan_file_name_parse_pattern "(.*)__PM__(.*).csv" \
--gwas_file $DATA/imputed_CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
--snp_column panel_variant_id --effect_allele_column effect_allele --non_effect_allele_column non_effect_allele --zscore_column zscore --keep_non_rsid --model_db_snp_key varID \
--cutoff_condition_number 30 \
--verbosity 7 \
--throw \
--output $RESULTS/smultixcan/eqtl/CARDIoGRAM_C4D_CAD_ADDITIVE_smultixcan.txt

```

# Colocalization methods

![Visual summary of colocalization](https://raw.githubusercontent.com/hakyimlab/QGT-Columbia-HKI/master/extras/figures/colocalization-run.png)


## fine-map GWAS results
We will run torus due to time limitation but ideally we would like to run a method that allows multiple causal variants per locus.

```{bash, eval=FALSE}

#torus -d Height.torus.zval.gz --load_zval -dump_pip Height.gwas.pip
#gzip Height.gwas.pip
TORUSOFT=torus

$TORUSOFT -d $PRE/data/fastenloc/Height.torus.zval.gz --load_zval -dump_pip $PRE/data/fastenloc/Height.gwas.pip
cd $PRE/data/fastenloc
gzip Height.gwas.pip
cd $PRE 

```

## calculate colocalization with fastENLOC

```{bash, eval=FALSE}
## check out tutorial https://github.com/xqwen/fastenloc/tree/master/tutorial

export eqtl_annotation_gzipped=$PRE/data/fastenloc/FASTENLOC-gtex_v8.eqtl_annot.vcf.gz
export gwas_data_gzipped=$PRE/data/fastenloc/Height.gwas.pip.gz
export TISSUE=Whole_Blood
export FASTENLOCSOFT=fastenloc

mkdir $RESULTS/fastenloc/
cd $RESULTS/fastenloc/
$FASTENLOCSOFT -eqtl $eqtl_annotation_gzipped -gwas $gwas_data_gzipped -t $TISSUE 

#[-total_variants total_snp] [-thread n] [-prefix prefix_name] [-s shrinkage]

```

## analyze results 

```{r}

## optional - compare with s-predixcan results

```

-[] prepare 


----------
# Mendelian randomization methods

## run SMR (optional)

```{bash, eval=FALSE}

```

## run  TWMR (for a locus)
![TWMR](https://raw.githubusercontent.com/hakyimlab/QGT-Columbia-HKI/master/extras/figures/TWMR.png)

```{bash, eval=FALSE}

export TWMR=$PRE/repos/TWMR-master
export OUTPUT=$PRE/results
GENE=ENSG00000002919

cd $TWMR

R < $TWMR/MR.R --no-save $GENE

cd $PRE

## output: /home/student/QGT-Columbia-HKI/repos/TWMR-master/ENSG00000002919.alpha

```