analysis/analysis_plan.Rmd

---
title: "QGT-Columbia-analysis-plan"
author: "Hae Kyung Im"
date: "2020-06-03"
output: workflowr::wflow_html
editor_options:
  chunk_output_type: console
---

```{r preliminary definitions}

library(tidyverse)

```

# Preliminary information
Data and copies of repositories can be downloaded from [Box here](https://uchicago.box.com/s/zhapf2zfxcpj7thvq4sjnqale3emleum)

The latest version of the analysis plan that generated this page is on [github here](https://github.com/hakyimlab/QGT-Columbia-HKI/blob/master/analysis/analysis_plan.Rmd)

# Transcriptome-wide association methods

```{r preliminaries}

print(getwd())

pre="/home/rstudio/QGT-Columbia-HKI"
model.dir=glue::glue("{pre}/models")
metaxcan.dir=glue::glue("{pre}/repos/MetaXcan-master/software")
fastenloc.dir=glue::glue("{pre}/repos/fastenloc-master")
torus.dir=glue::glue("{pre}/repos/torus-master")
twmr.dir=glue::glue("{pre}/repos/TWMR-master")
results.dir=glue::glue("{pre}/results")


```

## predict expression 

```{bash predict genetic component of expression,eval=FALSE}

export PRE="/home/rstudio/QGT-Columbia-HKI"
export DATA=$PRE/predixcan/data
export MODEL=$PRE/models
export METAXCAN=$PRE/repos/MetaXcan-master/software
export RESULTS=$PRE/results

printf "Predict expression\n\n"
python3 $METAXCAN/Predict.py \
--model_db_path $PRE/models/gtex_v8_en/en_Whole_Blood.db \
--vcf_genotypes $DATA/genotype/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz \
--vcf_mode genotyped \
--variant_mapping $DATA/gtex_v8_eur_filtered_maf0.01_monoallelic_variants.txt.gz id rsid \
--on_the_fly_mapping METADATA "chr{}_{}_{}_{}_b38" \
--prediction_output $RESULTS/predixcan/Whole_Blood__predict.txt \
--prediction_summary_output $RESULTS/predixcan/Whole_Blood__summary.txt \
--verbosity 9 \
--throw
```

## assess prediction performance (optional)

```{r}

predicted_expression = read_tsv(glue::glue("{results.dir}/predixcan/Whole_Blood__predict.txt"))
dim(predicted_expression)
head(predicted_expression[,1:5])
prediction_summary = read_tsv(glue::glue("{results.dir}/predixcan/Whole_Blood__summary.txt"))
dim(prediction_summary)
head(prediction_summary)

## merge with GEUVADIS expression data

## calculate spearman correlation

## select a few genes and plot predicted vs observed expression

```

## run association with phenotype

```{bash, eval=FALSE}

export PRE="/home/rstudio/QGT-Columbia-HKI"
export DATA=$PRE/predixcan/data
export MODEL=$PRE/models
export METAXCAN=$PRE/repos/MetaXcan-master/software
export RESULTS=$PRE/results

printf "association\n\n"
python3 $METAXCAN/PrediXcanAssociation.py \
--expression_file $RESULTS/predixcan/Whole_Blood__predict.txt \
--input_phenos_file $DATA/phenotype/random_pheno_1000G_hg38.txt \
--input_phenos_column pheno \
--output $RESULTS/predixcan/random_pheno/Whole_Blood__association.txt \
--verbosity 9 \
--throw

```

## read results

```{r, eval=FALSE}

predixcan_association = read_tsv(glue::glue("{results.dir}/predixcan/random_pheno/Whole_Blood__association.txt"))
dim(predixcan_association)
predixcan_association %>% arrange(pvalue) %>% head
predixcan_association %>% arrange(pvalue) %>% ggplot(aes(pvalue)) + geom_histogram(bins=20)


```


## Exercise
-[  ] Run association with another phenotype 
in $PRE/predixcan/data/phenotype/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38_x_en_Whole_Blood.simulated_phenotype.spike_n_slab_0.5_x_pve0.6.txt

```{bash, eval=FALSE}

```

```{r}

```

-------
-------

# Summary PrediXcan

```{r}

## harmonized and imputed GWAS result for coronary artery disease is available in 
# $PRE/s-predixcan/data/

```

## run s-predixcan 

```{bash, eval=FALSE}

export PRE="/home/rstudio/QGT-Columbia-HKI"
export DATA=$PRE/s-predixcan/data
export MODEL=$PRE/models
export METAXCAN=$PRE/repos/MetaXcan-master/software
export RESULTS=$PRE/results

python $METAXCAN/SPrediXcan.py \
--gwas_file  $DATA/imputed_CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
--snp_column panel_variant_id --effect_allele_column effect_allele --non_effect_allele_column non_effect_allele --zscore_column zscore \
--model_db_path $MODEL/gtex_v8_mashr/mashr_Whole_Blood.db \
--covariance $MODEL/gtex_v8_mashr/mashr_Whole_Blood.txt.gz \
--keep_non_rsid --additional_output --model_db_snp_key varID \
--throw \
--output_file $RESULTS/spredixcan/eqtl/CARDIoGRAM_C4D_CAD_ADDITIVE__PM__Whole_Blood.csv

```

## plot and interpret s-predixcan results

```{r, eval=FALSE}

spredixcan_association = read_csv(glue::glue("{results.dir}/spredixcan/eqtl/CARDIoGRAM_C4D_CAD_ADDITIVE__PM__Whole_Blood.csv"))
dim(spredixcan_association)
spredixcan_association %>% arrange(pvalue) %>% head
spredixcan_association %>% arrange(pvalue) %>% ggplot(aes(pvalue)) + geom_histogram(bins=20)

```

SORT1, considered to be a causal gene for LDL cholesterol and as a consequence of coronary artery disease, is not found here. Why? (tissue)


## run multixcan (optional)

```{bash, eval=FALSE}

python $METAXCAN/SMulTiXcan.py \
--models_folder $DATA/models/eqtl/mashr \
--models_name_pattern "mashr_(.*).db" \
--snp_covariance $DATA/models/gtex_v8_expression_mashr_snp_covariance.txt.gz \
--metaxcan_folder $RESULTS/spredixcan/eqtl/ \
--metaxcan_filter "CARDIoGRAM_C4D_CAD_ADDITIVE__PM__(.*).csv" \
--metaxcan_file_name_parse_pattern "(.*)__PM__(.*).csv" \
--gwas_file $RESULTS/processed_summary_imputation/imputed_CARDIoGRAM_C4D_CAD_ADDITIVE.txt.gz \
--snp_column panel_variant_id --effect_allele_column effect_allele --non_effect_allele_column non_effect_allele --zscore_column zscore --keep_non_rsid --model_db_snp_key varID \
--cutoff_condition_number 30 \
--verbosity 7 \
--throw \
--output $RESULTS/smultixcan/eqtl/CARDIoGRAM_C4D_CAD_ADDITIVE_smultixcan.txt

```

# Colocalization methods

## fine-map GWAS results - 
We will run torus due to time limitation but ideally we would like to run a method that allows multiple causal variants per locus.

```{bash, eval=FALSE}

#torus -d Height.torus.zval.gz --load_zval -dump_pip Height.gwas.pip
#gzip Height.gwas.pip

torus -d /Users/haekyungim/Box/LargeFiles/QGT-Columbia-HKI/fastenloc/data/Height.torus.zval.gz --load_zval -dump_pip /Users/haekyungim/Box/LargeFiles/QGT-Columbia-HKI/fastenloc/data/Height.gwas.pip
gzip Height.gwas.pip

```

## estimate priors
is this done internally by fastENLOC?
```{r}


```

## calculate colocalization with fastENLOC 

```{bash, eval=FALSE}
## tutorial https://github.com/xqwen/fastenloc/tree/master/tutorial

export EQTLGZ=eqtl_annotation_gzipped
export GWASGZ=gwas_data_gzipped
export TISSUE=Whole_Blood
fastenloc -eqtl EQTLGZ -gwas GWASGZ -t tissue_name #[-total_variants total_snp] [-thread n] [-prefix prefix_name] [-s shrinkage]

```

## analyze results 

```{r}

## optional - compare with s-predixcan results

```

# Mendelian randomization methods

## run SMR (optional)

```{bash, eval=FALSE}

```

## run  TWMR (for a locus)

```{bash, eval=FALSE}

cd $TWMR
export PRE=/Users/haekyungim/Box/LargeFiles/QGT-Columbia-HKI
export TWMR=$PRE/repos/TWMR-master
export OUTPUT=$PRE/results
GENE=ENSG00000002919

R < $TWMR/MR.R --no-save $GENE

cd $PRE

```