# Colocalization with FastENLOC 

## Aim

The purpose of this notebook is to demonstrate a colocalization analysis workflow utilizing fastENLOC

## Input

1) __GWAS summary statistics file__ including the following columns at minimum (no header)

- snp_id: variant ID. Format: chr#_bp_a1_a2_build. Example: chr21_25380778_A_G_b38
- chr: chromosome number
- pos: base pair position
- z_score: z-score values

2) __eQTL SNPs data file__ including the following columns (no header) 

- gene: gene name
- snp: variant ID
- tss: distance to transcription starting site
- pval: p-value
- beta: beta-hat, effect size
- se: standard error of beta-hat

3) __LD correlation matrix file__ for all SNPs included in eQTL data file. (no header)


4) __eQTL Z-score file__ with the following columns (no header)

- snp: variant ID. Format: chr#_bp_a1_A2. Example: chr21_25380778_A_G
- z score

5) __SNP vcf file__ to annotate all SNP positions, with the following columns (include header)

- CHROM: chromosome number. Format chr#. Example: chr1
- POS: base pair position
- ID: variant ID. Format: chr#_bp_a1_a2. Example: chr21_25380778_A_G
- REF: Effect allele
- ALT: Other allele


## Output 

1) Enrichment analysis result `prefix.enloc.enrich.rst`: estimated enrichment parameters and standard errors.

2) Signal-level colocalization result `prefix.enloc.sig.out`: the main output from the colocalization analysis wi th the following format
- column 1: signal cluster name (from eQTL analysis)
- column 2: number of member SNPs
- column 3: cluster PIP of eQTLs
- column 4: cluster PIP of GWAS hits (without eQTL prior)
- column 5: cluster PIP of GWAS hits (with eQTL prior)
- column 6: regional colocalization probability (RCP)

3) SNP-level colocalization result `prefix.enloc.snp.out`: SNP-level colocalization output with the following form at
- column 1: signal cluster name
- column 2: SNP name
- column 3: SNP-level PIP of eQTLs
- column 4: SNP-level PIP of GWAS (without eQTL prior)
- column 5: SNP-level PIP of GWAS (with eQTL prior)
- column 6: SNP-level colocalization probability

4) Sorted list of colocalization signals with  

  ```sort -grk6 prefix.enloc.sig.out ```

## Workflow

### Step 0 : Prepare Intermediate Input Files

- `gene.prior.txt.gz` : eQTL SNP data file 

### FIXME:In R, I want to read in input eqtl snp .txt file (somehow obtain the full file name), create a z-score txt file and a vcf file, and then gzip the vcf file (somehow obtain the full filename of this)

In [None]:
[global]
parameter: container = "/home/at3535/fastenloc/fastenloc.sif"
parameter: wd = path("./")
parameter: exe_dir = "/usr/local/bin/"
parameter: name = "demo"
parameter: sumstats = path("./")
paramter: eqtl = path("./")

In [None]:
[eqtl_z1]
parameter: sumstats = path
input: sumstats
bash:
    mkdir z_files vcfs dap_rst_dir annot
    gunzip -c gene.prior.txt.gz   

In [None]:
[eqtl_z2]
R: 
    # create z file 
    library("tidyverse")
    data = data.table::fread("{_input}", header = F)
    gene = as.character(data[1,1])
    data = data %>% mutate(z = V5/V6) %>% select(c(V2, z))
    filename = paste0("z_files/",paste(gene, "z","txt",sep = "."))
    write.table(data,filename,colnames = F, rownames = F, quote = F) 
  
    # create vcf file
    eqtl.split = function(var.id){
      rows = dim(var.id)[1]
      chr = vector(length = rows)
      pos = vector(length = rows)
      a1 = vector(length = rows)
      a2 = vector(length = rows)
      for (i in 1:rows){
        split = str_split(var.id[i], "_")
        chr[i]= split[[1]][1]
        pos[i] = split[[1]][2]
        a1[i] = split[[1]][3]
        a2[i] = split[[1]][4]
      }
      eqtl.df = data.frame(chr,pos,var.id,a1,a2)
    }
  
    vcf = eqtl.split(data[,2])
    colnames(vcf) = c("CHROM", "POS", "ID", "REF", "ALT")
    vcf.filename = paste0("vcfs/",paste(gene, "vcf",sep = "."))
    write.table(vcf,vcf.filename,col.names = F, row.names = F, quote = F)

In [None]:
[eqtl_z3]
bash: 
    gzip gene.vcf

### Step 1: Prepare GWAS PIP 

- `gwas_z.txt` : GWAS summary statistics file

__Part 1__: Assign LD block for each SNP.

In [None]:
[ld_block]
bash: 
    perl format2torus.pl gwas_z.txt > gwas.zval
    gzip gwas.zval

__Part 2__: Convert z-scores to PIPs.

In [None]:
[gwas_pip]
bash:
    torus -d gwas.zval.gz --load_zval -dump_pip gwas.pip
    gzip gwas.pip

### Step 2: Prepare eQTL Annotation File

#### Method 1: Use pre-computed GTEx multi-tissue eQTL annotation files

download: 

- hg38 Position ID: https://drive.google.com/open?id=1kfH_CffxyCtZcx3z7k63rIARNidLv1_P
- rsID: https://drive.google.com/open?id=1rSaHenk8xOFtQo7VuDZevRkjUz6iwuj0

obtain:
- gtex_v8.eqtl_annot.vcf.gz
- gtex_v8.eqtl_annot_rsid.vcf.gz

#### Method 2: Derive annotations based on own eQTL data, using DAP-G

DAP-G annotations are produced through 2 parts: 

__Part 1__: Estimate priors with `torus`

- `gene.prior.txt.gz` : eQTL SNP data file 

In [None]:
[estimate_prior]
bash: 
    torus -d for_prior/gene.prior.txt.gz --fastqtl -dump_prior dumpgene

__Part 2__: Annotate with `DAP-G`

- `z_file.txt` : eQTL z-score file
- `chr1_ld.ld.bin` : eQTL LD correlation matrix
- `genes.vcf.gz` : eQTL SNP vcf file

In [None]:
[dap_annot]
bash: 
    dap-g -d_z z_file.txt -d_ld chr1_ld.ld.bin -p dumpgene/gene.prior -ld_control 0.5 --all -t 4 > dap_rst_dir 
    perl summarize_dap2enloc.pl -dir dap_rst_dir -vcf vcfs/gene.vcf.gz | gzip - > annot/gene.annot.vcf.gz

### Step 3: Colocalization with fastENLOC

In [None]:
[fastenloc]
bash: 
    fastenloc -eqtl annot/gene.annot.vcf.gz -gwas gwas.pip.gz 

## Minimum Working Example: 

In [None]:
[example]
bash:
    mkdir dap_rst_dir annot 
    perl format2torus.pl ad.0921.sumstats.txt > ad.0921.sumstats.zval
    gzip ad.0921.sumstats.zval
    torus -d ad.0921.sumstats.zval.gz --load_zval -dump_pip ad.0921.sumstats.pip
    gzip ad.0921.sumstats.pip
    torus -d for_prior/ENSG00000203710.prior.txt.gz --fastqtl -dump_prior dumpENSG00000203710
    dap-g -d_z z_files/ENSG00000203710.z.txt -d_ld lds/ENSG00000203710.ld -p dumpENSG00000203710/ENSG00000203710.prior -ld_control 0.5 --all -t 4 > dap_rst_dir/ENSG00000110079.dap
    perl summarize_dap2enloc.pl -dir dap_rst_dir -vcf vcfs/ENSG00000203710.vcf.gz | gzip - > annot/ENSG00000203710.annot.vcf.gz
    fastenloc -eqtl annot/ENSG00000203710.annot.vcf.gz -gwas ad.0921.sumstats.pip.gz

#### Summary: 

In [1]:
head enloc.enrich.out

                Intercept   -13.668           -
Enrichment (no shrinkage)    12.122       1.145
Enrichment (w/ shrinkage)     5.244       0.753


In [3]:
head enloc.sig.out

Signal	Num_SNP	CPIP_qtl	CPIP_gwas_marginal	CPIP_gwas_qtl_prior	RCP
ENSG00000203710:1      4  9.972e-01 1.395e-01    8.832e-01      8.688e-01


In [4]:
head enloc.snp.out

Signal	SNP	PIP_qtl	PIP_gwas_marginal	PIP_gwas_qtl_prior	SCP
ENSG00000203710:1   chr1_207684192_T_G   1.789e-01 3.882e-02    1.627e-01      1.582e-01
ENSG00000203710:1   chr1_207685965_A_C   1.220e-01 3.726e-02    1.119e-01      1.074e-01
ENSG00000203710:1   chr1_207692049_A_G   1.896e-01 2.799e-02    1.632e-01      1.603e-01
ENSG00000203710:1   chr1_207738077_T_C   5.066e-01 3.540e-02    4.454e-01      4.429e-01
