# How to conduct fine-mapping analysis for eQTL data from pancreatic tissue on gene level

Pancreatic gene-level eQTL data were downloaded from https://www.gtexportal.org/home/downloads/adult-gtex/qtl from GTEx v8.

This documentation contains instruction on how to conduct fine-mapping analysis for pancreatic gene-level eQTL data.

In this analysis, we employed genotype data from 40,000 unrelated British individuals in the UK Biobank.

We thank Dr. Arushi Varshney (Parker Lab) for their valuable support in shaping the analysis strategies and code development.

General steps:
- Split summary files into chromosome files, then gene files.
- Bgzip and index gene files.
- Get lead SNPs and bgzip and index.
- Run finaemapping for eQTLs.

## Step 1: Set up data for the fine-mapping pipeline

We set up a file with gene-level summary stat files for all lead signals, which will be used in the next step.

### Step 1.1: Convert `parquet` format

Data from GTEx is in `parquet` format which can be converted into txt files using the following code:

In [None]:
library("dplyr")
library("tidyr")
library("data.table")
library(arrow)
files <- list.files("/scratch/scjp_root/scjp99/vthihong/2_PanKBase/colocGWAS_T1D/0_data/GTEx_EUR/", pattern="parquet")
for (i in files) {
        chr <- gsub(".parquet", "", 
                    gsub("GTEx_Analysis_v8_QTLs_GTEx_Analysis_v8_EUR_eQTL_all_associations_Pancreas.v8.EUR.allpairs.", "", i))
        a <- read_parquet(paste0("/scratch/scjp_root/scjp99/vthihong/2_PanKBase/colocGWAS_T1D/0_data/GTEx_EUR/", i))
        a <- a[,1:(ncol(a)-1)]
        setDT(a)
        a[, c("snp_chr", "snp_stop", "ref_gtex", "alt_gtex", "gtex_code") := tstrsplit(variant_id, "_")]
        a <- as.data.frame(a)
        colnames(a) <- c("gene_id", "variant_id", "tss_distance", "maf", "ma_samples", "ma_count", "pval_nominal", "slope", 
                         "slope_se", "snp_chr", "snp_stop", "ref_gtex", "alt_gtex", "gtex_code")
        a$snp_stop <- as.numeric(a$snp_stop)
        a$snp_start <- a$snp_stop-1
        a <- a[, c("gene_id", "variant_id", "tss_distance", "ma_samples", "ma_count", "maf", "pval_nominal", "slope", "slope_se", 
                   "snp_chr", "snp_stop", "ref_gtex", "alt_gtex", "gtex_code", "snp_start")]
        write.table(a, 
                    paste0("/scratch/scjp_root/scjp99/vthihong/2_PanKBase/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/", chr, ".bed"), 
                    col.names=T, row.names=F, sep="\t", quote=F)
}

### Step 1.2: Split all variants based on feature-level (in this case, gene-level)

GTEx has variant names in the form `chr1_666028_G_A_b38` which is different from our reference data, so we need to map GTEx variants to our reference. Additionally, we'd want variants for each gene in its own separate files for downstream steps.

This step can be done with Snakemake. See example Snakemake file at `scripts/splitGenes.sf`. This Snakemake task utilizes a genome-wide vcf file which could be obtained using instruction in Step 1 of fine-mapping eQTL InsPIRE code. Then every gene file needs indexing, which can be done using the following code
```
cd /scratch/scjp_root/scjp99/vthihong/2_PanKBase/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed
module load Bioinformatics
module load Bioinformatics  gcc/10.3.0-k2osx5y
module load samtools/1.13-fwwss5n

chr=$1

a=`ls *__"$chr":*bed`

for i in $a
do
bgzip -@ 2 $i
tabix --preset=bed "$i".gz
done
```

To map GTEx variant names to our reference data for lead SNPs (which is supplied in the file `Pancreas.v8.egenes.txt.gz`), one can use the script `scripts/getHg38SummStats_leads.R`. The end result of this script is a file named `eQTL_EUR_leads.txt`, which should be indexed, and it will be used in the next step.

### Step 1.3: Set up a file with gene-level summary stat files for all lead signals

In [2]:
library(dplyr)


ind <- read.table("/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/1_eQTL-gtex-susie/data/eQTL_EUR_leads.txt.gz", header = F)
df38 <- ind[, c("V9", "V4")]

In [3]:
files <- list.files("/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/", pattern = "gz")
files <- files[grep("tbi", files, invert = T)]
file_df <- data.frame(eqtl_input = files)
file_df$gene <- unlist(lapply(strsplit(file_df$eqtl_input, '__'), '[', 1))
df38 <- inner_join(df38, file_df, by = c("V9" = "gene"))
df38$eqtl_input <- paste0("/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/",
                          df38$eqtl_input)
df38$gene_id <- unlist(lapply(strsplit(df38$V9, '\\.'), '[', 1))

In [4]:
tss <- read.table("/scratch/scjp_root/scjp99/vthihong/genome/geneTSS.bed", header = F)
df38 <- inner_join(df38, tss[, c(1, 2, 3, 4, 6)], by = c("gene_id" = "V6"))
df38$locus <- paste0(df38$V4.y, "__", df38$V4.x, "__P")
df <- df38[, c("V1", "V2", "V3", "locus", "gene_id", "eqtl_input")]
colnames(df) <- c("chr", "start", "end", "locus", "gene_id", "eqtl_input")
head(df)
write.table(df, row.names = F, sep = "\t", quote = F,
            "/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/1_eQTL-gtex-susie/data/gene_eQTLs-selected.tsv")

Unnamed: 0_level_0,chr,start,end,locus,gene_id,eqtl_input
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>
1,chr1,778770,778771,RP11-206L10.9__rs187772768__P,ENSG00000237491,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000237491.8__GTEx_Pancreas_Gene__1:778770.bed.gz
2,chr1,817371,817372,FAM87B__rs187772768__P,ENSG00000177757,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000177757.2__GTEx_Pancreas_Gene__1:817371.bed.gz
3,chr1,817712,817713,RP11-206L10.8__rs114525117__P,ENSG00000230092,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000230092.7__GTEx_Pancreas_Gene__1:800879.bed.gz
4,chr1,827522,827523,LINC00115__rs114525117__P,ENSG00000225880,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000225880.5__GTEx_Pancreas_Gene__1:826206.bed.gz
5,chr1,825138,825139,LINC01128__rs4970388__P,ENSG00000228794,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000228794.8__GTEx_Pancreas_Gene__1:825138.bed.gz
6,chr1,959309,959310,NOC2L__rs4970441__P,ENSG00000188976,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000188976.10__GTEx_Pancreas_Gene__1:944582.bed.gz


Save the `df` object in a file named `gene_eQTLs-selected.tsv`

## Step 3: Set up scripts for every eGene of interest

First, we need to set up a config file with some house-keeping information such as directory of files and parameters. See example in `config.yaml`. The file `gene_eQTLs-selected.tsv` is used for `trait1-leads` and `selected-stats`. Then, we can use `scripts/make_susie-sh.py` script to create a SLURM job per region of interest.

Important note: the script `scripts/make_susie-sh.py` requires two other scripts that should be specified in the config file, namely:
```
prep-template: "{base}/scripts/dosage-template.sh"
susie-template: "{base}/scripts/susie-template.sh"
```

```
cd /scratch/scjp_root/scjp99/vthihong/2_PanKBase/colocGWAS_T1D/1_eQTL-gtex-susie/results/susie-region

python /scratch/scjp_root/scjp99/vthihong/2_PanKBase/colocGWAS_T1D/1_eQTL-gtex-susie/scripts/make_susie-sh.py --config /scratch/scjp_root/scjp99/vthihong/2_PanKBase/colocGWAS_T1D/1_eQTL-gtex-susie/scripts/config.yaml
```

At this point, we have a series of individual scripts for each region, with names in the format `gene_eQTLs__<locus name, no other special characters like ;() etc>__<lead snp rsid>__<primary P or secondary S>__<region>__<window>.susieprep.sh` and `gene_eQTLs___<locus name, no other special characters like ;() etc>__<lead snp rsid>__<primary P or secondary S>__<region>__<window>.susie.sh`. The `*susieprep.sh` is necessary to fetch information such as variants and dosages. The `*susie.sh` is to run the fine-mapping analysis.

Example of a `susieprep.sh` file is as the following:
```
cat gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.susieprep.sh 

#!/bin/bash

## fetch variants in the region and intersect UKBB vcfs
for i in /scratch/scjp_root/scjp99/vthihong/2_PanKBase/colocGWAS_T1D/0_data/hg38/chr16.imputed.poly.vcf.gz; do tabix $i chr16:28042519-28542520 | awk '{if (($0 !~ /^#/ && $0 !~ /^chr/)) print "chr"$0; else print $0}' ; done | sort | uniq > gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb.genotypes
zcat /scratch/scjp_root/scjp99/vthihong/2_PanKBase/colocGWAS_T1D/0_data/hg38/chr16.imputed.poly.vcf.gz | head -10000 | awk '{if (($0 ~ /^#/)) print $0}' > gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb.header
cat gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb.header gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb.genotypes | bgzip -c > gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb.vcf.gz; tabix gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb.vcf.gz
rm gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb.genotypes gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb.header

## fetch UKBB dosages 
zcat gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb.vcf.gz | head -10000 | awk -F'\t' '{if (($0 ~/^#CHROM/)) print $0}' OFS='\t' | sed -e 's:#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT:ID:g' > gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb-header.txt 
bcftools query -f "%ID-%REF-%ALT[\t%DS]\n" gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb.vcf.gz | cat gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb-header.txt - > gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb-dosages.tsv 

## bgzip to save space
module load Bioinformatics
module load Bioinformatics  gcc/10.3.0-k2osx5y
module load samtools/1.13-fwwss5n

bgzip -@ 2 gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb-dosages.tsv

## cleanup
rm -rf gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb-header.txt gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb.vcf.gz*
```

Example of a `susie.sh` file is as the following:
```
cat gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.susie.sh

#!/bin/bash

################## running SuSiE for  gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb:

## Susie 
/scratch/scjp_root/scjp99/vthihong/2_PanKBase/colocGWAS_T1D/1_eQTL-gtex-susie/scripts/susie-eqtl.R --prefix gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb --type quant --beta slope --p pval_nominal --se slope_se --effect ALT --non_effect REF --sdY 1 --coverage 0.95 --maxit 10000 --min_abs_corr 0.1 --s_threshold 0.3 --number_signals_default 10 --number_signals_high_s 1 --marker rs62031562 --trait1 /scratch/scjp_root/scjp99/vthihong/2_PanKBase/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000188322.4__GTEx_Pancreas_Gene__16:28292519.bed.gz --trait1_ld gene_eQTLs__SBK1__rs62031562__P__chr16-28042519-28542520__250kb.ukbb-dosages.tsv.gz 
```

## Step 4: Conduct fine-mapping analysis for all regions of interest

After we set up analysis scripts for each eGene, we can run the analysis for every eGene using Snakemake. See example of a Snakemake file at `scripts/susie.sf`.

Signals of each eGene by default will be saved in a R object names `*.susie.Rda`.

## Step 5: Obtain output files for PanKgraph

For the purpose of PanKgraph, we will extract some outputs into text files. Example of code is the following:

In [5]:
library(glue)
library(tidyr)
suppressPackageStartupMessages(library(dplyr))

In [6]:
process_dosage = function(f, snplist){
    ld = read.csv(f, sep='\t', check.names = F)
    dups = ld[ (duplicated(ld$ID) | duplicated(ld$ID, fromLast = TRUE)),]
    print(glue("N duplicates = {nrow(dups)}"))
    ld = ld[! (duplicated(ld$ID) | duplicated(ld$ID, fromLast = TRUE)),]
    row.names(ld) = ld$ID
    ld$ID = NULL
    idlist = intersect(snplist, row.names(ld))
    ld = ld[idlist,]
    print(ld[1:5, 1:10])
    ld = cor(t(ld))
    return(ld)
}

meta <- read.table("/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/1_eQTL-gtex-susie/data/gene_eQTLs-selected.tsv", header = T)
meta <- distinct(meta)
head(meta)

Unnamed: 0_level_0,chr,start,end,locus,gene_id,eqtl_input
Unnamed: 0_level_1,<chr>,<int>,<int>,<chr>,<chr>,<chr>
1,chr1,778770,778771,RP11-206L10.9__rs187772768__P,ENSG00000237491,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000237491.8__GTEx_Pancreas_Gene__1:778770.bed.gz
2,chr1,817371,817372,FAM87B__rs187772768__P,ENSG00000177757,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000177757.2__GTEx_Pancreas_Gene__1:817371.bed.gz
3,chr1,817712,817713,RP11-206L10.8__rs114525117__P,ENSG00000230092,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000230092.7__GTEx_Pancreas_Gene__1:800879.bed.gz
4,chr1,827522,827523,LINC00115__rs114525117__P,ENSG00000225880,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000225880.5__GTEx_Pancreas_Gene__1:826206.bed.gz
5,chr1,825138,825139,LINC01128__rs4970388__P,ENSG00000228794,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000228794.8__GTEx_Pancreas_Gene__1:825138.bed.gz
6,chr1,959309,959310,NOC2L__rs4970441__P,ENSG00000188976,/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/3_t1d-eQTL_GTEx-coloc/results/GTEx_EUR/gtex_indexed/ENSG00000188976.10__GTEx_Pancreas_Gene__1:944582.bed.gz


In [11]:
l <- "SBK1__rs62031562__P__chr16-28042519-28542520"
input <- meta[meta$locus == "SBK1__rs62031562__P", "eqtl_input"]

qtl <- read.csv(input, sep='\t', header=T, check.names=F)
qtl$snp <- paste0(qtl$SNP, "-", qtl$REF, "-", qtl$ALT)

load(paste0("/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/1_eQTL-gtex-susie/results/susie-prep/gene_eQTLs__", l, "__250kb.susie.Rda"))

if (length(S2$sets$cs) > 0) {
        for (j in 1:length(S2$sets$cs)) {
            pip <- data.frame(pip=S2$pip[names(S2$sets$cs[[j]])])
            if (S2$sets$coverage[[j]] < 0.95) {
                print(names(S2$sets$cs[[j]]))
                next
            }

        pip$snp <- row.names(pip)
        pip <- inner_join(pip, qtl[,c("snp", "pval_nominal", "alt_gtex", "ref_gtex", "slope")]) #The effect sizes of eQTLs are defined as the effect of the alternative allele (ALT) relative to the reference (REF) allele in the human genome reference. In other words, the eQTL effect allele is the ALT allele, not the minor allele. https://gtexportal.org/home/faq
        print(head(pip))

        idx = S2$sets$cs_index[j]
        isnps = colnames(S2$lbf_variable)
        bf = S2$lbf_variable[idx, isnps, drop=FALSE]
        bf = data.frame(snp = isnps, lbf = t(bf)[,1])
        pip <- inner_join(pip, bf, by = c("snp" = "snp"))
        colnames(pip) <- c("pip", "snp", "nominal_p", "effect_allele", "other_allele", "slope", "lbf")

        ldf <- process_dosage(paste0("/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/1_eQTL-gtex-susie/results/susie-prep/gene_eQTLs__", l, "__250kb.ukbb-dosages.tsv.gz"), pip$snp)
        ldf <- ldf**2
        colnames(ldf) <- stringr::str_extract(colnames(ldf), "[^-]*")
        rownames(ldf) <- stringr::str_extract(rownames(ldf), "[^-]*")
        #write.table(ldf, paste0("/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/1_eQTL-gtex-susie/results/susie/", l, "__250kb__credibleSet", j, "__ld.txt"), sep = "\t", quote = F)

        pip$snp <- stringr::str_extract(pip$snp, "[^-]*")
        print(head(pip))
        #write.table(pip[, c("snp", "pip", "nominal_p", "effect_allele", "other_allele", "slope", "lbf")], paste0("/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/1_eQTL-gtex-susie/results/susie/", l, "__250kb__credibleSet", j, ".txt"), row.names = F, sep = "\t", quote = F)
            
        if (length(S2$sets$cs) > 0) {
            purity <- c()
            coverage <- c()
            p <- data.frame(locus = rep(l, length(S2$sets$cs)), purity = NA, coverage = NA)
            for (j in 1:length(S2$sets$cs)) {
                coverage <- c(coverage, S2$sets$coverage[[j]])
                purity <- c(purity, S2$sets$purity[j, 1])
                }
            p$purity <- purity
            p$coverage <- coverage
            p$credibleset <- 1:length(S2$sets$cs)
            print(head(p))
            #write.table(p, paste0("/nfs/turbo/umms-scjp-pank/vthihong/colocGWAS_T1D/1_eQTL-gtex-susie/results/susie/purity/gene_eQTLs__", 
            #                "A1CF", "__", l, ".txt"), row.names = F, sep = "\t", quote = F)
            }
    }
}

[1m[22mJoining with `by = join_by(snp)`


         pip                 snp pval_nominal alt_gtex ref_gtex      slope
1 0.03207257 rs141876325-C-CACCT 2.183181e-06    CACCT        C -0.2944332
2 0.03207257       rs2650492-G-A 2.183181e-06        A        G -0.2944332
3 0.01068336       rs2726034-T-C 6.400740e-06        C        T -0.2745833
4 0.01086953      rs12598357-A-G 6.292714e-06        G        A -0.2681219
5 0.04608285       rs2726036-A-C 1.536282e-06        C        A -0.2837276
6 0.02849677      rs13333976-T-C 2.449138e-06        C        T -0.2772882
N duplicates = 0
                    1000251 1000534 1000542 1000766 1000898 1000924 1000961
rs141876325-C-CACCT       0       0       1       2       0       0       0
rs2650492-G-A             0       0       1       2       0       0       0
rs2726034-T-C             0       0       1       2       0       0       0
rs12598357-A-G            0       0       1       2       0       0       1
rs2726036-A-C             0       0       1       2       0       0       1
  