# Fine-mapping with PolyFun

## Aim

The purpose of this notebook ipmlements commands for [a functionally-informed fine-mapping workflow using the PolyFun method](https://github.com/omerwe/polyfun/wiki).

## Methods Overview 

`PolyFun` offers the following features:

1. Using and/or creating Functional Annotations
2. Estimating Functional Enrichment Using `S-LDSC`
3. Using and/or computating Prior Causal Probabilities from 1
4. Functionally Informed Fine Mapping with Finemapper
5. Polygenic Localization with `PolyLoc`

**Notice: this workflow does not implements 5 the `PolyLoc`.**

## Input 

1) GWAS summary statistics including the following variables: 

    - variant_id - variant ID 
    - P - p-value 
    - CHR - chromosome number 
    - BP - base pair position
    - A1 - The effect allele (i.e., the sign of the effect size is with respect to A1)
    - A2 - the second allele 
    - MAF - minor allele frequency 
    - BETA - effect size 
    - SE - effect size standard error


2) Functional annotation files including the following columns: 

    - CHR - chromosome number
    - BP base pair position (in hg19 coordinates)
    - SNP - dbSNP reference number 
    - A1 - The effect allele 
    - A2 - the second allele
    - Arbitrary additional columns representing annotations


3) A `.l2.M` white-space delimited file containing a single line with the sums of the columns of each annotation

4) LD-score files 

    - Strongly recommended that LD-score files include A1,A2 columns


5) LD information, taken from one of three possible data sources:

    - plink files with genotypes from a reference panel
    - bgen file with genotypes from a reference panel
    - pre-computed LD matrix

    Optional if (4) is obtained and no plans to compute prior causal probabilities non-parametrically 

6) Ld-score weights files.

    - Strongly recommended that weight files include A1,A2 columns



## Output

A `.gz` file containing input summary statistics columns and additionally the following columns:

- PIP - posterior causal probability
- BETA_MEAN - posterior mean of causal effect size (in standardized genotype scale)
- BETA_SD - posterior standard deviation of causal effect size (in standardized genotype scale)
- CREDIBLE_SET - the index of the first (typically smallest) credible set that the SNP belongs to (0 means none).


## Workflow

Step 1 and 2 are optional if using pre-computed prior causal probabilities

### Step 1: Obtain functional annotations 

For each chromosome, the following files need to be obtained: 

1) A `.gz` or `.parquet` annotations file containing the following columns:

- CHR - chromosome number
- BP base pair position
- SNP - dbSNP reference number 
- A1 - The effect allele 
- A2 - the second allele
- Arbitrary additional columns representing annotations 

2) A `.l2.M` white-space delimited file containing a single line with the sums of the columns of each annotation

3) (Optional) A `l2.M_5_50` file that is the `.l2.M` file but only containing common SNPS (MAF between 5% and 50%) 

The above files can be obtained either by using existing function annotation files, or by creating your own through other software such as `TORUS`.

Existing function annotation files example: functional annotations for ~19 million UK Biobank imputed SNPs with MAF>0.1%, based on the baseline-LF 2.2.UKB annotations.

Download (30G): https://data.broadinstitute.org/alkesgroup/LDSCORE/baselineLF_v2.2.UKB.polyfun.tar.gz

### Step 2: Compute LD-scores for annotations 

Precomputed LD-score files can be used. LD-score files can also be generated through the methods below:

#### Method 1: Compute with reference panel of sequenced individuals 

Reference panel should have at least 3000 sequenced individuals from target population.

In [None]:
[global]
parameter: container = "/mnt/mfs/statgen/containers/xqtl_pipeline_sif/polyfun.sif"
parameter: wd = path("./")
parameter: exe_dir = "/usr/local/bin/"
parameter: name = "demo"
parameter: genoFile = path("./")
parameter: annot_file = path("./")
parameter: sumstats = path("./")

In [None]:
[ld_score]
input: annot_file, genoFile
output: f'{wd:a}/{name}.ref.ldscore.parquet'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout' , container = container
    python $[exe_dir]/compute_ldscores.py \
        --bfile $[_input[1]:n] \
        --annot $[_input[0]] \
        --out $[_output]

#### Method 2: Compute with pre-computed UK Biobank LD matrices 

Matrices download: https://data.broadinstitute.org/alkesgroup/UKBB_LD

In [None]:
[ld_score_ukb]
input: annot_file
output: f'{wd:a}/{name}.ukb.ldscore.parquet'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout' , container = container
    python $[exe_dir]/compute_ldscores_from_ld.py \
        --annot $[_input[0]] \
        --ukb \
        --out $[_output]

#### Method 3: Compute with own pre-computed LD matrices

Own pre-computed LD matrices should be in `.bcor` format. 

In [None]:
[ld_score_own]
parameter: sample_size = int
parameter: bcor_files = paths
input: annot_file,bcor_files
output: f'{wd:a}/{name}.original.ldscore.parquet'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout' , container = container
    python $[exe_dir]/compute_ldscores_from_ld.py $[_input[1]] \
        --annot $[_input[0]] \
        --out $[_output] \
        --n $[sample_size] \
        

### Step 3: Compute Prior Causal Probabilities

#### Method 1: Use precomputed prior causal probabilities

Use precomputed prior causal probabilities of 19 million imputed UK Biobank SNPs with MAF>0.1%, based on a meta-analysis of 15 UK Biobank traits. 

In [None]:
[prior_causal_prob]
input: sumstats
output: f'{wd:a}/{name}.pcp.gz'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout', container = container
    python $[exe_dir]/extract_snpvar.py \
        --sumstats $[_input] \
        --out $[_output] \
        --allow-missing

#### Method 2: Compute via L2-regularized extension of S-LDSC (preferred)

Compute via an L2-regularized extension of stratified LD-score regression (S-LDSC). Use the annotation and LD-score files produced in Step1. 

1) Create a munged summary statistics file in a PolyFun-friendly parquet format.

In [None]:
[munged_sumstats]
parameter: sample_size = 472868 
parameter: min_info = 0.6
parameter: min_maf = 0.01
input: sumstats
output: f'{wd:a}/{name}.sumstats_munged.parquet'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '127G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout' , container = container
    python $[exe_dir]/munge_polyfun_sumstats.py \
      --sumstats $[_input] \
      --n $[sample_size] \
      --out $[_output] \
      --min-info $[min_info] \
      --min-maf $[min_maf]

2) Run PolyFun with L2-regularized S-LDSC
- Require at least 45 GB of mem

In [None]:
[L2_SLDSC]
# a ld score file with surfix l2.ldscore.parquet
parameter: ref_ld = path
# another ld score file with surfix l2.ldscore.parquet, different from ref_ld
parameter: ref_wgt = ref_ld
parameter: partitions = ""
input: ref_ld, ref_wgt,output_from("munged_sumstats")
# parameter: sumstat = _input[2]
output: f'{wd:a}/{name}.ldsldsc.parquent'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '127G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout' , container = container
    python $[exe_dir]/polyfun.py \
        --compute-h2-L2 \
        --output-prefix $[_output] \
        --sumstats $[_input[2]] \
        --ref-ld-chr $[_input[0]:nnnn].\
        --w-ld-chr $[_input[1]:nnnn]. \
        --allow-missing $["" if partitions else "--no-partitions"]

#### Method 3: Compute Non-parametrically

1) Create a munged summary statistics file in a PolyFun-friendly parquet format.
Duplicated cells are commented out, the input of [ld_snpbin] is the output from [L2_regu_SLDSC]

In [None]:
#[munged_sumstats2]
#parameter: sumstats = AD_sumstats_Jansenetal_2019sept.txt.gz
#parameter: sample_size = int
#parameter: container = none
#bash: container = container 
#    mkdir -p SLDSC_output
#    python munge_polyfun_sumstats.py \
#      --sumstats sumstats \
#      --n sample_size \
#      --out /SLDSC_output/sumstats_munged.parquet \
#      --min-info 0 \
#      --min-maf 0

2) Run PolyFun with L2-regularized S-LDSC

In [None]:
# [L2_regu_SLDSC2]
# 
# parameter: container = none
# paramter: ref_ld = example_data/annotations.
# parameter: ref_wgt = example_data/weights.
# bash: container=container
#     python polyfun.py \
#     --compute-h2-L2 \
#     --output-prefix output/testrun \
#     --sumstats example_data/sumstats.parquet \
#     --ref-ld-chr ref_ld \
#     --w-ld-chr ref_wgt

3) Compute LD-scores for each SNP bin

In [None]:
[ld_snpbin]
depends: sos_step("L2_regu_SLDSC")
parameter: chrom = int
input: annot_file, genoFile
output: f'{wd:a}/{name}.snpbin.ldscore.parquet'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout' , container = container
     python $[exe_dir]/polyfun.py \
        --compute-ldscores \
        --bfile-chr $[_input[1]:n] \
        --output-prefix $[_output] \
        --chr $[chrom]

4) Re-estimate per-SNP heritabilities via S-LDSC

In [None]:
#[re_SLDSC]
#bash:
#    python polyfun.py \
#    --compute-h2-bins \
#    --output-prefix output/testrun \
#    --sumstats example_data/sumstats.parquet \
#    --w-ld-chr example_data/weights.

[L2_SLDSC_bins]
paramter: ref_ld = path
parameter: ref_wgt = ref_ld
parameter: partitions = ""
input: ref_ld, ref_wgt,output_from("munged_sumstats")
output: f'{wd:a}/{name}.txt.gz'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout' , container = container
    python $[exe_dir]/polyfun.py \
        --compute-h2-bins \
        --output-prefix $[_output] \
        --sumstats $[_input[2]] \
        --ref-ld-chr $[_input[0]]\
        --w-ld-chr $[_input[1]] 

### Step 4: Functionally informed fine mapping with finemapper

Input summary statistics file must have `SNPVAR` column (per-SNP heritability) to perform functionally-informed fine-mapping. To fine-map without annotations, use additional parameter `--non-funct`. The summary statistical file then will not require the `SNPVAR` column. 

In [None]:
[fine_mapping]
parameter: sample_size = 383290
parameter: chrom = 1
parameter: start = 46000001
parameter: end = 49000001
#parameter: output_path = "output/finemap.1.46000001.49000001.gz"
parameter: max_num_causal = 5
input: genoFile,sumstats
output: f'{wd:a}/output/finemap.{chrom}.{start}.{end}.gz'
task: trunk_workers = 1, trunk_size = 1, walltime = '24h',  mem = '30G', tags = f'{step_name}_{_output[0]:bn}'
bash: expand = "$[ ]", stderr = f'{_output}.stderr', stdout = f'{_output}.stdout' , container = container
    python $[exe_dir]/finemapper.py \
        --geno $[_input[0]:n] \
        --sumstats $[_input[1]]  \
        --n $[sample_size] \
        --chr $[chrom] \
        --start $[start] \
        --end $[end] \
        --method susie \
        --max-num-causal $[max_num_causal] \
        --cache-dir $[_output:d]/cache \
        --out $[_output]

## Minimal Working Example

### Example 1: Functionally-informed fine-mapping using summary statistics file with precomputed prior causal probabilities

In [None]:
nohup sos run ~/GIT/xqtl-pipeline/pipeline/integrative_analysis/SuSiE_Ann/polyfun.ipynb prior_causal_prob \
    --sumstats /home/at3535/polyfun/AD_sumstats_Jansenetal_2019sept.txt.gz  \
    -J 200 -q csg \
    -c /home/hs3163/GIT/ADSPFG-xQTL/code/csg.yml &

In [1]:
nohup sos run ~/GIT/xqtl-pipeline/pipeline/integrative_analysis/SuSiE_Ann/polyfun.ipynb fine_mapping \
    --sumstats /home/at3535/polyfun/AD_sumstats_Jansenetal_2019sept.txt.gz  \
    --genoFile /mnt/mfs/statgen/ROSMAP_xqtl/dataset/snvCombinedPlink/chr1.bed \
    -J 200 -q csg \
    -c /home/hs3163/GIT/ADSPFG-xQTL/code/csg.yml &

### Example 2: Functionally-informed fine-mapping using summary statistics file generated from pre-obtained annotation and LD-score files 

In [None]:
nohup sos run ~/GIT/xqtl-pipeline/pipeline/integrative_analysis/SuSiE_Ann/polyfun.ipynb munged_sumstats  \
--sumstats /home/at3535/polyfun/GCST90012877_buildGRCh37_colrenamed.txt.gz \
-J 200 -q csg -c /home/hs3163/GIT/ADSPFG-xQTL/code/csg.yml &

In [None]:
nohup sos run ~/GIT/xqtl-pipeline/pipeline/integrative_analysis/SuSiE_Ann/polyfun.ipynb L2_SLDSC  \
--sumstats demo.sumstats_munged.parquet \
--ref_ld /mnt/mfs/statgen/tl3030/baselineLF2.2.UKB/baselineLF2.2.UKB.1.l2.ldscore.parquet \
-J 200 -q csg -c /home/hs3163/GIT/ADSPFG-xQTL/code/csg.yml &

In [None]:
nohup sos run ~/GIT/xqtl-pipeline/pipeline/integrative_analysis/SuSiE_Ann/polyfun.ipynb fine_mapping \
    --sumstats demo.ldsldsc.parquent  \
    --genoFile /mnt/mfs/statgen/ROSMAP_xqtl/dataset/snvCombinedPlink/chr1.bed \
    -J 200 -q csg -c /home/hs3163/GIT/ADSPFG-xQTL/code/csg.yml &

### Summary

#### Example 1:

In [23]:
bash:
    gzcat output/finemap.1.46000001.49000001.gz | head

CHR	SNP	BP	A1	A2	SNPVAR	Z	N	P	PIP	BETA_MEAN	BETA_SD	DISTANCE_FROM_CENTER	CREDIBLE_SET
1	rs2088102	46032974	T	C	1.70060e-06	1.25500e+01	383290	3.97510e-36	1.00000e+00	-2.03917e-02	1.61901e-03	1456799	1
1	rs7528714	47966058	G	A	1.18040e-06	5.14320e+00	383290	2.70098e-07	9.97870e-01	7.42146e-03	1.62305e-03	476285	2
1	rs7528075	47870271	G	A	1.18040e-06	4.40160e+00	383290	1.07456e-05	9.76545e-01	-5.98945e-03	1.81667e-03	380498	3
1	rs212968	48734666	G	A	1.70060e-06	-3.01130e+00	383290	2.60132e-03	3.75823e-01	-1.56305e-03	2.23942e-03	1244893	0
1	rs2622911	47837404	C	A	1.70060e-06	3.12520e+00	383290	1.77684e-03	3.71804e-01	1.54312e-03	2.22812e-03	347631	0
1	rs4511165	48293181	G	A	1.70060e-06	-1.18940e+00	383290	2.34282e-01	5.75970e-02	1.60630e-04	7.52226e-04	803408	0
1	rs3766196	47284526	C	A	6.93040e-06	-5.92360e-02	383290	9.52764e-01	4.89776e-02	-5.06776e-07	3.48039e-04	205247	0
1	rs12567716	48197570	T	C	1.18040e-06	2.14810e+00	383290	3.17058e-02	4.45128e-02	-1.28281e-04	6.81457e-04	707797	0


In [25]:
import numpy as np
import pandas as pd

data = pd.read_csv('output/finemap.1.46000001.49000001.gz', sep="\t")

data.head(5)
    
num_var_cs = np.count_nonzero(data['CREDIBLE_SET'])
total_cs = len(data.CREDIBLE_SET.unique())- 1
avg_var_cs = float(num_var_cs) / total_cs
pip50 = sum(1 for i in data['PIP'] if i >0.5)
pip95 = sum(1 for i in data['PIP'] if i >0.95)

result = "Number of variants with PIP > 0.5: " + str(pip50) + "\n" + "Number of variants with PIP > 0.95: " + str(pip95) + "\n" \
    + "Number of variants that have credible sets: " + str(num_var_cs) + "\n" \
    + "Number of unique credible sets: " + str(total_cs) + "\n" \
    + "Average number of variants per credible set: " + str(avg_var_cs) 


with open('results.txt', 'a') as the_file:
    the_file.write(result)

with open('results.txt') as f:
    contents = f.readlines()
    print(contents)

['Number of variants with PIP > 0.5: 3\n', 'Number of variants with PIP > 0.95: 3\n', 'Number of variants that have credible sets: 3\n', 'Number of unique credible sets: 3\n', 'Average number of variants per credible set: 1.0']


#### Example 2: 

In [None]:
bash:
    gzcat output/finemap.1.460000010.49000000.gz | head

In [25]:
import numpy as np
import pandas as pd

data = pd.read_csv('output/finemap.1.46000000.49000000.gz', sep="\t")

data.head(5)
    
num_var_cs = np.count_nonzero(data['CREDIBLE_SET'])
total_cs = len(data.CREDIBLE_SET.unique())- 1
avg_var_cs = float(num_var_cs) / total_cs
pip50 = sum(1 for i in data['PIP'] if i >0.5)
pip95 = sum(1 for i in data['PIP'] if i >0.95)

result = "Number of variants with PIP > 0.5: " + str(pip50) + "\n" + "Number of variants with PIP > 0.95: " + str(pip95) + "\n" \
    + "Number of variants that have credible sets: " + str(num_var_cs) + "\n" \
    + "Number of unique credible sets: " + str(total_cs) + "\n" \
    + "Average number of variants per credible set: " + str(avg_var_cs) 


with open('results.txt', 'a') as the_file:
    the_file.write(result)

with open('results.txt') as f:
    contents = f.readlines()
    print(contents)

['Number of variants with PIP > 0.5: 3\n', 'Number of variants with PIP > 0.95: 3\n', 'Number of variants that have credible sets: 3\n', 'Number of unique credible sets: 3\n', 'Average number of variants per credible set: 1.0']
