Skip to content

3. Functionally informed fine mapping with finemapper

Omer Weissbrod edited this page Mar 23, 2023 · 25 revisions

Below we explain how to use the prior causal probabilities estimated by PolyFun with SuSiE and FINEMAP. We recommend using the script finemapper.py, which saves many of the preprocessing steps requires to perform fine-mapping (e.g. handling allelic flips between the summary statistics and reference genotypes). Alternatively, you can run SuSiE or FINEMAP directly with the prior causal probabilities computed by PolyFun, as described below.

Using prior causal probabilities using the finemapper script

The script finemapper performs functionally-informed fine-mapping using methods like SuSiE or FINEMAP. It works seamlessly with PolyFun by taking input files created by munge_polyfun_sumstats.py or by PolyFun itself.

Specifically, the script takes two types of inputs:

  1. A summary statistics file with the following columns: SNP, CHR, BP, A1, A2, Z (Z-score), and (optionally) SNPVAR (per-SNP heritability, which is proportional to prior causal probability). A1 is the effect allele (i.e. the sign of Z should match the A1 allele). If the SNPVAR column is missing, you cannot perform functionally-informed fine-mapping.

  2. LD information, taken from one of three possible data sources:

    • A plink file with genotypes from a reference panel
    • A bgen file with genotypes from a reference panel
    • A pre-computed LD matrix. We have made summary LD information for 2,763 3Mb-long regions spanning the entire genome freely available from an AWS S3 bucket (thanks to the kind support of the AWS Open data program). The files can be downloaded from: s3://broad-alkesgroup-ukbb-ld/UKBB_LD/

If you provide a .bgen file, the script will compute an LD matrix using LDstore 2.0, which must be installed on your system. The script can also cache LD matrices on disk, which can save substantial time and effort when re-analyzing the same data multiple times with different configurations (which always happens).

To run finemapper with SuSiE, you need to install rpy2 and the SuSiE package on your system. To run it with FINEMAP, you need to install FINEMAP v1.4 on your system.

We will first show a few use examples and then describe all the command line arguments in detail.

Example 1: Fine-mapping with SuSiE, using genotypes from a plink file

mkdir -p LD_cache
mkdir -o output

python finemapper.py \
    --geno example_data/chr1 \
    --sumstats example_data/chr1.finemap_sumstats.txt.gz \
    --n 383290 \
    --chr 1 \
    --start 46000001 \
    --end 49000001 \
    --method susie \
    --max-num-causal 5 \
    --cache-dir LD_cache \
    --out output/finemap.1.46000001.49000001.gz

This command takes an input plink file (example_data/chr1.bed) and an input summary statistics file (corresponding to an analysis of n=383,290 individuals). The A1 allele is assumed to be the effect allele. The script performs fine-mapping in chromosome 1, in the locus spanning basepair positions 46000001-49000001, using SuSiE. It prints the output to the file output/finemap.1.46000001.49000001.gz, and saves the computed LD matrix in the directory LD_cache. The argument --max-num-causal 5 tells SuSiE to assume that there are exactly 5 causal SNPs in the locus (the argument name is general, but for SuSiE it specifies an exact rather than a max number). Here are the first few lines of the output (seen with cat output/finemap.1.46000001.49000001.gz | zcat | head):

CHR  SNP         BP        A1  A2  SNPVAR       Z             N       P            PIP          BETA_MEAN     BETA_SD      CREDIBLE_SET
1    rs2088102   46032974  T   C   1.70060e-06  1.25500e+01   383290  3.97510e-36  1.00000e+00  -2.03824e-02  1.64279e-03  1
1    rs7528714   47966058  G   A   1.18040e-06  5.14320e+00   383290  2.70098e-07  9.97861e-01  7.50858e-03   1.64286e-03  2
1    rs7528075   47870271  G   A   1.18040e-06  4.40160e+00   383290  1.07456e-05  9.73775e-01  -5.96171e-03  1.86070e-03  3
1    rs212968    48734666  G   A   1.70060e-06  -3.01130e+00  383290  2.60132e-03  3.93743e-01  -1.19187e-03  1.82430e-03  0
1    rs2622911   47837404  C   A   1.70060e-06  3.12520e+00   383290  1.77684e-03  3.83424e-01  1.14278e-03   1.78675e-03  0
1    rs3766196   47284526  C   A   6.93040e-06  -5.92360e-02  383290  9.52764e-01  2.56734e-01  4.62991e-06   6.60726e-04  0
1    rs4511165   48293181  G   A   1.70060e-06  -1.18940e+00  383290  2.34282e-01  1.64878e-01  3.06280e-04   8.96284e-04  0
1    rs12567716  48197570  T   C   1.18040e-06  2.14810e+00   383290  3.17058e-02  1.25214e-01  -2.40839e-04  8.08277e-04  0
1    rs4927234   48384796  G   A   1.70060e-06  -7.59600e-01  383290  4.47494e-01  1.21689e-01  1.80196e-04   6.76205e-04  0

Columns 1-9 describe the input summary statistics (and are based on data from the input files). The rows are sorted according to PIP in descending order. Columns 10-14 contain the following fields:

  1. PIP - posterior causal probability
  2. BETA_MEAN - posterior mean of causal effect size (in standardized genotype scale)
  3. BETA_SD - posterior standard deviation of causal effect size (in standardized genotype scale)
  4. CREDIBLE_SET - the index of the first (typically smallest) credible set that the SNP belongs to (0 means none).

Please note that posterior effect size estimates are provided in a per-standardized-genotype scale. To get estimates on a per-allele scale, please divide these estimates by sqrt(2*MAF*(1-MAF)).

Example 2: Fine-mapping with SuSiE, using pre-computed summary LD information from the UK Biobank

To run this example you will need to download summary LD information that we pre-computed using N=337K unrelated British-ancestry individuals from the UK Biobank (warning: This is a large download, requiring ~1GB of disk space). Ideally you should only do this when analyzing UK Biobank individuals, or closely-matched British-ancestry individuals.

#download an LD matrix
mkdir -p LD_cache
mkdir -o output
mkdir -p LD_temp
cd LD_temp
wget https://broad-alkesgroup-ukbb-ld.s3.amazonaws.com/UKBB_LD/chr1_46000001_49000001.npz
wget https://broad-alkesgroup-ukbb-ld.s3.amazonaws.com/UKBB_LD/chr1_46000001_49000001.gz
cd ..

#run fine-mapper
python finemapper.py \
    --ld LD_temp/chr1_46000001_49000001 \
    --sumstats example_data/chr1.finemap_sumstats.txt.gz \
    --n 383290 \
    --chr 1 \
    --start 46000001 \
    --end 49000001 \
    --method susie \
    --max-num-causal 5 \
    --out output/finemap.UKB.1.46000001.49000001.gz

Example 3: Fine-mapping with FINEMAP, using genotypes from a plink file

mkdir -p LD_cache
mkdir -o output

python finemapper.py \
    --geno example_data/chr1 \
    --sumstats example_data/chr1.finemap_sumstats.txt.gz \
    --n 383290 \
    --chr 1 \
    --start 46000001 \
    --end 49000001 \
    --method finemap \
    --finemap-exe <path to FINEMAP v1.4 executable> \
    --max-num-causal 5 \
    --cache-dir LD_cache \
    --out output/finemap.finemap_exe.1.46000001.49000001.gz

Overview of all command like arguments of finemapper

We now describe the command-lime arguments of finemapper in detail:

  1. --geno - The name of a .bgen file, or the prefix of the name of a plink file (without the suffix .bed). finemapper will compute an LD matrix using the genotypes in this file. Warning: this file should ideally contain genotypes of the same individuals used to generate summary statistics, or at least very closely matched individuals. Using an external reference panel in fine-mapping is strongly discouraged and can lead to severe false-positive results (see Benner et al. 2017 AJHG, Ulirsch et al. 2019 Nat Genet for an investigation of this issue).

  2. --ld - Instead of --geno you can provide the prefix of the name of pre-computed LD summary information (e.g. LD that we pre-computed using N=337K unrelated British-ancestry individuals from the UK Biobank as in example 2 above, or cached LD from a previous run of finemapper.py). These matrices can be stored in two formats:

    1. Numpy sparse matrix format: For every such matrix you need two files: A .npz file with the actual LD matrix, and a .gz file with identities of the SNPs in the LD matrix. We have made LD matrices for 2,763 3Mb-long regions spanning the entire genome freely available to download from the following AWS S3 bucket (thanks to the kind support of the AWS Open Data Program): s3://broad-alkesgroup-ukbb-ld/UKBB_LD/.
    2. BCOR 1.1 format: This is the native format of LDstore 2.0, stored in a single .bcor file.
  3. --sumstats - The name of a summary statistics file, which must include the columns SNP, CHR, BP, A1, A2, Z (z-score). This file can also include a column called SNPVAR that specifies prior per-SNP heritability. If it exists (and unless requested otherwise), finemapper will use this column to perform functionally-informed fine-mapping. We recommend using the output files of PolyFun as input sumstats files for finemapper.

  4. --n - the sample size used to generate summary statistics. In case the summary statistics were computed with BOLT-LMM, we recommend specifying the effective sample size (this quantity is automatically computed by munge_polyfun_sumstats.py).

  5. --chr - the target chromosome to fine-map.

  6. --start, --end - the start and end positions of the target locus to finemap (base pair coordinates).

  7. --method - the fine-mapping method. We currently support susie and finemap

  8. --max-num-causal - the max number of causal SNPs that can be modeled (for FINEMAP) or the exact number (for SuSiE).

  9. --cache-dir - a directory that will cache LD matrices for future reuse. If not specified, LD matrices will be saved to a temp directory and deleted after the script terminates.

  10. --out - the name of the output file.

  11. --ldstore2 - the path of the LDstore 2.0 executable on your system.

  12. --finemap-exe - the path of the FINEMAP v1.4 executable on your system. Only required if you specify --method finemap.

  13. --non-funct - if specified, finemapper will perform non-functionally informed fine-mapping. In this case it does not require that the sumstats file has a column called SNPVAR.

  14. --verbose - if specified, finemapper and the tools it runs will print more detailed output messages.

  15. --hess - if specified, the prior causal effect size variance will be determined using a modified HESS procedure, as described in the PolyFun paper. Otherwise, the causal effect size variance will be estimated by SuSiE and/or FINEMAP.

  16. --sample-file - if you provide a bgen file for --geno, you must also provide a SNPTEST2 sample file, like for other tools that use bgen data. This is a simple text file without a header and a single column that contains individual ids. If --geno is a plink file, you do not need to provide this argument. Please see the LDstore website for more information and examples.

  17. --incl-samples - an optional text file without a header and with a single column that includes the ids of a subset of individuals to use in the LD matrix computation. This can be provided for both plink and bgen files. If not provided, all individuals will be used. Please see the LDstore website for more information and examples.

  18. --threads - the number of CPU threads that LDstore will use to compute LD matrices (if not specified, use the max number of available CPU cores).

  19. --memory - The maximum memory to allocate (in GB) when computing LD (default is 1GB).

  20. --verbose - This flag will show verbose output, including the full output of SuSiE/FINEMAP.

Genome-wide fine-mapping

We envision that PolyFun will primarily be used to fine-map genome-wide significant loci, because <10% of PIP>0.95 SNP-trait pairs in the analyses we performed in the PolyFun paper had P>5e-8, and because genome-wide fine-mapping is computationally challenging. However, we also provide support for genome-wide fine-mapping for completeness. Due to the computational challenges, we provide a script that creates commands that can be run in parallel on a computer cluster. It is up to the user to invoke these commands in their specific computer cluster environment.

Genome-wide fine-mapping consists of three steps:

  1. Create region-specific jobs
  2. Invoke the region-specific jobs
  3. Aggregate the results

Genome-wide fine-mapping step 1: Creating region-specific jobs

You can create region-specific jobs with the script create_finemappe_jobs.py. Here is an example command:

python create_finemapper_jobs.py \
    --sumstats example_data/chr1.finemap_sumstats.txt.gz \
    --n 383290 \
    --method susie \
    --max-num-causal 5 \
    --out-prefix output/polyfun_all \
    --jobs-file output/polyfun_all_jobs.txt

This will create a file called output/polyfun_all_jobs.txt with fine-mapping commands. Here are the first 3 lines (which you can view using head -3 output/polyfun_all_jobs.txt):

python3 finemapper.py --chr 1 --start 1 --end 3000001 --out output/polyfun_all.chr1.1_3000001.gz --ld https://broad-alkesgroup-ukbb-ld.s3.amazonaws.com/UKBB_LD/chr1_1_3000001 --method susie --sumstats example_data/chr1.finemap_sumstats.txt.gz --n 383290 --memory 1 --max-num-causal 5
python3 finemapper.py --chr 1 --start 1000001 --end 4000001 --out output/polyfun_all.chr1.1000001_4000001.gz --ld https://broad-alkesgroup-ukbb-ld.s3.amazonaws.com/UKBB_LD/chr1_1000001_4000001 --method susie --sumstats example_data/chr1.finemap_sumstats.txt.gz --n 383290 --memory 1 --max-num-causal 5
python3 finemapper.py --chr 1 --start 2000001 --end 5000001 --out output/polyfun_all.chr1.2000001_5000001.gz --ld https://broad-alkesgroup-ukbb-ld.s3.amazonaws.com/UKBB_LD/chr1_2000001_5000001 --method susie --sumstats example_data/chr1.finemap_sumstats.txt.gz --n 383290 --memory 1 --max-num-causal 5

You can now submit each of these commands to your computer cluster.

Notes and explanations:

  1. The script accepts almost all command-line arguments that can be provided to finemapper.py, except for --start, --end, --ld.
  2. The argument --out-prefix specifies the prefix of the region-specific output files that will be created.
  3. The script automatically partitions the genome into 3Mb-long regions, and creates a fine-mapping job for each region.
  4. You can provide the argument --pvalue-cutoff <cutoff> (e.g. --pvalue-cutoff 5e-8). This will exclude regions in which all SNPs have p-values greater than this cutoff. You can use this argument to e.g. restrict the analysis to only genome-wide significant loci.
  5. You can provide a genotypes file from an LD-reference panel using --geno <genotypes file>. If you don't provide one, each command will automatically download the relevant region-specific UK Biobank LD matrices that we published online. Please note that this is only suitable for fine-mapping summary statistics based on N=337K unrelated British-ancestry UK Biobank individuals, as we did in the PolyFun paper.
  6. You can restrict to a single chromosome using --chr <chromoeome number>. This can be convenient if you have a chromosome-specific genotypes file, which you can provide using --geno.
  7. The script uses the command python3 as the python3 executable by default. You can change this using the argument --python3 <python3 executable>.

Genome-wide fine-mapping step 2: Invoking the region-specific jobs

In this step the user should submit each of the commands listed in the output file of create_finemapper_jobs.py (specified via the argument --jobs-file) to their computer cluster.

Genome-wide fine-mapping step 3: Aggregating the results

After all the region-specific jobs have been completed, you can aggregate the results with the script aggregate_finemapper_results.py. Here is an example command:

python aggregate_finemapper_results.py \
    --out-prefix output/polyfun_all \
    --sumstats example_data/chr1.finemap_sumstats.txt.gz \
    --out output/polyfun_agg.txt.gz

The script will aggregate all of the results of the region-speific jobs, keeping only one result for each SNP, based on the region in which it was most central. If you provided one of the arguments --chr, --pvalue-cutoff in step 1, you should provide it here as well.

Note: Since we merge the outputs of different fine-mapping jobs in different (possibly overlapping) regions, the credible sets are region-specific. To avoid confusion, the script adds the region name to the credible set, so that only credible sets coming from the same region are comparable (e.g. chr1:46000001-49000001:1 means that a SNP is in credible set 1, in the fine-mapping of chromosome 1 bp46M-49M. Only SNPs with the same exact credible set name are in the same credible set as this SNP).

Extracting annotations of top SNPs

We provide a script called extract_annotations.py which extracts the annotations of large-PIP SNPs. You can use it as follows:

python extract_annotations.py \
       --pips output/finemap.1.46000001.49000001.gz \
       --annot annotations/baselineLF2.2.UKB.1.annot.parquet \
       --pip-cutoff 0.95 \
       --out output/top_annot.1.46000001.49000001.txt.gz

You can view the first few columns of the output file via the command:

cat output/top_annot.1.46000001.49000001.txt.gz | zcat | cut -f1,2,3,4,5,6,7,8

The output should look like:

PIP         CHR  SNP        BP        A1  A2  Coding_UCSC_lowfreq  Coding_UCSC_common
1.0000e+00  1    rs2088102  46032974  T   C   0                    0
9.9786e-01  1    rs7528714  47966058  G   A   0                    0
9.7377e-01  1    rs7528075  47870271  G   A   0                    0

In this example we can see two annotation columns (Coding_UCSC_lowfreq, Coding_UCSC_common). The annotation value is 0 for the three SNPs with PIP>=0.95, indicating that none of them is coding. You can see details about all the annotations in Supplementary Table 1 of the PolyFun paper).

The script arguments are the following:

  1. --pips: An output file created by finemapper.py
  2. --annot: An annotations file (e.g., you can download the baseline-LF annotations as explained here).
  3. --pip-cutoff: The script will only extract annotations for SNPs with PIP greater than this cutoff (default is 0.95)
  4. --out: output file.

Using prior causal probabilities in SuSiE directly (without using the finemapper script)

All you have to do is provide SuSiE the flag prior_weights with per-SNP heritability estimates from PolyFun (i.e., the contents of the column SNPVAR).

Using prior causal probabilities in FINEMAP directly (without using the finemapper script)

To do this, please invoke FINEMAP v1.4 with the flag --prior-snps and add a new column to the .z file called prob with the prior causal probabilities. Please see the FINEMAP v1.4 documentation for more details.