# Fine-mapping of Alzheimer's disease GWAS summary statistics using GentroPy

This notebook presents an example of fine-mapping of the GWAS catalog study for Alzheimer's disease ([link to study](https://genetics.opentargets.org/Study/GCST90012877/associations)). The study itself is a good benchmarking example for fine-mapping - relatively large number of SNPs, very strong signal on the 19th chromosome (APOE). It's worth noting that usually very strong signals are excluded from fine-mapping due to instability.

Also, we excluded MHC region (6:28M-34M) from fine-mapping because it has a huge density of the variants.

To execute it on your local machine (not dataproc) you need to install https://github.com/broadinstitute/install-gcs-connector.

## Initialization

In [1]:
!gcloud auth application-default login

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=764086051850-6qr4p6gpi6hn506pt8ejuq83di341hur.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=openid+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fuserinfo.email+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fcloud-platform+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fsqlservice.login+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=2Jvk4c7unAsigRvEKhceIxcrpGmeK8&access_type=offline&code_challenge=84guS6MmOY7qgvNpHLxoJbhRDBUAEUS93teMwQboD3Q&code_challenge_method=S256


Credentials saved to file: [/Users/yt4/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Quota project "open-targets-genetics-dev" was added to ADC which can be used by Google client libraries for billing and quota. Note that some services may still bill the project owning

In [1]:
import os

import hail as hl
import pandas as pd
import pyspark.sql.functions as f

from gentropy.common.session import Session
from gentropy.dataset.study_index import StudyIndex
from gentropy.dataset.summary_statistics import SummaryStatistics
from gentropy.method.window_based_clumping import WindowBasedClumping
from gentropy.susie_finemapper import SusieFineMapperStep

pd.set_option("display.max_colwidth", None)
pd.set_option("display.expand_frame_repr", False)

hail_dir = os.path.dirname(hl.__file__)
session = Session(hail_home=hail_dir, start_hail=True, extended_spark_conf={"spark.driver.memory": "12g",
    "spark.kryoserializer.buffer.max": "500m","spark.driver.maxResultSize":"3g"})

24/04/09 10:40:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
pip-installed Hail requires additional configuration options in Spark referring
  to the path to the Hail Python module directory HAIL_DIR,
  e.g. /path/to/python/site-packages/hail:
    spark.jars=HAIL_DIR/backend/hail-all-spark.jar
    spark.driver.extraClassPath=HAIL_DIR/backend/hail-all-spark.jar
    spark.executor.extraClassPath=./hail-all-spark.jarRunning on Apache Spark version 3.3.4
SparkUI available at http://192.168.0.232:4040
Welcome to
     __  __     <>__
    / /_/ /__  __/ /
   / __  / _ `/ / /
  /_/ /_/\_,_/_/_/   version 0.2.127-bb535cd096c5
LOGGING: writing to /dev/null


## Loading the data and clumping

In [2]:
path_gwas1="gs://gwas_catalog_data/harmonised_summary_statistics/GCST90012877.parquet"
path_si="gs://gwas_catalog_data/study_index"

gwas1 = SummaryStatistics.from_parquet(session, path_gwas1)
study_index = StudyIndex.from_parquet(session, path_si)

slt=WindowBasedClumping.clump(gwas1,gwas_significance=5e-8,distance=1e6)
slt_df=slt._df


                                                                                

Number of SNPs in GWAS:  10607272


[Stage 7:>                                                          (0 + 1) / 1]

Number of clumps:  33


                                                                                



+------------+----------------+----------+---------+----------------+----------+--------------+--------------+-------------------------------+----------------+--------------------+---------------+
|     studyId|       variantId|chromosome| position|            beta|sampleSize|pValueMantissa|pValueExponent|effectAlleleFrequencyFromSource|   standardError|        studyLocusId|qualityControls|
+------------+----------------+----------+---------+----------------+----------+--------------+--------------+-------------------------------+----------------+--------------------+---------------+
|GCST90012877| 1_161185602_G_A|         1|161185602| 0.0609052805639|      null|         4.302|            -8|                        0.23499| 0.0111181765833| 6360456299763482946|             []|
|GCST90012877| 1_207577223_T_C|         1|207577223| -0.122752564739|      null|         1.403|           -23|                       0.822818| 0.0122652043685|-6742466305250328444|             []|
|GCST90012877| 

                                                                                

## Fine-mapping without outliers detection and imputation using 2M as window size

In [4]:
df = slt_df.withColumn("row_index", f.monotonically_increasing_id())

columns = ["N_gwas", "N_ld", "N_overlap", "N_outliers", "N_imputed", "N_final_to_fm", "eleapsed_time"]
logs = pd.DataFrame(columns=columns)

for i in range(0,df.count()):
    if i!=27:
        one_row = df.filter(df.row_index == i).first()

        res=SusieFineMapperStep.susie_finemapper_one_studylocus_row_v2_dev(
            GWAS=gwas1,
            session=session,
            study_locus_row=one_row,
            study_index=study_index,
            window= 2_000_000,
            L=10,
            susie_est_tausq=False,
            run_carma=False,
            run_sumstat_imputation=False,
            carma_time_limit=600,
            imputed_r2_threshold=0.8,
            ld_score_threshold=4
        )

        sl=res["study_locus"]
        #print(sl._df.withColumn("size", f.size(sl._df["locus"])).show())
        #print(res["log"])
        logs=pd.concat([logs,res["log"]])

                                                                                

2024-04-09 10:41:57.354 Hail: INFO: Table.join: renamed the following fields on the right to avoid name conflicts:
    'freq_index_dict' -> 'freq_index_dict_1'
    'faf_index_dict' -> 'faf_index_dict_1'
    'rf' -> 'rf_1'
    'age_index_dict' -> 'age_index_dict_1'
    'freq_meta' -> 'freq_meta_1'
    'age_distribution' -> 'age_distribution_1'
    'popmax_index_dict' -> 'popmax_index_dict_1'
2024-04-09 10:42:15.499 Hail: INFO: Coerced sorted dataset
2024-04-09 10:42:28.284 Hail: INFO: Coerced sorted dataset
2024-04-09 10:44:41.305 Hail: INFO: Table.join: renamed the following fields on the right to avoid name conflicts:
    'freq_index_dict' -> 'freq_index_dict_1'
    'faf_index_dict' -> 'faf_index_dict_1'
    'rf' -> 'rf_1'
    'age_index_dict' -> 'age_index_dict_1'
    'freq_meta' -> 'freq_meta_1'
    'age_distribution' -> 'age_distribution_1'
    'popmax_index_dict' -> 'popmax_index_dict_1'
2024-04-09 10:44:51.854 Hail: INFO: Coerced sorted dataset
2024-04-09 10:45:03.059 Hail: INFO:

Region:  1:160185602-162185602 ; number of CSs:  1 ; log:



The behavior of DataFrame concatenation with empty or all-NA entries is deprecated. In a future version, this will no longer exclude empty or all-NA columns when determining the result dtypes. To retain the old behavior, exclude the relevant entries before the concat operation.

                                                                                

Region:  1:206577223-208577223 ; number of CSs:  1 ; log:


                                                                                

Region:  10:10678309-12678309 ; number of CSs:  1 ; log:


                                                                                

Region:  10:58886075-60886075 ; number of CSs:  1 ; log:


                                                                                

Region:  10:79520381-81520381 ; number of CSs:  1 ; log:


                                                                                

Region:  11:120564878-122564878 ; number of CSs:  2 ; log:


                                                                                

Region:  11:46370397-48370397 ; number of CSs:  6 ; log:


                                                                                

Region:  11:59328267-61328267 ; number of CSs:  1 ; log:


                                                                                

Region:  11:85156833-87156833 ; number of CSs:  1 ; log:


                                                                                

Region:  14:51924962-53924962 ; number of CSs:  1 ; log:


                                                                                

Region:  14:91472511-93472511 ; number of CSs:  2 ; log:


                                                                                

Region:  15:49707194-51707194 ; number of CSs:  1 ; log:


                                                                                

Region:  15:57730416-59730416 ; number of CSs:  1 ; log:


                                                                                

Region:  15:62277703-64277703 ; number of CSs:  1 ; log:


                                                                                

Region:  16:30115000-32115000 ; number of CSs:  1 ; log:


                                                                                

Region:  17:4229833-6229833 ; number of CSs:  1 ; log:


                                                                                

Region:  17:62483402-64483402 ; number of CSs:  1 ; log:


                                                                                

Region:  19:50875-2050875 ; number of CSs:  2 ; log:


                                                                                

Region:  19:43892009-45892009 ; number of CSs:  10 ; log:


                                                                                

Region:  19:50224706-52224706 ; number of CSs:  1 ; log:


                                                                                

Region:  2:104749599-106749599 ; number of CSs:  1 ; log:


                                                                                

Region:  2:126135234-128135234 ; number of CSs:  2 ; log:


                                                                                

Region:  2:232117202-234117202 ; number of CSs:  1 ; log:


                                                                                

Region:  2:64381229-66381229 ; number of CSs:  1 ; log:


                                                                                

Region:  20:55423488-57423488 ; number of CSs:  1 ; log:




Region:  21:25775872-27775872 ; number of CSs:  2 ; log:


                                                                                

Region:  4:10025995-12025995 ; number of CSs:  1 ; log:


                                                                                

Region:  6:39974457-41974457 ; number of CSs:  2 ; log:


                                                                                

Region:  6:46627419-48627419 ; number of CSs:  1 ; log:




Region:  7:142410495-144410495 ; number of CSs:  2 ; log:


                                                                                

Region:  7:99374211-101374211 ; number of CSs:  1 ; log:


[Stage 1505:>                                                       (0 + 8) / 8]

Region:  8:26610986-28610986 ; number of CSs:  3 ; log:


                                                                                

In [9]:
pd.set_option("display.max_rows", None)

  N_gwas   N_ld N_overlap N_outliers N_imputed N_final_to_fm  eleapsed_time
0   7120  10431      6456          0         0          6456      56.839336
0   7128   8657      5769          0         0          5769      46.149004
0   9203  12106      7930          0         0          7930      93.531924
0   8351  10014      6995          0         0          6995      74.174323
0   9388  12551      8337          0         0          8337     120.602071
0   6560   8729      5758          0         0          5758      45.064894
0   5005   7701      3954          0         0          3954      55.229344
0   7012   8940      5815          0         0          5815      38.824251
0   8661  10303      7291          0         0          7291      68.802810
0   8081   9966      6771          0         0          6771      64.327746
0   8375  11213      7467          0         0          7467     141.808555
0   7377   9622      6369          0         0          6369      51.198955
0   8181  10

In [11]:
summary = logs["N_overlap"].mean()

6653.3125


# Fine-mapping of APOE locus

In [4]:
df = slt_df.withColumn("row_index", f.monotonically_increasing_id())
one_row = df.filter(df.row_index == 18).first()
one_row

                                                                                

Row(studyId='GCST90012877', variantId='19_44892009_G_A', chromosome='19', position=44892009, beta=0.352722374032, sampleSize=None, pValueMantissa=1.9950000047683716, pValueExponent=-277, effectAlleleFrequencyFromSource=0.6050670146942139, standardError=0.00991069396551, studyLocusId=6814727764900576662, qualityControls=[], row_index=18)

### Without CARMA, without imputation

In [5]:
res=SusieFineMapperStep.susie_finemapper_one_studylocus_row_v2_dev(
    GWAS=gwas1,
    session=session,
    study_locus_row=one_row,
    study_index=study_index,
    window= 2_000_000,
    L=10,
    susie_est_tausq=False,
    run_carma=False,
    run_sumstat_imputation=False,
    carma_time_limit=1000,
    imputed_r2_threshold=0.8,
    ld_score_threshold=4
)
sl=res["study_locus"]

                                                                                

2024-04-08 21:34:03.208 Hail: INFO: Table.join: renamed the following fields on the right to avoid name conflicts:
    'freq_index_dict' -> 'freq_index_dict_1'
    'popmax_index_dict' -> 'popmax_index_dict_1'
    'age_distribution' -> 'age_distribution_1'
    'freq_meta' -> 'freq_meta_1'
    'rf' -> 'rf_1'
    'age_index_dict' -> 'age_index_dict_1'
    'faf_index_dict' -> 'faf_index_dict_1'
2024-04-08 21:34:19.253 Hail: INFO: Coerced sorted dataset
2024-04-08 21:34:34.941 Hail: INFO: Coerced sorted dataset
2024-04-08 21:37:16.576 Hail: INFO: Table.join: renamed the following fields on the right to avoid name conflicts:
    'freq_index_dict' -> 'freq_index_dict_1'
    'popmax_index_dict' -> 'popmax_index_dict_1'
    'age_distribution' -> 'age_distribution_1'
    'freq_meta' -> 'freq_meta_1'
    'rf' -> 'rf_1'
    'age_index_dict' -> 'age_index_dict_1'
    'faf_index_dict' -> 'faf_index_dict_1'
2024-04-08 21:37:28.867 Hail: INFO: Coerced sorted dataset
2024-04-08 21:37:44.733 Hail: INFO:

+--------------------+------------+--------------------+----------------+--------------------+---------------+----------+--------+-----------------+------------------+----+
|        studyLocusId|     studyId|              region|credibleSetIndex|               locus|      variantId|chromosome|position|finemappingMethod|credibleSetlog10BF|size|
+--------------------+------------+--------------------+----------------+--------------------+---------------+----------+--------+-----------------+------------------+----+
|-6417720984991662128|GCST90012877|19:43892009-45892009|               1|[{19_44908684_T_C...|19_44908684_T_C|        19|44908684|        SuSiE-inf| 2135.710824756712|   1|
|-1158278093713046158|GCST90012877|19:43892009-45892009|               2|[{19_44921094_A_T...|19_44921094_A_T|        19|44921094|        SuSiE-inf| 955.4948390766739|   1|
| 8324745608044585165|GCST90012877|19:43892009-45892009|               3|[{19_44917947_C_T...|19_44917947_C_T|        19|44917947|     

### With CARMA, without imputation

In [6]:
res=SusieFineMapperStep.susie_finemapper_one_studylocus_row_v2_dev(
    GWAS=gwas1,
    session=session,
    study_locus_row=one_row,
    study_index=study_index,
    window= 2_000_000,
    L=10,
    susie_est_tausq=False,
    run_carma=True,
    run_sumstat_imputation=False,
    carma_time_limit=1000,
    imputed_r2_threshold=0.8,
    ld_score_threshold=4
)
sl=res["study_locus"]

                                                                                

+--------------------+------------+--------------------+----------------+--------------------+---------------+----------+--------+-----------------+------------------+----+
|        studyLocusId|     studyId|              region|credibleSetIndex|               locus|      variantId|chromosome|position|finemappingMethod|credibleSetlog10BF|size|
+--------------------+------------+--------------------+----------------+--------------------+---------------+----------+--------+-----------------+------------------+----+
|-6417720984991662128|GCST90012877|19:43892009-45892009|               1|[{19_44908684_T_C...|19_44908684_T_C|        19|44908684|        SuSiE-inf|1995.6574121818223|   1|
|-1158278093713046158|GCST90012877|19:43892009-45892009|               2|[{19_44921094_A_T...|19_44921094_A_T|        19|44921094|        SuSiE-inf| 721.2637360279233|   1|
| 7760477027903907683|GCST90012877|19:43892009-45892009|               3|[{19_44911142_C_A...|19_44911142_C_A|        19|44911142|     

### Without CARMA, with imputation

In [5]:
res=SusieFineMapperStep.susie_finemapper_one_studylocus_row_v2_dev(
    GWAS=gwas1,
    session=session,
    study_locus_row=one_row,
    study_index=study_index,
    window= 2_000_000,
    L=10,
    susie_est_tausq=False,
    run_carma=False,
    run_sumstat_imputation=True,
    carma_time_limit=10000,
    imputed_r2_threshold=0.8,
    ld_score_threshold=4
)
sl=res["study_locus"]

                                                                                

2024-04-08 22:25:15.739 Hail: INFO: Table.join: renamed the following fields on the right to avoid name conflicts:
    'age_index_dict' -> 'age_index_dict_1'
    'freq_index_dict' -> 'freq_index_dict_1'
    'faf_index_dict' -> 'faf_index_dict_1'
    'freq_meta' -> 'freq_meta_1'
    'rf' -> 'rf_1'
    'age_distribution' -> 'age_distribution_1'
    'popmax_index_dict' -> 'popmax_index_dict_1'
2024-04-08 22:25:30.625 Hail: INFO: Coerced sorted dataset
2024-04-08 22:25:46.020 Hail: INFO: Coerced sorted dataset
2024-04-08 22:32:35.094 Hail: INFO: Table.join: renamed the following fields on the right to avoid name conflicts:
    'age_index_dict' -> 'age_index_dict_1'
    'freq_index_dict' -> 'freq_index_dict_1'
    'faf_index_dict' -> 'faf_index_dict_1'
    'freq_meta' -> 'freq_meta_1'
    'rf' -> 'rf_1'
    'age_distribution' -> 'age_distribution_1'
    'popmax_index_dict' -> 'popmax_index_dict_1'
2024-04-08 22:32:47.616 Hail: INFO: Coerced sorted dataset
2024-04-08 22:33:02.484 Hail: INFO:

+--------------------+------------+--------------------+----------------+--------------------+-----------------+----------+--------+-----------------+------------------+----+
|        studyLocusId|     studyId|              region|credibleSetIndex|               locus|        variantId|chromosome|position|finemappingMethod|credibleSetlog10BF|size|
+--------------------+------------+--------------------+----------------+--------------------+-----------------+----------+--------+-----------------+------------------+----+
|-1350283509846281677|GCST90012877|19:43892009-45892009|               1|[{19_44909967_TGG...|19_44909967_TGG_T|        19|44909967|        SuSiE-inf| 2310.665662473933|   1|
|-1158278093713046158|GCST90012877|19:43892009-45892009|               2|[{19_44921094_A_T...|  19_44921094_A_T|        19|44921094|        SuSiE-inf| 903.6138342773536|   1|
| 8324745608044585165|GCST90012877|19:43892009-45892009|               3|[{19_44917947_C_T...|  19_44917947_C_T|        19|44

### With CARMA, with imputation

In [6]:
res=SusieFineMapperStep.susie_finemapper_one_studylocus_row_v2_dev(
    GWAS=gwas1,
    session=session,
    study_locus_row=one_row,
    study_index=study_index,
    window= 2_000_000,
    L=10,
    susie_est_tausq=False,
    run_carma=True,
    run_sumstat_imputation=True,
    carma_time_limit=10000,
    imputed_r2_threshold=0.8,
    ld_score_threshold=4
)
sl=res["study_locus"]

                                                                                

+--------------------+------------+--------------------+----------------+--------------------+---------------+----------+--------+-----------------+------------------+----+
|        studyLocusId|     studyId|              region|credibleSetIndex|               locus|      variantId|chromosome|position|finemappingMethod|credibleSetlog10BF|size|
+--------------------+------------+--------------------+----------------+--------------------+---------------+----------+--------+-----------------+------------------+----+
| 3030414938485808431|GCST90012877|19:43892009-45892009|               1|[{19_44895007_C_T...|19_44895007_C_T|        19|44895007|        SuSiE-inf|2680.9099711333456|   1|
|-2201142982564351776|GCST90012877|19:43892009-45892009|               2|[{19_44900601_A_G...|19_44900601_A_G|        19|44900601|        SuSiE-inf| 2103.873956796136|   1|
|-6417720984991662128|GCST90012877|19:43892009-45892009|               3|[{19_44908684_T_C...|19_44908684_T_C|        19|44908684|     

### With CARMA, with imputation, with estimation of infinitisimal effects (susie_est_tausq=True)

In [7]:
res=SusieFineMapperStep.susie_finemapper_one_studylocus_row_v2_dev(
    GWAS=gwas1,
    session=session,
    study_locus_row=one_row,
    study_index=study_index,
    window= 2_000_000,
    L=10,
    susie_est_tausq=True,
    run_carma=True,
    run_sumstat_imputation=True,
    carma_time_limit=10000,
    imputed_r2_threshold=0.8,
    ld_score_threshold=4
)
sl=res["study_locus"]

                                                                                

+--------------------+------------+--------------------+----------------+--------------------+---------------+----------+--------+-----------------+------------------+----+
|        studyLocusId|     studyId|              region|credibleSetIndex|               locus|      variantId|chromosome|position|finemappingMethod|credibleSetlog10BF|size|
+--------------------+------------+--------------------+----------------+--------------------+---------------+----------+--------+-----------------+------------------+----+
|-6417720984991662128|GCST90012877|19:43892009-45892009|               1|[{19_44908684_T_C...|19_44908684_T_C|        19|44908684|        SuSiE-inf| 1105.297844890198|   1|
|-1158278093713046158|GCST90012877|19:43892009-45892009|               2|[{19_44921094_A_T...|19_44921094_A_T|        19|44921094|        SuSiE-inf|1042.0949995382389|   1|
|-2201142982564351776|GCST90012877|19:43892009-45892009|               3|[{19_44900601_A_G...|19_44900601_A_G|        19|44900601|     

# Fine-mapping of MHC region using 1Mb window

In [7]:
df = slt_df.withColumn("row_index", f.monotonically_increasing_id())
one_row = df.filter(df.row_index == 27).first()
one_row

                                                                                

Row(studyId='GCST90012877', variantId='6_32592248_A_G', chromosome='6', position=32592248, beta=-0.103604380043, sampleSize=None, pValueMantissa=2.877000093460083, pValueExponent=-15, effectAlleleFrequencyFromSource=0.21086899936199188, standardError=0.0131209374957, studyLocusId=5718491981995302674, qualityControls=[], row_index=27)

In [8]:
res=SusieFineMapperStep.susie_finemapper_one_studylocus_row_v2_dev(
    GWAS=gwas1,
    session=session,
    study_locus_row=one_row,
    study_index=study_index,
    window= 1_000_000,
    L=10,
    susie_est_tausq=False,
    run_carma=False,
    run_sumstat_imputation=False,
    carma_time_limit=10000,
    imputed_r2_threshold=0.8,
    ld_score_threshold=4
)
sl=res["study_locus"]



+--------------------+------------+-------------------+----------------+--------------------+---------------+----------+--------+-----------------+------------------+----+
|        studyLocusId|     studyId|             region|credibleSetIndex|               locus|      variantId|chromosome|position|finemappingMethod|credibleSetlog10BF|size|
+--------------------+------------+-------------------+----------------+--------------------+---------------+----------+--------+-----------------+------------------+----+
|-3446214959021623473|GCST90012877|6:32092248-33092248|               1|[{6_32557997_G_A,...| 6_32557997_G_A|         6|32557997|        SuSiE-inf| 4323.908142062261|   1|
| -439738150050389281|GCST90012877|6:32092248-33092248|               2|[{6_32558002_G_T,...| 6_32558002_G_T|         6|32558002|        SuSiE-inf|3428.8321277074765|   1|
| 5831857384024844796|GCST90012877|6:32092248-33092248|               3|[{6_32557987_C_A,...| 6_32557987_C_A|         6|32557987|        SuS