# Data preparation

The aim of this notebook is to collect the information about the credible set lead variants.
This includes:

- Addition of Major population sample size and size of cases/controls from studyIndex,
- Addition of most severe consequences and consequence score derived from VEP annotations from variantIndex
- Calculation of MAF (Minor Allele Frequency) for lead variants
- Calculation of Variance Explained by lead variant


## Data extraction and loading

Data for this analysis has to be downloaded from 3 datasets available by FTP:

- credible_set
- variant
- study


In [1]:
# Ensure proper java version < 11
!java -version


openjdk version "11.0.26" 2025-01-21
OpenJDK Runtime Environment Temurin-11.0.26+4 (build 11.0.26+4)
OpenJDK 64-Bit Server VM Temurin-11.0.26+4 (build 11.0.26+4, mixed mode)


In [2]:
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/output/credible_set ../../data/.
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/output/study ../../data/.
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/output/variant ../../data/.


receiving incremental file list
credible_set/
credible_set/_SUCCESS
credible_set/part-00000-38026948-016a-4eae-8a31-1e6699a8333a-c000.snappy.parquet
credible_set/part-00001-38026948-016a-4eae-8a31-1e6699a8333a-c000.snappy.parquet
credible_set/part-00002-38026948-016a-4eae-8a31-1e6699a8333a-c000.snappy.parquet
credible_set/part-00003-38026948-016a-4eae-8a31-1e6699a8333a-c000.snappy.parquet
credible_set/part-00004-38026948-016a-4eae-8a31-1e6699a8333a-c000.snappy.parquet
credible_set/part-00005-38026948-016a-4eae-8a31-1e6699a8333a-c000.snappy.parquet
credible_set/part-00006-38026948-016a-4eae-8a31-1e6699a8333a-c000.snappy.parquet
credible_set/part-00007-38026948-016a-4eae-8a31-1e6699a8333a-c000.snappy.parquet
credible_set/part-00008-38026948-016a-4eae-8a31-1e6699a8333a-c000.snappy.parquet
credible_set/part-00009-38026948-016a-4eae-8a31-1e6699a8333a-c000.snappy.parquet
credible_set/part-00010-38026948-016a-4eae-8a31-1e6699a8333a-c000.snappy.parquet
credible_set/part-00011-38026948-016a-4ea

#### Loading the data with gentropy


In [26]:
from gentropy.common.session import Session
from gentropy.dataset.study_index import StudyIndex
from gentropy.dataset.study_locus import StudyLocus
from gentropy.dataset.variant_index import VariantIndex
from pyspark.sql import Column, Window
from pyspark.sql import functions as f


In [2]:
session = Session(extended_spark_conf={"spark.driver.memory": "40G"})
variant_index_path = "../../data/variant"
study_index_path = "../../data/study"
credible_set_path = "../../data/credible_set"


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/30 10:27:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
session.spark


In [4]:
vi = VariantIndex.from_parquet(session, variant_index_path)
si = StudyIndex.from_parquet(session, study_index_path)
cs = StudyLocus.from_parquet(session, credible_set_path)


In [5]:
vi.df.show(n=1)
si.df.show(n=1)
cs.df.show(n=1)


+--------------+----------+--------+---------------+---------------+--------------------+-----------------------+----------------------+-------------+---------------+--------------------+--------------------+--------------------+
|     variantId|chromosome|position|referenceAllele|alternateAllele|       variantEffect|mostSevereConsequenceId|transcriptConsequences|        rsIds|         hgvsId|   alleleFrequencies|             dbXrefs|  variantDescription|
+--------------+----------+--------+---------------+---------------+--------------------+-----------------------+----------------------+-------------+---------------+--------------------+--------------------+--------------------+
|17_4901571_G_A|        17| 4901571|              G|              A|[{VEP, synonymous...|             SO_0001819|  [{[SO_0001819], N...|[rs137952354]|17:g.4901571G>A|[{sas_adj, 1.0978...|[{rs137952354, en...|Synonymous varian...|
+--------------+----------+--------+---------------+---------------+------------

25/04/30 10:28:04 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.


+--------------------+-----------+---------+--------------------+------------------------+-------------+------+---------------------+-----------+--------+----------------+----------------------+---------------+------------------+----------------------------------+--------------------+--------------------+------+---------+--------+---------+---------------------+-------------------+------------------+---------------+-------------+--------------------+-----------+---------+---------------+
|             studyId|  projectId|studyType|     traitFromSource|traitFromSourceMappedIds|   diseaseIds|geneId|biosampleFromSourceId|biosampleId|pubmedId|publicationTitle|publicationFirstAuthor|publicationDate|publicationJournal|backgroundTraitFromSourceMappedIds|backgroundDiseaseIds|   initialSampleSize|nCases|nControls|nSamples|  cohorts|ldPopulationStructure|   discoverySamples|replicationSamples|qualityControls|analysisFlags|summarystatsLocation|hasSumstats|condition|sumstatQCValues|
+-------------

## MAF dataset

The dataset below contains lead variants from credible sets contains:

- maf
- major population used to calculate maf
- trait infromation derived from `traitFromSourceMappedIds` or `geneId` fields depending on the studyType found from study index
- allelic frequencies derived from variant index (gnomAD) for major population found in `ldPopulationStructure` in study index
- vep score
- information about the cases and controls counts from study index
- variant association statistics from study locus
- study type

Code below collects all required fields required to perform analysis on MAF and variant effects


### Methods


In [7]:
def major_population_in_study(ld_col: Column, default_major_pop: str = "nfe") -> Column:
    """Extract the major population from the study ld population structure.

    Args:
        ld_col (Column): ld population structure field  array<struct<ldPopulation: string, relativeSampleSize: double>>
        default_major_pop (str, optional): population to use as default, when no population was reported. Defaults to "nfe".

    Returns:
        Column: ld_col struct

    """

    def reduce_pops(pop1: Column, pop2: Column) -> Column:
        """Reduce two populations based on relative sample size.

        This function takes 2 populations and report one of them based on following conditions:
        * Use pop with bigger relativeSampleSize
        * In case of a tie, the default_major_pop is preferred,
        * In case of tie and no default_major_pop in pop1 and pop2, use pop1.
        """
        return (
            f.when(pop1.getField("relativeSampleSize") > pop2.getField("relativeSampleSize"), pop1)
            .when(pop1.getField("relativeSampleSize") < pop2.getField("relativeSampleSize"), pop2)
            .when(
                (
                    (pop1.getField("relativeSampleSize") == pop2.getField("relativeSampleSize"))
                    & (pop1.getField("ldPopulation") == f.lit(default_major_pop))
                ),
                pop1,
            )
            .when(
                (
                    (pop1.getField("relativeSampleSize") == pop2.getField("relativeSampleSize"))
                    & (pop2.getField("ldPopulation") == f.lit(default_major_pop))
                ),
                pop2,
            )
            .otherwise(pop1)
        )

    fallback = f.struct(f.lit(default_major_pop).alias("ldPopulation"), f.lit(0.0).alias("relativeSampleSize"))

    return f.when(
        f.size(ld_col) > 0,
        f.reduce(
            ld_col,
            fallback,
            reduce_pops,
        ),
    ).otherwise(fallback)


def vep_variant_effect(c: Column) -> Column:
    """Extract VEP variant effect."""

    def extract_fields(ve: Column) -> Column:
        return f.struct(
            ve.getField("assessment").alias("assessment"),
            ve.getField("normalisedScore").alias("normalisedScore"),
            ve.getField("targetId").alias("targetId"),
        )

    return f.transform(f.filter(c, lambda ve: ve.getField("method") == f.lit("VEP")), extract_fields).getItem(0)


def major_population_allele_freq(major_pop: Column, allele_freq: Column) -> Column:
    """Extract major population from variant.alleleFrequencies."""
    return f.filter(
        allele_freq,
        lambda freq: f.replace(freq.getField("populationName"), f.lit("_adj"), f.lit(""))
        == major_pop.getField("ldPopulation"),
    )


def maf(variant_freq: Column) -> Column:
    """Calculate Minor Allele Frequency from variant frequency."""
    return (
        f.when(
            ((f.size(variant_freq) == 1) & (variant_freq.getItem(0).getField("alleleFrequency") > 0.5)),
            f.lit(1.0) - variant_freq.getItem(0).getField("alleleFrequency"),
        )
        .when(
            ((f.size(variant_freq) == 1) & (variant_freq.getItem(0).getField("alleleFrequency") <= 0.5)),
            variant_freq.getItem(0).getField("alleleFrequency"),
        )
        .otherwise(None)
    )


def extract_pip_from_locus(variant_col: Column, locus: Column) -> Column:
    """Extract Posterior propability from variant from locus.

    In the case the lead variant is not present in the locus, None is returned.
    """
    lead_variant_stats = f.filter(locus, lambda v: v.getField("variantId") == variant_col)
    return (
        f.when(
            f.size(lead_variant_stats) == 1,
            lead_variant_stats.getItem(0).getField("posteriorProbability"),
        )
        .otherwise(None)
        .alias("posteriorProbability")
    )


def create_maf_dataset(si: StudyIndex, cs: StudyLocus, vi: VariantIndex):
    """Create MAF dataset from StudyIndex, StudyLocus and VariantIndex.

    Args:
        si (StudyIndex): StudyIndex object
        cs (StudyLocus): StudyLocus object
        vi (VariantIndex): VariantIndex object

    Returns:
        DataFrame: MAF dataset

    """
    _cs = cs.df.select(
        f.col("studyId"),
        f.col("studyLocusId"),
        f.col("variantId"),
        f.col("beta"),
        f.col("zScore"),
        f.col("pValueMantissa"),
        f.col("pValueExponent"),
        f.col("standardError"),
        f.col("finemappingMethod"),
        f.col("studyType"),
        f.size("locus").alias("credibleSetSize"),
        f.col("isTransQtl"),
        extract_pip_from_locus(f.col("variantId"), f.col("locus")),
    )
    _si = si.df.select(
        f.col("studyId"),
        f.col("nSamples"),
        f.col("nControls"),
        f.col("nCases"),
        f.col("geneId"),  # for molqtl traits
        f.col("traitFromSourceMappedIds"),
        major_population_in_study(f.col("ldPopulationStructure"), "nfe").alias("majorPopulation"),
    )

    _vi = vi.df.select(
        f.col("variantId"),
        f.col("allelefrequencies"),
        vep_variant_effect("variantEffect").alias("vepEffect"),
    )

    return (
        _cs.join(_si, how="left", on="studyId")
        .join(_vi, how="left", on="variantId")
        .select(
            "*",
            major_population_allele_freq(
                f.col("majorPopulation"),
                f.col("alleleFrequencies"),
            ).alias("majorPopulationAF"),
        )
        .select(
            "*",
            maf(f.col("majorPopulationAf")).alias("majorPopulationMAF"),
        )
    )


MAF dataset consists of specific fields from credible sets, study index and variant index


In [8]:
dataset_maf = create_maf_dataset(si, cs, vi)


Total number of lead variants is 2622098


Save the dataset for further analysis


In [10]:
dataset_maf.write.mode("overwrite").parquet("../../data/lead-maf-vep")


                                                                                

### Systematic finemapping

The goal is to see how many lead variants from credible sets had correctly calculated MAF

There would be cases where:

1. majorPopulationMAF == 0.0 in case the gnomAD allelic frequencies for population matching majorPopulation defined in study did not capture the variant
2. majorPopulationMAF not inferred correctly when variant was not found in gnomAD, so allelic frequencies could not be


In [7]:
dataset_maf = session.spark.read.parquet("../../data/lead-maf-vep")


Count how many lead variants have maf calculated


In [6]:
cs = StudyLocus.from_parquet(session, credible_set_path).df


In [8]:
cs.count() == dataset_maf.count()


True

In [None]:
if cs.count() == dataset_maf.count():
    w = Window().partitionBy()
    print("Counts are equal")
    print("Estimating the MAF distribution")
    dataset_maf.select(
        f.col("majorPopulationMAF"),
        f.col("variantId"),
        f.when(f.col("majorPopulationMAF").isNull(), f.lit("Unable to infer MAF"))
        .when(f.col("majorPopulationMAF") == 0.0, f.lit("MAF eq 0.0"))
        .otherwise("able to infer MAF")
        .alias("majorPopulationMAFInterenceGroup"),
    ).groupBy("majorPopulationMAFInterenceGroup").agg(
        f.count("*").alias("count"),
        f.collect_list("variantId").alias("majorPopulationMAFList"),
    ).withColumn("total", f.sum(f.col("count")).over(w)).withColumn(
        "%", f.round(f.col("count") / f.col("total") * 100, 2)
    ).withColumn("majorPopulationMAFList", f.slice(f.col("majorPopulationMAFList"), 1, 3)).show(truncate=False)
else:
    print("Counts are not equal")


Counts are equal
Estimating the MAF distribution


25/04/30 11:00:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/04/30 11:00:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/04/30 11:00:22 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/04/30 11:00:23 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/04/30 11:00:23 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+--------------------------------+-------+----------------------------------------------------------------+-------+-----+
|majorPopulationMAFInterenceGroup|count  |majorPopulationMAFList                                          |total  |%    |
+--------------------------------+-------+----------------------------------------------------------------+-------+-----+
|able to infer MAF               |2616656|[10_100009635_T_G, 10_100043680_A_T, 10_100055122_T_C]          |2622098|99.79|
|MAF eq 0.0                      |3800   |[10_10495814_C_T, 10_122254422_C_T, 10_129680683_G_A]           |2622098|0.14 |
|Unable to infer MAF             |1642   |[10_12914618_G_A, 10_62759811_GAACTTCTCATAA_G, 11_114030799_G_A]|2622098|0.06 |
+--------------------------------+-------+----------------------------------------------------------------+-------+-----+



25/04/30 11:00:24 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/04/30 11:00:24 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
                                                                                

In [None]:
vi.df.filter(f.col("variantId") == "10_12914618_G_A").show(truncate=False)


[Stage 256:>                                                        (0 + 1) / 1]

+---------------+----------+--------+---------------+---------------+-------------------------------------------------------------------------------------------------------------------------------+-----------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                