# Lead Variant Effect dataset preparation

The aim of this notebook is to collect the information about the effect of the credible set lead variants.
**This includes**:

- Addition of **Major population sample size** and **size of cases/controls** from _studyIndex_,
- Addition of **VEP consequence score** derived annotations from _variantIndex_
- Addition of **study specific major ancestry variant AF** (allele frequency)<a name="out of sample AF"></a>[<sup>[1]</sup>](#cite_note-1) annotations from _variantIndex_
- Calculation of **MAF (Minor Allele Frequency)** based on AF of the **credible set lead variants** derived from _studyLocus_
- Calculation of **Variance Explained by lead variant**
- Calculation of the **Rescaled estimated effect sizes** based on the trait class (dichotomous or continuous) and the MAF of the lead variant.

<a name="cite_note-1"></a>1. [^](#cite_ref-1) AF is derived from GnomAD v4.1 allele frequencies from joint Exome and WGS datasets.


## Data extraction and loading

<div class="alert alert-block alert-info"> 
    <b style="font-size: 1.2em">Downloading datasets</b><br><br>
    <b>The analysis can be performed on the:</b>
    <ul>
        <li>2025.03 release (rsync from EBI FTP server)</li>
        <li>2025.06 release (rsync from google cloud storage)</li>
    </ul>
    <I>This code chunk should be run only once to download the relevant datasets.</I>
</div>

Data for this analysis has to be downloaded from 3 datasets:

- credible_set
- variant
- study


In [1]:
# Ensure proper java version < 11
!java -version


openjdk version "11.0.26" 2025-01-21
OpenJDK Runtime Environment Temurin-11.0.26+4 (build 11.0.26+4)
OpenJDK 64-Bit Server VM Temurin-11.0.26+4 (build 11.0.26+4, mixed mode)


In [None]:
# Download the release data from the Open Targets Platform 25.03 release
# !rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/output/credible_set ../../data/.
# !rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/output/study ../../data/.
# !rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/output/variant ../../data/.

# Download the release data from the Open Targets Platform 25.06 release
!gcloud storage rsync --recursive --delete-unmatched-destination-objects gs://open-targets-pipeline-runs/szsz/25.06-testrun-4/output/credible_set ../../data/credible_set
!gcloud storage rsync --recursive --delete-unmatched-destination-objects gs://open-targets-pipeline-runs/szsz/25.06-testrun-4/output/study ../../data/study
!gcloud storage rsync --recursive --delete-unmatched-destination-objects gs://open-targets-pipeline-runs/szsz/25.06-testrun-4/output/variant ../../data/variant


## Session setup

- Create the sparkSession
- Set all input/output paths


In [3]:
from gentropy.common.session import Session
from gentropy.dataset.study_index import StudyIndex
from gentropy.dataset.study_locus import StudyLocus
from gentropy.dataset.variant_index import VariantIndex
from pyspark.sql import functions as f


In [4]:
session = Session(extended_spark_conf={"spark.driver.memory": "40G"})
variant_index_path = "../../data/variant"
study_index_path = "../../data/study"
credible_set_path = "../../data/credible_set"
output_dataset_path = "../../data/lead-maf-vep"


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/18 22:15:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [5]:
session.spark


## Building temporary dataset

The temporary dataset needs to be build from the _studyIndex_, _studyLocus_ and _variantIndex_ datasets.


In [6]:
vi = VariantIndex.from_parquet(session, variant_index_path)
si = StudyIndex.from_parquet(session, study_index_path)
cs = StudyLocus.from_parquet(session, credible_set_path)


_cs = cs.df.select(
    f.col("studyId"),
    f.col("studyLocusId"),
    f.col("variantId"),
    f.col("beta"),
    f.col("zScore"),
    f.col("pValueMantissa"),
    f.col("pValueExponent"),
    f.col("standardError"),
    f.col("finemappingMethod"),
    f.col("studyType"),
    f.size("locus").alias("credibleSetSize"),
    f.col("isTransQtl"),
)
_si = si.df.select(
    f.col("studyId"),
    f.col("nSamples"),
    f.col("nControls"),
    f.col("nCases"),
    f.col("geneId"),  # for molqtl traits
    f.col("traitFromSourceMappedIds"),
    f.col("ldPopulationStructure"),
    f.col("traitFromSource"),
    f.col("traitFromSourceMappedIds"),
)

_vi = vi.df.select(
    f.col("variantId"),
    f.col("allelefrequencies"),
    f.col("variantEffect"),
)

dataset = _cs.join(_si, how="left", on="studyId").join(_vi, how="left", on="variantId")

dataset.show(5, truncate=False)


                                                                                

+---------------+----------------------------------------------------------+--------------------------------+---------------------+------------------+--------------+--------------+-------------+-----------------+---------+---------------+----------+--------+---------+------+---------------+------------------------+---------------------+------------------------------------------------------------+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+
|variantId      |studyId                                                   |studyLocusId                    |beta  

## MAF Calculation

To add the MAF (Minor Allele Frequency) to the dataset we need to extract the major ancestry from _studyIndex_ and use it to extract the relevant allele frequency from the _variantIndex_ dataset.

The MAF is calculated as follows:

<ol>
    <li>Extract the major ancestry from the <code>studyIndex</code> dataset.</li>
    <ol>
        <li>In case there are multiple ancestries that match the <code>relativeSampleSize</code>, and one of them is <code>NFE</code>, use <code>NFE</code> as the major ancestry.</li>
        <li>In case there are multiple ancestries that match the <code>relativeSampleSize</code> and none of them is <code>NFE</code>, use the first ancestry in the list as the major ancestry.</li>
        <li>If there is no ancestry in the list, use <code>NFE</code> as the major ancestry, assign the <code>relativeSampleSize</code> to 0.0</li>
    </ol>
    <li>Extract the allele frequency for the major ancestry from the <code>variantIndex</code> dataset.</li>
</ol>


In [7]:
from manuscript_methods.ld_populations import LDPopulationName, LDPopulationStructure
from manuscript_methods.maf import AlleleFrequencies

ld_pop = LDPopulationStructure(f.col("ldPopulationStructure"))
major_ld_pop = ld_pop.major_population(default_major_pop=LDPopulationName.NFE)
major_ld_maf = AlleleFrequencies(f.col("alleleFrequencies")).ld_population_maf(major_ld_pop.ld_population)
major_ld_af = AlleleFrequencies(f.col("alleleFrequencies")).ld_population_af(major_ld_pop.ld_population)

dataset = dataset.withColumns(
    {
        "majorLdPopulation": major_ld_pop.col,
        "majorLdPopulationMaf": major_ld_maf.col,
        "majorLdPopulationAf": major_ld_af.col,
    }
)

dataset.show(5, truncate=False)


                                                                                

+---------------+----------------------------------------------------------+--------------------------------+---------------------+------------------+--------------+--------------+-------------+-----------------+---------+---------------+----------+--------+---------+------+---------------+------------------------+---------------------+------------------------------------------------------------+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------------------+-------------------------------+
|variantId      |studyId      

## Phenotypic variance explained by lead variant (Approximation)

The code below is used to calculate the PVE (Phenotypic Variance Explained) by the lead variant in the credible set.

The variance explained follows the simplified formula

${variance\;explained}=\chi^2 / n $

- The $\chi^2$ is calculated as **Inverse survival function** by using `scipy.stats.isf` function from lead variant $pValue$ (depicted as `pValueMantissa` and `pValueExponent`).
- The $n$ parameter is the number of samples derived from GWAS study description.

- In case where the `pValueExponent < 300` to avoid floating point errors we estimate $\chi^2$ statistic with $-log_{10}(pValue)$
- The $variance\;explained$ can be only calculated where the $n > 0$


In [8]:
from manuscript_methods.variant_statistics import PValueComponents, VariantStatistics

pval_components = PValueComponents(p_value_mantissa=f.col("pValueMantissa"), p_value_exponent=f.col("pValueExponent"))
n_samples = f.col("nSamples")
variant_stats = VariantStatistics.compute(pval_components, n_samples)

dataset = dataset.withColumns(
    {
        "variantStatistics": variant_stats.col,
    }
)

dataset.show(5, truncate=False)



In Python 3.6+ and Spark 3.0+, it is preferred to specify type hints for pandas UDF instead of specifying pandas UDF type which will be deprecated in the future releases. See SPARK-28264 for more details.

25/06/18 22:16:09 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.

+---------------+----------------------------------------------------------+--------------------------------+---------------------+------------------+--------------+--------------+-------------+-----------------+---------+---------------+----------+--------+---------+------+---------------+------------------------+---------------------+------------------------------------------------------------+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------------------+-------------------------------+-------------------------------

                                                                                

## Cohort statistics

The code below is used to combine and classify the cohort statistics from the _studyIndex_ dataset.

This includes:

- n_cases
- n_controls
- n_samples
- study_type
- trait
- trait_class - either binary, continuous or unknown
- trait_ids


In [9]:
from manuscript_methods.study_statistics import StudyStatistics

cohort_stat = StudyStatistics.compute(
    n_samples=f.col("nSamples"),
    n_cases=f.col("nCases"),
    n_controls=f.col("nControls"),
    trait=f.col("traitFromSource"),
    study_type=f.col("studyType"),
    is_trans_pqtl=f.col("isTransQtl"),
    trait_ids=f.col("traitFromSourceMappedIds"),
)

dataset = dataset.withColumns({"studyStatistics": cohort_stat.col})
dataset.show(5, truncate=False)




+---------------+----------------------------------------------------------+--------------------------------+---------------------+------------------+--------------+--------------+-------------+-----------------+---------+---------------+----------+--------+---------+------+---------------+------------------------+---------------------+------------------------------------------------------------+------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------------------+-------------------------------+-------------------------------

                                                                                

## Rescaling of the marginal effect size

Rescaling of marginal effect size to the original value from the standardised marginal effect size is done via two formulas depending on trait being **quantitative** or **binary**

Estimation of the trait type is done on the basis of availability of reported `nCases` and `nControls` fields in the study description.

- In case both fields are non empty and non zero we assume _binary trait_
- In case cases are zero or are not reported we assume _quantitative trait_

In both cases we estimate the marginal effect size $estimated\;\beta$ with following formula
$$estimated\;\beta = zscore \cdot se$$

Where

- $zscore = \frac{\beta}{|{\beta}|} \cdot \sqrt{\chi^2}$
- $se$ depends on the trait type
- $\beta$ - _standardised beta reported from in the summary statistics_

In case when $\beta$ was not reported we assumed the $\frac{\beta}{|{\beta}|}$ to be equal to 1

#### Binary trait marginal effect size estimation

$$se = \frac{1}{\sqrt{(varG \cdot prev \cdot (1 - prev))}}$$

- $varG = 2 \cdot f \cdot (1 - f)$ - _component of genetic variance_ - the original is $var_{G} = 2\beta^2f(1 - f)$
- $f$ - _Minor Allele Frequency of lead variant_
- $prev = \frac{nCases}{nSamples}$ - _Trait prevelance_

#### Quantative trait marginal effect size estimation

$$se = \frac{1}{\sqrt{varG}}$$

- $varG = 2 \cdot f \cdot (1 - f)$
- $f$ - _Minor Allele Frequency of lead variant_

The $\chi^2$ was esteimated as described in `variance Explained` calculation.


In [10]:
from manuscript_methods.maf import MinorAlleleFrequency
from manuscript_methods.rescaled_beta import RescaledStatistics

rescaled_stats = RescaledStatistics.compute(
    beta=f.col("beta"),
    chi2_stat=VariantStatistics(f.col("variantStatistics")).chi2_stat,
    trait_class=StudyStatistics(f.col("studyStatistics")).trait_class,
    maf=MinorAlleleFrequency(f.col("majorLdPopulationMaf")).value,
    n_cases=StudyStatistics(f.col("studyStatistics")).n_cases,
    n_samples=StudyStatistics(f.col("studyStatistics")).n_samples,
)

dataset = dataset.withColumns({"rescaledStatistics": rescaled_stats.col})

dataset.show(5, truncate=False)


[Stage 78:>                                                         (0 + 1) / 1]

+------------------+---------------------------------------------------------------------+--------------------------------+---------+------+--------------+--------------+-------------+-----------------+---------+---------------+----------+--------+---------+------+---------------+------------------------+---------------------+------------------------------------------------+------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------------------+-------------------------------+-----------------------------------------------------+--

                                                                                

## Final dataset contract

```
root
 |-- variantId: string (nullable = true)
 |-- studyLocusId: string (nullable = true)
 |-- studyId: string (nullable = true)
 |-- geneId: string (nullable = true)
 |-- originalBeta: double (nullable = true)
 |-- originalStandardError: double (nullable = true)
 |-- finemappingMethod: string (nullable = true)
 |-- credibleSetSize: integer (nullable = false)
 |-- isTransQtl: boolean (nullable = true)
 |-- variantEffect: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- method: string (nullable = true)
 |    |    |-- assessment: string (nullable = true)
 |    |    |-- score: float (nullable = true)
 |    |    |-- assessmentFlag: string (nullable = true)
 |    |    |-- targetId: string (nullable = true)
 |    |    |-- normalisedScore: double (nullable = true)
 |-- majorLdPopulation: struct (nullable = true)
 |    |-- ldPopulation: string (nullable = false)
 |    |-- relativeSampleSize: double (nullable = false)
 |-- majorLdPopulationMaf: struct (nullable = true)
 |    |-- value: double (nullable = true)
 |    |-- type: string (nullable = false)
 |-- majorLdPopulationAf: struct (nullable = true)
 |    |-- populationName: string (nullable = true)
 |    |-- alleleFrequency: double (nullable = true)
 |-- variantStatistics: struct (nullable = false)
 |    |-- chi2Stat: double (nullable = true)
 |    |-- pValueMantissa: float (nullable = true)
 |    |-- pValueExponent: integer (nullable = true)
 |    |-- ApproximatedVarianceExplained: double (nullable = true)
 |-- studyStatistics: struct (nullable = false)
 |    |-- nCases: integer (nullable = true)
 |    |-- nControls: integer (nullable = true)
 |    |-- nSamples: integer (nullable = true)
 |    |-- trait: string (nullable = true)
 |    |-- studyType: string (nullable = true)
 |    |-- traitClass: string (nullable = true)
 |    |-- traitFromSourceMappedIds: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- rescaledStatistics: struct (nullable = false)
 |    |-- directionOfEffect: string (nullable = false)
 |    |-- zScore: double (nullable = true)
 |    |-- varG: double (nullable = true)
 |    |-- prevalence: double (nullable = true)
 |    |-- estimatedSE: double (nullable = true)
 |    |-- estimatedBeta: double (nullable = true)
 |    |-- majorAlleleEstimatedBeta: double (nullable = true)
```


In [11]:
dataset = dataset.select(
    f.col("variantId"),
    f.col("studyLocusId"),
    f.col("studyId"),
    f.col("geneId"),
    f.col("beta").alias("originalBeta"),
    f.col("standardError").alias("originalStandardError"),
    f.col("finemappingMethod"),
    f.col("credibleSetSize"),
    f.col("isTransQtl"),
    f.col("variantEffect"),
    f.col("majorLdPopulation"),
    f.col("majorLdPopulationMaf"),
    f.col("majorLdPopulationAf"),
    f.col("variantStatistics"),
    f.col("studyStatistics"),
    f.col("rescaledStatistics"),
)


### Save the dataset to parquet


In [12]:
dataset.repartition(50).write.mode("overwrite").parquet(output_dataset_path)


                                                                                

### Show the dataset schema


In [13]:
dataset.printSchema()


root
 |-- variantId: string (nullable = true)
 |-- studyLocusId: string (nullable = true)
 |-- studyId: string (nullable = true)
 |-- geneId: string (nullable = true)
 |-- originalBeta: double (nullable = true)
 |-- originalStandardError: double (nullable = true)
 |-- finemappingMethod: string (nullable = true)
 |-- credibleSetSize: integer (nullable = false)
 |-- isTransQtl: boolean (nullable = true)
 |-- variantEffect: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- method: string (nullable = true)
 |    |    |-- assessment: string (nullable = true)
 |    |    |-- score: float (nullable = true)
 |    |    |-- assessmentFlag: string (nullable = true)
 |    |    |-- targetId: string (nullable = true)
 |    |    |-- normalisedScore: double (nullable = true)
 |-- majorLdPopulation: struct (nullable = true)
 |    |-- ldPopulation: string (nullable = false)
 |    |-- relativeSampleSize: double (nullable = false)
 |-- majorLdPopulationMaf: struct (null

### Show the final dataset


In [14]:
dataset.show(5, truncate=False)


[Stage 106:>                                                        (0 + 1) / 1]

+------------------+--------------------------------+---------------------------------------------------------------------+---------------+------------+---------------------+-----------------+---------------+----------+-------------------------------------------------------------------------------------------------------------------------------+-----------------+----------------------------------+-------------------------------+-----------------------------------------------------+--------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
|variantId         |studyLocusId                    |studyId                                                              |geneId         |originalBeta|originalStandardError|finemappingMethod|credibleSetSize|isTransQtl|variantEffect                                                         

                                                                                

### Describe the dataset


In [15]:
dataset.describe().show(truncate=False)




+-------+----------------+--------------------------------+--------------------------------------------------------------------------------+---------------+------------------+---------------------+-----------------+------------------+
|summary|variantId       |studyLocusId                    |studyId                                                                         |geneId         |originalBeta      |originalStandardError|finemappingMethod|credibleSetSize   |
+-------+----------------+--------------------------------+--------------------------------------------------------------------------------+---------------+------------------+---------------------+-----------------+------------------+
|count  |2833758         |2833758                         |2833758                                                                         |2044305        |2781497           |2254038              |2833758          |2833758           |
|mean   |NULL            |Infinity                        |N

                                                                                