# Lead Variant Effect dataset preparation

The aim of this notebook is to collect the information about the effect of the credible set lead variants.
**This includes**:

- Addition of **Major population sample size** and **size of cases/controls** from _studyIndex_,
- Addition of **VEP consequence score** derived annotations from _variantIndex_
- Addition of **study specific major ancestry variant AF** (allele frequency)<a name="out of sample AF"></a>[<sup>[1]</sup>](#cite_note-1) annotations from _variantIndex_
- Calculation of **MAF (Minor Allele Frequency)** based on AF of the **credible set lead variants** derived from _studyLocus_
- Calculation of **Variance Explained by lead variant**
- Calculation of the **Rescaled estimated effect sizes** based on the trait class (dichotomous or continuous) and the MAF of the lead variant.

<a name="cite_note-1"></a>1. [^](#cite_ref-1) AF is derived from GnomAD v4.1 allele frequencies from joint Exome and WGS datasets.


## Data extraction and loading

<div class="alert alert-block alert-info"> 
    <b style="font-size: 1.2em">Downloading datasets</b><br><br>
    <b>The analysis can be performed on the:</b>
    <ul>
        <li>2025.03 release (rsync from EBI FTP server)</li>
        <li>2025.06 release (rsync from google cloud storage)</li>
    </ul>
    <I>This code chunk should be run only once to download the relevant datasets.</I>
</div>

Data for this analysis has to be downloaded from 3 datasets:

- credible_set
- variant
- study


In [1]:
# Ensure proper java version < 11
!java -version


openjdk version "17.0.14" 2025-01-21
OpenJDK Runtime Environment Temurin-17.0.14+7 (build 17.0.14+7)
OpenJDK 64-Bit Server VM Temurin-17.0.14+7 (build 17.0.14+7, mixed mode, sharing)


In [None]:
# Download the release data from the Open Targets Platform 25.06 release
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.06/output/credible_set ../../data/.
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.06/output/study ../../data/.
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.06/output/variant ../../data/.


Transfer starting: 27 files

sent 16 bytes  received 1838 bytes  18540000 bytes/sec
total size is 2578157705  speedup is 1390591.32
Transfer starting: 3 files

sent 16 bytes  received 158 bytes  1740000 bytes/sec
total size is 93324727  speedup is 536345.92
Transfer starting: 27 files

sent 16 bytes  received 1836 bytes  18520000 bytes/sec
total size is 3456816333  speedup is 1866530.49


## Session setup

- Create the sparkSession
- Set all input/output paths


In [1]:
from gentropy.common.session import Session
from gentropy.dataset.study_index import StudyIndex
from gentropy.dataset.study_locus import StudyLocus
from gentropy.dataset.variant_index import VariantIndex
from pyspark.sql import functions as f


In [2]:
session = Session(extended_spark_conf={"spark.driver.memory": "40G"})
variant_index_path = "../../data/variant"
study_index_path = "../../data/study"
credible_set_path = "../../data/credible_set"
output_dataset_path = "../../data/lead_variant_effect"


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/03 13:36:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/07/03 13:36:47 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/07/03 13:36:47 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/07/03 13:36:47 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/07/03 13:36:47 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
25/07/03 13:36:47 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.


In [5]:
session.spark


## Building temporary dataset

The temporary dataset needs to be build from the _studyIndex_, _studyLocus_ and _variantIndex_ datasets.


In [6]:
vi = VariantIndex.from_parquet(session, variant_index_path)
si = StudyIndex.from_parquet(session, study_index_path)
cs = StudyLocus.from_parquet(session, credible_set_path)


_cs = cs.df.select(
    f.col("studyId"),
    f.col("studyLocusId"),
    f.col("variantId"),
    f.col("beta"),
    f.col("zScore"),
    f.col("pValueMantissa"),
    f.col("pValueExponent"),
    f.col("standardError"),
    f.col("finemappingMethod"),
    f.col("studyType"),
    f.col("locus"),
    f.col("isTransQtl"),
)
_si = si.df.select(
    f.col("studyId"),
    f.col("nSamples"),
    f.col("nControls"),
    f.col("nCases"),
    f.col("geneId"),  # for molqtl traits
    f.col("traitFromSourceMappedIds"),
    f.col("ldPopulationStructure"),
    f.col("traitFromSource"),
    f.col("traitFromSourceMappedIds"),
)

_vi = vi.df.select(
    f.col("variantId"),
    f.col("allelefrequencies"),
    f.col("variantEffect"),
    f.col("transcriptConsequences"),
    f.col("chromosome"),
    f.col("position"),
    f.col("referenceAllele"),
    f.col("alternateAllele"),
)

dataset = _cs.join(_si, how="left", on="studyId").join(_vi, how="left", on="variantId")


## MAF Calculation

To add the MAF (Minor Allele Frequency) to the dataset we need to extract the major ancestry from _studyIndex_ and use it to extract the relevant allele frequency from the _variantIndex_ dataset.

The MAF is calculated as follows:

<ol>
    <li>Extract the major ancestry from the <code>studyIndex</code> dataset.</li>
    <ol>
        <li>In case there are multiple ancestries that match the <code>relativeSampleSize</code>, and one of them is <code>NFE</code>, use <code>NFE</code> as the major ancestry.</li>
        <li>In case there are multiple ancestries that match the <code>relativeSampleSize</code> and none of them is <code>NFE</code>, use the first ancestry in the list as the major ancestry.</li>
        <li>If there is no ancestry in the list, use <code>NFE</code> as the major ancestry, assign the <code>relativeSampleSize</code> to 0.0</li>
    </ol>
    <li>Extract the allele frequency for the major ancestry from the <code>variantIndex</code> dataset.</li>
</ol>


In [7]:
from manuscript_methods.ld_populations import LDPopulationName, LDPopulationStructure
from manuscript_methods.maf import AlleleFrequencies

ld_pop = LDPopulationStructure(f.col("ldPopulationStructure"))
major_ld_pop = ld_pop.major_population(default_major_pop=LDPopulationName.NFE)
major_ld_maf = AlleleFrequencies(f.col("alleleFrequencies")).ld_population_maf(major_ld_pop.ld_population)
major_ld_af = AlleleFrequencies(f.col("alleleFrequencies")).ld_population_af(major_ld_pop.ld_population)

dataset = dataset.withColumns(
    {
        "majorLdPopulation": major_ld_pop.col,
        "majorLdPopulationMaf": major_ld_maf.col,
        "majorLdPopulationAf": major_ld_af.col,
    }
)


# dataset.select("majorLdPopulation", "majorLdPopulationMaf", "majorLdPopulationAf").show(5, truncate=False)
# dataset.select("majorLdPopulation", "majorLdPopulationMaf", "majorLdPopulationAf").printSchema()


## Phenotypic variance explained by lead variant (Approximation)

The code below is used to calculate the PVE (Phenotypic Variance Explained) by the lead variant in the credible set.

The variance explained follows the simplified formula

${variance\;explained}=\chi^2 / n $

- The $\chi^2$ is calculated as **Inverse survival function** by using `scipy.stats.isf` function from lead variant $pValue$ (depicted as `pValueMantissa` and `pValueExponent`).
- The $n$ parameter is the number of samples derived from GWAS study description.

- In case where the `pValueExponent < 300` to avoid floating point errors we estimate $\chi^2$ statistic with $-log_{10}(pValue)$
- The $variance\;explained$ can be only calculated where the $n > 0$


In [8]:
from manuscript_methods.variant_statistics import PValueComponents, VariantStatistics

pval_components = PValueComponents(p_value_mantissa=f.col("pValueMantissa"), p_value_exponent=f.col("pValueExponent"))
n_samples = f.col("nSamples")
variant_stats = VariantStatistics.compute(pval_components, n_samples)

dataset = dataset.withColumns(
    {
        "variantStatistics": variant_stats.col,
    }
)
# dataset.select("variantStatistics").show(5, truncate=False)
# dataset.select("variantStatistics").printSchema()




## Study statistics

The code below is used to combine and classify the cohort statistics from the _studyIndex_ dataset.

This includes:

- n_cases
- n_controls
- n_samples
- study_type
- trait
- trait_ids
- gene_id


In [9]:
from manuscript_methods.study_statistics import StudyStatistics

cohort_stat = StudyStatistics.compute(
    n_samples=f.col("nSamples"),
    n_cases=f.col("nCases"),
    n_controls=f.col("nControls"),
    trait=f.col("traitFromSource"),
    study_type=f.col("studyType"),
    is_trans_pqtl=f.col("isTransQtl"),
    gene_id=f.col("geneId"),
)

dataset = dataset.withColumns({"studyStatistics": cohort_stat.col})
# dataset.select("studyStatistics").show(5, truncate=False)
# dataset.select("studyStatistics").printSchema()


## Rescaling of the marginal effect size

Rescaling of marginal effect size to the original value from the standardised marginal effect size is done via two formulas depending on trait being **quantitative** or **binary**

Estimation of the trait type is done on the basis of availability of reported `nCases` and `nControls` fields in the study description.

- In case both fields are non empty and non zero we assume _binary trait_
- In case cases are zero or are not reported we assume _quantitative trait_

In both cases we estimate the marginal effect size $estimated\;\beta$ with following formula
$$estimated\;\beta = zscore \cdot se$$

Where

- $zscore = \frac{\beta}{|{\beta}|} \cdot \sqrt{\chi^2}$
- $se$ depends on the trait type
- $\beta$ - _standardised beta reported from in the summary statistics_

In case when $\beta$ was not reported we assumed the $\frac{\beta}{|{\beta}|}$ to be equal to 1

#### Binary trait marginal effect size estimation

$$se = \frac{1}{\sqrt{(varG \cdot prev \cdot (1 - prev))}}$$

- $varG = 2 \cdot f \cdot (1 - f)$ - _component of genetic variance_ - the original is $var_{G} = 2\beta^2f(1 - f)$
- $f$ - _Minor Allele Frequency of lead variant_
- $prev = \frac{nCases}{nSamples}$ - _Trait prevelance_

#### Quantative trait marginal effect size estimation

$$se = \frac{1}{\sqrt{varG}}$$

- $varG = 2 \cdot f \cdot (1 - f)$
- $f$ - _Minor Allele Frequency of lead variant_

The $\chi^2$ was esteimated as described in `variance Explained` calculation.


In [10]:
from manuscript_methods.maf import MinorAlleleFrequency, PopulationFrequency
from manuscript_methods.rescaled_beta import RescaledStatistics

rescaled_stats = RescaledStatistics.compute(
    beta=f.col("beta"),
    chi2_stat=VariantStatistics(f.col("variantStatistics")).chi2_stat,
    trait_class=StudyStatistics(f.col("studyStatistics")).trait_class,
    af=PopulationFrequency(f.col("majorLdPopulationAf")).allele_frequency,
    maf=MinorAlleleFrequency(f.col("majorLdPopulationMaf")).value,
    n_cases=StudyStatistics(f.col("studyStatistics")).n_cases,
    n_samples=StudyStatistics(f.col("studyStatistics")).n_samples,
)

dataset = dataset.withColumns({"rescaledStatistics": rescaled_stats.col})

# dataset.select("rescaledStatistics").show(5, truncate=False)
# dataset.select("rescaledStatistics").printSchema()


## VEP consequence extraction

To extract the VEP annotations from the _variantIndex_ dataset we need to:

- for GWAS lead variants extract the VEP annotation with most severe consequence
- for QTL lead variants extract the VEP annotation that is linked to the `geneId` defined in the _studyIndex_ dataset in case
  the `geneId` is found in transcript annotations (**in-gene effect**), otherwise use the most severe consequence annotation `(**out-gene effect**).


In [11]:
from manuscript_methods.study_statistics import StudyStatistics
from manuscript_methods.tc import LeadVariantConsequences, TranscriptConsequences

tc = TranscriptConsequences(f.col("transcriptConsequences"))
sstats = StudyStatistics(f.col("studyStatistics"))
lc = LeadVariantConsequences.compute(tc, sstats)


dataset = dataset.withColumn(lc.name, lc.col)
# dataset.select(lc.name).show(5, truncate=False)
# dataset.select(lc.name).printSchema()


## Locus statistics

The locus statistics gets:

- Posterior Probability of the lead variant
- Locus length
- Locus size
- Locus start - (start of the first variant in the locus)
- Locus end - (end of the last variant in the locus)

All statistics are derived from the **studyLocus** dataset `locus` object.


In [12]:
from manuscript_methods.locus_statistics import LocusStatistics

ls = LocusStatistics.compute(locus=f.col("locus"), lead_variant=f.col("variantId"))
dataset = dataset.withColumn(ls.name, ls.col)
# dataset.select(ls.name).show(5, truncate=False)
# dataset.select(f"{ls.name}.*").printSchema()


## Variant type for interval joins

Compute the variant type and effective length of Indels to use them downstream for interval joins


In [14]:
from manuscript_methods.variant_type import Variant

dataset = dataset.withColumn(
    "variant",
    Variant.compute(f.col("chromosome"), f.col("position"), f.col("referenceAllele"), f.col("alternateAllele")).col,
)

# dataset.select("variant").show(5, truncate=False)
# dataset.select("variant").printSchema()
# # Show the number of lead variants per type
# dataset.select("variant.*").groupby("type").count().show()


## Final dataset contract


In [15]:
dataset = dataset.select(
    f.col("variantId"),
    f.col("variant"),
    f.col("studyLocusId"),
    f.col("studyId"),
    f.col("geneId"),
    f.col("beta").alias("originalBeta"),
    f.col("standardError").alias("originalStandardError"),
    f.col("locusStatistics"),
    f.col("finemappingMethod"),
    f.col("isTransQtl"),
    f.col("variantEffect"),
    f.col("majorLdPopulation"),
    f.col("majorLdPopulationMaf"),
    f.col("majorLdPopulationAf"),
    f.col("variantStatistics"),
    f.col("studyStatistics"),
    f.col("rescaledStatistics"),
    f.col("leadVariantConsequence"),
    f.col("traitFromSourceMappedIds"),
)


### Save the dataset to parquet


In [17]:
dataset.repartition(50).write.mode("overwrite").parquet(output_dataset_path)


25/07/02 14:06:02 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

In [18]:
import json

with open("../../src/manuscript_methods/schemas/lead_variant_effect.json", "w") as fp:
    json.dump(
        json.loads(dataset.schema.json()),
        fp,
        indent=2,
    )


In [None]:
print(f"Dataset with {dataset.count()} rows saved to {output_dataset_path}")


[Stage 115:>                                                        (0 + 8) / 9]

Dataset with 2833758 rows saved to ../../data/lead-maf-vep
1725150


                                                                                

In [20]:
from manuscript_methods.datasets import LeadVariantEffect

LeadVariantEffect.from_parquet(session, output_dataset_path).df.printSchema()


root
 |-- variantId: string (nullable = true)
 |-- variant: struct (nullable = true)
 |    |-- chromosome: string (nullable = true)
 |    |-- start: integer (nullable = true)
 |    |-- end: integer (nullable = true)
 |    |-- type: string (nullable = true)
 |    |-- ref: string (nullable = true)
 |    |-- alt: string (nullable = true)
 |    |-- length: integer (nullable = true)
 |-- studyLocusId: string (nullable = true)
 |-- studyId: string (nullable = true)
 |-- geneId: string (nullable = true)
 |-- originalBeta: double (nullable = true)
 |-- originalStandardError: double (nullable = true)
 |-- locusStatistics: struct (nullable = true)
 |    |-- locusSize: integer (nullable = true)
 |    |-- locusLength: integer (nullable = true)
 |    |-- locusStart: integer (nullable = true)
 |    |-- locusEnd: integer (nullable = true)
 |    |-- leadVariantPIP: double (nullable = true)
 |-- finemappingMethod: string (nullable = true)
 |-- isTransQtl: boolean (nullable = true)
 |-- variantEffect: a

25/07/02 17:47:51 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 912413 ms exceeds timeout 120000 ms
25/07/02 17:47:51 WARN SparkContext: Killing executors is not supported by current scheduler.
25/07/02 18:04:54 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:124)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$