# Variance Explained & Marginal effect size

The goal of this notebook is to add a variance explained and unscaled marginal effect size (following non-standardised distribution)


## Loading data


In [1]:
import polars as pl
from scipy.stats import chi2


### Load the dataset from previous notebook


In [2]:
lead_variant_maf_dataset = pl.read_parquet("../../data/lead-maf-vep/*.parquet")
print(lead_variant_maf_dataset.shape)
print(lead_variant_maf_dataset.columns)


(2622098, 20)
['variantId', 'studyId', 'studyLocusId', 'beta', 'zScore', 'pValueMantissa', 'pValueExponent', 'standardError', 'finemappingMethod', 'studyType', 'credibleSetSize', 'posteriorProbability', 'nSamples', 'nControls', 'nCases', 'majorPopulation', 'allelefrequencies', 'vepEffect', 'majorPopulationAF', 'majorPopulationMAF']


## Variance explained


The variance explained follows the simplified formula

${variance\;explained}=\chi^2 / n $

- The $\chi^2$ is calculated as **Inverse survival function** by using `scipy.stats.isf` function from lead variant $pValue$ (depicted as `pValueMantissa` and `pValueExponent`).
- The $n$ parameter is the number of samples derived from GWAS study description.

- In case where the `pValueExponent < 300` to avoid floating point errors we estimate $\chi^2$ statistic with $-log_{10}(pValue)$
- The $variance\;explained$ can be only calculated where the $n > 0$


In [3]:
def variance_explained(p_value_mantissa: pl.Expr, p_value_exponent: pl.Expr, n_samples: pl.Expr) -> pl.Expr:
    """Estimate the variance explained by the lead variant in a dataset.

    # NOTE! Calculate variance explained requires removal of the studies that have nSamples = 0
    """
    p_value = (p_value_mantissa.cast(pl.Float64()) * pl.lit(10).pow(p_value_exponent.cast(pl.Float64()))).alias(
        "pValue"
    )
    neglog_pval = -1 * p_value_mantissa.log10() + p_value_exponent
    neglog_approximation_intercept = -5.367
    neglog_approximation_coeff = 4.596
    chi2_stat = (
        pl.when(p_value_exponent < -300)
        .then(neglog_pval * neglog_approximation_coeff + neglog_approximation_intercept)
        .otherwise(p_value.map_elements(lambda x: chi2.isf(x, df=1), pl.Float64()))
    ).alias("chi2Stat")
    variance_exp = (chi2_stat / n_samples).alias("varianceExplained")

    return pl.struct(chi2_stat, p_value, variance_exp).alias("leadVariantStats")


In [4]:
df = lead_variant_maf_dataset.filter(pl.col("nSamples").is_not_null() | pl.col("nSamples") > 0).select(
    "*",
    variance_explained(
        pl.col("pValueMantissa"),
        pl.col("pValueExponent"),
        pl.col("nSamples"),
    ),
)


In [5]:
df.write_parquet("../../data/variance-explained.parquet")


## Data sanity checks


In [6]:
lead_variant_maf_dataset.filter((pl.col("nSamples") == 0) | (pl.col("nSamples").is_null())).shape[0]


1050

Exactly 1050 samples have no `nSamples` disallowing us to calculate the varianceExplained


In [7]:
lead_variant_maf_dataset.filter(
    (
        (pl.col("nCases").is_not_null() & pl.col("nCases") != 0)
        | (pl.col("nControls").is_not_null() & pl.col("nControls") != 0)
    )
).filter(pl.col("nSamples") != pl.col("nCases") + pl.col("nControls")).shape[0]


1590

Exaclty 1590 studies for assumed binary traits have diverging `nSamples` as opposed to sum of `nCases` and `nControls`


In [8]:
lead_variant_maf_dataset.filter(pl.col("majorPopulationMAF") == 0.0).shape[0]


3800

Exactly lead variants have MAF estimated on the basis of major population ancestry is equal to 0.0.


## Rescale marginal effect size


Rescaling of marginal effect size to the original value from the standardised marginal effect size is done via two formulas depending on trait being **quantitative** or **binary**

Estimation of the trait type is done on the basis of availability of reported `nCases` and `nControls` fields in the study description.

- In case both fields are non empty and non zero we assume _binary trait_
- In case cases are zero or are not reported we assume _quantitative trait_

In both cases we estimate the marginal effect size $estimated\;\beta$ with following formula
$$estimated\;\beta = zscore \cdot se$$

Where

- $zscore = \frac{\beta}{|{\beta}|} \cdot \sqrt{\chi^2}$
- $se$ depends on the trait type
- $\beta$ - _standardised beta reported from in the summary statistics_

In case when $\beta$ was not reported we assumed the $\frac{\beta}{|{\beta}|}$ to be equal to 1

#### Binary trait marginal effect size estimation

$$se = \frac{1}{\sqrt{(varG \cdot prev \cdot (1 - prev))}}$$

- $varG = 2 \cdot f \cdot (1 - f)$ - _component of genetic variance_ - the original is $var_{G} = 2\beta^2f(1 - f)$
- $f$ - _Minor Allele Frequency of lead variant_
- $prev = \frac{nCases}{nSamples}$ - _Trait prevelance_

#### Quantative trait marginal effect size estimation

$$se = \frac{1}{\sqrt{varG}}$$

- $varG = 2 \cdot f \cdot (1 - f)$
- $f$ - _Minor Allele Frequency of lead variant_

The $\chi^2$ was esteimated as described in `variance Explained` calculation.


In [9]:
def rescale_beta(
    beta: pl.Expr,
    n_cases: pl.Expr,
    n_controls: pl.Expr,
    n_samples: pl.Expr,
    p_value_mantissa: pl.Expr,
    p_value_exponent: pl.Expr,
    maf: pl.Expr,
) -> pl.Expr:
    """Rescale beta to be between 0 and 1."""
    neglog_approximation_intercept = -5.367
    neglog_approximation_coeff = 4.596
    trait_class = (
        pl.when(n_cases.is_null())
        .then(pl.lit("quantitative"))
        .when((n_cases == 0) | (n_controls == 0))
        .then(pl.lit("quantitative"))
        .otherwise(pl.lit("binary"))
    )
    p_value = (p_value_mantissa.cast(pl.Float64()) * pl.lit(10).pow(p_value_exponent.cast(pl.Float64()))).alias(
        "pValue"
    )
    neglog_pval = -1 * p_value_mantissa.log10() + p_value_exponent
    # Calculate the chi2 value - the calculation has to be approximated to the -neglog pval in case when exponent is very low
    # otherwise the chi2 will be infinity.
    chi2_stat = (
        pl.when(p_value_exponent < -300)
        .then(neglog_pval * neglog_approximation_coeff + neglog_approximation_intercept)
        .otherwise(p_value.map_elements(lambda x: chi2.isf(x, df=1), pl.Float64()))
    )
    # In case beta is positive or not reported we use 1 as a sign
    effect_direction = pl.when(beta < 0).then(pl.lit(-1)).otherwise(pl.lit(1))
    z_score = effect_direction * chi2_stat.sqrt()
    var_g = 2 * maf * (1 - maf)
    prev = n_cases / n_samples
    se = (
        pl.when(trait_class == "quantitative")
        .then((1 / (var_g * n_samples)).sqrt())
        .otherwise((1 / (var_g * n_samples * prev * (1 - prev))).sqrt())
    )
    new_beta = z_score * se

    return pl.struct(
        new_beta.alias("estimatedBeta"),
        trait_class.alias("traitClass"),
        chi2_stat.alias("chi2Stat"),
        se.alias("estimatedSE"),
        var_g.alias("varG"),
        prev.alias("prev"),
        n_samples.alias("nSamples"),
    ).alias("rescaledStatistics")


In [10]:
df2 = df.select(
    "*",
    rescale_beta(
        pl.col("beta"),
        pl.col("nCases"),
        pl.col("nControls"),
        pl.col("nSamples"),
        pl.col("pValueMantissa"),
        pl.col("pValueExponent"),
        pl.col("majorPopulationMAF"),
    ),
)


In [11]:
# Write the dataset
df2.write_parquet("../../data/rescaled-betas.parquet")


## Data Sanity checks

Check the statistics of estimated beta for `binary` and `quantitative` traits


In [12]:
# binary estimatedBeta distribtution
df2.select(pl.col("rescaledStatistics").struct.unnest()).filter(pl.col("traitClass") == "binary").filter(
    pl.col("estimatedBeta").is_finite()
).select(pl.col("estimatedBeta").abs()).describe()


statistic,estimatedBeta
str,f64
"""count""",69584.0
"""null_count""",0.0
"""mean""",0.907922
"""std""",6.412485
"""min""",0.010357
"""25%""",0.04851
"""50%""",0.084865
"""75%""",0.187728
"""max""",820.178296


In [13]:
# quantitative estimatedBeta distribution
df2.select(pl.col("rescaledStatistics").struct.unnest()).filter(pl.col("traitClass") == "quantitative").filter(
    pl.col("estimatedBeta").is_finite()
).select(pl.col("estimatedBeta").abs()).describe()


statistic,estimatedBeta
str,f64
"""count""",2539980.0
"""null_count""",0.0
"""mean""",1.066917
"""std""",2.991953
"""min""",0.003516
"""25%""",0.413165
"""50%""",0.693346
"""75%""",1.06016
"""max""",576.10404


#### Number of lead variants per binary vs quantitative traits


In [14]:
df2.select(pl.col("rescaledStatistics").struct.unnest()).group_by("traitClass").len()


traitClass,len
str,u32
"""binary""",70360
"""quantitative""",2550688


#### Number of lead variants per binary vs quantitative traits per studyType


In [15]:
df2.select(pl.col("rescaledStatistics").struct.unnest(), pl.col("studyType")).group_by("traitClass", "studyType").len()


traitClass,studyType,len
str,str,u32
"""quantitative""","""gwas""",506061
"""quantitative""","""pqtl""",33737
"""quantitative""","""sceqtl""",52746
"""quantitative""","""sqtl""",223507
"""quantitative""","""tuqtl""",384937
"""binary""","""gwas""",70360
"""quantitative""","""eqtl""",1349700


#### Number of lead variants per binary vs quantitative traits in 1k


In [16]:
df2.select(
    pl.col("rescaledStatistics").struct.field("traitClass"),
    "nSamples",
    pl.when(pl.col("nSamples") > 10000)
    .then(pl.lit("nSamples > 10k"))
    .otherwise(pl.lit("nSamples < 10k"))
    .alias("nSamplesClass"),
).group_by("traitClass", "nSamplesClass").len().sort("nSamplesClass").filter(pl.col("traitClass") == "binary")


traitClass,nSamplesClass,len
str,str,u32
"""binary""","""nSamples < 10k""",5313
"""binary""","""nSamples > 10k""",65047


#### Analysis of missingness of the nSamples compared to the sum of nCases and nControls


In [17]:
### Compare nSamples and nCases + nControls
si = pl.read_parquet("../../data/study/*.parquet")
pl.Config.set_fmt_str_lengths(1000)
x = (
    df2.select(
        pl.col("nSamples"),
        (pl.col("nCases") + pl.col("nControls")).alias("sumNcasesNcontrols"),
        pl.col("rescaledStatistics").struct.field("traitClass"),
        pl.col("rescaledStatistics").struct.field("prev"),
        pl.col("nCases"),
        pl.col("nControls"),
        pl.col("studyId"),
    )
    .filter(pl.col("traitClass") == "binary")
    .select("*", (pl.col("nSamples") == pl.col("sumNcasesNcontrols")).alias("isEqual"))
    .filter(pl.col("isEqual") == False)
    .join(si.select("studyId", "traitFromSourceMappedIds", "initialSampleSize"), how="left", on="studyId")
)

x.filter(pl.col("nSamples") > 100_000)


nSamples,sumNcasesNcontrols,traitClass,prev,nCases,nControls,studyId,isEqual,traitFromSourceMappedIds,initialSampleSize
i32,i32,str,f64,i32,i32,str,bool,list[str],str
388324,74046,"""binary""",0.065873,25580,48466,"""GCST005922""",false,"[""EFO_0009268"", ""MONDO_0004975""]","""up to 42,034 British ancestry individuals with parental history of Alzheimer's disease, at least 272,244 British ancestry individuals with no parental history of Alzheimer's disease, 25,580 Alzheimer's disease cases, 48,466 controls"""
556676,120552,"""binary""",0.138493,77096,43456,"""GCST008595""",false,"[""EFO_0004337"", ""EFO_0004784"", ""MONDO_0005090""]","""328,917 individuals with educational attainment measurements, 107,207 individuals with cognitive ability measurements, 77,096 European ancestry schizophrenia cases, 43,456 European ancestry controls"""
351696,127265,"""binary""",0.022596,7947,119318,"""GCST009722""",false,"[""MONDO_0005041""]","""133,492 European ancestry individuals with intraocular pressure measurements, 90,939 European ancestry individuals with vertical cup-disc ratio measurements, 7,947 British ancestry glaucoma cases, 119,318 British ancestry controls"""
132876,135957,"""binary""",0.15665,20815,115142,"""GCST004296""",false,"[""EFO_0000275""]","""15,979 European ancestry cases, 102,776 European ancestry controls, 641 African American cases, 5,234 African American controls, 837 Japanese ancestry cases, 3,293 Japanese controls, 277 Hispanic cases, 3,081 Hispanic controls,197 Brazilian ancestry cases, 758 Brazilian ancestry controls."""
1133353,375752,"""binary""",0.052653,59674,316078,"""GCST011062""",false,"[""EFO_0005763"", ""MONDO_0005277""]","""757,601 European ancestry individuals with blood pressure measurements, 59,674 European ancestry migraine cases, 316,078 European ancestry controls"""
…,…,…,…,…,…,…,…,…,…
556676,120552,"""binary""",0.138493,77096,43456,"""GCST008595""",false,"[""EFO_0004337"", ""EFO_0004784"", ""MONDO_0005090""]","""328,917 individuals with educational attainment measurements, 107,207 individuals with cognitive ability measurements, 77,096 European ancestry schizophrenia cases, 43,456 European ancestry controls"""
556676,120552,"""binary""",0.138493,77096,43456,"""GCST008595""",false,"[""EFO_0004337"", ""EFO_0004784"", ""MONDO_0005090""]","""328,917 individuals with educational attainment measurements, 107,207 individuals with cognitive ability measurements, 77,096 European ancestry schizophrenia cases, 43,456 European ancestry controls"""
1478037,218997,"""binary""",0.016567,24486,194511,"""GCST90399470""",false,"[""EFO_0004190""]","""21,839 European, African, Central/South Asian, East Asian or Hispanic cases, 1,237,201 European, African, Central/South Asian, East Asian or Hispanic controls, 15,229 European ancestry cases, 177,473 European ancestry controls, 9,257 African ancestry cases, 17,038 African ancestry controls"""
248710,202356,"""binary""",0.121563,30234,172122,"""GCST90129537""",false,"[""EFO_0004286"", ""EFO_0004629""]","""30,234 European or African American VTE cases, 172,122 European or African American VTE controls, 46,354 European, African American or Hispanic individuals with measurements"""


In [18]:
def discovery_samples(col: pl.Expr) -> pl.Expr:
    return col.str.split(", ").alias("S")


In [19]:
# w = pl.

x.select(discovery_samples(pl.col("initialSampleSize")), "nSamples", "nCases", "nControls", "initialSampleSize").limit(
    4
).tail(1).explode("S").select(
    pl.col("S").str.extract(r"[0-9,]+", 0).str.replace(r",\d+", "").cast(pl.Int64()).alias("Z"),
    "nCases",
    "nControls",
    "nSamples",
    "S",
    "initialSampleSize",
    pl.when(pl.col("S").str.contains("cases"))
    .then(pl.lit("cases"))
    .when(pl.col("S").str.contains("controls"))
    .then(pl.lit("controls"))
    .otherwise(pl.lit("unknown"))
    .alias("type"),
).select(
    "type",
    "nCases",
    "nControls",
    "nSamples",
    "S",
    "Z",
    "initialSampleSize",
    pl.col("Z").over("type").filter(pl.col("type") == "cases").sum().alias("sumCases"),
    pl.col("Z").over("type").filter(pl.col("type") == "controls").sum().alias("sumControls"),
    pl.col("Z").over("type").sum().alias("sumSamples"),
)


type,nCases,nControls,nSamples,S,Z,initialSampleSize,sumCases,sumControls,sumSamples
str,i32,i32,i32,str,i64,str,i64,i64,i64
"""cases""",20815,115142,132876,"""15,979 European ancestry cases""",15,"""15,979 European ancestry cases, 102,776 European ancestry controls, 641 African American cases, 5,234 African American controls, 837 Japanese ancestry cases, 3,293 Japanese controls, 277 Hispanic cases, 3,081 Hispanic controls,197 Brazilian ancestry cases, 758 Brazilian ancestry controls.""",1773,868,2641
"""controls""",20815,115142,132876,"""102,776 European ancestry controls""",102,"""15,979 European ancestry cases, 102,776 European ancestry controls, 641 African American cases, 5,234 African American controls, 837 Japanese ancestry cases, 3,293 Japanese controls, 277 Hispanic cases, 3,081 Hispanic controls,197 Brazilian ancestry cases, 758 Brazilian ancestry controls.""",1773,868,2641
"""cases""",20815,115142,132876,"""641 African American cases""",641,"""15,979 European ancestry cases, 102,776 European ancestry controls, 641 African American cases, 5,234 African American controls, 837 Japanese ancestry cases, 3,293 Japanese controls, 277 Hispanic cases, 3,081 Hispanic controls,197 Brazilian ancestry cases, 758 Brazilian ancestry controls.""",1773,868,2641
"""controls""",20815,115142,132876,"""5,234 African American controls""",5,"""15,979 European ancestry cases, 102,776 European ancestry controls, 641 African American cases, 5,234 African American controls, 837 Japanese ancestry cases, 3,293 Japanese controls, 277 Hispanic cases, 3,081 Hispanic controls,197 Brazilian ancestry cases, 758 Brazilian ancestry controls.""",1773,868,2641
"""cases""",20815,115142,132876,"""837 Japanese ancestry cases""",837,"""15,979 European ancestry cases, 102,776 European ancestry controls, 641 African American cases, 5,234 African American controls, 837 Japanese ancestry cases, 3,293 Japanese controls, 277 Hispanic cases, 3,081 Hispanic controls,197 Brazilian ancestry cases, 758 Brazilian ancestry controls.""",1773,868,2641
"""controls""",20815,115142,132876,"""3,293 Japanese controls""",3,"""15,979 European ancestry cases, 102,776 European ancestry controls, 641 African American cases, 5,234 African American controls, 837 Japanese ancestry cases, 3,293 Japanese controls, 277 Hispanic cases, 3,081 Hispanic controls,197 Brazilian ancestry cases, 758 Brazilian ancestry controls.""",1773,868,2641
"""cases""",20815,115142,132876,"""277 Hispanic cases""",277,"""15,979 European ancestry cases, 102,776 European ancestry controls, 641 African American cases, 5,234 African American controls, 837 Japanese ancestry cases, 3,293 Japanese controls, 277 Hispanic cases, 3,081 Hispanic controls,197 Brazilian ancestry cases, 758 Brazilian ancestry controls.""",1773,868,2641
"""cases""",20815,115142,132876,"""3,081 Hispanic controls,197 Brazilian ancestry cases""",3,"""15,979 European ancestry cases, 102,776 European ancestry controls, 641 African American cases, 5,234 African American controls, 837 Japanese ancestry cases, 3,293 Japanese controls, 277 Hispanic cases, 3,081 Hispanic controls,197 Brazilian ancestry cases, 758 Brazilian ancestry controls.""",1773,868,2641
"""controls""",20815,115142,132876,"""758 Brazilian ancestry controls.""",758,"""15,979 European ancestry cases, 102,776 European ancestry controls, 641 African American cases, 5,234 African American controls, 837 Japanese ancestry cases, 3,293 Japanese controls, 277 Hispanic cases, 3,081 Hispanic controls,197 Brazilian ancestry cases, 758 Brazilian ancestry controls.""",1773,868,2641


- We extract the `nCases` and `nControls` incorrectly due to the mismatch in regex `,\s+`, when the space is not present.
- We miss the `cases` and `controls` words in the extracted strings, this mean that we count correctly the `nSamples` but we do not evaluate `nCases` and `nControls` correctly.
- There is no sanity check to see if `nCases` and `nControls` sum match the `nSamples` in the `annotate_discovery_sample_size` method in gentropy.
