# Filter initial dataset for plots


The `LeadVariantEffect` dataset has to be filtered for downstream analysis.

For the downstream analysis of MAF, rescaled effect size and variant effects we need to filter the `LeadVariantEffect` dataset to:

1. Limit the dataset to `gwas, cis-pqtl and eqlt` datasets.
2. Split the GWAS studies into two types:
   - measurements (continuous traits)
   - diseases (binary traits bound to therapeutic areas)
3. Apply the `replicated` mask to the GWAS-measurements, GWAS-diseases and molecular QTL datasets.
4. Apply the `qualified` mask to the GWAS-measurements, GWAS-diseases.
5. Apply Posterior Inclusion Probability (PIP) filter to all lead variants, keeping only those with PIP >= 0.9.

After the filtering applied check how many variants are left in the dataset and compute statistics on rescaled effect size & MAF.

5. Ensure there is no lead variants with MAF >= 0.05 and is not empty
6. Ensure there is no lead variants with absolute value of rescaled effect size <= 3 and is not empty


## Data Loading

The data required for the analysis is loaded from the

- `lead variant effect` dataset
- `qualified gwas measurements` dataset
- `qualified gwas diseases` dataset
- `replicated molecular qtls` dataset
- `replicated gwas` dataset


### Data downloading


In [1]:
gwas_therapeutic_areas_path = "../../data/gwas_therapeutic_areas"
qualifying_gwas_disease_studies_path = "../../data/qualifying_disease_studies"
qualifying_gwas_measurements_studies_path = "../../data/qualifying_measurements_studies"
qualifying_gwas_disease_credible_set_path = "../../data/qualifying_disease_credible_sets"
qualifying_gwas_measurements_credible_set_path = "../../data/qualifying_measurement_credible_sets"
lead_variant_effect_dataset_path = "../../data/lead_variant_effect"
lead_variant_effect_filtered_dataset_path = "../../data/lead_variant_effect_filtered"
replicated_molqtls_path = "../../data/replicated_molqtl_credible_sets"
replicated_gwas_path = "../../data/replicated_gwas_credible_sets"
replicated_credible_sets_path = "../../data/replicated_credible_sets"


In [None]:
!gcloud storage rsync -r --delete-unmatched-destination-objects  gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/gwas_therapeutic_areas $gwas_therapeutic_areas_path
!gcloud storage rsync -r --delete-unmatched-destination-objects gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/qualifying_studies $qualifying_gwas_disease_studies_path
!gcloud storage rsync -r --delete-unmatched-destination-objects  gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/qualifying_measurements $qualifying_gwas_measurements_studies_path
!gcloud storage rsync -r --delete-unmatched-destination-objects  gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/qualifying_credible_sets $qualifying_gwas_disease_credible_set_path
!gcloud storage rsync -r --delete-unmatched-destination-objects  gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/qualifying_measurement_credible_sets $qualifying_gwas_measurements_credible_set_path
!gcloud storage rsync -r --delete-unmatched-destination-objects gs://genetics-portal-dev-analysis/yt4/20250403_for_gentropy_paper/list_of_molqtls_replicated_CSs.parquet $replicated_molqtls_path
!gcloud storage rsync -r --delete-unmatched-destination-objects gs://genetics-portal-dev-analysis/yt4/20250403_for_gentropy_paper/list_of_gwas_replicated_CSs.parquet $replicated_gwas_path


### Data reading


In [None]:
import narwhals as nw
from gentropy.common.session import Session
from gentropy.dataset.study_locus import StudyLocus
from pyspark.sql import functions as f

from manuscript_methods.datasets import LeadVariantEffect
from manuscript_methods.maf import MinorAlleleFrequency
from manuscript_methods.rescaled_beta import RescaledStatistics
from manuscript_methods.study_statistics import StudyStatistics, StudyType


In [3]:
session = Session(extended_spark_conf={"spark.driver.memory": "40G"})
full_lve = LeadVariantEffect.from_parquet(session=session, path=lead_variant_effect_dataset_path)
qualified_gwas_measurements_lve = LeadVariantEffect.from_parquet(
    session=session, path=qualifying_gwas_measurements_credible_set_path
)
qualified_gwas_disease_lve = LeadVariantEffect.from_parquet(
    session=session, path=qualifying_gwas_disease_credible_set_path
)
replicated_molqtls = session.spark.read.parquet(replicated_molqtls_path)
replicated_gwas = session.spark.read.parquet(replicated_gwas_path)


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/07/03 10:01:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/07/03 10:01:02 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


## Analysis steps

1. Filter full lve dataset to keep only cis-pqtl and eqtl datasets.
2. Union with the qualified gwas measurements and diseases datasets lve datasets
3. Apply filtering based on the replicated gwas and molqtls datasets
4. Remove lead variants with MAF >= 0.05 and absolute value of rescaled effect size <= 3
5. Save the filtered dataset to parquet


In [4]:
print(f"Before filtering qtls: {full_lve.df.count():,}")
molqtl_lve = LeadVariantEffect(
    full_lve.df.filter(StudyStatistics().study_type.isin(StudyType.CIS_PQTL, StudyType.EQTL))
)
print(f"After filtering qtls: {molqtl_lve.df.count():,}")


Before filtering qtls: 2,833,758
After filtering qtls: 1,365,531


In [5]:
print(f"Before union - measurements: {qualified_gwas_measurements_lve.df.count():,}")
print(f"Before union - disease: {qualified_gwas_disease_lve.df.count():,}")
print(f"Before union - molqtl: {molqtl_lve.df.count():,}")

study_stats = StudyStatistics()

# Adjust the study types
qualified_gwas_measurements_lve = LeadVariantEffect(
    qualified_gwas_measurements_lve.df.withColumn(
        study_stats.name, study_stats.transform_study_type(StudyType.GWAS_MEASUREMENT).col
    )
)
qualified_gwas_disease_lve = LeadVariantEffect(
    qualified_gwas_disease_lve.df.withColumn(
        study_stats.name, study_stats.transform_study_type(StudyType.GWAS_DISEASE).col
    )
)


qualified_lve = LeadVariantEffect(
    molqtl_lve.df.unionByName(qualified_gwas_measurements_lve.df).unionByName(qualified_gwas_disease_lve.df)
)
print(f"After union: {qualified_lve.df.count():,}")


Before union - measurements: 472,873
Before union - disease: 72,444
Before union - molqtl: 1,365,531
After union: 1,910,848


In [6]:
# Filter based on replicated credible sets
print(f"Before filtering replicated credible sets: {qualified_lve.df.count():,}")
print(f"Before filtering replicated GWAS: {replicated_gwas.count():,}")
print(f"Before filtering replicated molQTLs: {replicated_molqtls.count():,}")
replicated_cs_df = (
    replicated_gwas.unionByName(replicated_molqtls).write.mode("overwrite").parquet(replicated_credible_sets_path)
)
replicated_cs = StudyLocus.from_parquet(session=session, path=replicated_credible_sets_path)
cs_count = replicated_cs.df.count()
print(f"Replicated credible sets count: {cs_count:,}")
replicated_qualified_lve = qualified_lve.filter_by_study_locus_id(replicated_cs)
print(f"After filtering replicated credible sets: {replicated_qualified_lve.df.count():,}")


Before filtering replicated credible sets: 1,910,848
Before filtering replicated GWAS: 263,705
Before filtering replicated molQTLs: 1,461,445


                                                                                

Replicated credible sets count: 1,725,150


                                                                                

StudyLocus dimension: 1725150, unique studyLocusId: 1725150
Initial rows: 1910848
Filtered 774663 rows based on the StudyLocus.
Remaining rows: 1136185
After filtering replicated credible sets: 1,136,185


In [18]:
# Removal of maf and beta outliers
print(f"Before filtering maf and beta outliers: {replicated_qualified_lve.df.count():,}")
lve = replicated_qualified_lve.maf_filter().effect_size_filter()
print(f"After filtering maf and beta outliers: {lve.df.count():,}")
lve.df.write.mode("overwrite").parquet(lead_variant_effect_filtered_dataset_path)


                                                                                

Before filtering maf and beta outliers: 1,136,185


                                                                                

After filtering maf and beta outliers: 1,116,215


                                                                                

## Data after maf and beta filtering


In [17]:
lve.df.select(RescaledStatistics().estimated_beta, f.col("majorLdPopulationMaf.value").alias("MAF")).describe().show()


+-------+--------------------+-------------------+
|summary|       estimatedBeta|                MAF|
+-------+--------------------+-------------------+
|  count|             1116215|            1116215|
|   mean|0.045912691340399284|0.26083995977706037|
| stddev|  0.9200104685863284|0.14225958637683978|
|    min| -2.9997936995577867|7.62779092394584E-6|
|    max|   2.999975493530603|                0.5|
+-------+--------------------+-------------------+



                                                                                

In [8]:
# Check the number of studyTypes after all filtering
from manuscript_methods import group_statistics

group_statistics(lve.df.select("studyStatistics.studyType"), [f.col("studyType")]).show()


                                                                                

+----------------+------+-----+-------------------+
|       studyType| count|    %|         percentage|
+----------------+------+-----+-------------------+
|            eqtl|924848|82.86|  82.85572223989107|
|gwas-measurement|164967|14.78| 14.779142011171684|
|    gwas-disease| 24276| 2.17|  2.174849827318214|
|        cis-pqtl|  2124| 0.19|0.19028592161904292|
+----------------+------+-----+-------------------+

