# Filter initial dataset for plots


The `LeadVariantEffect` dataset has to be filtered for downstream analysis.

For the downstream analysis of MAF, rescaled effect size and variant effects we need to filter the `LeadVariantEffect` dataset to:

1. Limit the dataset to `gwas, cis-pqtl and eqlt` datasets.
2. Split the GWAS studies into two types:
   - measurements (continuous traits)
   - diseases (binary traits bound to therapeutic areas)
3. Apply the `replicated` mask to the GWAS-measurements, GWAS-diseases and molecular QTL datasets.
4. Apply the `qualified` mask to the GWAS-measurements, GWAS-diseases.
5. Apply Posterior Inclusion Probability (PIP) filter to all lead variants, keeping only those with PIP >= 0.1.

After the filtering applied check how many variants are left in the dataset and compute statistics on rescaled effect size & MAF.

5. Ensure there is no lead variants with MAF >= 0.05 and is not empty
6. Ensure there is no lead variants with absolute value of rescaled effect size <= 3 and is not empty

The resulting dataset should be split into two datasets:

- lead variant effects with MAF >= 0.01 (qualified_lead_variant_effect_maf_filtered)
- lead variant effects with all variants (qualified_lead_variant_effect)


## Data Loading

The data required for the analysis is loaded from the

- `lead variant effect` dataset
- `qualified gwas measurements` dataset
- `qualified gwas diseases` dataset
- `replicated molecular qtls` dataset
- `replicated gwas` dataset


### Data downloading


In [34]:
gwas_therapeutic_areas_path = "../../data/gwas_therapeutic_areas"
qualifying_gwas_disease_studies_path = "../../data/qualifying_disease_studies"
qualifying_gwas_measurements_studies_path = "../../data/qualifying_measurements_studies"
qualifying_gwas_disease_credible_set_path = "../../data/qualifying_disease_credible_sets"
qualifying_gwas_measurements_credible_set_path = "../../data/qualifying_measurement_credible_sets"
lead_variant_effect_dataset_path = "../../data/lead_variant_effect"
qualified_lead_variant_effect_path = "../../data/qualified_lead_variant_effect"
qualified_lead_variant_effect_maf_filtered_path = "../../data/qualified_lead_variant_effect_maf_filtered"
replicated_molqtls_path = "../../data/replicated_molqtl_credible_sets"
replicated_gwas_path = "../../data/replicated_gwas_credible_sets"
replicated_credible_sets_path = "../../data/replicated_credible_sets"


In [35]:
!gcloud storage rsync -r --delete-unmatched-destination-objects  gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/gwas_therapeutic_areas $gwas_therapeutic_areas_path
!gcloud storage rsync -r --delete-unmatched-destination-objects gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/qualifying_studies $qualifying_gwas_disease_studies_path
!gcloud storage rsync -r --delete-unmatched-destination-objects  gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/qualifying_measurements $qualifying_gwas_measurements_studies_path
!gcloud storage rsync -r --delete-unmatched-destination-objects  gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/qualifying_credible_sets $qualifying_gwas_disease_credible_set_path
!gcloud storage rsync -r --delete-unmatched-destination-objects  gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/qualifying_measurement_credible_sets $qualifying_gwas_measurements_credible_set_path
!gcloud storage rsync -r --delete-unmatched-destination-objects gs://genetics-portal-dev-analysis/yt4/20250403_for_gentropy_paper/list_of_molqtls_replicated_CSs.parquet $replicated_molqtls_path
!gcloud storage rsync -r --delete-unmatched-destination-objects gs://genetics-portal-dev-analysis/yt4/20250403_for_gentropy_paper/list_of_gwas_replicated_CSs.parquet $replicated_gwas_path


At file://../../data/gwas_therapeutic_areas/**, worker process 3482 thread 8797626112 listed 3...
At gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/gwas_therapeutic_areas/**, worker process 3482 thread 8797626112 listed 3...
  Completed files 0 | 0B                                                       


Updates are available for some Google Cloud CLI components.  To install them,
please run:
  $ gcloud components update

At file://../../data/qualifying_disease_studies/**, worker process 3589 thread 8797626112 listed 3...
At gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/qualifying_studies/**, worker process 3589 thread 8797626112 listed 3...
  Completed files 0 | 0B                                                       
At file://../../data/qualifying_measurements_studies/**, worker process 3684 thread 8797626112 listed 3...
At gs://genetics-portal-dev-analysis/dc16/output/gentropy_paper/qualifying_measurements/**, worker process 3684 thread 8797626112 lis

### Data reading


In [36]:
from gentropy.common.session import Session
from gentropy.dataset.study_locus import StudyLocus
from pyspark.sql import functions as f

from manuscript_methods.datasets import LeadVariantEffect
from manuscript_methods.locus_statistics import LocusStatistics
from manuscript_methods.rescaled_beta import RescaledStatistics
from manuscript_methods.study_statistics import StudyStatistics, StudyType


In [37]:
session = Session(extended_spark_conf={"spark.driver.memory": "40G"})
full_lve = LeadVariantEffect.from_parquet(session=session, path=lead_variant_effect_dataset_path)
qualified_gwas_measurements_lve = LeadVariantEffect.from_parquet(
    session=session, path=qualifying_gwas_measurements_credible_set_path
)
qualified_gwas_disease_lve = LeadVariantEffect.from_parquet(
    session=session, path=qualifying_gwas_disease_credible_set_path
)
replicated_molqtls = session.spark.read.parquet(replicated_molqtls_path)
replicated_gwas = session.spark.read.parquet(replicated_gwas_path)


## Analysis steps

1. Filter full lve dataset to keep only cis-pqtl and eqtl datasets.
2. Union with the qualified gwas measurements and diseases datasets lve datasets
3. Apply filtering based on the replicated gwas and molqtls datasets
4. Remove lead variants with MAF == 0.0 and empty MAF and absolute value of rescaled effect size <= 3
5. Filter by PIP >= 0.1
6. Save the filtered dataset to parquet
7. MAF filter >= 0.01
8. Save the filtered datasets


In [38]:
print(f"Before filtering qtls: {full_lve.df.count():,}")
molqtl_lve = LeadVariantEffect(
    full_lve.df.filter(StudyStatistics().study_type.isin(StudyType.CIS_PQTL, StudyType.EQTL))
)
print(f"After filtering qtls: {molqtl_lve.df.count():,}")


Before filtering qtls: 2,833,758
After filtering qtls: 1,365,531


In [39]:
print(f"Before union - measurements: {qualified_gwas_measurements_lve.df.count():,}")
print(f"Before union - disease: {qualified_gwas_disease_lve.df.count():,}")
print(f"Before union - molqtl: {molqtl_lve.df.count():,}")

study_stats = StudyStatistics()

# Adjust the study types
qualified_gwas_measurements_lve = LeadVariantEffect(
    qualified_gwas_measurements_lve.df.withColumn(
        study_stats.name, study_stats.transform_study_type(StudyType.GWAS_MEASUREMENT).col
    )
)
qualified_gwas_disease_lve = LeadVariantEffect(
    qualified_gwas_disease_lve.df.withColumn(
        study_stats.name, study_stats.transform_study_type(StudyType.GWAS_DISEASE).col
    )
)


qualified_lve = LeadVariantEffect(
    molqtl_lve.df.unionByName(qualified_gwas_measurements_lve.df).unionByName(qualified_gwas_disease_lve.df)
)
print(f"After union: {qualified_lve.df.count():,}")


Before union - measurements: 472,873
Before union - disease: 72,444
Before union - molqtl: 1,365,531
After union: 1,910,848


In [40]:
# Filter based on replicated credible sets
print(f"Before filtering replicated credible sets: {qualified_lve.df.count():,}")
print(f"Before filtering replicated GWAS: {replicated_gwas.count():,}")
print(f"Before filtering replicated molQTLs: {replicated_molqtls.count():,}")
replicated_cs_df = (
    replicated_gwas.unionByName(replicated_molqtls).write.mode("overwrite").parquet(replicated_credible_sets_path)
)
replicated_cs = StudyLocus.from_parquet(session=session, path=replicated_credible_sets_path)
cs_count = replicated_cs.df.count()
print(f"Replicated credible sets count: {cs_count:,}")
replicated_qualified_lve = qualified_lve.filter_by_study_locus_id(replicated_cs)
print(f"After filtering replicated credible sets: {replicated_qualified_lve.df.count():,}")


Before filtering replicated credible sets: 1,910,848
Before filtering replicated GWAS: 263,705
Before filtering replicated molQTLs: 1,461,445


                                                                                

Replicated credible sets count: 1,725,150


                                                                                

StudyLocus dimension: 1725150, unique studyLocusId: 1725150
Initial rows: 1910848
Filtered 774663 rows based on the StudyLocus.
Remaining rows: 1136185
After filtering replicated credible sets: 1,136,185


In [41]:
# Removal of beta outliers and MAF == 0, null
print(f"Before filtering maf and beta outliers: {replicated_qualified_lve.df.count():,}")
lve_maf = replicated_qualified_lve.maf_filter(threshold=None).effect_size_filter()
print(f"After filtering maf and beta outliers: {lve_maf.df.count():,}")


Before filtering maf and beta outliers: 1,136,185




After filtering maf and beta outliers: 1,116,215


                                                                                

In [42]:
# Remove variants with low PIP
pip_threshold = 0.1
locus_stats = LocusStatistics()
print(f"Before filtering PIP outliers: {lve_maf.df.count():,}")
lve = lve_maf.filter(locus_stats.col.getField("leadVariantPIP") >= pip_threshold)
print(f"After filtering PIP outliers: {lve.df.count():,}")


                                                                                

Before filtering PIP outliers: 1,116,215
After filtering PIP outliers: 949,753


                                                                                

In [43]:
lve.df.write.mode("overwrite").parquet(qualified_lead_variant_effect_path)


                                                                                

In [44]:
# Apply MAF filter to 0.01
print(f"Before MAF filtering: {lve.df.count():,}")
lve_maf_filtered = lve.maf_filter()
print(f"After MAF filtering: {lve_maf_filtered.df.count():,}")


                                                                                

Before MAF filtering: 949,753
After MAF filtering: 937,597


In [45]:
lve_maf_filtered.df.write.mode("overwrite").parquet(qualified_lead_variant_effect_maf_filtered_path)


                                                                                

## Sanity check


In [46]:
lve.df.select(RescaledStatistics().estimated_beta, f.col("majorLdPopulationMaf.value").alias("MAF")).describe().show()




+-------+-------------------+-------------------+
|summary|      estimatedBeta|                MAF|
+-------+-------------------+-------------------+
|  count|             949753|             949753|
|   mean|0.04908040009332393| 0.2596172211727276|
| stddev|  0.922123726627298|  0.143303106502015|
|    min|-2.9997936995577867|7.62779092394584E-6|
|    max|  2.999975493530603|                0.5|
+-------+-------------------+-------------------+



                                                                                

In [47]:
# Check the number of studyTypes after all filtering
from manuscript_methods import group_statistics

group_statistics(lve.df.select("studyStatistics.studyType"), [f.col("studyType")]).show()




+----------------+------+-----+-------------------+
|       studyType| count|    %|         percentage|
+----------------+------+-----+-------------------+
|            eqtl|778127|81.93|  81.92940690895423|
|gwas-measurement|148004|15.58| 15.583420110281304|
|    gwas-disease| 21615| 2.28|  2.275854880163579|
|        cis-pqtl|  2007| 0.21|0.21131810060089307|
+----------------+------+-----+-------------------+



                                                                                