# Data preparation

The aim of this notebook is to collect the information about the credible set lead variants.
This includes:

- Addition of Major population sample size and size of cases/controls from studyIndex,
- Addition of consequence score derived from VEP annotations from variantIndex
- Calculation of MAF (Minor Allele Frequency) for lead variants based on the gnomAD v4.2 allele frequency data for a given population given by the major ancestry.
- Calculation of Variance Explained by lead variant


## Data extraction and loading

Data for this analysis has to be downloaded from 3 datasets available by FTP:

- credible_set
- variant
- study


In [1]:
# Ensure proper java version < 11
!java -version




In [2]:
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/output/credible_set ../../data/.
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/output/study ../../data/.
!rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.03/output/variant ../../data/.




#### Loading the data with gentropy


In [6]:
from gentropy.common.session import Session
from gentropy.dataset.study_index import StudyIndex
from gentropy.dataset.study_locus import StudyLocus
from gentropy.dataset.variant_index import VariantIndex


In [7]:
session = Session(extended_spark_conf={"spark.driver.memory": "40G"})
variant_index_path = "../../data/variant"
study_index_path = "../../data/study"
credible_set_path = "../../data/credible_set"


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/04 13:08:00 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [None]:
session.spark


In [11]:
vi = VariantIndex.from_parquet(session, variant_index_path)
si = StudyIndex.from_parquet(session, study_index_path)
cs = StudyLocus.from_parquet(session, credible_set_path)


In [11]:
vi.df.show(n=1)
si.df.show(n=1)
cs.df.show(n=1)


+---------------+----------+--------+---------------+---------------+--------------------+-----------------------+----------------------+------------+----------------+--------------------+--------------------+--------------------+
|      variantId|chromosome|position|referenceAllele|alternateAllele|       variantEffect|mostSevereConsequenceId|transcriptConsequences|       rsIds|          hgvsId|   alleleFrequencies|             dbXrefs|  variantDescription|
+---------------+----------+--------+---------------+---------------+--------------------+-----------------------+----------------------+------------+----------------+--------------------+--------------------+--------------------+
|11_69447201_C_A|        11|69447201|              C|              A|[{VEP, upstream_g...|             SO_0001631|  [{[SO_0001632], N...|[rs58111031]|11:g.69447201C>A|[{sas_adj, 8.2850...|[{rs58111031, ens...|Upstream gene var...|
+---------------+----------+--------+---------------+---------------+-------

## MAF dataset

The dataset below contains lead variants from credible sets contains:

- maf
- major population used to calculate maf
- trait infromation derived from `traitFromSourceMappedIds` or `geneId` fields depending on the studyType found from study index
- allelic frequencies derived from variant index (gnomAD) for major population found in `ldPopulationStructure` in study index
- vep score
- information about the cases and controls counts from study index
- variant association statistics from study locus
- study type

Code below collects all required fields required to perform analysis on MAF and variant effects


### Building dataset


In [9]:
from manuscript_methods.datasets.maf_dataset import create_maf_dataset


MAF dataset consists of specific fields from credible sets, study index and variant index


In [12]:
dataset_maf = create_maf_dataset(si, cs, vi)


Save the dataset for further analysis


In [8]:
dataset_maf.write.mode("overwrite").parquet("../../data/lead-maf-vep")


                                                                                