# Benchmarking new PICS implementation

The objective of this notebook is to compare the new implementation of PICS estimated on GWAS Catalog associations using gnomAD LD reference, against the previous implementation using 1000 genomes phase III LD reference. 

In [4]:
import pyspark.sql.functions as f
import pyspark.sql.types as t
from pyspark.sql import SparkSession, DataFrame

spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/08 20:24:10 INFO org.apache.spark.SparkEnv: Registering MapOutputTracker
22/12/08 20:24:10 INFO org.apache.spark.SparkEnv: Registering BlockManagerMaster
22/12/08 20:24:11 INFO org.apache.spark.SparkEnv: Registering BlockManagerMasterHeartbeat
22/12/08 20:24:11 INFO org.apache.spark.SparkEnv: Registering OutputCommitCoordinator


In [5]:
new_study_locus = spark.read.parquet("gs://genetics_etl_python_playground/XX.XX/output/python_etl/parquet/pics_credible_set/")
new_study_locus.printSchema()

                                                                                

root
 |-- chromosome: string (nullable = true)
 |-- studyId: string (nullable = true)
 |-- variantId: string (nullable = true)
 |-- tagVariantId: string (nullable = true)
 |-- R_overall: double (nullable = true)
 |-- pics_mu: double (nullable = true)
 |-- pics_postprob: double (nullable = true)
 |-- pics_95_perc_credset: boolean (nullable = true)
 |-- pics_99_perc_credset: boolean (nullable = true)



## Observations on new study-locus dataset

- Records lack chromosome or variantId information. (Expected due to the lack of the QC column)

In [6]:
new_study_locus.filter(~(~f.col("variantId").contains("rs") & f.col("variantId").isNotNull() & f.col("chromosome").isNotNull())).count()

                                                                                

71723

In [7]:
new_study_locus_filtered = new_study_locus.filter(~f.col("variantId").contains("rs") & f.col("variantId").isNotNull() & f.col("chromosome").isNotNull())

- Unexpected NaNs returned by numpy in postprob for majority of records. Surprisingly some are still `true` in the credset assesment.

In [8]:
new_study_locus.filter(f.isnan(f.col("pics_postprob"))).show()

+----------+------------+---------------+----------------+----------+------------------+-------------+--------------------+--------------------+
|chromosome|     studyId|      variantId|    tagVariantId| R_overall|           pics_mu|pics_postprob|pics_95_perc_credset|pics_99_perc_credset|
+----------+------------+---------------+----------------+----------+------------------+-------------+--------------------+--------------------+
|         1|GCST000083_9|1_159436169_T_C|1_159393815_C_CA|  0.844099| 8.121067838279345|          NaN|                true|                true|
|         1|GCST000083_9|1_159436169_T_C| 1_159412117_A_G|  0.952546|10.341851135593762|          NaN|               false|               false|
|         1|GCST000083_9|1_159436169_T_C| 1_159356328_C_T|  0.718341| 5.881494248066192|          NaN|               false|               false|
|         1|GCST000083_9|1_159436169_T_C| 1_159406553_A_G| 81.166539| 75089.70916952091|          NaN|               false|       

In [9]:
new_study_locus.filter(f.isnan(f.col("pics_postprob"))).count()

                                                                                

34067544

# Comparing new credible sets with old credible sets

In [10]:
# All records
new_study_locus_filtered.select("studyId", "variantId", "tagVariantID").distinct().count()

                                                                                

30330984

In [11]:
new_study_locus_filtered.select("studyId").distinct().count()

                                                                                

34557

In [12]:
#Unique projects (GCST IDs)
new_study_locus_filtered = new_study_locus_filtered.withColumn("projectId", f.split(f.col("studyId"), "_").getItem(0))
new_study_locus_filtered.select("projectId").distinct().count()

                                                                                

22407

In [13]:
# Unique variantIds, in new dataset
new_study_locus_filtered.select("variantId").distinct().count()

                                                                                

209911

In [15]:
old_study_locus = spark.read.parquet("gs://genetics-portal-dev-staging/v2d/220210/ld.parquet")
old_study_locus_ = old_study_locus.select("study_id", f.concat_ws("_", f.col("lead_chrom"), f.col("lead_pos"), f.col("lead_ref"), f.col("lead_alt")).alias("variantId"), f.concat_ws("_", f.col("tag_chrom"), f.col("tag_pos"), f.col("tag_ref"), f.col("tag_alt")).alias("tagVariantId")).distinct()

In [16]:
# Unique records, in old dataset
old_study_locus_.count()

                                                                                

19406516

In [17]:
# Unique projects (GCST IDs)
old_study_locus_ = old_study_locus_.withColumn("projectId", f.split(f.col("study_id"), "_").getItem(0))
old_study_locus_.select("projectId").distinct().count()

                                                                                

8190

In [18]:
# Unique studyIds, in old dataset
old_study_locus_.select("study_id").distinct().count()

                                                                                

18349

In [19]:
# Unique variantIds, in old dataset
old_study_locus_.select("variantId").distinct().count()

                                                                                

126076

# Compare old and new in overlapping projects

In [20]:
olderProjects_newStudyLocus =new_study_locus_filtered.join(old_study_locus_.select("projectId").distinct(), on=["projectId"], how="inner")

In [21]:
# Total records
olderProjects_newStudyLocus.count()

                                                                                

26944342

In [22]:
# unique studies
olderProjects_newStudyLocus.select("studyId").distinct().count()

                                                                                

17161

In [23]:
# unique variants
olderProjects_newStudyLocus.select("variantId").distinct().count()

                                                                                

152722

TODO: once posterior probabilities and credible sets are properly calculated compare the numbers. At the moment the NEW results don't filter credible sets (tags) at any confindence level 95 / 99