# **Variant filtering example**
This Jupyter script is an example for a use case of the Oskar API, which works in perfect combination with Pyspark in order to provide a genomic analysis tool.

First of all we need to import the modules that contain the Oskar API as well as the Spark ones. Also we need to create a new instance of the Oskar object, from which depends the whole functionality, and load our data as "df".

**Optional: Create a Temporary View case you plan to acces the data folder from SQL.*

In [1]:
from pyoskar.core import Oskar
from pyoskar.spark.sql import *
from pyoskar.spark.analysis import *
from pyspark.sql.functions import col, udf, count, explode, concat, when, expr
from pyspark.sql.functions import *

oskar = Oskar(spark)
df = oskar.load("/home/roldanx/appl/oskar/oskar-spark/src/test/resources/platinum_chr22.small.parquet")
df.createOrReplaceTempView("platinum")

## Region filter
First of all we will execute a variant filtering based on a restricting zone, which in this example we chose as the 22th chromosome and the nucleotides between 17.000.000 and 17.500.000 position

In [5]:
# Pyspark
df.select("id", "chromosome", "start", "end").filter(df.chromosome == 22).filter(df.start > 17000000).filter(df.end < 17500000).show(10)

+---------------+----------+--------+--------+
|             id|chromosome|   start|     end|
+---------------+----------+--------+--------+
|22:17001352:C:G|        22|17001352|17001352|
|22:17002352:C:A|        22|17002352|17002352|
|22:17004097:G:A|        22|17004097|17004097|
|22:17011943:G:C|        22|17011943|17011943|
|22:17012760:G:A|        22|17012760|17012760|
|22:17013084:C:T|        22|17013084|17013084|
|22:17013900:A:G|        22|17013900|17013900|
|22:17019574:C:A|        22|17019574|17019574|
|22:17030949:T:C|        22|17030949|17030949|
|22:17034810:T:C|        22|17034810|17034810|
+---------------+----------+--------+--------+
only showing top 10 rows



In [4]:
# SQL
spark.sql("SELECT id,chromosome,start,end FROM platinum WHERE chromosome =='22' AND start > 17000000 AND end < 17500000").show(10)

+---------------+----------+--------+--------+
|             id|chromosome|   start|     end|
+---------------+----------+--------+--------+
|22:17001352:C:G|        22|17001352|17001352|
|22:17002352:C:A|        22|17002352|17002352|
|22:17004097:G:A|        22|17004097|17004097|
|22:17011943:G:C|        22|17011943|17011943|
|22:17012760:G:A|        22|17012760|17012760|
|22:17013084:C:T|        22|17013084|17013084|
|22:17013900:A:G|        22|17013900|17013900|
|22:17019574:C:A|        22|17019574|17019574|
|22:17030949:T:C|        22|17030949|17030949|
|22:17034810:T:C|        22|17034810|17034810|
+---------------+----------+--------+--------+
only showing top 10 rows



## Gene filter
On the second place we would like to execute a filtering which ties up the variants attached to a concrete gene. "NBEAP3" was chosen as the target. Here we start preciating the functionality of Oskar API: "genes" field is automatically located and easily accesed.

In [6]:
df.select(df.id, genes("annotation").alias("genes")).filter(array_contains("genes", "NBEAP3")).show()
# spark.sql("SELECT id,genes(annotation) FROM platinum WHERE CONTAINS(genes(annotation), 'NBEAP3')").show()

+---------------+--------------------+
|             id|               genes|
+---------------+--------------------+
|22:16096040:G:A|            [NBEAP3]|
|22:16099957:C:T|            [NBEAP3]|
|22:16100462:A:G|            [NBEAP3]|
|22:16105660:G:A|            [NBEAP3]|
|22:16112391:G:A|            [NBEAP3]|
|22:16114913:A:T|            [NBEAP3]|
|22:16127471:A:-|[LA16c-60H5.7, NB...|
+---------------+--------------------+



## Genotype filter
In this case we would like to show the way we could get the genotypes corresponding to any sample with our "sampleDataField" function.

In [20]:
# Spark
df.select(sample_data_field(col("studies"), 'NA12877', 'GT').alias("NA12877")).show(10)

+-------+
|NA12877|
+-------+
|    ./.|
|    0/1|
|    ./.|
|    0/1|
|    ./.|
|    ./.|
|    ./.|
|    0/1|
|    0/1|
|    0/1|
+-------+
only showing top 10 rows



In [21]:
# SQL
spark.sql("SELECT sample_data_field(studies, 'NA12877', 'GT')  AS NA12877 FROM platinum").show(10)

+-------+
|NA12877|
+-------+
|    ./.|
|    0/1|
|    ./.|
|    0/1|
|    ./.|
|    ./.|
|    ./.|
|    0/1|
|    0/1|
|    0/1|
+-------+
only showing top 10 rows



## Population Frequency filter
Finally we want to filter our dataframe by using PF field, so we just type "population_frequency" and specify the field were we can acces it, the ID of the concrete study we would like to check, and the target population. 

In [3]:
df.select(df.id, population_frequency("annotation", "GNOMAD_GENOMES", "ALL").alias("GNOMAD_GENOMES:ALL"))\
    .filter(population_frequency("annotation", "GNOMAD_GENOMES", "ALL") != 0).show(10)

+---------------+--------------------+
|             id|  GNOMAD_GENOMES:ALL|
+---------------+--------------------+
|22:16054454:C:T| 0.07566695660352707|
|22:16065809:T:C| 0.14594951272010803|
|22:16077310:T:A|  0.2338419109582901|
|22:16080499:A:G|                 0.0|
|22:16084621:T:C|                 0.0|
|22:16091610:G:T|0.003129890421405...|
|22:16096040:G:A|                 0.0|
|22:16099957:C:T|  0.6782668232917786|
|22:16100462:A:G|                 0.0|
|22:16105660:G:A| 0.12387744337320328|
+---------------+--------------------+
only showing top 10 rows



Otherwise, if we are interested in getting all the Population Frequencies available, we can use this other method named "population_frequency_as_map" which will return the whole set as a dictionary format. Then we could apply "explode" and convert that dictionary into a new dataframe.

In [43]:
PF = df.select(df.id, population_frequency_as_map("annotation").alias("populationFrequencies"))\
    .filter(df.id == "22:16099957:C:T")
PF.select(explode(PF.populationFrequencies).alias("study", "populationFrequenciesDF")).show()

+--------------------+-----------------------+
|               study|populationFrequenciesDF|
+--------------------+-----------------------+
|  GNOMAD_GENOMES:OTH|     0.6896551847457886|
|  GNOMAD_GENOMES:ALL|     0.6782668232917786|
|  GNOMAD_GENOMES:AFR|     0.6747621297836304|
|  GNOMAD_GENOMES:NFE|     0.6699579954147339|
| GNOMAD_GENOMES:MALE|      0.682662844657898|
|  GNOMAD_GENOMES:FIN|     0.6726332306861877|
|  GNOMAD_GENOMES:EAS|     0.9523809552192688|
|  GNOMAD_GENOMES:ASJ|     0.5721649527549744|
|GNOMAD_GENOMES:FE...|     0.6727412939071655|
|  GNOMAD_GENOMES:AMR|     0.6950549483299255|
+--------------------+-----------------------+

