# **Facets example**
This Jupyter script is an example for a use case of the Pyoskar API, which works in perfect combination with Pyspark in order to provide a genomic analysis tool. On this example we show some ways that could be of interest to start using the Facets functionality.

We have to go through this step before we start using Pyoskar:

In [1]:
from pyoskar.core import Oskar
from pyoskar.sql import *
from pyoskar.analysis import *
from pyspark.sql.functions import col, udf, count, explode, concat, when, expr
from pyspark.sql.functions import *

oskar = Oskar(spark)
df = oskar.load("/home/roldanx/appl/oskar/oskar-spark/src/test/resources/platinum_chr22.small.parquet")

## Simple Facet
Now that we have loaded our data, we start with an easy facet. This example executes the classics "groupBy" and "count" upon our dataframe basing on the variant types and the genes that contains them:

In [5]:
oskar.facet(df, "type").show()

+-----+-----+
| type|count|
+-----+-----+
|INDEL|  106|
|  SNV|  894|
+-----+-----+



In [8]:
oskar.facet(df, "gene").show(10)

+-----------+-----+
|       gene|count|
+-----------+-----+
|    ABCD1P4|    2|
|  ABHD17AP4|    2|
| AC000029.1|    1|
| AC000041.8|    2|
| AC000067.1|    1|
|AC000068.10|    1|
| AC000068.5|    1|
| AC000089.3|    1|
| AC002472.1|    4|
|AC004019.10|    1|
+-----------+-----+
only showing top 10 rows



## Include Facet
This next example goes a bit further and als applies a filtering based on the values we explicit in the function:

In [11]:
oskar.facet(df, "gene[BCL2L13,CECR2]").show()

+-------+-----+
|   gene|count|
+-------+-----+
|BCL2L13|    8|
|  CECR2|   11|
+-------+-----+



## Range facet
Using a similar sintax as with "Include facets" but dealing with quantitative fields instead of qualitative we find that we can apply facets by range, where we can determine both upper and downer thresholds as well as the step. For this example we have chosen the "phylop" conservation score but other conservation, functional and substitution scores are available too:

In [21]:
oskar.facet(df, "phylop[-5..0]:1").show()

+-----------+-----+
|phylopRange|count|
+-----------+-----+
|       -4.0|    3|
|       -3.0|   12|
|       -2.0|   55|
|       -1.0|  171|
|        0.0|  681|
+-----------+-----+



## Aggregation facet
We may want to check whether the compounds of all variants have historically been well conservated or otherways have notably evolved. For this task we could use the aggregation facets, with substitutes the default "count" function for another one we decide among this ones: average[avg], sumatory[sum], square sumatory[sumsq], percentiles[percentile] or set of values[unique].

In [23]:
oskar.facet(df, "sum(gerp)").show()

+------------------+-----+
|         sum(gerp)|count|
+------------------+-----+
|-351.8712293113349| 1000|
+------------------+-----+



In [27]:
oskar.facet(df, "percentile(gerp)").show(truncate=False)

+---------------------------------------------------------------------------------------+-----+
|percentile(gerp)                                                                       |count|
+---------------------------------------------------------------------------------------+-----+
|[-2.152000093460083, -0.6257500052452087, 0.0, 0.14900000393390656, 0.7430999755859375]|1000 |
+---------------------------------------------------------------------------------------+-----+



# Nested facets
The last feature we find available for our facet queries is nesting, which allows us to concatenate gruops and reach complex studies.

In [52]:
oskar.facet(df, "biotype>>ct[splice_donor_variant]").show(100, False)

+-----------------------+--------------------+-----+
|biotype                |ct                  |count|
+-----------------------+--------------------+-----+
|nonsense_mediated_decay|splice_donor_variant|1    |
|processed_transcript   |splice_donor_variant|1    |
|protein_coding         |splice_donor_variant|1    |
|retained_intron        |splice_donor_variant|1    |
+-----------------------+--------------------+-----+



# Gerp + PF ??

In [54]:
oskar.facet(df, "cadd_raw[-100..100]:10>>biotype>>type>>gerp[-10..10]:0.5>>gene[CNN2P1,EIF4ENIF1,IGLV3-12,CTA-85E5.10]").show(100, False)

+-------------+-----------------------+----+---------+-----------+-----+
|cadd_rawRange|biotype                |type|gerpRange|gene       |count|
+-------------+-----------------------+----+---------+-----------+-----+
|0.0          |IG_V_gene              |SNV |-1.0     |IGLV3-12   |1    |
|0.0          |antisense              |SNV |-1.0     |CTA-85E5.10|1    |
|0.0          |antisense              |SNV |-1.0     |EIF4ENIF1  |1    |
|0.0          |antisense              |SNV |-0.5     |CTA-85E5.10|1    |
|0.0          |antisense              |SNV |0.0      |EIF4ENIF1  |1    |
|0.0          |antisense              |SNV |0.5      |CTA-85E5.10|1    |
|0.0          |nonsense_mediated_decay|SNV |-1.0     |EIF4ENIF1  |1    |
|0.0          |nonsense_mediated_decay|SNV |0.0      |EIF4ENIF1  |1    |
|0.0          |nonsense_mediated_decay|SNV |2.5      |EIF4ENIF1  |1    |
|0.0          |processed_transcript   |SNV |2.5      |EIF4ENIF1  |1    |
|0.0          |protein_coding         |SNV |-1.0   