# **Advanced Variant Filtering tutorial**

In this Jupyter tutorial you can learn how to query variants based on some well known statistical tests, which are implemented as part of the functionality of PyOskar API. Here we will use transformers to access PySpark functionality. These kind of functions depend directly from an Oskar object, and they will return a whole dataframe with the due transformation applied.

First, we need to import the PyOskar and PySpark modules. Second, we need to create an instance of the _Oskar_ object, from which depends a big part of the functionality. Finally, we can load our data in a DataFrame _df_ and we are ready to start playing.

In [2]:
from pyoskar.core import Oskar
from pyoskar.sql import *
from pyoskar.analysis import *
from pyspark.sql.functions import col, udf, count, explode, concat, when, expr
from pyspark.sql.functions import *

oskar = Oskar(spark)
df = oskar.load("/home/roldanx/appl/oskar/oskar-spark/src/test/resources/platinum_chr22.small.parquet")

You can use PySpark to print the data from _df_. This is how our testing dataframe looks like, you can see that for this tutorial we are using a small dataset from Illumina Platinum Genomes with 1,000 random variants from chromosome 22:

In [3]:
print("Print first 20 variants:")
df.show()

Print first 20 variants:
+---------------+-----+----------+--------+--------+---------+---------+------+----+------+-----+----+--------------------+--------------------+
|             id|names|chromosome|   start|     end|reference|alternate|strand|  sv|length| type|hgvs|             studies|          annotation|
+---------------+-----+----------+--------+--------+---------+---------+------+----+------+-----+----+--------------------+--------------------+
|22:16054454:C:T|   []|        22|16054454|16054454|        C|        T|     +|null|     1|  SNV|  []|[[hgvauser@platin...|[22, 16054454, 16...|
|22:16065809:T:C|   []|        22|16065809|16065809|        T|        C|     +|null|     1|  SNV|  []|[[hgvauser@platin...|[22, 16065809, 16...|
|22:16077310:T:A|   []|        22|16077310|16077310|        T|        A|     +|null|     1|  SNV|  []|[[hgvauser@platin...|[22, 16077310, 16...|
|22:16080499:A:G|   []|        22|16080499|16080499|        A|        G|     +|null|     1|  SNV|  []|[[h

In [4]:
print("Total number of variants:")
df.count()

Total number of variants:


1000

 We will give a few examples of simple queries that could be of interest for the user.

## Hardy Weinberg
This transformer calculates the Hardy Weinberg Equilibrium based fisher test by using the poblational data stored in the dataframe.
Usage:
```python
hardyWeinberg(df[DataFrame], studyId[str]=None)
```

In [4]:
oskar.hardyWeinberg(df, "hgvauser@platinum:illumina_platinum").select("id", "HWE").show(10)

+---------------+--------------------+
|             id|                 HWE|
+---------------+--------------------+
|22:16054454:C:T|                 1.0|
|22:16065809:T:C|                 1.0|
|22:16077310:T:A|  0.9254727474972191|
|22:16080499:A:G|                 1.0|
|22:16084621:T:C|                 1.0|
|22:16091610:G:T|                 1.0|
|22:16096040:G:A|  0.4746014089729329|
|22:16099957:C:T|0.016007636455477054|
|22:16100462:A:G|0.001011008618240...|
|22:16105660:G:A|  0.3037449017426771|
+---------------+--------------------+
only showing top 10 rows



In [9]:
oskar.hardyWeinberg(df, "hgvauser@platinum:illumina_platinum").select("id", "HWE").filter("HWE < 0.005").show(10, truncate = False)

+---------------+---------------------+
|id             |HWE                  |
+---------------+---------------------+
|22:16100462:A:G|0.0010110086182406558|
|22:16147398:G:A|0.0010112245929821014|
|22:16202382:C:T|0.0010110086182406558|
|22:16409256:C:A|0.0010110086182406558|
|22:16409275:T:C|0.0010110086182406558|
|22:16463338:T:C|4.51387209620996E-5  |
|22:16847903:T:A|1.1233429091562846E-4|
|22:16850925:C:T|1.1233429091562846E-4|
|22:16853987:T:C|1.1233429091562846E-4|
|22:16854418:G:A|1.1233429091562846E-4|
+---------------+---------------------+
only showing top 10 rows



## Inbreeding coefficient
This transformer calculates the Hardy Weinberg Equilibrium based COI by using the poblational data stored in the dataframe. This one needs a previous transformation that generates new stats data. The use of that method is fully explained in the _stats_ tutorial.
Usage:
```python
inbreedingCoefficient(df[DataFrame], missingGenotypesAsHomRef[bool]=None, includeMultiAllelicGenotypes[bool]=None, mafThreshold[float]=None)
```

In [11]:
df2 = oskar.stats(df, studyId = "hgvauser@platinum:illumina_platinum", missingAsReference = True)
oskar.inbreedingCoefficient(df2).show(10)

+--------+-------------------+-----------+------------------+--------------+
|SampleId|                  F|ObservedHom|       ExpectedHom|GenotypesCount|
+--------+-------------------+-----------+------------------+--------------+
| NA12877|-1.0857581722788996|         70|233.97577702999115|           385|
| NA12878|-1.1024114888695444|         69|244.65916746854782|           404|
| NA12879|-1.1890914293957586|         69| 247.7093403339386|           398|
| NA12880|-1.1013660394101679|         71|248.15224742889404|           409|
| NA12881|-1.1560267972581504|         65| 252.6643579006195|           415|
| NA12882|-1.0112382612189488|         76| 224.8269881606102|           372|
| NA12883|-1.0602574055431329|         67|229.62110525369644|           383|
| NA12884|-1.0340014363992485|         74|224.47404664754868|           370|
| NA12885|-1.1105665251221366|         78| 254.8010356426239|           414|
| NA12886| -1.067867784696387|         72|244.48096668720245|           406|

In [19]:
df2 = oskar.stats(df, studyId = "hgvauser@platinum:illumina_platinum", missingAsReference = True)
oskar.inbreedingCoefficient(df2).filter("F > -1").show()

+--------+---+-----------+-----------+--------------+
|SampleId|  F|ObservedHom|ExpectedHom|GenotypesCount|
+--------+---+-----------+-----------+--------------+
+--------+---+-----------+-----------+--------------+



_* This null result evidences that there are a lot more observed heterozygous variants than expected, so Inbreeding cases are unlikely for this population's ancestry._

### DE HECHO CREO Q EL COEF ES DEMASIADO BAJO, ALGO PASA AHÍ CON LOS VALORES.

## Mendelian error
This transformer looks for variants which can't have been inherited by the specified samples.
Usage:
```python
mendel(df[DataFrame], father[str], mother[str], child[str], studyId[str]=None)
```

In [24]:
oskar.mendel(df, "NA12877", "NA12878", "NA12879").select("id", "mendelianError").show(10)

+---------------+--------------+
|             id|mendelianError|
+---------------+--------------+
|22:16054454:C:T|             0|
|22:16065809:T:C|             0|
|22:16077310:T:A|             0|
|22:16080499:A:G|             0|
|22:16084621:T:C|             0|
|22:16091610:G:T|             0|
|22:16096040:G:A|             0|
|22:16099957:C:T|             0|
|22:16100462:A:G|             0|
|22:16105660:G:A|             0|
+---------------+--------------+
only showing top 10 rows



In [22]:
# El ID del resultado es normal??? # df.filter("id = '22:19748211:CCCC:-'").show()
oskar.mendel(df, "NA12877", "NA12878", "NA12879").select("id", "mendelianError").filter(col("mendelianError") != "0").show()

+------------------+--------------+
|                id|mendelianError|
+------------------+--------------+
|22:19748211:CCCC:-|             1|
+------------------+--------------+



# TDT??? [No se pue poner aun pq el dataframe no tiene info de fenotipos creo]

# COMPOUND HETEROZYGOTE???