# Basic Example

Here is a basic example of running a `VerificationSuite` with a couple `checks` and then filtering them based on their results. 

We'll start by creating a Spark session and a small sample dataframe.

In [1]:
import pydeequ
from pyspark.sql import Row, SparkSession


spark = (
    SparkSession.builder.config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate()
)

Please set env variable SPARK_VERSION
Ivy Default Cache set to: /home/studio-lab-user/.ivy2/cache
The jars for the packages stored in: /home/studio-lab-user/.ivy2/jars
:: loading settings :: url = jar:file:/home/studio-lab-user/.conda/envs/deepqu/lib/python3.10/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
com.amazon.deequ#deequ added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-0929f323-3dcc-4685-a441-3af705fd0e2d;1.0
	confs: [default]
	found com.amazon.deequ#deequ;1.2.2-spark-3.0 in central
	found org.scalanlp#breeze_2.12;0.13.2 in central
	found org.scalanlp#breeze-macros_2.12;0.13.2 in central
	found org.scala-lang#scala-reflect;2.12.1 in central
	found com.github.fommil.netlib#core;1.1.2 in central
	found net.sf.opencsv#opencsv;2.3 in central
	found com.github.rwl#jtransforms;2.4.0 in central
	found junit#junit;4.8.2 in central
	found org.apache.commons#commons-math3;3.2 in central
	found org.spire-math#

In [2]:
df = spark.sparkContext.parallelize(
    [Row(a="foo", b=1, c=5), Row(a="bar", b=2, c=6), Row(a="baz", b=3, c=None)]
).toDF()

                                                                                

Now, we will be importing the necessary `PyDeequ` modules for running a VerificationSuite with Checks. We will be checking the following: 

- does `df` have a size of at least 3? 
- does the `b` column have a minimum value of 0? 
- is the `c` column complete? 
- is the `a` column unique? 
- are the values of `a` column contained in "foo", "bar", and "baz"? 
- are the values in `b` colum non-negative? 

Once these checks are run, we'll display out the dataframe to see the results!


In [3]:
from pydeequ.checks import *
from pydeequ.verification import *


check = Check(spark, CheckLevel.Error, "Integrity checks")

checkResult = (
    VerificationSuite(spark)
    .onData(df)
    .addCheck(
        check.hasSize(lambda x: x >= 3)
        .hasMin("b", lambda x: x == 0)
        .isComplete("c")
        .isUnique("a")
        .isContainedIn("a", ["foo", "bar", "baz"])
        .isNonNegative("b")
    )
    .run()
)

checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
checkResult_df.show()

Python Callback server started!


                                                                                

+----------------+-----------+------------+--------------------+-----------------+--------------------+
|           check|check_level|check_status|          constraint|constraint_status|  constraint_message|
+----------------+-----------+------------+--------------------+-----------------+--------------------+
|Integrity checks|      Error|       Error|SizeConstraint(Si...|          Success|                    |
|Integrity checks|      Error|       Error|MinimumConstraint...|          Failure|Value: 1.0 does n...|
|Integrity checks|      Error|       Error|CompletenessConst...|          Failure|Value: 0.66666666...|
|Integrity checks|      Error|       Error|UniquenessConstra...|          Success|                    |
|Integrity checks|      Error|       Error|ComplianceConstra...|          Success|                    |
|Integrity checks|      Error|       Error|ComplianceConstra...|          Success|                    |
+----------------+-----------+------------+--------------------+

It's nice to see those as a dataframe, but we noticed a couple **Failures** in the `constraint_status` column! Let's filter them by accessing the `checkResults` property of our run.

In [4]:
if checkResult.status == "Success":
    print("The data passed the test, everything is fine!")

else:
    print("We found errors in the data, the following constraints were not satisfied:")

    for check_json in checkResult.checkResults:
        if check_json["constraint_status"] != "Success":
            print(
                f"\t{check_json['constraint']} failed because: {check_json['constraint_message']}"
            )

We found errors in the data, the following constraints were not satisfied:
	MinimumConstraint(Minimum(b,None)) failed because: Value: 1.0 does not meet the constraint requirement!
	CompletenessConstraint(Completeness(c,None)) failed because: Value: 0.6666666666666666 does not meet the constraint requirement!
