# Deequ

## What is Deequ?
In this example, we will use PyDeequ, an open-source Python wrapper over Deequ (an open-source tool developed and used at Amazon) to check data validity. Deequ is written in Scala, whereas PyDeequ allows you to use its data quality and testing capabilities from Python and PySpark. Furthermore, PyDeequ allows for fluid interface with Pandas DataFrames as opposed to restricting within Apache Spark DataFrames.

**Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution**. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (billions of rows) that typically live in a data lake, distributed file system, or a data warehouse. PyDeequ gives you access to this capability, but also allows you to use it from the familiar environment of your Python Jupyter notebook.




## What are PyDeequ’s main components?

- **Metrics computation** : Deequ computes data quality metrics, that is, statistics such as completeness, maximum, or correlation. Deequ uses Spark to read from sources and compute metrics through an optimized set of aggregation queries. You have direct access to the raw metrics computed on the data.
- **Constraint verification** : With a given set of data quality constraints, Deequ generates a data quality report, which contains the result of the constraint verification.
- **Constraint suggestion** : Deequ can generate constraint by profiling the data.
- **Python wrappers** : You can call each Deequ function using Python syntax. The wrappers translate the commands to the underlying Deequ calls and return their response.


Below diagram shows the PyDeequ's main component:
![PyDeequ_architecture](../images/pydeequ_architecture.jpg)

In [14]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import StructType, IntegerType,StringType
from pydeequ.analyzers import *
from pydeequ.checks import *
from pydeequ.verification import *
import pydeequ
import os

In [2]:
local=True
deequ_jar_path="../lib/deequ-2.0.0-spark-3.1.jar"
if local:
    spark=SparkSession.builder\
            .master("local[4]")\
            .appName("deequ_example") \
            .config("spark.driver.extraClassPath", deequ_jar_path) \
            .config("spark.jars.packages", pydeequ.deequ_maven_coord) \
            .config("spark.jars.excludes", pydeequ.f2j_maven_coord) \
            .getOrCreate()
else:
    spark=SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443")\
        .appName("deequ_example") \
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0") \
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
        .config("spark.executor.instances", "4") \
        .config("spark.executor.memory","8g") \
        .config("spark.driver.extraClassPath", deequ_jar_path) \
        .config("spark.jars.packages", pydeequ.deequ_maven_coord) \
        .config("spark.jars.excludes", pydeequ.f2j_maven_coord) \
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
        .getOrCreate()

22/01/27 01:48:00 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.184.146 instead (on interface ens33)
22/01/27 01:48:00 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/opt/spark/spark-3.1.2/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/pliu/.ivy2/cache
The jars for the packages stored in: /home/pliu/.ivy2/jars
com.amazon.deequ#deequ added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-233da43a-0e1a-41ff-b659-1d4b7614e086;1.0
	confs: [default]
	found com.amazon.deequ#deequ;1.2.2-spark-3.0 in central
	found org.scalanlp#breeze_2.12;0.13.2 in central
	found org.scalanlp#breeze-macros_2.12;0.13.2 in central
	found org.scala-lang#scala-reflect;2.12.1 in central
	found com.github.fommil.netlib#core;1.1.2 in central
	found net.sf.opencsv#opencsv;2.3 in central
	found com.github.rwl#jtransforms;2.4.0 in central
	found junit#junit;4.8.2 in central
	found org.apache.commons#commons-math3;3.2 in central
	found org.spire-math#spire_2.12;0.13.0 in central
	found org.spire-math#spire-macros_2.12;0.13.0 in central
	found org.typelevel#machinist_2.12;0.6.1 in central
	found com.chuusai#shapeless_2.12;2.3.2 in central
	found org.typelevel#macro-compat_2.12;1.1.1 in ce

In [3]:
# read data
val_file_path="../data/adult.csv"
test_file_path="../data/adult_with_duplicates.csv"

schema = StructType() \
      .add("age",IntegerType(),True) \
      .add("workclass",StringType(),True) \
      .add("fnlwgt",IntegerType(),True) \
      .add("education",StringType(),True) \
      .add("education-num",IntegerType(),True) \
      .add("marital-status",StringType(),True) \
      .add("occupation",StringType(),True) \
      .add("relationship",StringType(),True) \
      .add("race",StringType(),True) \
      .add("sex",StringType(),True) \
      .add("capital-gain",IntegerType(),True) \
      .add("capital-loss",IntegerType(),True) \
      .add("hours-per-week",IntegerType(),True) \
      .add("native-country",StringType(),True) \
      .add("income",StringType(),True)

In [4]:
df_val=spark.read.option("header", False).schema(schema).csv(val_file_path)
df_val.show(5)
df_val.printSchema()

+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|education|education-num|    marital-status|       occupation| relationship| race|   sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States| <=50K|
| 38|         Private|215646|  HS-grad|            9|          Divorced|Handlers-cleaners|Not-i

In [5]:
df_test=spark.read.options(header='True', delimiter=',').schema(schema).csv(test_file_path)
df_test.show(5)
df_test.printSchema()

+----+--------------+------+---------+-------------+--------------+------------+-------------+-----+----+------------+------------+--------------+--------------+------+
| age|     workclass|fnlwgt|education|education-num|marital-status|  occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+----+--------------+------+---------+-------------+--------------+------------+-------------+-----+----+------------+------------+--------------+--------------+------+
| 139|     State-gov| 77516|Bachelors|           13| Never-married|Adm-clerical|Not-in-family|White|Male|        2174|           0|            40| United-States| <=50K|
| -12|     State-gov| 77516|Bachelors|           13| Never-married|Adm-clerical|Not-in-family|White|Male|        2174|           0|            40| United-States| <=50K|
|null|emp-by-pengfei| 77516|Bachelors|           13| Never-married|Adm-clerical|Not-in-family|White|Male|        2174|           0|            40| United-S

## 1. Data profiling

Before we define checks on the data, we want to calculate some statistics on the dataset; we call this step data profiling. As with Deequ, PyDeequ supports a rich set of metrics. For more information, see [Test data quality at scale with Deequ](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) or the [GitHub repo](https://github.com/awslabs/deequ/tree/master/src/main/scala/com/amazon/deequ/analyzers). In the following example, we use the **AnalysisRunner** to capture the metrics on valid data. You can find the source code of AnalysisRunner [here](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/analyzers/runners/AnalysisRunner.scala)


In [11]:
analysisResult = AnalysisRunner(spark) \
                    .onData(df_val) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("age")) \
                    .addAnalyzer(ApproxCountDistinct("age")) \
                    .addAnalyzer(Mean("age")) \
                    .addAnalyzer(Compliance("top age", "age >= 30")) \
                    .addAnalyzer(Correlation("marital-status", "relationship")) \
                    .addAnalyzer(Correlation("education-num", "age")) \
                    .run()

In [12]:
# to render the result as a dataframe, we use below method
analysisResult_df = AnalyzerContext.successMetricsAsDataFrame(spark, analysisResult)
analysisResult_df.show(10,truncate=False)

+-----------+-----------------+-------------------+-------------------+
|entity     |instance         |name               |value              |
+-----------+-----------------+-------------------+-------------------+
|Column     |top age          |Compliance         |0.7017597739627162 |
|Dataset    |*                |Size               |32561.0            |
|Mutlicolumn|education-num,age|Correlation        |0.03652718946410633|
|Column     |age              |Completeness       |1.0                |
|Column     |age              |ApproxCountDistinct|73.0               |
|Column     |age              |Mean               |38.58164675532078  |
+-----------+-----------------+-------------------+-------------------+



Now let's understand the output metric:
- row 1 compliance: is the result of Compliance("top age", "age >= 30"). It calculates the percentage of the value that is > 30 of the age column
- row 2 size: is the result of Size(). It calculates the row number of the dataframe
- row 3 correlation: is the result of Correlation("education-num", "age"). It calculates the correlation between column education-num and age. The value is between 1 and -1. -1 means not correlated, 1 means highly correlated
- row 4 completeness: is the result of Completeness("age"), the value 1.0 means the age column has no missing values
- row 5 ApproxCountDistinct: is the result of ApproxCountDistinct("age"). the value 73.0 means the column age has approximately 73 unique values.
- row 6 Mean: is the result of Mean("age"). the value 38.58 means the average age in the column is 38.

**Note, there is not ouput row for the Analyzer (Correlation("marital-status", "relationship")). Because in the current version, the correlation analyzer only works on numeric columns. If you put two string column, it will output nothing**.

## 2. Defining data validation rule

After profiling and understanding the data, we can define some validation rule for new coming dataset. By defining these validation rules on the data distribution as part of a data pipeline, we can ensure that every processed dataset is of high quality. We can detect quickly anomaly inside a dataset.

In below example, we implement the following data validation rules:
- The dataframe must have 32561 rows in total
- age is never NULL
- age is unique (will fail)
- age has a minimum of 1 and maximum of 100
- workclass column only contains 'Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'
- capital-gain does not contain negative values


To implement a validation run in deequ, we need to define a VerificationSuit, a dataframe, a check(validation rules). VerificationSuit associates dataframe and check.
After calling run() of the VerificationSuit, PyDeequ translates your test description into Deequ, which translates it into a series of Spark jobs that are run to compute metrics on the data. Afterwards, it invokes your assertion functions (for example, lambda x: x == 1.0 for the minimum age check) on these metrics to see if the constraints hold on the data.


In [48]:
# define an instance of a check object
check = Check(spark, CheckLevel.Warning, "Census income dataset Check")

# a validation run in deequ is based on a VerificationSuit. We can add data, and checks into a VerificationSuit.
checkResult = VerificationSuite(spark) \
    .onData(df_val) \
    .addCheck(
        check.hasSize(lambda x: x >= 30000.0) \
        .hasMin("age", lambda x: x == 1.0) \
        .hasMax("age", lambda x: x == 100.0)  \
        .isComplete("age")  \
        .isUnique("age")  \
        .isContainedIn("workclass", ['Federal-gov', 'Local-gov', 'Never-worked', 'Private', 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay','?']) \
        .isNonNegative("fnlwgt")) \
    .run()

                                                                                

In [49]:
checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult).drop("check","check_level","check_status")
print(checkResult_df.count())
checkResult_df.show(100)

7
+--------------------+-----------------+--------------------+
|          constraint|constraint_status|  constraint_message|
+--------------------+-----------------+--------------------+
|SizeConstraint(Si...|          Failure|Can't execute the...|
|MinimumConstraint...|          Failure|Can't execute the...|
|MaximumConstraint...|          Failure|Can't execute the...|
|CompletenessConst...|          Success|                    |
|UniquenessConstra...|          Failure|Value: 6.14231749...|
|ComplianceConstra...|          Success|                    |
|ComplianceConstra...|          Success|                    |
+--------------------+-----------------+--------------------+



You can notice in the result dataframe, some validation passed, some failed.

The first three should pass, but failed, that's because there is a bug in pydeequ, it can't pass the lambda function correctly. For more details about this bug, please visit this [page](https://github.com/awslabs/deequ/issues/367)

We also discovered another bug, in checks, you can have column name with "-". For example if you replace "fnlwgt" by "capital-gain". It will consider the column name is "capital" not "capital-gain".

We can also look at all the metrics that Deequ computed for this check by running the following:

In [50]:
checkResult_df = VerificationResult.successMetricsAsDataFrame(spark, checkResult)
checkResult_df.show()

+-------+--------------------+------------+--------------------+
| entity|            instance|        name|               value|
+-------+--------------------+------------+--------------------+
| Column|                 age|  Uniqueness|6.142317496391388E-5|
|Dataset|                   *|        Size|             32561.0|
| Column|workclass contain...|  Compliance|                 1.0|
| Column|fnlwgt is non-neg...|  Compliance|                 1.0|
| Column|                 age|     Minimum|                17.0|
| Column|                 age|     Maximum|                90.0|
| Column|                 age|Completeness|                 1.0|
+-------+--------------------+------------+--------------------+



## 3. Automated constraint suggestion

If you own a large number of datasets or if your dataset has many columns, it may be challenging for you to manually define appropriate constraints. Deequ can automatically suggest useful constraints based on the data distribution. Deequ first runs a data profiling method and then applies a set of rules on the result. For more information about how to run a data profiling method, see the [GitHub repo](https://github.com/awslabs/deequ/blob/master/src/main/scala/com/amazon/deequ/examples/data_profiling_example.md).

In [51]:
from pydeequ.suggestions import *
suggestionResult = ConstraintSuggestionRunner(spark) \
             .onData(df_val) \
             .addConstraintRule(DEFAULT()) \
             .run()

# Constraint Suggestions in JSON format
print(json.dumps(suggestionResult, indent=2))

Py4JError: An error occurred while calling None.com.amazon.deequ.suggestions.rules.CategoricalRangeRule. Trace:
py4j.Py4JException: Constructor com.amazon.deequ.suggestions.rules.CategoricalRangeRule([]) does not exist
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
	at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
	at py4j.Gateway.invoke(Gateway.java:237)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:829)



There is another bug, check the [issue](https://github.com/awslabs/python-deequ/issues/70)