# Deequ

## What is Deequ?
In this example, we will use PyDeequ, an open-source Python wrapper over Deequ (an open-source tool developed and used at Amazon) to check data validity. Deequ is written in Scala, whereas PyDeequ allows you to use its data quality and testing capabilities from Python and PySpark. Furthermore, PyDeequ allows for fluid interface with Pandas DataFrames as opposed to restricting within Apache Spark DataFrames.

**Deequ allows you to calculate data quality metrics on your dataset, define and verify data quality constraints, and be informed about changes in the data distribution**. Instead of implementing checks and verification algorithms on your own, you can focus on describing how your data should look. Deequ supports you by suggesting checks for you. Deequ is implemented on top of Apache Spark and is designed to scale with large datasets (billions of rows) that typically live in a data lake, distributed file system, or a data warehouse. PyDeequ gives you access to this capability, but also allows you to use it from the familiar environment of your Python Jupyter notebook.



## What are PyDeequ’s main components?

- **Metrics computation** : Deequ computes data quality metrics, that is, statistics such as completeness, maximum, or correlation. Deequ uses Spark to read from sources and compute metrics through an optimized set of aggregation queries. You have direct access to the raw metrics computed on the data.
- **Constraint verification** : With a given set of data quality constraints, Deequ generates a data quality report, which contains the result of the constraint verification.
- **Constraint suggestion** : Deequ can generate constraint by profiling the data.
- **Python wrappers** : You can call each Deequ function using Python syntax. The wrappers translate the commands to the underlying Deequ calls and return their response.


Below diagram shows the PyDeequ's main component:
![PyDeequ_architecture](../images/pydeequ_architecture.jpg)

In [1]:
from pyspark.sql import SparkSession, DataFrame
from pyspark.sql.types import StructType, IntegerType,StringType
from pydeequ.analyzers import *
import pydeequ
import os

Please set env variable SPARK_VERSION


In [2]:
local=True
deequ_jar_path="../lib/deequ-2.0.0-spark-3.1.jar"
if local:
    spark=SparkSession.builder\
            .master("local[4]")\
            .appName("deequ_example") \
            .config("spark.driver.extraClassPath", deequ_jar_path) \
            .config("spark.jars.packages", pydeequ.deequ_maven_coord) \
            .config("spark.jars.excludes", pydeequ.f2j_maven_coord) \
            .getOrCreate()
else:
    spark=SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443")\
        .appName("deequ_example") \
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0") \
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT']) \
        .config("spark.executor.instances", "4") \
        .config("spark.executor.memory","8g") \
        .config("spark.driver.extraClassPath", deequ_jar_path) \
        .config("spark.jars.packages", pydeequ.deequ_maven_coord) \
        .config("spark.jars.excludes", pydeequ.f2j_maven_coord) \
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE']) \
        .getOrCreate()

22/01/26 11:01:07 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 192.168.184.146 instead (on interface ens33)
22/01/26 11:01:07 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address


:: loading settings :: url = jar:file:/opt/spark/spark-3.2.0/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/pliu/.ivy2/cache
The jars for the packages stored in: /home/pliu/.ivy2/jars
com.amazon.deequ#deequ added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-d5a2ff8c-2249-41af-8c5b-f9d3c09755fa;1.0
	confs: [default]
	found com.amazon.deequ#deequ;1.2.2-spark-3.0 in central
	found org.scalanlp#breeze_2.12;0.13.2 in central
	found org.scalanlp#breeze-macros_2.12;0.13.2 in central
	found org.scala-lang#scala-reflect;2.12.1 in central
	found com.github.fommil.netlib#core;1.1.2 in central
	found net.sf.opencsv#opencsv;2.3 in central
	found com.github.rwl#jtransforms;2.4.0 in central
	found junit#junit;4.8.2 in central
	found org.apache.commons#commons-math3;3.2 in central
	found org.spire-math#spire_2.12;0.13.0 in central
	found org.spire-math#spire-macros_2.12;0.13.0 in central
	found org.typelevel#machinist_2.12;0.6.1 in central
	found com.chuusai#shapeless_2.12;2.3.2 in central
	found org.typelevel#macro-compat_2.12;1.1.1 in ce

In [3]:
# read data
val_file_path="../data/adult.csv"
test_file_path="../data/adult_with_duplicates.csv"

schema = StructType() \
      .add("age",IntegerType(),True) \
      .add("workclass",StringType(),True) \
      .add("fnlwgt",IntegerType(),True) \
      .add("education",StringType(),True) \
      .add("education-num",IntegerType(),True) \
      .add("marital-status",StringType(),True) \
      .add("occupation",StringType(),True) \
      .add("relationship",StringType(),True) \
      .add("race",StringType(),True) \
      .add("sex",StringType(),True) \
      .add("capital-gain",IntegerType(),True) \
      .add("capital-loss",IntegerType(),True) \
      .add("hours-per-week",IntegerType(),True) \
      .add("native-country",StringType(),True) \
      .add("income",StringType(),True)

In [4]:
df_val=spark.read.option("header", False).schema(schema).csv(val_file_path)
df_val.show(5)
df_val.printSchema()

+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
|age|       workclass|fnlwgt|education|education-num|    marital-status|       occupation| relationship| race|   sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+---+----------------+------+---------+-------------+------------------+-----------------+-------------+-----+------+------------+------------+--------------+--------------+------+
| 39|       State-gov| 77516|Bachelors|           13|     Never-married|     Adm-clerical|Not-in-family|White|  Male|        2174|           0|            40| United-States| <=50K|
| 50|Self-emp-not-inc| 83311|Bachelors|           13|Married-civ-spouse|  Exec-managerial|      Husband|White|  Male|           0|           0|            13| United-States| <=50K|
| 38|         Private|215646|  HS-grad|            9|          Divorced|Handlers-cleaners|Not-i

In [5]:
df_test=spark.read.options(header='True', delimiter=',').schema(schema).csv(test_file_path)
df_test.show(5)
df_test.printSchema()

+----+--------------+------+---------+-------------+--------------+------------+-------------+-----+----+------------+------------+--------------+--------------+------+
| age|     workclass|fnlwgt|education|education-num|marital-status|  occupation| relationship| race| sex|capital-gain|capital-loss|hours-per-week|native-country|income|
+----+--------------+------+---------+-------------+--------------+------------+-------------+-----+----+------------+------------+--------------+--------------+------+
| 139|     State-gov| 77516|Bachelors|           13| Never-married|Adm-clerical|Not-in-family|White|Male|        2174|           0|            40| United-States| <=50K|
| -12|     State-gov| 77516|Bachelors|           13| Never-married|Adm-clerical|Not-in-family|White|Male|        2174|           0|            40| United-States| <=50K|
|null|emp-by-pengfei| 77516|Bachelors|           13| Never-married|Adm-clerical|Not-in-family|White|Male|        2174|           0|            40| United-S

## 1. Data analysis

Before we define checks on the data, we want to calculate some statistics on the dataset; we call this step data profiling. As with Deequ, PyDeequ supports a rich set of metrics. For more information, see [Test data quality at scale with Deequ](https://aws.amazon.com/blogs/big-data/test-data-quality-at-scale-with-deequ/) or the [GitHub repo](https://github.com/awslabs/deequ/tree/master/src/main/scala/com/amazon/deequ/analyzers). In the following example, we use the **AnalysisRunner** to capture the metrics on valid data


In [6]:
analysisResult = AnalysisRunner(spark) \
                    .onData(df_val) \
                    .addAnalyzer(Size()) \
                    .addAnalyzer(Completeness("age")) \
                    .addAnalyzer(ApproxCountDistinct("age")) \
                    .addAnalyzer(Mean("age")) \
                    .addAnalyzer(Compliance("top age", "age >= 30")) \
                    .addAnalyzer(Correlation("marital-status", "relationship")) \
                    .addAnalyzer(Correlation("relationship", "age")) \
                    .run()

Py4JJavaError: An error occurred while calling o89.run.
: java.lang.AbstractMethodError: Receiver class org.apache.spark.sql.catalyst.expressions.aggregate.StatefulHyperloglogPlus does not define or inherit an implementation of the resolved method 'abstract org.apache.spark.sql.catalyst.trees.TreeNode withNewChildrenInternal(scala.collection.IndexedSeq)' of abstract class org.apache.spark.sql.catalyst.trees.TreeNode.
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$withNewChildren$2(TreeNode.scala:359)
	at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82)
	at org.apache.spark.sql.catalyst.trees.TreeNode.withNewChildren(TreeNode.scala:358)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:595)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:486)
	at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$3(TreeNode.scala:486)
	at scala.collection.immutable.List.map(List.scala:293)
	at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:595)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:486)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457)
	at org.apache.spark.sql.catalyst.trees.TreeNode.transform(TreeNode.scala:425)
	at org.apache.spark.sql.catalyst.util.package$.usePrettyExpression(package.scala:128)
	at org.apache.spark.sql.catalyst.util.package$.toPrettySQL(package.scala:158)
	at org.apache.spark.sql.RelationalGroupedDataset.alias(RelationalGroupedDataset.scala:89)
	at org.apache.spark.sql.RelationalGroupedDataset.$anonfun$toDF$1(RelationalGroupedDataset.scala:65)
	at scala.collection.immutable.List.map(List.scala:297)
	at org.apache.spark.sql.RelationalGroupedDataset.toDF(RelationalGroupedDataset.scala:65)
	at org.apache.spark.sql.RelationalGroupedDataset.agg(RelationalGroupedDataset.scala:255)
	at org.apache.spark.sql.Dataset.agg(Dataset.scala:1890)
	at com.amazon.deequ.analyzers.runners.AnalysisRunner$.liftedTree1$1(AnalysisRunner.scala:326)
	at com.amazon.deequ.analyzers.runners.AnalysisRunner$.runScanningAnalyzers(AnalysisRunner.scala:318)
	at com.amazon.deequ.analyzers.runners.AnalysisRunner$.doAnalysisRun(AnalysisRunner.scala:167)
	at com.amazon.deequ.analyzers.runners.AnalysisRunBuilder.run(AnalysisRunBuilder.scala:110)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
