# Data Quality Test using Deequ

Deequ is a data quality management tool developed by Amazon, designed to help ensure the quality of data in large-scale pipelines. Deequ can be used to define data quality constraints and run checks against data sources to ensure that they meet those constraints.

PyDeequ is the Pythin Deequ, which allows users to interact with Deequ functionality directly from Python code, providing more convenience for Python users to utilize the capabilities of Deequ in daat processing workflows.

In [2]:
# !pip install pydeequ

In [3]:
import findspark
findspark.init()

In [4]:
import os
import sys

os.environ["SPARK_VERSION"] = "3.0"

In [5]:
# Importing the necessary dependencies
import pandas as pd
import numpy as np
import pyspark
import pydeequ
import json
import sagemaker_pyspark

from pyspark.sql import SparkSession, Row, DataFrame, functions as F 
from pyspark.sql.functions import *
from pyspark.sql.types import *

from pydeequ.analyzers import *
from pydeequ.profiles import *
from pydeequ.suggestions import *
from pydeequ.checks import *
from pydeequ.verification import *

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [6]:
spark = (SparkSession
    .builder
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

In [7]:
# Ingest the stock data for the specific year.
stock_data_path = r"C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\transformed\1962_stock_data"
stock_data = spark.read.parquet(stock_data_path)
stock_data.count()

5106

In [8]:
df = stock_data

In [9]:
df.columns

['date', 'open', 'high', 'low', 'close', 'adj_close', 'volume', 'stock']

In [10]:
# Specifying the display options to prevent truncating for whe using .toPandas to display result.
pd.set_option('display.max_rows', None) # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
pd.set_option('display.width', None)    # Disable column witdth restriction
pd.set_option('display.max_colwidth', None)    # Disable column content width restriction.

* **Test for Completeness/Null Values**

Testing for null values using **Verification Suite**

In [11]:
# Completeness test

# Set up PyDeequ for completeness check
check = Check(spark, CheckLevel.Error, "Data Completeness Check")

# Loop through the columns of the stock data
for column in df.columns:
    checkResult = VerificationSuite(spark) \
        .onData(df) \
        .addCheck(
            check.isComplete(column)
    ) \
    .run()

resultDataFrame = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
resultDataFrame.toPandas()



Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Data Completeness Check,Error,Error,"CompletenessConstraint(Completeness(date,None))",Success,
1,Data Completeness Check,Error,Error,"CompletenessConstraint(Completeness(open,None))",Success,
2,Data Completeness Check,Error,Error,"CompletenessConstraint(Completeness(high,None))",Success,
3,Data Completeness Check,Error,Error,"CompletenessConstraint(Completeness(low,None))",Failure,Value: 0.9917743830787309 does not meet the constraint requirement!
4,Data Completeness Check,Error,Error,"CompletenessConstraint(Completeness(close,None))",Success,
5,Data Completeness Check,Error,Error,"CompletenessConstraint(Completeness(adj_close,None))",Failure,Value: 0.9958871915393654 does not meet the constraint requirement!
6,Data Completeness Check,Error,Error,"CompletenessConstraint(Completeness(volume,None))",Failure,Value: 0.9958871915393654 does not meet the constraint requirement!
7,Data Completeness Check,Error,Error,"CompletenessConstraint(Completeness(stock,None))",Success,


* **Test for Zeros**

Checking for the presence of entries with **zero** within the dataset using the **Verification Suite**

In [12]:
# Specifying only the numerical columns
numerical_cols = ['open', 'high', 'low', 'close', 'adj_close', 'volume']

In [13]:
# Setting up PyDeequ for Zero Values Check
check_zero = Check(spark, CheckLevel.Error, "Zero Values Check")

# Looping through the numerical columns of the dataset
for column in numerical_cols:
    checkResult_zero = VerificationSuite(spark) \
        .onData(df) \
        .addCheck(
            check_zero.hasMin(column, lambda x: x == 0)
        ) \
        .run()

# Displaying the results
resultDataFrame_zero = VerificationResult.checkResultsAsDataFrame(spark, checkResult_zero)
resultDataFrame_zero.toPandas()

Python Callback server started!




Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Zero Values Check,Error,Error,"MinimumConstraint(Minimum(open,None))",Success,
1,Zero Values Check,Error,Error,"MinimumConstraint(Minimum(high,None))",Success,
2,Zero Values Check,Error,Error,"MinimumConstraint(Minimum(low,None))",Failure,Value: 0.05237788334488869 does not meet the constraint requirement!
3,Zero Values Check,Error,Error,"MinimumConstraint(Minimum(close,None))",Failure,Value: 0.05362497642636299 does not meet the constraint requirement!
4,Zero Values Check,Error,Error,"MinimumConstraint(Minimum(adj_close,None))",Failure,Value: 4.0381453914051235E-7 does not meet the constraint requirement!
5,Zero Values Check,Error,Error,"MinimumConstraint(Minimum(volume,None))",Success,


* **Test fo Negative Values**

Check for **Negative Values** in the dataset using **Verification Suite**

In [15]:
# Setting up PyDeequ for Negative Values Check
constraints = Check(spark, CheckLevel.Error, "Negative Values Check")

# Looping through the numerical columns of the dataset
for column in numerical_cols:
    checkResult_zero = VerificationSuite(spark) \
        .onData(df) \
        .addCheck(
            constraints.isNonNegative(column)
        ) \
        .run()

# Displaying the results
resultDataFrame_zero = VerificationResult.checkResultsAsDataFrame(spark, checkResult_zero)
resultDataFrame_zero.toPandas()



Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Negative Values Check,Error,Success,"ComplianceConstraint(Compliance(open is non-negative,COALESCE(CAST(open AS DECIMAL(20,10)), 0.0) >= 0,None))",Success,
1,Negative Values Check,Error,Success,"ComplianceConstraint(Compliance(high is non-negative,COALESCE(CAST(high AS DECIMAL(20,10)), 0.0) >= 0,None))",Success,
2,Negative Values Check,Error,Success,"ComplianceConstraint(Compliance(low is non-negative,COALESCE(CAST(low AS DECIMAL(20,10)), 0.0) >= 0,None))",Success,
3,Negative Values Check,Error,Success,"ComplianceConstraint(Compliance(close is non-negative,COALESCE(CAST(close AS DECIMAL(20,10)), 0.0) >= 0,None))",Success,
4,Negative Values Check,Error,Success,"ComplianceConstraint(Compliance(adj_close is non-negative,COALESCE(CAST(adj_close AS DECIMAL(20,10)), 0.0) >= 0,None))",Success,
5,Negative Values Check,Error,Success,"ComplianceConstraint(Compliance(volume is non-negative,COALESCE(CAST(volume AS DECIMAL(20,10)), 0.0) >= 0,None))",Success,


* **Test for Stock Tickers Consistency**

Using **Verification Suite** to test for consistency in the stock tickers, in comparism with the information in the dataset metadata.

In [16]:
# Reading the metadata "symbols_valid_meta.csv" into pandas data frame.
metadata_path = r"C:\Users\USER\Desktop\Projects\Data-Profiling-and-Quality-Testing\data\symbols_valid_meta.csv"

metadata = pd.read_csv(metadata_path)
metadata.head()

Unnamed: 0,Nasdaq Traded,Symbol,Security Name,Listing Exchange,Market Category,ETF,Round Lot Size,Test Issue,Financial Status,CQS Symbol,NASDAQ Symbol,NextShares
0,Y,A,"Agilent Technologies, Inc. Common Stock",N,,N,100.0,N,,A,A,N
1,Y,AA,Alcoa Corporation Common Stock,N,,N,100.0,N,,AA,AA,N
2,Y,AAAU,Perth Mint Physical Gold ETF,P,,Y,100.0,N,,AAAU,AAAU,N
3,Y,AACG,"ATA Creativity Global - American Depositary Shares, each representing two common shares",Q,G,N,100.0,N,N,,AACG,N
4,Y,AADR,AdvisorShares Dorsey Wright ADR ETF,P,,Y,100.0,N,,AADR,AADR,N


In [18]:
# Converting the metadata to DataFrame
meta = spark.createDataFrame(metadata)

In [19]:
distinct_stock = df.groupBy('stock').count()
distinct_symbol = metadata['Symbol'].unique()

# Converting both data to list
stock_column = distinct_stock.toPandas()
stock_column = stock_column.values.tolist()
symbol_column = distinct_symbol.tolist()

stock_symbol = [item[0] for item in stock_column]

for stock in stock_symbol:
    if stock not in symbol_column:
        print(stock)

ARNCA


In [23]:
# Converting the meta DataFrame's 'Symbol' column to a Python list of allowed values
allowed_stock_symbols = meta.select("Symbol").distinct().rdd.flatMap(lambda x: x).collect()

# Useing VerificationSuite from PyDeequ to verify the stock tickers in the DataFrame `df`
verificationResult = VerificationSuite(spark) \
    .onData(df) \
    .addCheck(
        Check(spark, CheckLevel.Error, "Stock Ticker Verification") \
            .isContainedIn("stock", allowed_stock_symbols, 
                           hint="The stock ticker is not listed in the metadata.")
    ) \
    .run()

# Displaying the results of the verification
resultDataFrame_zero = VerificationResult.checkResultsAsDataFrame(spark, verificationResult)
resultDataFrame_zero.toPandas().drop(columns=['constraint'])



Unnamed: 0,check,check_level,check_status,constraint_status,constraint_message
0,Stock Ticker Verification,Error,Error,Failure,Value: 0.9958871915393654 does not meet the constraint requirement! The stock ticker is not listed in the metadata.


* **Test for Duplicates**

Using **Verification Suite** to check for the uniqueness of the entries in the dataset.

In [24]:
# Defining the primary key for uniqueness verification
primary_key = ['stock', 'date']

# Initializing VerificationSuite with the DataFrame
verification_suite = VerificationSuite(spark).onData(df)

# Defining a check for uniqueness
duplication_check = Check(spark, CheckLevel.Warning, "Duplication Check")\
    .hasUniqueness(primary_key, lambda x: x == 1) # Corrected variable name

# Adding the check to the VerificationSuite
verification_result = verification_suite.addCheck(duplication_check).run()

# Displaying the results
print("Duplication check results:")
results_df = VerificationResult.checkResultsAsDataFrame(spark, verification_result)
results_df.toPandas()

Duplication check results:




Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Duplication Check,Warning,Warning,"UniquenessConstraint(Uniqueness(Stream(stock, ?),None))",Failure,Value: 0.9776733254994124 does not meet the constraint requirement!
