# **Data Quality Verification - Deequ Analysis**
A library for measuring the data quality of large datasets. 



![](https://www.amurta.com/wp-content/uploads/2021/08/Infographics-The-6-Dimensions-of-Data-Quality-01.png)

# <a id="contents">Table of Contents</a><br>
1. [**Importing Dependencies**](#Import_dependencies)
  
     
    
2. [**Read Data into Spark DataFrame**](#preprocessing)
>   2.1 Import as CSV <br>
    2.2 Convert into Spark Dataframe<br>
    2.3 Display Dataset schemas<br>
   
3. **Data Quality Checks**<br>
>   3.1 [**Null Values Check**](#Null_Values)<br>
    3.2 [**Zero Values Check**](#Zero_Values)<br>
    3.3 [**Negative Values Check**](#Negative_Values)<br>
    3.4 [**Determine Maximum Values**](#Maximum_Values)<br>
    3.5 [**Duplications Check**](#Duplications)<br>
    
    
5. [**Conclusion and Recommendation**](#Conclusion/Recommendation) <br>
   > 5.1 [**Conclusion**](#Conclusion) <br>
     5.2 [**Recommendation**](#Recommendation) <br><br>


## Import_dependencies

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pydeequ
from pydeequ.analyzers import *
from pydeequ.profiles import *
from pydeequ.suggestions import *
from pydeequ.checks import *
from pydeequ.verification import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import DecimalType, DoubleType, IntegerType, DateType, NumericType, StructType, StringType, StructField

Please set env variable SPARK_VERSION


In [None]:
spark = (SparkSession
    .builder
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

## Read data into spark dataframe

In [None]:
#Read Test file
df_test_pandas = pd.read_csv('test.csv')
df_test = spark.createDataFrame(df_test_pandas)

#Read Train file
df_train_pandas = pd.read_csv('train.csv')
df_train = spark.createDataFrame(df_train_pandas)



In [None]:
#Viewing the schema the data originally comes in

print("TEST FILE SCHEMA:")
df_test.printSchema() 

print("TRAIN FILE SCHEMA:")
df_train.printSchema()

TEST FILE SCHEMA:
root
 |-- doi: string (nullable = true)
 |-- text_id: string (nullable = true)
 |-- text: string (nullable = true)
 |-- labels_negative: long (nullable = true)
 |-- labels_positive: long (nullable = true)
 |-- agreement: double (nullable = true)
 |-- id: long (nullable = true)

TRAIN FILE SCHEMA:
root
 |-- doi: string (nullable = true)
 |-- text_id: string (nullable = true)
 |-- text: string (nullable = true)
 |-- sdg: long (nullable = true)
 |-- labels_negative: long (nullable = true)
 |-- labels_positive: long (nullable = true)
 |-- agreement: double (nullable = true)
 |-- id: long (nullable = true)



## **Data Quality Checks**

# Null_Values
Checking the data for completeness.

In [None]:
def test_nulltest(data):
  checkResult = VerificationSuite(spark) \
                    .onData(data) \
                    .addCheck(
                    Check(spark,CheckLevel.Warning, "missing values")\
                    .isComplete('doi')\
                    .isComplete('text_id')\
                    .isComplete('text')\
                    .isComplete('labels_negative')\
                    .isComplete('labels_positive')\
                    .isComplete('agreement')\
                    .isComplete('id')\
                    .areComplete(data.columns))\
                    .run()

  checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
  return checkResult_df.toPandas()

In [None]:
# Null values check for Test file
test_nulltest(df_test)

Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,missing values,Warning,Success,"CompletenessConstraint(Completeness(doi,None))",Success,
1,missing values,Warning,Success,"CompletenessConstraint(Completeness(text_id,No...",Success,
2,missing values,Warning,Success,"CompletenessConstraint(Completeness(text,None))",Success,
3,missing values,Warning,Success,CompletenessConstraint(Completeness(labels_neg...,Success,
4,missing values,Warning,Success,CompletenessConstraint(Completeness(labels_pos...,Success,
5,missing values,Warning,Success,"CompletenessConstraint(Completeness(agreement,...",Success,
6,missing values,Warning,Success,"CompletenessConstraint(Completeness(id,None))",Success,
7,missing values,Warning,Success,ComplianceConstraint(Compliance(Combined Compl...,Success,


In [None]:
# Null values check for Train file
test_nulltest(df_train)

Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,missing values,Warning,Success,"CompletenessConstraint(Completeness(doi,None))",Success,
1,missing values,Warning,Success,"CompletenessConstraint(Completeness(text_id,No...",Success,
2,missing values,Warning,Success,"CompletenessConstraint(Completeness(text,None))",Success,
3,missing values,Warning,Success,CompletenessConstraint(Completeness(labels_neg...,Success,
4,missing values,Warning,Success,CompletenessConstraint(Completeness(labels_pos...,Success,
5,missing values,Warning,Success,"CompletenessConstraint(Completeness(agreement,...",Success,
6,missing values,Warning,Success,"CompletenessConstraint(Completeness(id,None))",Success,
7,missing values,Warning,Success,ComplianceConstraint(Compliance(Combined Compl...,Success,


## Zero_Values

Checking for zero values within the dataset.

In [None]:
def test_zerotest(data):
  checkResult = VerificationSuite(spark) \
                    .onData(data) \
                    .addCheck(
                    Check(spark,CheckLevel.Warning, "Non Zero Values")\
                    .satisfies("labels_negative == 0", "Zero values", lambda x: x==0)\
                    .satisfies("labels_positive == 0", "Zero values", lambda x: x==0)\
                    .satisfies("agreement == 0", "Zero values", lambda x: x==0)\
                    .satisfies("id == 0", "Zero values", lambda x: x==0)\
                    .satisfies("sdg == 0", "Zero values", lambda x: x==0)\
                    )\
                    .run()

  checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
  return checkResult_df.toPandas()

In [None]:
#Zero values check on Test file
test_zerotest(df_test)

Python Callback server started!


Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,la...",Failure,Value: 0.39571450593494684 does not meet the c...
1,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,la...",Failure,Value: 0.026360413133960228 does not meet the ...
2,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,ag...",Failure,Value: 0.026668722059503623 does not meet the ...
3,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,id...",Success,


In [None]:
#Zero values check on Train file
test_zerotest(df_train)

Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,la...",Failure,Value: 0.3996685168054271 does not meet the co...
1,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,la...",Failure,Value: 0.024668516805427074 does not meet the ...
2,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,ag...",Failure,Value: 0.023936170212765957 does not meet the ...
3,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,id...",Success,
4,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,sd...",Success,


## Test 3 - Negative values ➖️
Checking that all values in the data are positive.


In [None]:
#TODO: Write your code here
#Check for Negative values
def test_negative(data):
  checkResult = VerificationSuite(spark) \
                    .onData(data) \
                    .addCheck(
                    Check(spark,CheckLevel.Warning, "Non Negative Values")\
                    .isNonNegative('labels_negative')\
                    .isNonNegative('labels_positive')\
                    .isNonNegative('agreement')\
                    .isNonNegative('id')\
                    .isNonNegative('sdg')\
                    
                    )\
                    .run()

  checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
  return checkResult_df.toPandas()

In [None]:
#Negative values check on Test file
test_negative(df_test)

Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(labels_negativ...,Success,
1,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(labels_positiv...,Success,
2,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(agreement is n...,Success,
3,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(id is non-nega...,Success,


In [None]:
#Negative values check on Train file
test_negative(df_train)

Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(labels_negativ...,Success,
1,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(labels_positiv...,Success,
2,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(agreement is n...,Success,
3,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(id is non-nega...,Success,
4,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(sdg is non-neg...,Success,


## Test 4 - Determine Maximum Values ⚠️
We want to find the maximum values in the dataset for the numerical fields. Extremum values can often be used to define an upper bound for the column values so we can define them as the threshold values.

In [None]:
num_cols = ['sdg','labels_negative','labels_positive','agreement','id']

def test_maxvalue(data):
  result = ColumnProfilerRunner(spark) \
    .onData(data) \
    .run()

  for col, profile in result.profiles.items():
    #print(col,profile)
    if col in num_cols:
        print(f'Column: \'{col}\'')
        print('\t',f'Maximum Value: {profile.maximum}')

In [None]:
#Maximum Values check for Test file
test_maxvalue(df_test)

Column: 'agreement'
	 Maximum Value: 1.0
Column: 'labels_positive'
	 Maximum Value: 882.0
Column: 'id'
	 Maximum Value: 6487.0
Column: 'labels_negative'
	 Maximum Value: 106.0


In [None]:
#Maximum Values check for Train file
test_maxvalue(df_train)

Column: 'sdg'
	 Maximum Value: 15.0
Column: 'agreement'
	 Maximum Value: 1.0
Column: 'labels_positive'
	 Maximum Value: 925.0
Column: 'id'
	 Maximum Value: 25944.0
Column: 'labels_negative'
	 Maximum Value: 837.0


## Test 5 - Duplication 👥️
Lastly, we want to determine the uniqueness of the items found in the dataframe. 

The first thing to check will be if the primary key values within the dataset are unique - in our case, that will be a combination of the **text_id** and the **id**. Secondly, we want to check if the entries are all unique, which is done by checking for duplicates across that whole dataset.

In [None]:
def test_duplicates(data):
  checkResult = VerificationSuite(spark) \
                    .onData(data) \
                    .addCheck(
                    Check(spark,CheckLevel.Error, "Unique Values")\
                    .hasUniqueness(("text_id","id"), lambda x: x == 1)\
                    )\
                    .run()

  checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
  return checkResult_df.toPandas()

In [None]:
#Duplicates check for Test file
test_duplicates(df_test)

Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Unique Values,Error,Success,UniquenessConstraint(Uniqueness(Stream(text_id...,Success,


In [None]:
#Duplicates check for Train file
test_duplicates(df_train)

Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Unique Values,Error,Success,UniquenessConstraint(Uniqueness(Stream(text_id...,Success,


## Conclusion/Recommendation

## Conclusion

Based on the data quality checks, both train and test files are clean. The zero values present in the numerical columns are valid.

## Recommendation

File may be used for modeling with minimal or no further data cleaning.