# Processing Big Data - Deequ Analysis
A library for measuring the data quality of large datasets. 

# **Data Quality Verification - Deequ Analysis**

![](https://miosoft.com/resources/articles/article-assets/img/data-dimensions/statcan_dimensions.png)

# <a id="contents">Table of Contents</a><br>
1. [**Importing Dependencies**](#introduction)
  > 1.1 [**Problem Statement**](#problem_statement) <br>
    1.2 
    
2. [**Read Data into Spark DataFrame**](#preprocessing)
>   2.1 <br>
    2.2 <br>
    2.3 <br>

3. [**Exploratory Data Analysis**](#EDA) <br>
   
4. [**Modeling**](#model) <br>
    4.1 [**Model Training**](#modeltraining) <br>
    > 4.1.1. [**Model 1**](#model1) <br>
     4.1.2. [**Model 2**](#model2) <br>
     4.1.3. [**Model 3**](#model3) <br>
     
    4.2 [**Mertics Evaluation**](#merticsevaluation) <br>
    
5. [**Conclusion and Recommendation**](#Conclusion/Recommendation) <br>
   > 5.1 [**Conclusion**](#Conclusion) <br>
     5.2 [**Recommendation**](#Recommendation) <br><br>

6. [**References**](#reference) <br>

# 1.0 <a id="introduction"><strong>INTRODUCTION</strong></a>
[Table of Contents](#contents)<br>



## 1.1 <a id="problem_statement"><strong>Problem Statement</strong></a>
[Table of Contents](#contents)<br>


## 1.2 <a id="data"><strong>Data Description</strong></a>
[Table of Contents](#contents)<br>


 ## 1.3  <a id="importing_libraries"><strong>Importing Libraries</strong></a>
[Table of Contents](#contents)<br>


## 1.4 <a id="loading_data"><strong>Loading Data</strong></a>
[Table of Contents](#contents)


# 2.0 <a id="preprocessing"><strong>DATA PREPROCESSING</strong></a>
[Table of Contents](#contents)


# 3.0 <a id="EDA"><strong>EXPLORATORY DATA ANALYSIS</strong></a>
[Table of Contents](#contents)



# 4.0 <a id="model"><strong>Classification Models</strong></a>
[Table of Contents](#contents)


## 4.1 <a id="modelpreprocessing"><strong>Model Preprocessing</strong></a>
[Table of Contents](#contents)


## 4.2 <a id="modeltraining"><strong>Model Training</strong></a>
[Table of Contents](#contents)


### 4.2.1 <a id="model1"><strong>Model 1</strong></a>
[Table of Contents](#contents)


### 4.2.2 <a id="model2"><strong>Model 2</strong></a>
[Table of Contents](#contents)


### 4.2.3 <a id="model3"><strong>Model 3</strong></a>
[Table of Contents](#contents)


# 5.0 <a id="Conclusion/Recommendation"><strong>CONCLUSION AND RECOMMENDATION</strong></a>


## Import dependencies

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pydeequ
from pydeequ.analyzers import *
from pydeequ.profiles import *
from pydeequ.suggestions import *
from pydeequ.checks import *
from pydeequ.verification import *

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import DecimalType, DoubleType, IntegerType, DateType, NumericType, StructType, StringType, StructField

In [6]:
spark = (SparkSession
    .builder
    .config("spark.jars.packages", pydeequ.deequ_maven_coord)
    .config("spark.jars.excludes", pydeequ.f2j_maven_coord)
    .getOrCreate())

## Read data into spark dataframe

In [8]:
#read parquet file
#file = 'test'

df_pandas = pd.read_csv('/content/drive/MyDrive/data/test.csv')
df = spark.createDataFrame(df_pandas)

## **Run tests on the dataset**

## Test 1 - Null values ⛔️
Checking the data for completeness.

In [None]:
df.printSchema()

root
 |-- doi: string (nullable = true)
 |-- text_id: string (nullable = true)
 |-- text: string (nullable = true)
 |-- labels_negative: long (nullable = true)
 |-- labels_positive: long (nullable = true)
 |-- agreement: double (nullable = true)
 |-- id: long (nullable = true)



In [12]:
def test_nulltest(data1):
  checkResult = VerificationSuite(spark) \
                    .onData(data1) \
                    .addCheck(
                    Check(spark,CheckLevel.Warning, "missing values")\
                    .isComplete('doi')\
                    .isComplete('text_id')\
                    .isComplete('text')\
                    .isComplete('labels_negative')\
                    .isComplete('labels_positive')\
                    .isComplete('agreement')\
                    .isComplete('id')\
                    .areComplete(df.columns))\
                    .run()

  checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
  return checkResult_df.toPandas()

In [13]:
test_nulltest(df)

Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,missing values,Warning,Success,"CompletenessConstraint(Completeness(doi,None))",Success,
1,missing values,Warning,Success,"CompletenessConstraint(Completeness(text_id,No...",Success,
2,missing values,Warning,Success,"CompletenessConstraint(Completeness(text,None))",Success,
3,missing values,Warning,Success,CompletenessConstraint(Completeness(labels_neg...,Success,
4,missing values,Warning,Success,CompletenessConstraint(Completeness(labels_pos...,Success,
5,missing values,Warning,Success,"CompletenessConstraint(Completeness(agreement,...",Success,
6,missing values,Warning,Success,"CompletenessConstraint(Completeness(id,None))",Success,
7,missing values,Warning,Success,ComplianceConstraint(Compliance(Combined Compl...,Success,


## Test 2 - Zero Values 🅾️

Checking for zero values within the dataset.

In [15]:
def test_zerotest(data):
  checkResult = VerificationSuite(spark) \
                    .onData(data) \
                    .addCheck(
                    Check(spark,CheckLevel.Warning, "Non Zero Values")\
                    .satisfies("labels_negative == 0", "Zero values", lambda x: x==0)\
                    .satisfies("labels_positive == 0", "Zero values", lambda x: x==0)\
                    .satisfies("agreement == 0", "Zero values", lambda x: x==0)\
                    .satisfies("id == 0", "Zero values", lambda x: x==0)\
                    )\
                    .run()

  checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
  return checkResult_df.toPandas()

In [16]:
test_zerotest(df)

Python Callback server started!


Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,la...",Failure,Value: 0.39571450593494684 does not meet the c...
1,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,la...",Failure,Value: 0.026360413133960228 does not meet the ...
2,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,ag...",Failure,Value: 0.026668722059503623 does not meet the ...
3,Non Zero Values,Warning,Warning,"ComplianceConstraint(Compliance(Zero values,id...",Success,


## Test 3 - Negative values ➖️
Checking that all values in the data are positive.


In [18]:
#TODO: Write your code here
#Check for Negative values
def test_negative_test(data):
  checkResult = VerificationSuite(spark) \
                    .onData(data) \
                    .addCheck(
                    Check(spark,CheckLevel.Warning, "Non Negative Values")\
                    .isNonNegative('labels_negative')\
                    .isNonNegative('labels_positive')\
                    .isNonNegative('agreement')\
                    .isNonNegative('id')\
                    )\
                    .run()

  checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
  return checkResult_df.toPandas()

In [19]:
test_negative_test(df)

Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(labels_negativ...,Success,
1,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(labels_positiv...,Success,
2,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(agreement is n...,Success,
3,Non Negative Values,Warning,Success,ComplianceConstraint(Compliance(id is non-nega...,Success,


## Test 4 - Determine Maximum Values ⚠️
We want to find the maximum values in the dataset for the numerical fields. Extremum values can often be used to define an upper bound for the column values so we can define them as the threshold values.

In [21]:
num_cols = ['labels_negative','labels_positive','agreement','id']

def test_maxvalue(data):
  result = ColumnProfilerRunner(spark) \
    .onData(data) \
    .run()

  for col, profile in result.profiles.items():
    #print(col,profile)
    if col in num_cols:
        print(f'Column: \'{col}\'')
        print('\t',f'Maximum Value: {profile.maximum}')

In [22]:
test_maxvalue(df)

Column: 'agreement'
	 Maximum Value: 1.0
Column: 'labels_positive'
	 Maximum Value: 882.0
Column: 'id'
	 Maximum Value: 6487.0
Column: 'labels_negative'
	 Maximum Value: 106.0


## Test 5 - Duplication 👥️
Lastly, we want to determine the uniqueness of the items found in the dataframe. 

The first thing to check will be if the primary key values within the dataset are unique - in our case, that will be a combination of the **text_id** and the **id**. Secondly, we want to check if the entries are all unique, which is done by checking for duplicates across that whole dataset.

In [24]:
def test_duplicates(data):
  checkResult = VerificationSuite(spark) \
                    .onData(data) \
                    .addCheck(
                    Check(spark,CheckLevel.Error, "Unique Values")\
                    .hasUniqueness(("text_id","id"), lambda x: x == 1)\
                    )\
                    .run()

  checkResult_df = VerificationResult.checkResultsAsDataFrame(spark, checkResult)
  return checkResult_df.toPandas()

In [25]:
test_duplicates(df)

Unnamed: 0,check,check_level,check_status,constraint,constraint_status,constraint_message
0,Unique Values,Error,Success,UniquenessConstraint(Uniqueness(Stream(text_id...,Success,
