# üéØ DATA QUALITY & VALIDATION WITH PYSPARK

---

## üìã **OBJECTIVES**

1. Define data quality rules
2. Implement validation checks
3. Create data quality metrics
4. Build data quality dashboard
5. Automated quality monitoring

---

## üîß **SETUP SPARK SESSION**

In [14]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql.window import Window
import pandas as pd
from datetime import datetime, timedelta
import builtins

spark = SparkSession.builder \
    .appName("DataQuality") \
    .master("spark://spark-master:7077") \
    .config("spark.executor.memory", "2g") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minioadmin") \
    .config("spark.hadoop.fs.s3a.secret.key", "minioadmin123") \
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .getOrCreate()

print("‚úÖ Spark Session Created")
print(f"Spark Version: {spark.version}")
print(f"Master: {spark.sparkContext.master}")

‚úÖ Spark Session Created
Spark Version: 3.5.1
Master: spark://spark-master:7077


---

## üìä **1. CREATE TEST DATASET**

T·∫°o dataset v·ªõi nhi·ªÅu data quality issues

In [15]:
# Create test data with quality issues
test_data = [
    # Valid records
    ("ORD001", "CUST001", "john@email.com", 100.0, "2024-01-01", "completed", "USA"),
    ("ORD002", "CUST002", "jane@email.com", 200.0, "2024-01-02", "completed", "UK"),
    ("ORD003", "CUST003", "bob@email.com", 150.0, "2024-01-03", "completed", "Canada"),
    
    # Missing values
    ("ORD004", None, "alice@email.com", 300.0, "2024-01-04", "pending", "USA"),
    ("ORD005", "CUST005", None, 250.0, "2024-01-05", "completed", "UK"),
    ("ORD006", "CUST006", "charlie@email.com", None, "2024-01-06", "completed", "Canada"),
    
    # Invalid values
    ("ORD007", "CUST007", "invalid-email", 400.0, "2024-01-07", "completed", "USA"),
    ("ORD008", "CUST008", "david@email.com", -100.0, "2024-01-08", "completed", "UK"),
    ("ORD009", "CUST009", "eve@email.com", 500.0, "invalid-date", "completed", "Canada"),
    ("ORD010", "CUST010", "frank@email.com", 600.0, "2024-01-10", "invalid-status", "USA"),
    
    # Duplicates
    ("ORD001", "CUST001", "john@email.com", 100.0, "2024-01-01", "completed", "USA"),
    
    # Outliers
    ("ORD011", "CUST011", "grace@email.com", 1000000.0, "2024-01-11", "completed", "UK"),
    
    # Future dates
    ("ORD012", "CUST012", "henry@email.com", 700.0, "2025-01-01", "completed", "Canada"),
    
    # Invalid country
    ("ORD013", "CUST013", "ivy@email.com", 800.0, "2024-01-13", "completed", "InvalidCountry"),
    
    # More valid records
    ("ORD014", "CUST014", "jack@email.com", 180.0, "2024-01-14", "completed", "USA"),
    ("ORD015", "CUST015", "karen@email.com", 220.0, "2024-01-15", "pending", "UK"),
]

schema = StructType([
    StructField("order_id", StringType(), True),
    StructField("customer_id", StringType(), True),
    StructField("email", StringType(), True),
    StructField("amount", DoubleType(), True),
    StructField("order_date", StringType(), True),
    StructField("status", StringType(), True),
    StructField("country", StringType(), True)
])

df = spark.createDataFrame(test_data, schema)

print("üìä TEST DATASET:")
df.show(20, truncate=False)
print(f"\nTotal rows: {df.count()}")

üìä TEST DATASET:
+--------+-----------+-----------------+---------+------------+--------------+--------------+
|order_id|customer_id|email            |amount   |order_date  |status        |country       |
+--------+-----------+-----------------+---------+------------+--------------+--------------+
|ORD001  |CUST001    |john@email.com   |100.0    |2024-01-01  |completed     |USA           |
|ORD002  |CUST002    |jane@email.com   |200.0    |2024-01-02  |completed     |UK            |
|ORD003  |CUST003    |bob@email.com    |150.0    |2024-01-03  |completed     |Canada        |
|ORD004  |NULL       |alice@email.com  |300.0    |2024-01-04  |pending       |USA           |
|ORD005  |CUST005    |NULL             |250.0    |2024-01-05  |completed     |UK            |
|ORD006  |CUST006    |charlie@email.com|NULL     |2024-01-06  |completed     |Canada        |
|ORD007  |CUST007    |invalid-email    |400.0    |2024-01-07  |completed     |USA           |
|ORD008  |CUST008    |david@email.com  |-

---

## üéØ **2. DEFINE DATA QUALITY RULES**

In [16]:
# Define quality rules
quality_rules = {
    "completeness": {
        "order_id": "NOT NULL",
        "customer_id": "NOT NULL",
        "email": "NOT NULL",
        "amount": "NOT NULL",
        "order_date": "NOT NULL",
        "status": "NOT NULL",
        "country": "NOT NULL"
    },
    "validity": {
        "email": "REGEX: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}$",
        "amount": "RANGE: 0 to 100000",
        "order_date": "DATE FORMAT: yyyy-MM-dd",
        "status": "IN: [pending, completed, cancelled]",
        "country": "IN: [USA, UK, Canada]"
    },
    "uniqueness": {
        "order_id": "UNIQUE"
    },
    "timeliness": {
        "order_date": "NOT FUTURE DATE"
    }
}

print("üìã DATA QUALITY RULES:")
for category, rules in quality_rules.items():
    print(f"\n{category.upper()}:")
    for field, rule in rules.items():
        print(f"  ‚Ä¢ {field}: {rule}")

üìã DATA QUALITY RULES:

COMPLETENESS:
  ‚Ä¢ order_id: NOT NULL
  ‚Ä¢ customer_id: NOT NULL
  ‚Ä¢ email: NOT NULL
  ‚Ä¢ amount: NOT NULL
  ‚Ä¢ order_date: NOT NULL
  ‚Ä¢ status: NOT NULL
  ‚Ä¢ country: NOT NULL

VALIDITY:
  ‚Ä¢ email: REGEX: ^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$
  ‚Ä¢ amount: RANGE: 0 to 100000
  ‚Ä¢ order_date: DATE FORMAT: yyyy-MM-dd
  ‚Ä¢ status: IN: [pending, completed, cancelled]
  ‚Ä¢ country: IN: [USA, UK, Canada]

UNIQUENESS:
  ‚Ä¢ order_id: UNIQUE

TIMELINESS:
  ‚Ä¢ order_date: NOT FUTURE DATE


---

## ‚úÖ **3. COMPLETENESS CHECKS**

Check for missing/null values

In [17]:
def check_completeness(df):
    """
    Check completeness (null values) for all columns
    """
    total_rows = df.count()
    
    results = []
    
    for col_name in df.columns:
        null_count = df.filter(col(col_name).isNull()).count()
        null_percentage = (null_count / total_rows) * 100
        completeness = 100 - null_percentage
        
        status = "‚úÖ PASS" if null_count == 0 else "‚ùå FAIL"
        
        results.append({
            "column": col_name,
            "total_rows": total_rows,
            "null_count": null_count,
            "null_percentage": __builtins__.round(null_percentage, 2),
            "completeness": __builtins__.round(completeness, 2),
            "status": status
        })
    
    return spark.createDataFrame(results)

# Run completeness check
print("‚úÖ COMPLETENESS CHECK:")
completeness_report = check_completeness(df)
completeness_report.show(truncate=False)

# Summary
failed_checks = completeness_report.filter(col("status") == "‚ùå FAIL").count()
print(f"\nüìä Summary: {failed_checks} columns failed completeness check")

‚úÖ COMPLETENESS CHECK:
+-----------+------------+----------+---------------+------+----------+
|column     |completeness|null_count|null_percentage|status|total_rows|
+-----------+------------+----------+---------------+------+----------+
|order_id   |100.0       |0         |0.0            |‚úÖ PASS|16        |
|customer_id|93.75       |1         |6.25           |‚ùå FAIL|16        |
|email      |93.75       |1         |6.25           |‚ùå FAIL|16        |
|amount     |93.75       |1         |6.25           |‚ùå FAIL|16        |
|order_date |100.0       |0         |0.0            |‚úÖ PASS|16        |
|status     |100.0       |0         |0.0            |‚úÖ PASS|16        |
|country    |100.0       |0         |0.0            |‚úÖ PASS|16        |
+-----------+------------+----------+---------------+------+----------+


üìä Summary: 3 columns failed completeness check


---

## ‚úÖ **4. VALIDITY CHECKS**

Check if values meet business rules

In [18]:
# 4.1 Email validation
def check_email_validity(df):
    email_regex = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    
    df_check = df.withColumn(
        "is_valid_email",
        col("email").rlike(email_regex)
    )
    
    total = df_check.count()
    valid = df_check.filter(col("is_valid_email") == True).count()
    invalid = total - valid
    
    print("üìß EMAIL VALIDITY:")
    print(f"Total: {total}")
    print(f"Valid: {valid} ({(valid/total)*100:.2f}%)")
    print(f"Invalid: {invalid} ({(invalid/total)*100:.2f}%)")
    
    print("\n‚ùå INVALID EMAILS:")
    df_check.filter(col("is_valid_email") == False).select("order_id", "email").show(truncate=False)
    
    return df_check

df_email_check = check_email_validity(df)

# 4.2 Amount validation
def check_amount_validity(df):
    df_check = df.withColumn(
        "is_valid_amount",
        (col("amount") >= 0) & (col("amount") <= 100000)
    )
    
    total = df_check.count()
    valid = df_check.filter(col("is_valid_amount") == True).count()
    invalid = total - valid
    
    print("\nüí∞ AMOUNT VALIDITY:")
    print(f"Total: {total}")
    print(f"Valid: {valid} ({(valid/total)*100:.2f}%)")
    print(f"Invalid: {invalid} ({(invalid/total)*100:.2f}%)")
    
    print("\n‚ùå INVALID AMOUNTS:")
    df_check.filter(col("is_valid_amount") == False).select("order_id", "amount").show(truncate=False)
    
    return df_check

df_amount_check = check_amount_validity(df)

# 4.3 Status validation
def check_status_validity(df):
    valid_statuses = ["pending", "completed", "cancelled"]
    
    df_check = df.withColumn(
        "is_valid_status",
        col("status").isin(valid_statuses)
    )
    
    total = df_check.count()
    valid = df_check.filter(col("is_valid_status") == True).count()
    invalid = total - valid
    
    print("\nüìä STATUS VALIDITY:")
    print(f"Total: {total}")
    print(f"Valid: {valid} ({(valid/total)*100:.2f}%)")
    print(f"Invalid: {invalid} ({(invalid/total)*100:.2f}%)")
    
    print("\n‚ùå INVALID STATUSES:")
    df_check.filter(col("is_valid_status") == False).select("order_id", "status").show(truncate=False)
    
    return df_check

df_status_check = check_status_validity(df)

# 4.4 Country validation
def check_country_validity(df):
    valid_countries = ["USA", "UK", "Canada"]
    
    df_check = df.withColumn(
        "is_valid_country",
        col("country").isin(valid_countries)
    )
    
    total = df_check.count()
    valid = df_check.filter(col("is_valid_country") == True).count()
    invalid = total - valid
    
    print("\nüåç COUNTRY VALIDITY:")
    print(f"Total: {total}")
    print(f"Valid: {valid} ({(valid/total)*100:.2f}%)")
    print(f"Invalid: {invalid} ({(invalid/total)*100:.2f}%)")
    
    print("\n‚ùå INVALID COUNTRIES:")
    df_check.filter(col("is_valid_country") == False).select("order_id", "country").show(truncate=False)
    
    return df_check

df_country_check = check_country_validity(df)

# 4.5 Date validation
def check_date_validity(df):
    df_check = df.withColumn(
        "parsed_date",
        to_date(col("order_date"), "yyyy-MM-dd")
    ).withColumn(
        "is_valid_date",
        col("parsed_date").isNotNull()
    )
    
    total = df_check.count()
    valid = df_check.filter(col("is_valid_date") == True).count()
    invalid = total - valid
    
    print("\nüìÖ DATE VALIDITY:")
    print(f"Total: {total}")
    print(f"Valid: {valid} ({(valid/total)*100:.2f}%)")
    print(f"Invalid: {invalid} ({(invalid/total)*100:.2f}%)")
    
    print("\n‚ùå INVALID DATES:")
    df_check.filter(col("is_valid_date") == False).select("order_id", "order_date").show(truncate=False)
    
    return df_check

df_date_check = check_date_validity(df)

üìß EMAIL VALIDITY:
Total: 16
Valid: 14 (87.50%)
Invalid: 2 (12.50%)

‚ùå INVALID EMAILS:
+--------+-------------+
|order_id|email        |
+--------+-------------+
|ORD007  |invalid-email|
+--------+-------------+


üí∞ AMOUNT VALIDITY:
Total: 16
Valid: 13 (81.25%)
Invalid: 3 (18.75%)

‚ùå INVALID AMOUNTS:
+--------+---------+
|order_id|amount   |
+--------+---------+
|ORD008  |-100.0   |
|ORD011  |1000000.0|
+--------+---------+


üìä STATUS VALIDITY:
Total: 16
Valid: 15 (93.75%)
Invalid: 1 (6.25%)

‚ùå INVALID STATUSES:
+--------+--------------+
|order_id|status        |
+--------+--------------+
|ORD010  |invalid-status|
+--------+--------------+


üåç COUNTRY VALIDITY:
Total: 16
Valid: 15 (93.75%)
Invalid: 1 (6.25%)

‚ùå INVALID COUNTRIES:
+--------+--------------+
|order_id|country       |
+--------+--------------+
|ORD013  |InvalidCountry|
+--------+--------------+


üìÖ DATE VALIDITY:
Total: 16
Valid: 15 (93.75%)
Invalid: 1 (6.25%)

‚ùå INVALID DATES:
+--------+-----------

---

## ‚úÖ **5. UNIQUENESS CHECKS**

Check for duplicate records

In [19]:
def check_uniqueness(df, key_columns):
    """
    Check uniqueness of key columns
    """
    total_rows = df.count()
    distinct_rows = df.select(key_columns).distinct().count()
    duplicates = total_rows - distinct_rows
    
    print(f"üîë UNIQUENESS CHECK ({', '.join(key_columns)}):")
    print(f"Total rows: {total_rows}")
    print(f"Distinct rows: {distinct_rows}")
    print(f"Duplicates: {duplicates}")
    print(f"Uniqueness: {(distinct_rows/total_rows)*100:.2f}%")
    
    if duplicates > 0:
        print("\n‚ùå DUPLICATE RECORDS:")
        
        # Find duplicates
        windowSpec = Window.partitionBy(key_columns)
        df_dup = df.withColumn("dup_count", count("*").over(windowSpec)) \
            .filter(col("dup_count") > 1) \
            .orderBy(key_columns)
        
        df_dup.show(truncate=False)
    else:
        print("\n‚úÖ No duplicates found")
    
    return duplicates

# Check order_id uniqueness
check_uniqueness(df, ["order_id"])

# Check customer_id + order_date uniqueness
print("\n" + "="*60 + "\n")
check_uniqueness(df, ["customer_id", "order_date"])

üîë UNIQUENESS CHECK (order_id):
Total rows: 16
Distinct rows: 15
Duplicates: 1
Uniqueness: 93.75%

‚ùå DUPLICATE RECORDS:
+--------+-----------+--------------+------+----------+---------+-------+---------+
|order_id|customer_id|email         |amount|order_date|status   |country|dup_count|
+--------+-----------+--------------+------+----------+---------+-------+---------+
|ORD001  |CUST001    |john@email.com|100.0 |2024-01-01|completed|USA    |2        |
|ORD001  |CUST001    |john@email.com|100.0 |2024-01-01|completed|USA    |2        |
+--------+-----------+--------------+------+----------+---------+-------+---------+



üîë UNIQUENESS CHECK (customer_id, order_date):
Total rows: 16
Distinct rows: 15
Duplicates: 1
Uniqueness: 93.75%

‚ùå DUPLICATE RECORDS:
+--------+-----------+--------------+------+----------+---------+-------+---------+
|order_id|customer_id|email         |amount|order_date|status   |country|dup_count|
+--------+-----------+--------------+------+----------+-------

1

---

## ‚úÖ **6. TIMELINESS CHECKS**

Check for future dates and data freshness

In [20]:
def check_timeliness(df):
    """
    Check for future dates
    """
    df_check = df.withColumn(
        "parsed_date",
        to_date(col("order_date"), "yyyy-MM-dd")
    ).withColumn(
        "is_future_date",
        col("parsed_date") > current_date()
    )
    
    total = df_check.count()
    future_dates = df_check.filter(col("is_future_date") == True).count()
    
    print("üìÖ TIMELINESS CHECK:")
    print(f"Total: {total}")
    print(f"Future dates: {future_dates} ({(future_dates/total)*100:.2f}%)")
    
    if future_dates > 0:
        print("\n‚ùå FUTURE DATES FOUND:")
        df_check.filter(col("is_future_date") == True) \
            .select("order_id", "order_date", "parsed_date") \
            .show(truncate=False)
    else:
        print("\n‚úÖ No future dates found")
    
    return df_check

df_timeliness_check = check_timeliness(df)

üìÖ TIMELINESS CHECK:
Total: 16
Future dates: 0 (0.00%)

‚úÖ No future dates found


---

## ‚úÖ **7. CONSISTENCY CHECKS**

Check for data consistency across columns

In [21]:
def check_consistency(df):
    """
    Check business logic consistency
    """
    # Rule: Completed orders must have amount > 0
    df_check = df.withColumn(
        "is_consistent",
        when(
            (col("status") == "completed") & (col("amount") <= 0),
            False
        ).otherwise(True)
    )
    
    total = df_check.count()
    inconsistent = df_check.filter(col("is_consistent") == False).count()
    
    print("üîó CONSISTENCY CHECK:")
    print(f"Rule: Completed orders must have amount > 0")
    print(f"Total: {total}")
    print(f"Inconsistent: {inconsistent} ({(inconsistent/total)*100:.2f}%)")
    
    if inconsistent > 0:
        print("\n‚ùå INCONSISTENT RECORDS:")
        df_check.filter(col("is_consistent") == False) \
            .select("order_id", "status", "amount") \
            .show(truncate=False)
    else:
        print("\n‚úÖ All records are consistent")
    
    return df_check

df_consistency_check = check_consistency(df)

üîó CONSISTENCY CHECK:
Rule: Completed orders must have amount > 0
Total: 16
Inconsistent: 1 (6.25%)

‚ùå INCONSISTENT RECORDS:
+--------+---------+------+
|order_id|status   |amount|
+--------+---------+------+
|ORD008  |completed|-100.0|
+--------+---------+------+



---

## üìä **8. COMPREHENSIVE QUALITY REPORT**

In [22]:
def generate_quality_report(df):
    """
    Generate comprehensive data quality report
    """
    total_rows = df.count()
    
    # Add all validation flags
    email_regex = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
    valid_statuses = ["pending", "completed", "cancelled"]
    valid_countries = ["USA", "UK", "Canada"]
    
    df_report = df \
        .withColumn("has_null", 
            col("order_id").isNull() | 
            col("customer_id").isNull() | 
            col("email").isNull() | 
            col("amount").isNull() | 
            col("order_date").isNull() | 
            col("status").isNull() | 
            col("country").isNull()
        ) \
        .withColumn("is_valid_email", col("email").rlike(email_regex)) \
        .withColumn("is_valid_amount", (col("amount") >= 0) & (col("amount") <= 100000)) \
        .withColumn("is_valid_status", col("status").isin(valid_statuses)) \
        .withColumn("is_valid_country", col("country").isin(valid_countries)) \
        .withColumn("parsed_date", to_date(col("order_date"), "yyyy-MM-dd")) \
        .withColumn("is_valid_date", col("parsed_date").isNotNull()) \
        .withColumn("is_future_date", col("parsed_date") > current_date()) \
        .withColumn("is_quality_pass",
            ~col("has_null") &
            col("is_valid_email") &
            col("is_valid_amount") &
            col("is_valid_status") &
            col("is_valid_country") &
            col("is_valid_date") &
            ~col("is_future_date")
        )
    
    # Calculate metrics
    quality_pass = df_report.filter(col("is_quality_pass") == True).count()
    quality_fail = total_rows - quality_pass
    quality_score = (quality_pass / total_rows) * 100
    
    # Detailed breakdown
    has_null = df_report.filter(col("has_null") == True).count()
    invalid_email = df_report.filter(col("is_valid_email") == False).count()
    invalid_amount = df_report.filter(col("is_valid_amount") == False).count()
    invalid_status = df_report.filter(col("is_valid_status") == False).count()
    invalid_country = df_report.filter(col("is_valid_country") == False).count()
    invalid_date = df_report.filter(col("is_valid_date") == False).count()
    future_date = df_report.filter(col("is_future_date") == True).count()
    
    print("="*80)
    print("üìä DATA QUALITY REPORT")
    print("="*80)
    print(f"\nüìà OVERALL METRICS:")
    print(f"   Total Records: {total_rows}")
    print(f"   Quality Pass: {quality_pass} ({(quality_pass/total_rows)*100:.2f}%)")
    print(f"   Quality Fail: {quality_fail} ({(quality_fail/total_rows)*100:.2f}%)")
    print(f"   Quality Score: {quality_score:.2f}%")
    
    print(f"\n‚ùå QUALITY ISSUES BREAKDOWN:")
    print(f"   Missing Values: {has_null} ({(has_null/total_rows)*100:.2f}%)")
    print(f"   Invalid Email: {invalid_email} ({(invalid_email/total_rows)*100:.2f}%)")
    print(f"   Invalid Amount: {invalid_amount} ({(invalid_amount/total_rows)*100:.2f}%)")
    print(f"   Invalid Status: {invalid_status} ({(invalid_status/total_rows)*100:.2f}%)")
    print(f"   Invalid Country: {invalid_country} ({(invalid_country/total_rows)*100:.2f}%)")
    print(f"   Invalid Date: {invalid_date} ({(invalid_date/total_rows)*100:.2f}%)")
    print(f"   Future Date: {future_date} ({(future_date/total_rows)*100:.2f}%)")
    
    print(f"\nüéØ QUALITY GRADE:")
    if quality_score >= 95:
        grade = "A+ (Excellent)"
    elif quality_score >= 90:
        grade = "A (Very Good)"
    elif quality_score >= 80:
        grade = "B (Good)"
    elif quality_score >= 70:
        grade = "C (Fair)"
    elif quality_score >= 60:
        grade = "D (Poor)"
    else:
        grade = "F (Fail)"
    
    print(f"   Grade: {grade}")
    print("\n" + "="*80)
    
    return df_report, quality_score

# Generate report
df_with_quality, quality_score = generate_quality_report(df)

# Show failed records
print("\n‚ùå RECORDS THAT FAILED QUALITY CHECKS:")
df_with_quality.filter(col("is_quality_pass") == False) \
    .select("order_id", "customer_id", "email", "amount", "status", "country", "order_date") \
    .show(truncate=False)

üìä DATA QUALITY REPORT

üìà OVERALL METRICS:
   Total Records: 16
   Quality Pass: 7 (43.75%)
   Quality Fail: 9 (56.25%)
   Quality Score: 43.75%

‚ùå QUALITY ISSUES BREAKDOWN:
   Missing Values: 3 (18.75%)
   Invalid Email: 1 (6.25%)
   Invalid Amount: 2 (12.50%)
   Invalid Status: 1 (6.25%)
   Invalid Country: 1 (6.25%)
   Invalid Date: 1 (6.25%)
   Future Date: 0 (0.00%)

üéØ QUALITY GRADE:
   Grade: F (Fail)


‚ùå RECORDS THAT FAILED QUALITY CHECKS:
+--------+-----------+-----------------+---------+--------------+--------------+------------+
|order_id|customer_id|email            |amount   |status        |country       |order_date  |
+--------+-----------+-----------------+---------+--------------+--------------+------------+
|ORD004  |NULL       |alice@email.com  |300.0    |pending       |USA           |2024-01-04  |
|ORD005  |CUST005    |NULL             |250.0    |completed     |UK            |2024-01-05  |
|ORD006  |CUST006    |charlie@email.com|NULL     |completed     |Ca

---

## üîß **9. DATA QUALITY FRAMEWORK**

Reusable framework for data quality checks

In [23]:
class DataQualityChecker:
    """
    Reusable Data Quality Framework
    """
    
    def __init__(self, df, rules):
        self.df = df
        self.rules = rules
        self.results = []
    
    def check_not_null(self, column):
        """Check for null values"""
        total = self.df.count()
        null_count = self.df.filter(col(column).isNull()).count()
        pass_rate = ((total - null_count) / total) * 100
        
        self.results.append({
            "rule": f"{column} NOT NULL",
            "total": total,
            "passed": total - null_count,
            "failed": null_count,
            "pass_rate": __builtins__.round(pass_rate, 2),
            "status": "‚úÖ PASS" if null_count == 0 else "‚ùå FAIL"
        })
    
    def check_regex(self, column, pattern):
        """Check regex pattern"""
        total = self.df.count()
        passed = self.df.filter(col(column).rlike(pattern)).count()
        failed = total - passed
        pass_rate = (passed / total) * 100
        
        self.results.append({
            "rule": f"{column} REGEX",
            "total": total,
            "passed": passed,
            "failed": failed,
            "pass_rate": __builtins__.round(pass_rate, 2),
            "status": "‚úÖ PASS" if failed == 0 else "‚ùå FAIL"
        })
    
    def check_range(self, column, min_val, max_val):
        """Check value range"""
        total = self.df.count()
        passed = self.df.filter((col(column) >= min_val) & (col(column) <= max_val)).count()
        failed = total - passed
        pass_rate = (passed / total) * 100
        
        self.results.append({
            "rule": f"{column} RANGE [{min_val}, {max_val}]",
            "total": total,
            "passed": passed,
            "failed": failed,
            "pass_rate": __builtins__.round(pass_rate, 2),
            "status": "‚úÖ PASS" if failed == 0 else "‚ùå FAIL"
        })
    
    def check_in_list(self, column, valid_values):
        """Check if value in list"""
        total = self.df.count()
        passed = self.df.filter(col(column).isin(valid_values)).count()
        failed = total - passed
        pass_rate = (passed / total) * 100
        
        self.results.append({
            "rule": f"{column} IN {valid_values}",
            "total": total,
            "passed": passed,
            "failed": failed,
            "pass_rate": __builtins__.round(pass_rate, 2),
            "status": "‚úÖ PASS" if failed == 0 else "‚ùå FAIL"
        })
    
    def check_unique(self, columns):
        """Check uniqueness"""
        total = self.df.count()
        distinct = self.df.select(columns).distinct().count()
        duplicates = total - distinct
        pass_rate = (distinct / total) * 100
        
        self.results.append({
            "rule": f"{', '.join(columns)} UNIQUE",
            "total": total,
            "passed": distinct,
            "failed": duplicates,
            "pass_rate": __builtins__.round(pass_rate, 2),
            "status": "‚úÖ PASS" if duplicates == 0 else "‚ùå FAIL"
        })
    
    def run_all_checks(self):
        """Run all defined checks"""
        # Completeness checks
        for column in self.df.columns:
            self.check_not_null(column)
        
        # Validity checks
        self.check_regex("email", r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")
        self.check_range("amount", 0, 100000)
        self.check_in_list("status", ["pending", "completed", "cancelled"])
        self.check_in_list("country", ["USA", "UK", "Canada"])
        
        # Uniqueness checks
        self.check_unique(["order_id"])
        
        return self.get_report()
    
    def get_report(self):
        """Get quality report as DataFrame"""
        return spark.createDataFrame(self.results)

# Use the framework
print("üîß RUNNING DATA QUALITY FRAMEWORK:")
checker = DataQualityChecker(df, quality_rules)
quality_report = checker.run_all_checks()

print("\nüìä QUALITY REPORT:")
quality_report.show(truncate=False)

# Summary
total_checks = quality_report.count()
passed_checks = quality_report.filter(col("status") == "‚úÖ PASS").count()
failed_checks = total_checks - passed_checks

print(f"\nüìà SUMMARY:")
print(f"Total Checks: {total_checks}")
print(f"Passed: {passed_checks} ({(passed_checks/total_checks)*100:.2f}%)")
print(f"Failed: {failed_checks} ({(failed_checks/total_checks)*100:.2f}%)")

üîß RUNNING DATA QUALITY FRAMEWORK:

üìä QUALITY REPORT:
+------+---------+------+-----------------------------------------------+------+-----+
|failed|pass_rate|passed|rule                                           |status|total|
+------+---------+------+-----------------------------------------------+------+-----+
|0     |100.0    |16    |order_id NOT NULL                              |‚úÖ PASS|16   |
|1     |93.75    |15    |customer_id NOT NULL                           |‚ùå FAIL|16   |
|1     |93.75    |15    |email NOT NULL                                 |‚ùå FAIL|16   |
|1     |93.75    |15    |amount NOT NULL                                |‚ùå FAIL|16   |
|0     |100.0    |16    |order_date NOT NULL                            |‚úÖ PASS|16   |
|0     |100.0    |16    |status NOT NULL                                |‚úÖ PASS|16   |
|0     |100.0    |16    |country NOT NULL                               |‚úÖ PASS|16   |
|2     |87.5     |14    |email REGEX                     

---

## üíæ **10. SAVE QUALITY REPORT**

In [24]:
# Save quality report to MinIO
report_path = "s3a://warehouse/quality_reports/"

# Add timestamp
quality_report_with_ts = quality_report.withColumn(
    "report_timestamp",
    current_timestamp()
)

quality_report_with_ts.write \
    .mode("append") \
    .partitionBy("status") \
    .parquet(report_path)

print(f"‚úÖ Quality report saved to: {report_path}")

# Save data with quality flags
data_with_quality_path = "s3a://warehouse/orders_with_quality/"

df_with_quality.write \
    .mode("overwrite") \
    .partitionBy("is_quality_pass") \
    .parquet(data_with_quality_path)

print(f"‚úÖ Data with quality flags saved to: {data_with_quality_path}")

26/01/08 15:32:08 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

‚úÖ Quality report saved to: s3a://warehouse/quality_reports/


                                                                                

‚úÖ Data with quality flags saved to: s3a://warehouse/orders_with_quality/


---

## üìä **11. QUALITY DASHBOARD**

In [25]:
# Create quality dashboard
def create_quality_dashboard(quality_report):
    """
    Create visual quality dashboard
    """
    # Convert to Pandas for visualization
    report_pd = quality_report.toPandas()
    
    print("="*80)
    print("üìä DATA QUALITY DASHBOARD")
    print("="*80)
    
    # Overall metrics
    total_checks = len(report_pd)
    passed = len(report_pd[report_pd['status'] == '‚úÖ PASS'])
    failed = total_checks - passed
    avg_pass_rate = report_pd['pass_rate'].mean()
    
    print(f"\nüìà OVERALL METRICS:")
    print(f"   Total Checks: {total_checks}")
    print(f"   Passed: {passed} ({(passed/total_checks)*100:.2f}%)")
    print(f"   Failed: {failed} ({(failed/total_checks)*100:.2f}%)")
    print(f"   Average Pass Rate: {avg_pass_rate:.2f}%")
    
    # Failed checks
    print(f"\n‚ùå FAILED CHECKS:")
    failed_checks = report_pd[report_pd['status'] == '‚ùå FAIL']
    if len(failed_checks) > 0:
        for _, row in failed_checks.iterrows():
            print(f"   ‚Ä¢ {row['rule']}: {row['failed']} failures ({100-row['pass_rate']:.2f}%)")
    else:
        print("   None! All checks passed ‚úÖ")
    
    # Top issues
    print(f"\nüîù TOP 5 ISSUES:")
    top_issues = report_pd.nlargest(5, 'failed')
    for i, row in top_issues.iterrows():
        print(f"   {i+1}. {row['rule']}: {row['failed']} failures")
    
    print("\n" + "="*80)

create_quality_dashboard(quality_report)

üìä DATA QUALITY DASHBOARD

üìà OVERALL METRICS:
   Total Checks: 12
   Passed: 4 (33.33%)
   Failed: 8 (66.67%)
   Average Pass Rate: 94.27%

‚ùå FAILED CHECKS:
   ‚Ä¢ customer_id NOT NULL: 1 failures (6.25%)
   ‚Ä¢ email NOT NULL: 1 failures (6.25%)
   ‚Ä¢ amount NOT NULL: 1 failures (6.25%)
   ‚Ä¢ email REGEX: 2 failures (12.50%)
   ‚Ä¢ amount RANGE [0, 100000]: 3 failures (18.75%)
   ‚Ä¢ status IN ['pending', 'completed', 'cancelled']: 1 failures (6.25%)
   ‚Ä¢ country IN ['USA', 'UK', 'Canada']: 1 failures (6.25%)
   ‚Ä¢ order_id UNIQUE: 1 failures (6.25%)

üîù TOP 5 ISSUES:
   9. amount RANGE [0, 100000]: 3 failures
   8. email REGEX: 2 failures
   2. customer_id NOT NULL: 1 failures
   3. email NOT NULL: 1 failures
   4. amount NOT NULL: 1 failures



---

## üéì **KEY TAKEAWAYS**

### **‚úÖ Data Quality Dimensions:**

1. **Completeness** - No missing values
2. **Validity** - Values meet business rules
3. **Uniqueness** - No duplicates
4. **Timeliness** - Data is current
5. **Consistency** - Data is logically consistent
6. **Accuracy** - Data reflects reality

### **üîß Best Practices:**

1. **Define clear rules** - Document all quality requirements
2. **Automate checks** - Run quality checks in ETL pipeline
3. **Track metrics** - Monitor quality over time
4. **Flag bad data** - Don't delete, flag for review
5. **Create reports** - Share quality metrics with stakeholders
6. **Set thresholds** - Define acceptable quality levels
7. **Alert on failures** - Notify when quality drops
8. **Root cause analysis** - Investigate quality issues

### **üöÄ Production Tips:**

- Run quality checks BEFORE and AFTER transformations
- Store quality reports for auditing
- Create quality SLAs (e.g., 95% pass rate)
- Integrate with monitoring tools (Grafana, DataDog)
- Use Great Expectations or Deequ for advanced checks

---

## üéâ **CONGRATULATIONS!**

B·∫°n ƒë√£ ho√†n th√†nh **DAY 2: DATA I/O & CLEANING**!

### **‚úÖ ƒê√£ h·ªçc:**
- Reading data from multiple formats
- Writing data with partitioning
- Data cleaning techniques
- Data quality validation
- Quality reporting & monitoring

### **üöÄ Next: DAY 3**
- Transformations & Aggregations
- Window Functions
- Complex Joins
- Performance Optimization

---

In [26]:
# Cleanup
spark.stop()
print("‚úÖ Spark session stopped")
print("\nüéâ DAY 2 COMPLETED! Ready for DAY 3!")

‚úÖ Spark session stopped

üéâ DAY 2 COMPLETED! Ready for DAY 3!
