This notebook implements a Automated Testing Suite for the Data Profiling utility. It follows professional software engineering practices by separating core business logic from testing procedures, ensuring that the profiling metrics used in the production pipeline are accurate and reliable.

##1. Core Profiling Logic

This section contains the refined calculate_profile_metrics function.

####Logic: 

It programmatically assesses a DataFrame to calculate null distributions, total row counts, and duplicate entries based on Primary Key (PK) columns. 

####Why this code: 

By returning a DataFrame instead of just printing results, this function becomes "testable." We can compare the output of this function against known expected results to verify its accuracy.

In [0]:
from pyspark.sql.functions import col, sum as spark_sum, min, max, lit
from pyspark.sql import DataFrame


def calculate_profile_metrics(df: DataFrame, table_name: str, pk_cols=None, date_col=None) -> DataFrame:
    """
    Pure logic for profiling. Returns the profile result as a DataFrame.
    """
    total_rows = df.count()
    
    # 1. Null counts per column logic
    null_counts = df.select([
        spark_sum(col(c).isNull().cast("int")).alias(c)
        for c in df.columns
    ])

    # 2. Duplicate detection logic
    duplicate_rows = 0
    if pk_cols:
        duplicate_rows = total_rows - df.select(pk_cols).distinct().count()

    # 3. Date range analysis logic
    min_date, max_date = None, None
    if date_col:
        dates = df.select(
            min(col(date_col)).alias("min_date"),
            max(col(date_col)).alias("max_date")
        ).collect()[0]
        min_date = dates["min_date"]
        max_date = dates["max_date"]

    # 4. Final summary construction: Cast dates to strings for schema consistency
    return (
        null_counts
        .withColumn("table_name", lit(table_name))
        .withColumn("total_rows", lit(total_rows))
        .withColumn("duplicate_rows", lit(duplicate_rows))
        .withColumn("min_date", lit(str(min_date))) # Cast to string for consistent comparison
        .withColumn("max_date", lit(str(max_date)))
    )

##2. Unit and Integration Test Suite

This section uses the unittest framework to validate the profiling logic and environment access.

####Logic: 
    
*Unit Tests: 
    
    Create small, "dummy" DataFrames with known errors (like specific null counts) and assert that the function detects them correctly.

*Integration Tests: 
    Verify that the Databricks environment can actually access the external Volumes where the raw data is stored. 
    
####Why this code: 

Automated tests prevent "regression" (fixing one thing and breaking another). If the logic for calculating duplicates is ever changed, these tests will immediately flag any errors.

In [0]:
import unittest
import io
from unittest import TextTestRunner
from pyspark.testing.utils import assertDataFrameEqual, assertSchemaEqual
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, LongType

class ProfilingLogicTest(unittest.TestCase):

    @classmethod
    def setUpClass(cls):
        # Use the existing SparkSession in the Databricks environment
        cls.spark = spark

    def test_null_count_logic(self):
        """Unit Test: Verify null counts are calculated accurately"""
        # Create dummy data with 1 null in 'id' and 2 nulls in 'val'
        data = [("1", None), (None, "A"), (None, "B")]
        df = self.spark.createDataFrame(data, ["id", "val"])
        
        profile_result = calculate_profile_metrics(df, "test_table")
        
        # Expect 2 nulls for 'id' and 1 for 'val'
        actual_id_nulls = profile_result.select("id").collect()[0][0]
        actual_val_nulls = profile_result.select("val").collect()[0][0]
        
        self.assertEqual(actual_id_nulls, 2)
        self.assertEqual(actual_val_nulls, 1)

    def test_duplicate_logic(self):
        """Unit Test: Verify duplicate row calculation"""
        data = [(101, "item1"), (101, "item1"), (102, "item2")]
        df = self.spark.createDataFrame(data, ["pk_id", "name"])
        
        profile_result = calculate_profile_metrics(df, "dup_test", pk_cols=["pk_id"])
        
        actual_dups = profile_result.select("duplicate_rows").collect()[0][0]
        self.assertEqual(actual_dups, 1) # Total 3 rows - 2 distinct = 1 duplicate

    def test_integration_volume_access(self):
        """Integration Test: Check access to Chunked Data Volumes"""
        path = "/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk1/"
        try:
            dbutils.fs.ls(path)
            accessible = True
        except:
            accessible = False
        
        self.assertTrue(accessible, f"Path {path} must be accessible for profiling.")

# Initialize the suite
suite = unittest.TestLoader().loadTestsFromTestCase(ProfilingLogicTest)

##3. Quality Report Execution

####Logic: 

Executes the test suite and captures the output into a formatted report. 

####Why this code: 

In a production CI/CD (Continuous Integration/Continuous Deployment) pipeline, this report determines if the code is safe to deploy. A 100% success rate is required for the pipeline to proceed.

In [0]:
# Create a stream to capture the report
stream = io.StringIO()
runner = TextTestRunner(stream=stream, verbosity=2)

# Run tests
result = runner.run(suite)

# Print Final Report
print("●●● DATA PROFILING QUALITY REPORT ●●●")
print("-" * 40)
print(stream.getvalue())
print("-" * 40)
print(f"TOTAL TESTS RUN: {result.testsRun}")
print(f"SUCCESSES: {result.testsRun - len(result.failures) - len(result.errors)}")
print(f"FAILURES: {len(result.failures)}")
print(f"ERRORS: {len(result.errors)}")
print("-" * 40)

if not result.wasSuccessful():
    print("CRITICAL: Profiling logic validation failed. Use %debug to inspect.")

####Report Summary

*Total Tests Run: 3

*Successes: 3

*Failures: 0

*Status:  OK