This notebook implements a Validation and Unit Testing Suite for the Data Chunking processes. It ensures that the transformation logic for converting CSV data into JSON and XML formats is accurate and that the Databricks environment maintains proper access to the raw data volumes.

##1. Core Transformation Logic

This section isolates the business logic used during the chunking process into testable functions.

####Logic:

 These functions represent the "unit of work" for Chunks 3 and 4. The JSON logic ensures the data is coalesced into a single partition for consistent file output, while the XML logic prepares the column structure. 
 
####Why this code: 
 
 Separating logic from the main ingestion scripts allows for unit testing. We can verify that these specific transformations behave as expected without needing to run the entire data pipeline.

In [0]:
from pyspark.sql import DataFrame
from pyspark.sql.functions import col

def transform_csv_to_json_logic(df: DataFrame) -> DataFrame:
    """
    Business logic for Chunk 3: Preparing data for JSON output.
    Logic: Uses coalesce(1) to ensure the output is a single, clean JSON file.
    """
    return df.coalesce(1)

def transform_xml_logic(df: DataFrame) -> DataFrame:
    """
    Business logic for Chunk 4: Preparing data for XML output.
    Logic: Selects all columns to ensure the schema is preserved during format conversion.
    """
    return df.select([col(c) for c in df.columns])

##2. Unit and Integration Test Suite

This section uses the unittest framework and PySpark's built-in testing utilities to validate the transformation logic and environment connectivity.

####Logic:

* Schema Validation: Ensures the data structure matches the expected user-defined schema.

* Transformation Validation: Uses assertDataFrameEqual to confirm that the transform_csv_to_json_logic function preserves data integrity.

* Integration Test: Verifies that the notebook has sufficient permissions to access the raw data Volumes.

####Why this code: 

Manual verification is error-prone. Automated tests provide an immediate "Safety Net," ensuring that any changes made to the transformation functions do not introduce bugs into the production data.

In [0]:
import unittest
from pyspark.testing.utils import assertDataFrameEqual, assertSchemaEqual
from pyspark.sql.types import StructType, StructField, StringType

class DataChunkingTest(unittest.TestCase):

    @classmethod
    def setUpClass(cls):
        # Access the existing SparkSession in Databricks
        cls.spark = spark

    def test_schema_equality(self):
        """Unit Test: Verify the schema matches the expected user structure"""
        data = [("1", "John Doe")]
        schema = StructType([
            StructField("id", StringType(), True),
            StructField("name", StringType(), True)
        ])
        df = self.spark.createDataFrame(data, schema)
        
       # Define the expected schema for comparison
        expected_schema = StructType([
            StructField("id", StringType(), True),
            StructField("name", StringType(), True)
        ])
        
        # Logic: Verify that the inferred or defined schema is identical to our requirement
        assertSchemaEqual(df.schema, expected_schema)

    def test_json_transform_content(self):
        """Test if transform_csv_to_json_logic preserves data correctly"""
        source_data = [("101", "Credit"), ("102", "Cash")]
        source_df = self.spark.createDataFrame(source_data, ["id", "method"])
        
        # Apply the transformation function
        transformed_df = transform_csv_to_json_logic(source_df)
        
        # Logic: Confirm that the data values remain unchanged after the transformation
        assertDataFrameEqual(transformed_df, source_df)

    def test_integration_path_exists(self):
        """Integration test: Check if the raw volumes are accessible"""
        try:
            dbutils.fs.ls("/Volumes/vstone-catalog/vstone_schema/raw_data")
            path_exists = True
        except:
            path_exists = False
        
        # Logic: This ensures the ingestion pipeline won't fail due to missing path permissions
        self.assertTrue(path_exists, "Raw data volume should be accessible.")

# Prepare the test suite for execution
suite = unittest.TestLoader().loadTestsFromTestCase(DataChunkingTest)

##3. Execution and Test Reporting

The final block executes the tests and generates a formatted report within the Databricks UI.

####Logic: 

It captures the test output into a string buffer and prints a clean summary of the results, including total runs, errors, and failures. 

####Why this code: 

Clear reporting is essential for Continuous Integration (CI). This summary allows developers to quickly confirm that the chunking logic is healthy or identify exactly which test failed for debugging.

In [0]:
import io
from unittest import TextTestRunner

# 1. Create a string buffer to catch the report
stream = io.StringIO()
runner = TextTestRunner(stream=stream, verbosity=2)

# 2. Run the tests
result = runner.run(suite)

# 3. Print the formatted report
print("======= DATA CHUNKING TEST REPORT =======")
print(stream.getvalue())
print(f"Tests Run: {result.testsRun}")
print(f"Errors: {len(result.errors)}")
print(f"Failures: {len(result.failures)}")
print("=========================================")

# Debugging Logic: Triggers an alert if any test fails
if not result.wasSuccessful():
    print("\n[DEBUG MODE] A test failed. You can use %debug in the next cell to investigate.")