
This notebook establishes a Quality Assurance and Testing Suite specifically for the XML ingestion components of the data pipeline. It ensures that the modular ingestion logic correctly handles audit metadata and that the Databricks environment properly communicates with job parameters via widgets.

##1.Core Transformation Function

The primary business logic for XML processing is encapsulated in a dedicated function to allow for isolated testing.

####Logic: 

The function transform_xml_data takes a raw DataFrame and appends technical audit columns: a current timestamp and a string literal identifying the data source. 

####Why this code:

 Isolating this logic from the database write operations allows us to run unit tests in-memory, ensuring the data is correctly "tagged" before it ever hits the Bronze layer.

In [0]:
from pyspark.sql.functions import current_timestamp, lit

def transform_xml_data(df, source_tag):
    """
    Logic: Core transformation that adds audit columns.
    Used for unit testing without writing to Delta tables.
    """
    # Append the load timestamp and the specific source identifier
    return (df.withColumn("load_dt", current_timestamp())
            .withColumn("source", lit(source_tag)))

##2.Unit and Integration Test Suite

This section utilizes the unittest framework to validate both the data transformation logic and the Databricks environment integration.

####Logic:

* Transformation Test: Creates a mock DataFrame and asserts that the load_dt and source columns are present and contain the expected values.

* Integration Test: Mocks the creation and retrieval of Databricks widgets to ensure the pipeline can receive dynamic parameters from external Job tasks.

####Why this code: 

Automated testing is critical for maintainability. These tests verify that any changes to the global ingestion script won't break the specific requirements for XML sources like vouchers or stores.

In [0]:
import unittest
import io
from unittest import TextTestRunner
from pyspark.testing.utils import assertDataFrameEqual, assertSchemaEqual
from pyspark.sql.types import StructType, StructField, StringType

class XMLIngestionTest(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        # Attach the existing SparkSession to the test class
        cls.spark = spark

    def test_xml_transformation_columns(self):
        """Unit Test: Verify audit columns are added to the XML structure"""
        # Create mock data mimicking an XML item
        data = [("101", "VoucherA")]
        input_df = self.spark.createDataFrame(data, ["voucher_id", "voucher_code"])
        
        # Apply the transformation function
        output_df = transform_xml_data(input_df, "vstone_vouchers")
        
       # Assertions to ensure audit columns exist and are correct
        self.assertIn("load_dt", output_df.columns)
        self.assertIn("source", output_df.columns)
        actual_source = output_df.select("source").first()[0]
        self.assertEqual(actual_source, "vstone_vouchers")

    def test_integration_widget_params(self):
        """Integration Test: Validate widget retrieval logic for job parameters"""
        # Simulate setting a widget in Databricks
        dbutils.widgets.text("test_tag", "mock_xml")
        retrieved_tag = dbutils.widgets.get("test_tag")
        self.assertEqual(retrieved_tag, "mock_xml")

# Load the test cases into a suite for execution
suite = unittest.TestLoader().loadTestsFromTestCase(XMLIngestionTest)

##3. Execution and Quality Reporting

The final block runs the suite and generates a structured Quality Report to provide clear visibility into the health of the ingestion logic.

####Logic: 


The script executes the tests through a TextTestRunner and captures the results in a string buffer. It then parses these results to display a final summary of passes, failures, and errors. 

####Why this code: 

This reporting format is essential for automated workflows. If this notebook is run as part of a CI/CD pipeline, the "Alert" logic at the end will trigger a failure if any test does not pass, preventing faulty code from being deployed.

In [0]:
# 1. Setup reporting stream to capture test results
stream = io.StringIO()
runner = TextTestRunner(stream=stream, verbosity=2)

# 2. Execute the tests
result = runner.run(suite)

# 3. Print the Final Formatted Report
print("●●● XML INGESTION QUALITY REPORT ●●●")
print("-" * 45)
print(stream.getvalue())
print("-" * 45)
print(f"TOTAL TESTS RUN: {result.testsRun}")
print(f"PASSED: {result.testsRun - len(result.failures) - len(result.errors)}")
print(f"FAILED: {len(result.failures)}")
print(f"ERRORS: {len(result.errors)}")
print("-" * 45)
# Final verdict logic for automated pipelines
if not result.wasSuccessful():
    print("\n[ALERT] Logic validation failed. Run '%debug' to investigate.")