
This notebook establishes a Unit Testing Framework for the Auto Loader ingestion pipeline (Chunk 3). It focuses on validating the core transformation logic and environment configuration independently of the streaming process, ensuring high code quality and reliable data ingestion into the Bronze layer.

##1. Refactored Transformation Logic


To enable effective testing, the ingestion logic is refactored into modular, pure functions.
* transform_json_ingestion: This function encapsulates the core transformation logic.
####Logic: 
It adds technical metadata, specifically a load timestamp (load_dt) and a source identifier (source), to the incoming stream.
* get_table_name: This function handles environment-specific configuration.
####Logic:
 It dynamically constructs the full Unity Catalog table identifier using provided catalog and schema names.

In [0]:
import unittest
import io
from unittest import TextTestRunner
from pyspark.sql import Row
from pyspark.sql.functions import current_timestamp, lit



In [0]:
# --- 1. Refactored Logic for Transformation ---
def transform_json_ingestion(df):
    """
    Core transformation logic for Chunk 3.
    Adds audit columns to the stream.
    """
    return (
        df.withColumn("load_dt", current_timestamp())
          .withColumn("source", lit("dab_json_ingestion"))
    )

def get_table_name(catalog, schema):
    """Logic for Environment Configuration"""
    return f"`{catalog}`.`{schema}`.`users_bronze`"

##2.Unit Test Suite


The test suite utilizes the unittest framework to verify that the refactored functions behave correctly under various scenarios.

* **test_transformation_metadata:** Validates the addition of audit columns.

* **Logic:** It creates a mock DataFrame using pyspark.sql.Row and asserts that the resulting DataFrame contains both the load_dt and source columns with correct values.

* **test_environment_widget_logic:** Ensures correct table naming.

* **Logic**: It asserts that the get_table_name function correctly formats the 3-tier namespace string required for Unity Catalog.

In [0]:

# --- 2. Test Suite (Only Unit Tests) ---
class AutoloaderIngestionTest(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        # Access the active Spark session (Spark Connect compatible)
        cls.spark = spark

    def test_transformation_metadata(self):
        """Unit Test: Verify audit columns (load_dt, source) are added"""
        # Create data using Rows to avoid JVM/sparkContext dependencies
        input_data = [Row(user_id=1, name="John")]
        input_df = self.spark.createDataFrame(input_data)
        
        output_df = transform_json_ingestion(input_df)
        
        # Verify columns exist
        self.assertIn("load_dt", output_df.columns)
        self.assertIn("source", output_df.columns)
        
        # Verify constant value matches logic
        actual_source = output_df.select("source").distinct().collect()[0][0]
        self.assertEqual(actual_source, "dab_json_ingestion")

    def test_environment_widget_logic(self):
        """Unit Test: Ensure table identifier construction is correct"""
        actual_name = get_table_name("prod_catalog", "bronze_db")
        self.assertEqual(actual_name, "`prod_catalog`.`bronze_db`.`users_bronze`")




##3. Quality Reporting

The final section executes the test suite and generates a structured Pipeline Quality Report.

- **Execution:** The TextTestRunner runs the suite and captures detailed logs in an io.StringIO stream.

- **Summary:** The report displays the total number of tests, successes, and any failures or errors.

- **Verdict:** A final status message indicates whether all logic and unit tests are verified or if issues were detected.

In [0]:
# --- 3. Run and Generate Report ---
# Load the tests into the suite
suite = unittest.TestLoader().loadTestsFromTestCase(AutoloaderIngestionTest)

# Create a stream to capture the results
stream = io.StringIO()
runner = TextTestRunner(stream=stream, verbosity=2)

# Run the suite
result = runner.run(suite)

# Print the formatted report
print("●●● AUTOLOADER PIPELINE QUALITY REPORT ●●●")
print("-" * 45)
print(stream.getvalue())
print("-" * 45)
print(f"TOTAL TESTS: {result.testsRun}")
print(f"SUCCESSES: {result.testsRun - len(result.failures) - len(result.errors)}")
print(f"FAILURES/ERRORS: {len(result.failures) + len(result.errors)}")
print("-" * 45)

if result.wasSuccessful():
    print("STATUS: PASSED - All logic and unit tests are verified.")
else:
    print("STATUS: FAILED - Logic validation issues detected.")