##1. Core Profiling Logic

This section defines the basic_profile function, which serves as the primary engine for analyzing any Spark DataFrame.

####Logic: 

It programmatically iterates through every column to count nulls and calculates distinctness based on provided Primary Key (PK) columns. 

####Why this code: 

Manual profiling is time-consuming. This function centralizes logic to ensure consistent data quality checks across the entire pipeline, including date range validation for temporal datasets.

In [0]:
# Import essential PySpark functions for aggregation and column manipulation
from pyspark.sql.functions import col, count, when, min, max
from pyspark.sql.functions import col, sum as spark_sum, min, max, lit



In [0]:
def basic_profile(df, table_name, pk_cols=None, date_col=None):
    total_rows = df.count()

    # -------------------------
    # # 1. Null counts per column
    # Logic: Cast isNull() booleans to integers (1 or 0) and sum them up for every column.
    # -------------------------
    null_counts = df.select([
        spark_sum(col(c).isNull().cast("int")).alias(c)
        for c in df.columns
    ])

    # -------------------------
    # # 2. Duplicate detection
    # Logic: Subtract the count of distinct PK combinations from the total row count.
    # -------------------------
    duplicate_rows = 0
    if pk_cols:
        duplicate_rows = total_rows - df.select(pk_cols).distinct().count()

    # -------------------------
    # 3. Date range analysis
    # Logic: Identify the minimum and maximum dates to verify data freshness/coverage.
    # -------------------------
    min_date = None
    max_date = None
    if date_col:
        dates = df.select(
            min(col(date_col)).alias("min_date"),
            max(col(date_col)).alias("max_date")
        ).collect()[0]

        min_date = dates["min_date"]
        max_date = dates["max_date"]

   # -------------------------
    # 4. Final summary construction
    # Logic: Append the metadata (table name, row counts) to the null count results.
    # -------------------------
    profile_df = (
        null_counts
        .withColumn("table_name", lit(table_name))
        .withColumn("total_rows", lit(total_rows))
        .withColumn("duplicate_rows", lit(duplicate_rows))
        .withColumn("min_date", lit(min_date))
        .withColumn("max_date", lit(max_date))
    )
    # Display the final health report in the Databricks UI
    display(profile_df)


##2. Transaction Data Profiling (CSV)

####Logic:

 This block loads transaction headers and line items from the chunk1 and chunk2 directories. 
 
####Why this code: 
 
 Transaction data is usually the largest. Profiling it helps identify if mandatory fields (like transaction_id) have missing values or if there are integrity issues like duplicate IDs.

In [0]:
# Load Transaction Header data
transactions_df = spark.read.option("header", True).csv(
    "/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk1/"
)
# Run health check for Transactions
basic_profile(
    df=transactions_df,
    table_name="transactions",
    pk_cols=["transaction_id"],
    date_col="created_at"
)

In [0]:

display(transactions_df.describe())

Databricks data profile. Run in Databricks to view.

In [0]:
# Load and Profile Transaction Items (using inferSchema to handle numeric prices)
transaction_items_df = (
    spark.read
    .option("header", True)
    .option("inferSchema", True)
    .csv(
        "/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk2/transaction_items/*.csv"
    )
)

basic_profile(
    df=transaction_items_df,
    table_name="transaction_items",
    pk_cols=["transaction_id", "item_id"],
    date_col="created_at"
)

In [0]:
transaction_items_df.describe().display()

##3. User & Reference Data Profiling (JSON/XML)

####Logic: 

Analyzes User data (JSON) and static reference tables like Vouchers and Stores (XML). 

####Why this code: 

Reference data often has different formats. Profiling XML/JSON ensures that the Spark xml and json readers are correctly parsing the nested or tagged structures into flat DataFrames.

In [0]:
# Load User data from JSON
users_df = spark.read.json(
    "/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk3/users/"
)
# Profile Users (checking for duplicate user IDs and registration date ranges)
basic_profile(
    df=users_df,
    table_name="users",
    pk_cols=["user_id"],
    date_col="registered_at"
)

In [0]:
users_df.describe().display()

In [0]:
# Load Voucher data from XML using 'item' as the row delimiter
vouchers_df = (
    spark.read
    .format("xml")
    .option("rowTag", "item")
    .load("/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk4/vouchers.xml")
)
# Profile Vouchers (verifying total discount entries and potential duplicates)
basic_profile(
    df=vouchers_df,
    table_name="vouchers",
    pk_cols=["voucher_id"]
)

In [0]:
display(vouchers_df.describe())

Databricks data profile. Run in Databricks to view.

In [0]:
menu_items_df = (
    spark.read
    .format("xml")
    .option("rowTag", "item")
    .load("/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk4/menu_items.xml")
)

basic_profile(
    df=menu_items_df,
    table_name="menu_items",
    pk_cols=["item_id"]
)

In [0]:
menu_items_df.describe().display()

In [0]:
payment_methods_df = (
    spark.read
    .format("xml")
    .option("rowTag", "item")
    .load("/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk4/payment_methods.xml")
)
basic_profile(
    df=payment_methods_df,
    table_name="payment_methods",
    pk_cols=["method_id"]
)

In [0]:
payment_methods_df.describe().display()

In [0]:
stores_df = (
    spark.read
    .format("xml")
    .option("rowTag", "item")
    .load("/Volumes/vstone-catalog/vstone_schema/chunked_data/chunk4/stores.xml")
)
basic_profile(
    df=stores_df,
    table_name="stores",
    pk_cols=["store_id"]
)

In [0]:
stores_df.describe().display()