## Delta Table JSON Stats Inspector

**Author:** Jitesh Soni 
**Date:** 2025-03-14

## Overview

This notebook extracts column-level statistics from the Delta table's JSON log files stored in the `_delta_log` directory. The notebook performs the following tasks:
- Retrieves the Delta table path using the Delta Lake API.
- Reads the Delta log JSON files and filters out logs that contain column statistics.
- Automatically infers the schema from a sample JSON string using DataFrame operations (avoiding custom RDD code) to comply with shared cluster limitations.
- Parses and flattens the JSON statistics for further analysis.

## Usage

1. Update the `table_name` variable with the full name of your Delta table.
2. Run the cells sequentially to inspect the parsed column-level statistics.
3. The final DataFrame shows the file paths along with parsed statistics such as `numRecords`, `minValues`, `maxValues`, and `nullCount`.


### [Link To Demo Video](https://www.loom.com/share/001cfe9fc730481ea5b2a44c621c6b55?sid=03c724f5-e95a-42a1-838d-836eabbd65fd)

In [0]:
import json
from delta.tables import DeltaTable
from pyspark.sql.functions import col, from_json
from pyspark.sql.types import (
    StructType, StructField, StringType, LongType,
    DoubleType, BooleanType, ArrayType
)
from pyspark.sql import SparkSession

In [0]:

def infer_spark_schema_from_dict(d: dict) -> StructType:
    """
    Recursively infers a Spark StructType schema from a Python dict.
    """
    fields = []
    for k, v in d.items():
        if isinstance(v, dict):
            field_type = infer_spark_schema_from_dict(v)
        elif isinstance(v, list):
            # For non-empty lists, infer type from first element; else default to StringType
            if len(v) > 0:
                first_elem = v[0]
                if isinstance(first_elem, dict):
                    element_type = infer_spark_schema_from_dict(first_elem)
                elif isinstance(first_elem, int):
                    element_type = LongType()
                elif isinstance(first_elem, float):
                    element_type = DoubleType()
                elif isinstance(first_elem, bool):
                    element_type = BooleanType()
                else:
                    element_type = StringType()
            else:
                element_type = StringType()
            field_type = ArrayType(element_type)
        elif isinstance(v, int):
            field_type = LongType()
        elif isinstance(v, float):
            field_type = DoubleType()
        elif isinstance(v, bool):
            field_type = BooleanType()
        else:
            field_type = StringType()
        fields.append(StructField(k, field_type, True))
    return StructType(fields)




In [0]:

def inspect_column_stats_from_delta_table(table_name: str, spark: SparkSession):
    """
    Extracts column-level statistics from Delta log JSON files for a given table,
    and automatically infers the schema of the JSON stats using a sample row.
    
    Args:
        table_name (str): The full name of the Delta table (e.g., "db_name.table_name").
        spark (SparkSession): Your active SparkSession.
    
    Returns:
        A DataFrame with the parsed column statistics.
    """
    # Get Delta table path
    delta_table_path = (
        DeltaTable.forName(spark, table_name)
        .detail()
        .select("location")
        .collect()[0][0]
    )
    print(f"🔍 Delta Table Path: {delta_table_path}")

    # Read the delta log JSON files
    log_path = f"{delta_table_path}/_delta_log/*.json"
    logs = spark.read.json(log_path).cache()

    # Filter logs with stats and select relevant columns
    relevant_logs = logs.filter(col("add.stats").isNotNull())
    stats_logs = relevant_logs.select(col("add.path").alias("path"), col("add.stats").alias("stats"))

    # Get one sample JSON string from the stats column (driver-side collection of one row only)
    sample_row = stats_logs.filter(col("stats").isNotNull()).limit(1).collect()
    if not sample_row:
        raise ValueError("No stats found in the logs.")
    sample_json_str = sample_row[0]["stats"]

    # Convert the sample JSON string into a dict
    sample_dict = json.loads(sample_json_str)

    # Infer the schema from the sample dictionary
    inferred_schema = infer_spark_schema_from_dict(sample_dict)
    print("Inferred Schema:")
    print(inferred_schema.json())

    # Parse the stats column using the inferred schema
    stats_with_schema = stats_logs.withColumn("parsed_stats", from_json(col("stats"), inferred_schema))

    # Flatten the parsed stats into individual columns (along with the file path)
    final_df = stats_with_schema.select("path", "parsed_stats.*")
    display(final_df)
    return final_df

In [0]:
# Example usage:
table_name = "soni.default.iot_data_merge_partitioned"
column_stats_df = inspect_column_stats_from_delta_table(table_name, spark)