 ### Efficiently Processing & Querying Highly Nested, Schema-Variable JSON

* **Problem:** Source systems often output deeply nested JSON with arrays, complex objects, and fields that might appear/disappear between records. Flattening everything can be inefficient or lead to massive tables. Querying specific nested elements directly can be cumbersome with basic tools.
* **Trickiness:** Schema evolution, deeply nested structures, arrays, inconsistent fields make simple flattening or SQL querying difficult and inefficient. `explode()` can cause row explosion.
* **Technique (Spark Notebook):**
    * **Selective Flattening & Struct Navigation:** Instead of full flattening, use PySpark to read the JSON (inferring schema or providing a robust one). Employ `select()` with complex path expressions (`col("address.city")`, `col("orders[0].item_id")`) and functions like `explode()` (for arrays), `element_at()` (for specific array elements or map values), and `getField()` to extract *only* the necessary top-level and specific nested fields into a structured DataFrame. Keep other less-used nested parts as `StructType` columns.
* **Demo Focus:** Read complex `JSON` from `OneLake`/`Lakehouse` -> Apply selective flattening/struct navigation -> Show querying the resulting DataFrame with a mix of flat columns and accessible structs -> Optionally show a `Pandas UDF` tackling a particularly messy part -> Write results to a `Delta Lake` table, queryable via the `SQL endpoint`.


#### Reset Demo

Trying this configure command to improve Default Lakehouse stability. Occasionally getting 400 errors on reading the JSON file using the relative path. See here: https://community.fabric.microsoft.com/t5/Data-Engineering/400-error-when-accessing-lakehouse-files-from-notebook/m-p/4315410

In [1]:
%%configure
{
    "defaultLakehouse": {
        "name": "HealthcareData"
}}

StatementMeta(, 3aab02fa-84d7-4888-8d7d-a9c7f6d144b9, -1, Finished, Available, Finished)

In [2]:
%%sql

DROP TABLE IF EXISTS orders_processed

StatementMeta(, 3aab02fa-84d7-4888-8d7d-a9c7f6d144b9, 2, Finished, Available, Finished)

<Spark SQL result set with 0 rows and 0 fields>

#### Cell 1: Setup & Imports

In [3]:
# Required import for the Fabric Warehouse/Lakehouse Spark Connector
import com.microsoft.spark.fabric

# Import PySpark functions
from pyspark.sql import functions as F
from pyspark.sql.types import * # Optional, for defining schema manually if needed

print("Setup complete. PySpark functions imported.")

# Define file paths (Adjust these to your Lakehouse structure)
# Assuming 'Files/landing/orders/' directory in your Lakehouse
json_file_path = "Files/landing/orders/orders.jsonl"
# Define output delta table path (relative to 'Tables' folder)
output_delta_table_name = "orders_processed"

StatementMeta(, 3aab02fa-84d7-4888-8d7d-a9c7f6d144b9, 4, Finished, Available, Finished)

Setup complete. PySpark functions imported.


###### Highlights of this data:

    Nesting: customer, address, items, shipping_details.
    Arrays: tags, items.
    Optional Fields: middle_name (record 2), discount in items (record 3), tracking_no missing in shipping_details (record 4).
    Null Objects/Empty Arrays: shipping_details (record 2), tags (record 2), items (record 4).
    Mixed Types (Potentially): customer_id is numeric (123, 456) and string ("CUST-789"). Spark might infer string for the column.

##### Cell 2: Read JSON and Inspect Schema

In [4]:
# Read the JSON lines file, letting Spark infer the schema
try:
    raw_df = spark.read.json(json_file_path)
    print("Successfully read JSON file.")
    print("Inferred Schema:")
    raw_df.printSchema()
    print("\nSample Data (Raw):")
    raw_df.show(truncate=False) # Show raw structure
except Exception as e:
    print(f"Error reading JSON: {e}")
    # Consider stopping the notebook if read fails
    mssparkutils.notebook.exit("Failed to read source JSON")

StatementMeta(, 3aab02fa-84d7-4888-8d7d-a9c7f6d144b9, 5, Finished, Available, Finished)

Successfully read JSON file.
Inferred Schema:
root
 |-- customer: struct (nullable = true)
 |    |-- address: struct (nullable = true)
 |    |    |-- city: string (nullable = true)
 |    |    |-- state: string (nullable = true)
 |    |    |-- street: string (nullable = true)
 |    |    |-- zip: string (nullable = true)
 |    |-- customer_id: string (nullable = true)
 |    |-- email: string (nullable = true)
 |    |-- first_name: string (nullable = true)
 |    |-- last_name: string (nullable = true)
 |    |-- middle_name: string (nullable = true)
 |    |-- tags: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |-- is_pickup: boolean (nullable = true)
 |-- items: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- discount: double (nullable = true)
 |    |    |-- item_id: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |    |    |-- price: double (nullable = true)
 |    |    |-- quantity: long (nullab

###### Discussion Point: 

Discuss the inferred schema. Point out the struct types for customer, address, items (as array<struct>), shipping_details. Note the data types inferred (e.g., for customer_id). Show how complex it looks raw.

#### Cell 3: Selective Flattening & Struct Navigation

In [5]:
# Select top-level fields and specific nested fields
# Explode the items array to get one row per item per order
# Keep the full address as a struct for demonstration

if 'raw_df' in locals(): # Check if previous cell ran successfully
    processed_df = raw_df.select(
        F.col("order_id"),
        F.col("order_timestamp"),
        # Select specific customer fields, renaming for clarity
        F.col("customer.customer_id").alias("customer_id"), # Note potential type mixing
        F.col("customer.email").alias("customer_email"),
        F.col("customer.first_name").alias("customer_fname"),
        F.col("customer.last_name").alias("customer_lname"),
        # Handle optional middle_name - will be null if missing
        F.col("customer.middle_name").alias("customer_mname"),
        # Select specific address fields
        F.col("customer.address.city").alias("customer_city"),
        F.col("customer.address.state").alias("customer_state"),
        # Keep the full address struct as well
        F.col("customer.address").alias("customer_address_struct"),
        # Handle potentially null shipping details before accessing nested fields
        F.col("shipping_details.method").alias("shipping_method"),
        F.col("shipping_details.tracking_no").alias("shipping_tracking"), # Will be null if missing
        # Explode the items array - this creates multiple rows per order if >1 item
        # Use posexplode if you need the position in the array
        F.explode_outer("items").alias("item") # explode_outer handles null/empty arrays gracefully
    ).select(
        "*", # Select all columns generated so far
        # Now select fields from the exploded 'item' struct
        F.col("item.item_id").alias("item_id"),
        F.col("item.name").alias("item_name"),
        F.col("item.quantity").alias("item_quantity"),
        F.col("item.price").alias("item_price"),
        # Handle optional discount, defaulting to 0.0 if null or missing
        F.coalesce(F.col("item.discount"), F.lit(0.0)).alias("item_discount")
    ).drop("item") # Drop the intermediate exploded struct column

    print("Schema after selection and explosion:")
    processed_df.printSchema()
    print("\nSample Data (Processed):")
    processed_df.show(truncate=False)
else:
    print("raw_df not found. Skipping processing.")

StatementMeta(, 3aab02fa-84d7-4888-8d7d-a9c7f6d144b9, 6, Finished, Available, Finished)

Schema after selection and explosion:
root
 |-- order_id: string (nullable = true)
 |-- order_timestamp: string (nullable = true)
 |-- customer_id: string (nullable = true)
 |-- customer_email: string (nullable = true)
 |-- customer_fname: string (nullable = true)
 |-- customer_lname: string (nullable = true)
 |-- customer_mname: string (nullable = true)
 |-- customer_city: string (nullable = true)
 |-- customer_state: string (nullable = true)
 |-- customer_address_struct: struct (nullable = true)
 |    |-- city: string (nullable = true)
 |    |-- state: string (nullable = true)
 |    |-- street: string (nullable = true)
 |    |-- zip: string (nullable = true)
 |-- shipping_method: string (nullable = true)
 |-- shipping_tracking: string (nullable = true)
 |-- item_id: string (nullable = true)
 |-- item_name: string (nullable = true)
 |-- item_quantity: long (nullable = true)
 |-- item_price: double (nullable = true)
 |-- item_discount: double (nullable = false)


Sample Data (Processed

##### Discussion Point: 

Explain the select choices. Show how dot notation accesses nested fields. Explain explode_outer's role in handling items and empty arrays. Point out how missing fields (middle_name, tracking_no, discount) result in null and how coalesce provides a default. Show the resulting 'flatter' but still potentially wide schema. Discuss the trade-off (more rows due to explode).

##### Cell 4 (Optional): Alternative Array Handling (e.g., First Item Only)

Show this as an _alternative_ if denormalization via explode isn't desired. Explain element_at and size.



In [6]:
# Alternative if you don't want to explode, e.g., get first item's name
if 'raw_df' in locals():
    first_item_df = raw_df.select(
        F.col("order_id"),
        F.element_at(F.col("items"), 1).getField("name").alias("first_item_name"), # Use element_at (1-based index)
        F.size(F.col("items")).alias("item_count") # Get array size
    )
    print("\nSample Data (First Item Name & Count):")
    first_item_df.show()
else:
    print("raw_df not found. Skipping alternative.")

StatementMeta(, 3aab02fa-84d7-4888-8d7d-a9c7f6d144b9, 7, Finished, Available, Finished)


Sample Data (First Item Name & Count):
+--------+---------------+----------+
|order_id|first_item_name|item_count|
+--------+---------------+----------+
|ORD-1001|      Product A|         2|
|ORD-1002|      Product L|         1|
|ORD-1003|      Product A|         2|
|ORD-1004|           NULL|         0|
+--------+---------------+----------+



#### Cell 5: Write Processed Data to Delta Lake

In [7]:
# Write the selectively flattened DataFrame to a Delta table in the Lakehouse

if 'processed_df' in locals(): # Check if processing ran
    try:
        processed_df.write.format("delta").mode("overwrite").saveAsTable(output_delta_table_name)
        # Or use .save(f"Tables/{output_delta_table_name}") if you prefer path-based
        print(f"Successfully wrote processed data to Delta table: {output_delta_table_name}")
    except Exception as e:
        print(f"Error writing Delta table: {e}")
else:
    print("processed_df not found. Skipping Delta write.")

StatementMeta(, 3aab02fa-84d7-4888-8d7d-a9c7f6d144b9, 8, Finished, Available, Finished)

Successfully wrote processed data to Delta table: orders_processed


##### Discussion Point: 

Mention the benefits of Delta Lake (ACID, schema evolution, time travel). Show the table appearing in the Lakehouse UI under "Tables".

#### Cell 6: Query via SQL

In [8]:
%%sql

SELECT
    order_id,
    customer_id,
    customer_email,
    customer_city,
    -- Access fields within the stored struct column
    customer_address_struct.street AS customer_street,
    customer_address_struct.zip AS customer_zip,
    shipping_method,
    shipping_tracking,
    item_id,
    item_name,
    item_quantity,
    item_price,
    item_discount
FROM
    orders_processed
LIMIT 20;

StatementMeta(, 3aab02fa-84d7-4888-8d7d-a9c7f6d144b9, 9, Finished, Available, Finished)

<Spark SQL result set with 6 rows and 13 fields>

##### Discussion Point: 

Show how standard SQL can query the processed Delta table. Crucially, demonstrate querying inside the customer_address_struct column, proving you retained some structure but made it easily accessible via SQL (also could do this in the SQL endpoint of Lakehouse). Discuss how this balances flattening vs. retaining structure.