# PySpark: Zero to Hero
## Module 15: Working with JSON and Complex Data

In the previous module, we looked at columnar formats like Parquet. Today, we focus on **JSON**, one of the most common formats for web APIs and NoSQL databases. JSON data in Spark can be complex, often containing nested structures (Structs) and Arrays.

### Agenda:
1.  **Reading JSON:** Single-line vs. Multi-line files.
2.  **Schema Handling:** Inference vs. Enforcement.
3.  **JSON Functions:** Parsing JSON strings (`from_json`) and writing JSON strings (`to_json`).
4.  **Complex Data:** Accessing nested fields and flattening arrays (`explode`).

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, from_json, to_json, explode
from pyspark.sql.types import StructType, StructField, StringType, LongType, ArrayType, DoubleType

spark = SparkSession.builder \
    .appName("JSON_Processing") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Active")

## 1. Reading JSON Files

Spark reads standard JSON (one JSON object per line) by default. However, many JSON files are "pretty-printed" (spread across multiple lines).

*   **Single-line JSON:** Supported out of the box.
*   **Multi-line JSON:** Requires the option `multiline=True`.

In [None]:
# 1. Read Single-line JSON (Standard)
single_line_path = "data/input/order_singleline.json"

df_single = spark.read \
    .format("json") \
    .load(single_line_path)

print("--- Single Line JSON Schema ---")
df_single.printSchema()

# 2. Read Multi-line JSON
# If we read this without the option, Spark treats each line as a corrupt record.
multi_line_path = "data/input/order_multiline.json"

df_multi = spark.read \
    .format("json") \
    .option("multiline", "true") \
    .load(multi_line_path)

print("--- Multi Line JSON Data ---")
df_multi.show(truncate=False)

## 2. Enforcing Custom Schema

While Spark infers schema automatically, for production pipelines, it is best practice to enforce a specific schema. This prevents data type errors and improves performance (Spark doesn't have to scan the file once to guess the types).

In [None]:
# Defining a complex schema with Arrays and Structs manually
# Structure:
# - contact: Array of Strings (originally inferred as Long in the video, but let's cast to String)
# - customer_id: String
# - order_id: String
# - order_line_items: Array of Structs (Nested Data)

custom_schema = StructType([
    StructField("contact", ArrayType(StringType())),
    StructField("customer_id", StringType()),
    StructField("order_id", StringType()),
    StructField("order_line_items", ArrayType(StructType([
        StructField("amount", DoubleType()),
        StructField("item_id", StringType()),
        StructField("qty", LongType())
    ])))
])

# Reading with schema enforcement
df_schema = spark.read \
    .format("json") \
    .schema(custom_schema) \
    .load(single_line_path)

print("--- Data with Enforced Schema ---")
df_schema.printSchema()
df_schema.show()

## 3. The `from_json` Function

Sometimes, data isn't in a JSON file, but stored as a **JSON String** inside a text column (common in Kafka logs or CSVs).

To handle this:
1.  Read the data as Text (or CSV).
2.  Use `from_json()` with a schema to convert the String column into a Struct/Map.

In [None]:
# Step 1: Read the JSON file as a plain Text file (simulating a raw string column)
df_raw = spark.read.text(single_line_path)

print("--- Raw Text Data ---")
df_raw.show(truncate=False)

# Step 2: Parse the 'value' column using the schema we defined earlier
df_parsed = df_raw.withColumn("parsed_data", from_json(col("value"), custom_schema))

print("--- Parsed Data (Struct Column) ---")
df_parsed.printSchema()

# Step 3: Select fields from the struct using Dot Notation
df_final_parsed = df_parsed.select("parsed_data.*")
df_final_parsed.show()

## 4. The `to_json` Function

This is the reverse of `from_json`. It takes a Struct or Array column and converts it back into a JSON string. This is useful when you need to write data to a downstream system (like Kafka) that expects a single payload string.

In [None]:
# Convert the structured columns back into a single JSON string
df_json_string = df_final_parsed.select(
    to_json(struct(col("*"))).alias("json_payload")
)

print("--- Converted Back to JSON String ---")
df_json_string.show(truncate=False)

## 5. Exploding Arrays

When you have an Array of items (e.g., `order_line_items`), you often want to flatten it so that each item in the array becomes its own row.

*   **`explode()`**: Creates a new row for each element in the given array.

In [None]:
# Our data has an array: order_line_items. One order has multiple items.
# We want 1 row per Item, not 1 row per Order.

# 1. Explode the array
df_exploded = df_final_parsed.withColumn("exploded_item", explode(col("order_line_items")))

# 2. Flatten the struct inside the array using Dot Notation
df_flattened = df_exploded.select(
    col("order_id"),
    col("customer_id"),
    col("exploded_item.item_id"),
    col("exploded_item.qty"),
    col("exploded_item.amount")
)

print("--- Flattened Transactional Data ---")
df_flattened.show()

## Summary

1.  **Read Options:** Use `.option("multiline", "true")` for non-standard JSON files.
2.  **Schema:** Always define schemas for complex JSON to avoid inference costs and errors.
3.  **Parsing:** Use `from_json` to turn string columns into usable Structs.
4.  **Formatting:** Use `to_json` to turn Structs back into strings.
5.  **Transformation:** Use `explode` to convert Arrays into rows and `.` notation to access nested fields.

**Next Steps:**
In the next module, we will likely dive into Spark SQL or performing aggregations on this flattened data.