**⭐ 1. What This Pattern Solves**

Parsing JSON strings in a DataFrame column into structured columns so you can query nested data easily.

Use cases:

Reading JSON logs from Kafka or S3.

Converting API responses stored as strings into columns.

Preprocessing semi-structured data for analytics or Delta Lake pipelines.

**⭐ 2. SQL Equivalent**

In [0]:
%sql
-- Parse JSON column
SELECT 
  get_json_object(json_column, '$.user.id') AS user_id,
  get_json_object(json_column, '$.user.name') AS user_name
FROM logs;

**⭐ 3. Core Idea**

Use from_json with a schema to convert a JSON string column into a struct column, which can then be accessed using dot notation.

**⭐ 4. Template Code (MEMORIZE THIS)**

In [0]:
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

schema = StructType([
    StructField("field1", StringType()),
    StructField("field2", IntegerType())
])

df_parsed = df.withColumn("json_parsed", from_json(col("json_column"), schema))
df_parsed.select("json_parsed.*").show()

**⭐ 5. Detailed Example**

In [0]:
data = [
    ('{"name":"Alice","age":30}',),
    ('{"name":"Bob","age":25}',)
]
df = spark.createDataFrame(data, ["json_str"])
df.show(truncate=False)

In [0]:
+---------------------+
|json_str             |
+---------------------+
|{"name":"Alice","age":30}|
|{"name":"Bob","age":25}  |
+---------------------+

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import from_json, col

schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])

df_parsed = df.withColumn("parsed", from_json(col("json_str"), schema))
df_parsed.select("parsed.*").show()


In [0]:
+-----+---+
|name |age|
+-----+---+
|Alice|30 |
|Bob  |25 |
+-----+---+

**⭐ 6. Mini Practice Problems**

Parse a JSON column with nested struct: {"user":{"id":1,"name":"Alice"}}.

Parse a JSON array of objects and explode it into rows.

Use get_json_object to extract a single value from a JSON string column.

**⭐ 7. Full Data Engineering Problem**

You receive event logs from Kafka where each message is a JSON string with user, event_type, and metadata fields.

Task: Parse the JSON string → extract nested user info → explode metadata array → store in Delta Lake for analytics.

Pattern: from_json → explode → flatten → write to Silver table.

**⭐ 8. Time & Space Complexity**

Time: O(n) for parsing per row

Space: Extra memory for struct column; nested arrays increase row expansion if exploded

**⭐ 9. Common Pitfalls**

Forgetting to define the correct schema → results in nulls.

Parsing large JSON strings repeatedly without caching → performance hit.

Nested arrays/structs require further explode or select → often overlooked.

Using get_json_object without schema → returns strings, not proper types.