# PySpark: Zero to Hero
## Module 14: Reading Parquet, ORC, and Recursive File Lookup

CSV is human-readable, but it is terrible for Big Data performance. In this module, we will learn how to read optimized **Columnar File Formats** like Parquet and ORC.

### Agenda:
1.  **Row vs. Columnar Storage:** Why use Parquet?
2.  **Reading Parquet Files:** `spark.read.parquet()`.
3.  **Reading ORC Files:** `spark.read.orc()`.
4.  **Recursive File Lookup:** Reading nested folders automatically.

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Complex_File_Formats") \
    .master("local[*]") \
    .getOrCreate()

print("Spark Session Active")

## Row vs. Columnar Storage

*   **Row-Oriented (CSV, Avro, RDBMS):** Stores data row by row. Good for writing, bad for reading specific columns.
*   **Column-Oriented (Parquet, ORC):** Stores data column by column. 
    *   **Benefits:**
        1.  **Compression:** Better compression ratios.
        2.  **Column Pruning:** If you `select("salary")`, Spark only reads the salary column file blocks, ignoring the rest. This makes it incredibly fast for analytics.
        3.  **Schema Embedded:** You don't need to specify schema; it's stored inside the file footer.

In [None]:
# Reading a Parquet file.
# Notice we don't need to specify schema or header=true options.
# Parquet files contain the schema metadata internally.

parquet_path = "data/input/sales_data.parquet"

df_parquet = spark.read.parquet(parquet_path)
# OR: spark.read.format("parquet").load(parquet_path)

print("--- Parquet Data ---")
df_parquet.printSchema()
df_parquet.show(5)

In [None]:
# Reading an ORC file.
# Similar to Parquet, ORC is also a columnar format, highly optimized for Hive.

orc_path = "data/input/sales_data.orc"

df_orc = spark.read.orc(orc_path)

print("--- ORC Data ---")
df_orc.printSchema()
df_orc.show(5)

In [None]:
# Spark can read a folder containing multiple part-files automatically.
# Path: "data/input/sales_total.parquet" (This is a folder, not a file)

folder_path = "data/input/sales_total.parquet"

df_total = spark.read.parquet(folder_path)

print(f"Total Records from Folder: {df_total.count()}")

In [None]:
# Scenario: You have data nested deep inside folders like:
# /data/year=2023/month=01/day=01/file.parquet
# /data/year=2023/month=01/day=02/file.parquet

# By default, Spark might not look deep enough. We can force recursive lookup.

recursive_path = "data/input/sales_recursive"

df_recursive = spark.read \
    .format("parquet") \
    .option("recursiveFileLookup", "true") \
    .load(recursive_path)

print(f"Recursive Read Count: {df_recursive.count()}")
df_recursive.show(5)

In [None]:
# Let's prove Column Pruning via the Spark Execution Plan.
# We select ONLY 'transacted_at'. Spark should optimize the read.

df_pruned = df_parquet.select("transacted_at")

print("--- Execution Plan showing Column Pruning ---")
df_pruned.explain()

# Look for "ReadSchema" in the Physical Plan output.
# It should only list: struct<transacted_at:timestamp>
# This confirms Spark ignored all other columns during the read.

## Summary

1.  **Parquet/ORC** are superior for reading data due to **Column Pruning** and compression.
2.  **Schema** is built-in; no need to define it manually.
3.  **`recursiveFileLookup`** helps read deeply nested directory structures.
4.  Spark optimizes I/O by only reading the columns you specifically select.

**Next Steps:**
In the next module, we tackle **JSON** files, handling nested structures, multiline JSONs, and schema enforcement.