# PySpark: Zero to Hero
## Module 30: Data Scanning and Partitioning Strategy

In Big Data systems, I/O (Input/Output) is often the biggest bottleneck. Reading data from disk is expensive. 
**Data Scanning** refers to the process of reading data files. To optimize performance, we want to scan *only* the data we need.

**Partitioning** is the technique of dividing a large dataset into smaller, manageable parts (folders) based on specific columns (e.g., Date, Country).

### Agenda:
1.  **The Problem:** Unnecessary data scanning in non-partitioned datasets.
2.  **The Solution:** Using `partitionBy` to organize data physically.
3.  **Partition Pruning:** How Spark automatically skips irrelevant files.
4.  **High Cardinality:** Understanding the risks of over-partitioning (Small File Problem).

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit
import os
import shutil

# Setup Spark Session
spark = SparkSession.builder \
    .appName("Partitioning_Demo") \
    .master("local[*]") \
    .getOrCreate()

# Generate Synthetic Sales Data
# We create data for 3 distinct countries
data = [
    (1, "USA", 100, "2023-01-01"),
    (2, "USA", 150, "2023-01-02"),
    (3, "USA", 120, "2023-01-03"),
    (4, "India", 200, "2023-01-01"),
    (5, "India", 250, "2023-01-02"),
    (6, "UK", 300, "2023-01-01"),
    (7, "UK", 350, "2023-01-02"),
    (8, "UK", 400, "2023-01-03")
]

columns = ["order_id", "country", "amount", "date"]
df = spark.createDataFrame(data, columns)

print("Source Data:")
df.show()

In [None]:
# 1. Write data as a standard Parquet file (Flat structure)
# In a real distributed system, this might create multiple files, but they won't be separated by folders.

output_path_flat = "data/sales_flat"
df.write.mode("overwrite").parquet(output_path_flat)

print(f"Data written to {output_path_flat}")

# Let's look at the file structure (using Python os command)
print("\nFile Structure (Flat):")
for file in os.listdir(output_path_flat):
    if file.endswith(".parquet"):
        print(f"  - {file}")

In [None]:
# Reading the flat data
df_flat = spark.read.parquet(output_path_flat)

# Query: Get sales only for 'India'
india_sales_flat = df_flat.filter("country = 'India'")

print("Plan for Non-Partitioned Read:")
# Explain shows the physical plan. Look for 'PushedFilters'.
# Even with PushedFilters, Spark usually has to read the File Metadata or Footer of ALL files 
# to know if 'India' exists inside them.
india_sales_flat.explain()

print(f"Count: {india_sales_flat.count()}")

In [None]:
# 2. Write data partitioned by 'country'
# This physically segregates data into folders based on the country column.

output_path_part = "data/sales_partitioned"
df.write.mode("overwrite").partitionBy("country").parquet(output_path_part)

print(f"Data written to {output_path_part}")

# Let's look at the file structure (Partitioned)
# Notice the sub-folders like 'country=India', 'country=USA'
print("\nFile Structure (Partitioned):")
for folder in os.listdir(output_path_part):
    if not folder.startswith(".") and not folder.startswith("_"):
        print(f"  - {folder}")

In [None]:
# Reading the partitioned data
# Spark discovers the 'country' column from the directory structure automatically.
df_part = spark.read.parquet(output_path_part)

# Query: Get sales only for 'India'
india_sales_part = df_part.filter("country = 'India'")

print("Plan for Partitioned Read:")
# Look for 'PartitionFilters' in the explain plan.
# Spark realizes it only needs to look into the 'country=India' folder.
# It completely SKIPS scanning folders for USA and UK.
india_sales_part.explain()

print(f"Count: {india_sales_part.count()}")

## The Risk: High Cardinality Columns

**Cardinality** refers to the number of unique values in a column.
*   **Low Cardinality:** Country, Region, Status (Active/Inactive) -> **Good for Partitioning**.
*   **High Cardinality:** Order ID, User ID, Timestamp, Transaction ID -> **Bad for Partitioning**.

**Why is High Cardinality bad?**
If you partition by `order_id` (unique for every row), Spark will create a separate folder and a separate small file for *every single row*.
1.  **Small File Problem:** Reading thousands of tiny files is much slower than reading one large file due to metadata overhead.
2.  **NameNode Pressure:** In Hadoop/HDFS, this can crash the NameNode.
3.  **Listing Overhead:** Just listing the files to start the job takes a long time.

In [None]:
# Cleaning up the generated data
try:
    shutil.rmtree("data/sales_flat")
    shutil.rmtree("data/sales_partitioned")
    print("Cleanup successful.")
except Exception as e:
    print(f"Cleanup failed: {e}")

## Summary

1.  **Data Scanning:** To make queries fast, minimize the amount of data scanned.
2.  **Partitioning:** Physically groups data into directories (`key=value`).
3.  **Partition Pruning:** When you filter on a partitioned column (e.g., `WHERE country='India'`), Spark skips scanning directories that don't match. This provides massive performance gains.
4.  **Best Practice:** 
    *   Partition by columns commonly used in `WHERE` clauses.
    *   Ensure columns have **Low Cardinality** (e.g., Date, Region).
    *   Avoid partitioning by unique IDs or high cardinality columns.

**Next Steps:**
In the next module, we will look at **Z-Ordering**, an advanced optimization technique used in Delta Lake to speed up queries on columns that have high cardinality where partitioning isn't suitable.