# Projections and Column Pruning in Apache Iceberg with Spark

This notebook demonstrates how to use column projection and pruning with Apache Iceberg tables in Spark SQL.

---

## 1. Import Required Functions

We start by importing necessary functions from PySpark.

In [46]:
from pyspark.sql.functions import *

## 2. Create the Iceberg Table

We drop the table if it exists, then create a new Iceberg table named `sales_projection` with a sample schema.

In [47]:
# Drop table if exists
spark.sql("DROP TABLE IF EXISTS local.db.sales_projection")

# Create table
spark.sql("""
    CREATE TABLE local.db.sales_projection (
        id INT,
        region STRING,
        product STRING,
        category STRING,
        sales DOUBLE,
        discount DOUBLE,
        quantity INT,
        sales_date DATE
    )
    USING iceberg
""")

DataFrame[]

## 3. Insert Sample Data

We create a DataFrame with sample sales data and write it to the Iceberg table.

In [None]:
data = [
    (1, 'North', 'Phone', 'Electronics', 500.0, 10.0, 1, '2024-01-01'),
    (2, 'South', 'Tablet', 'Electronics', 700.0, 20.0, 2, '2024-01-02'),
    (3, 'East', 'Laptop', 'Electronics', 1000.0, 50.0, 1, '2024-01-03'),
    (4, 'West', 'Headphones', 'Accessories', 200.0, 5.0, 3, '2024-01-04'),
]

columns = ["id", "region", "product", "category", "sales", "discount", "quantity", "sales_date"]

df = spark.createDataFrame(data, columns) \
    .withColumn("sales_date", to_date(col("sales_date"), "yyyy-MM-dd"))

df.writeTo("local.db.sales_projection").append()

## 4. Select All Columns

We query the table to display all columns, showing the full dataset.

In [49]:
# Select all columns
df_full = spark.sql("SELECT * FROM local.db.sales_projection")
df_full.show()

+---+------+----------+-----------+------+--------+--------+----------+
| id|region|   product|   category| sales|discount|quantity|sales_date|
+---+------+----------+-----------+------+--------+--------+----------+
|  1| North|     Phone|Electronics| 500.0|    10.0|       1|2024-01-01|
|  2| South|    Tablet|Electronics| 700.0|    20.0|       2|2024-01-02|
|  3|  East|    Laptop|Electronics|1000.0|    50.0|       1|2024-01-03|
|  4|  West|Headphones|Accessories| 200.0|     5.0|       3|2024-01-04|
+---+------+----------+-----------+------+--------+--------+----------+



## 5. Select Only Required Columns (Projection)

We select only the `region` and `sales` columns to demonstrate column projection.

In [50]:
# Select only region and sales
df_proj = spark.sql("SELECT region, sales FROM local.db.sales_projection")
df_proj.show()

+------+------+
|region| sales|
+------+------+
| North| 500.0|
| South| 700.0|
|  East|1000.0|
|  West| 200.0|
+------+------+



## 6. View the Execution Plan for Projection

We use the `EXPLAIN` statement to show the physical execution plan, confirming that only the required columns are read.

In [51]:
# Print execution plan to confirm projection
spark.sql("EXPLAIN SELECT region, sales FROM local.db.sales_projection").show(truncate=False)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|plan                                                                                                                                                                                          |
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|== Physical Plan ==\n*(1) ColumnarToRow\n+- BatchScan local.db.sales_projection[region#1656, sales#1659] local.db.sales_projection (branch=null) [filters=, groupedBy=] RuntimeFilters: []\n\n|
+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+



## 7. Column Projection with Filter Pushdown

We select `region` and `sales` for rows where `category` is 'Electronics', demonstrating both projection and filter pushdown.

In [52]:
# Column projection + filter pushdown
spark.sql("""
    SELECT region, sales
    FROM local.db.sales_projection
    WHERE category = 'Electronics'
""").show()

+------+------+
|region| sales|
+------+------+
| North| 500.0|
| South| 700.0|
|  East|1000.0|
+------+------+



## 8. Execution Plan: Projection + Filter Pushdown

We use `EXPLAIN` again to confirm that both column pruning and filter pushdown are applied in the query plan.

In [53]:
spark.sql("""
    EXPLAIN SELECT region, sales
    FROM local.db.sales_projection
    WHERE category = 'Electronics'
""").show(truncate=False)

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|plan                                                                                                                                                                                                                                                                                                                                                                                        |
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------