# COW vs MOR in Apache Iceberg Tables.

In [None]:
!pip install findspark
!pip install pyspark==3.5 #change the pyspark version here if you want to run it on anyother version

In [1]:
import findspark
findspark.init()
findspark.find()

'/Users/akashdeepgupta/Documents/project-repos/pyspark-playground/venv/lib/python3.8/site-packages/pyspark'

In [2]:
from pyspark.sql import SparkSession

# Avro jar to look into Manifest list and manifest file data.
# Change the iceberg jar version if pyspark version is other than 3.5 => iceberg-spark-runtime-<pyspark_version>_2.12:1.4.2
# update the warehouse path as per your local directory where you want to create Iceberg Tables
spark = SparkSession.builder \
    .master("local[4]") \
    .appName("iceberg-poc") \
    .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.2,org.apache.spark:spark-avro_2.12:3.5.0')\
    .config('spark.sql.extensions','org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions') \
    .config('spark.sql.catalog.local','org.apache.iceberg.spark.SparkCatalog') \
    .config('spark.sql.catalog.local.type','hadoop') \
    .config('spark.sql.catalog.local.warehouse','/Users/akashdeepgupta/Documents/project-repos/pyspark-playground/warehouse') \
    .getOrCreate()

24/01/24 16:22:53 WARN Utils: Your hostname, Akashdeeps-MacBook-Air.local resolves to a loopback address: 127.0.0.1; using 192.168.1.2 instead (on interface en0)
24/01/24 16:22:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /Users/akashdeepgupta/.ivy2/cache
The jars for the packages stored in: /Users/akashdeepgupta/.ivy2/jars
org.apache.iceberg#iceberg-spark-runtime-3.5_2.12 added as a dependency
org.apache.spark#spark-avro_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-aa21dd6b-800e-4606-a927-ebad06fab1cc;1.0
	confs: [default]


:: loading settings :: url = jar:file:/Users/akashdeepgupta/Documents/project-repos/pyspark-playground/venv/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.4.2 in central
	found org.apache.spark#spark-avro_2.12;3.5.0 in central
	found org.tukaani#xz;1.9 in central
:: resolution report :: resolve 153ms :: artifacts dl 4ms
	:: modules in use:
	org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.4.2 from central in [default]
	org.apache.spark#spark-avro_2.12;3.5.0 from central in [default]
	org.tukaani#xz;1.9 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   3   |   0   |   0   |   0   ||   3   |   0   |
	---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent-aa21dd6b-800e-4606-a927-ebad06fab1cc
	confs: [default]
	0 artifacts copied, 3 a

## Reading NYC Taxi Trips [data](https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page) for creating Iceberg Tables

In [5]:
from pyspark.sql.functions import lit, col
# Reading NYC Yellow Taxi Trip Data Sep 2023 data
yellow_sep_df = spark.read.parquet("../../../nyc-taxi-trips/yellow/sep-2023/")
yellow_oct_df = spark.read.parquet("../../../nyc-taxi-trips/yellow/oct-2023/")
# Creating month and year column
yellow_sep_df = yellow_sep_df.withColumn("month", lit(9)) \
        .withColumn("year", lit(2023))
yellow_oct_df = yellow_oct_df.withColumn("month", lit(10)) \
        .withColumn("year", lit(2023))
yellow_df = yellow_sep_df.unionByName(yellow_oct_df)
yellow_df.groupBy("VendorID","month").count().show()

                                                                                

+--------+-----+-------+
|VendorID|month|  count|
+--------+-----+-------+
|       1|    9| 731968|
|       2|    9|2113902|
|       6|    9|    852|
|       2|   10|2617320|
|       1|   10| 904463|
|       6|   10|    502|
+--------+-----+-------+



## Creating 2 iceberg tables with same data.
- These tables will be have different Table Properties for COW and MOR properties.

In [7]:
yellow_df.writeTo("local.nyc_tlc.yellow_taxi_trips_cow").partitionedBy("year", "month") \
    .using("iceberg") \
    .tableProperty("format-version", "2")\
    .create()

                                                                                

In [8]:
yellow_df.writeTo("local.nyc_tlc.yellow_taxi_trips_mor").partitionedBy("year", "month") \
    .using("iceberg") \
    .tableProperty("format-version", "2")\
    .create()

                                                                                

In [9]:
cow_table = "local.nyc_tlc.yellow_taxi_trips_cow"
mor_table = "local.nyc_tlc.yellow_taxi_trips_mor"

### Setting up table properties for COW and MOR

In [10]:
spark.sql(f"""ALTER TABLE {cow_table} SET TBLPROPERTIES (
 'write.delete.mode'='copy-on-write',
 'write.update.mode'='copy-on-write',
 'write.merge.mode'='copy-on-write'
)""")

24/01/24 16:25:56 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up


DataFrame[]

In [11]:
spark.sql(f"""ALTER TABLE {mor_table} SET TBLPROPERTIES (
 'write.delete.mode'='merge-on-read',
 'write.update.mode'='merge-on-read',
 'write.merge.mode'='merge-on-read'
)""")

24/01/24 16:25:57 WARN BaseTransaction: Failed to load metadata for a committed snapshot, skipping clean-up


DataFrame[]

## Performing Delete Operation

Row-level deletes or updates configured as:
- `copy-on-write`: rewrites the data files that are impacted on running delete/update operation.
- `merge-on-read`: DOESN'T rewrites the data file. Instead,
    - In case of DELETE operation, only a Delete File is written that contains the deleted records file_path and the exact postion of deleted record in that file.
    - In case of UPDATE operation:
        - A Delete File is written that contains the data file path that contains the record is updated along with it's exact position in the file.
        - A new Data File that contains only the updated records with updated values.

#### On COW table

In [12]:
# from cow table
spark.sql(f"DELETE from {cow_table} where VendorId=6")

                                                                                

DataFrame[]

#### Analyzing COW metadata tables after delete operation

In [15]:
spark.sql(f"select * from {cow_table}.snapshots").orderBy(col("committed_at").desc()).show(truncate=False)

+-----------------------+-------------------+-------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|committed_at           |snapshot_id        |parent_id          |operation|manifest_list                                                                                                                                                                         |summary                                                                                  

In [14]:
# check the changes in table latest snapshot_id
spark.sql(f"select * from {cow_table}.manifests").drop("partition_summaries").show(truncate=False)

+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+
|content|path                                                                                                                                                          |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+--

Things to look in above output:
- `added_data_files_count`, `deleted_data_files_count` and `added_delete_files_count`
- This shows that there are data files that are rewritten on performing DELETE operation on table with `COW` table properties.

In [16]:
# checking the data files present in the table and being currently used by the latest snapshot
spark.sql(f"select content, file_path, file_format, partition,record_count from {cow_table}.files").show(truncate=False)

+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------+------------+
|content|file_path                                                                                                                                                                                   |file_format|partition |record_count|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------+------------+
|0      |/Users/akashdeepgupta/Documents/project-repos/pyspark-playground/warehouse/nyc_tlc/yellow_taxi_trips_cow/data/year=2023/month=9/00000-36-5093cb84-2aaf-4f24-851f-f2a3fd67ffc4-00001.parquet |PARQUET    |{2023, 9} |2845870     |
|0      |/Users/akashdeepgupta/Documents/project-repos/pyspa

#### On MOR table

In [17]:
# from MOR table
spark.sql(f"DELETE from {mor_table} where VendorId=6")

DataFrame[]

#### Analyzing MOR metadata tables after delete operation

In [18]:
# Get the latest_snapshot_id
spark.sql(f"select * from {mor_table}.snapshots").orderBy(col("committed_at").desc()).show(truncate=False)

+-----------------------+-------------------+-------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|committed_at           |snapshot_id        |parent_id          |operation|manifest_list                                                                                                                                                                         |summary                                                                                                                              

In [20]:
# check the latest snapshot_id => there are no data files written and deleted_files are written 
# i.e. added_data_files_count=0 and added_deleted_files_count=2
latest_snapshot_id = 1607848476326965355
spark.sql(f"select * from {mor_table}.manifests where added_snapshot_id={latest_snapshot_id}").drop("partition_summaries").show(truncate=False)

+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+
|content|path                                                                                                                                                          |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+--

In `manifest` table, `content` column can have 2 possible values:
- `0` : represents the manifest file tracking the Data Files.
- `1` : represents the manifest file tracking the Delete Files.

In the above output, `added_deleted_files_count` shows that 2 delete files are added as a part of DELETE operation and `added_data_files_count`, `deleted_data_files_count` is `0` which means there were no rewriting of data files in this case.

In [22]:
# files table show only the data files that current manifest file is pointing towards
spark.sql(f"select content, file_path, file_format, partition,record_count from {mor_table}.files").show(truncate=False)

+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------+------------+
|content|file_path                                                                                                                                                                                           |file_format|partition |record_count|
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------+------------+
|0      |/Users/akashdeepgupta/Documents/project-repos/pyspark-playground/warehouse/nyc_tlc/yellow_taxi_trips_mor/data/year=2023/month=9/00000-29-2d94390c-70a5-481f-9b11-203c6b6026a4-00001.parquet         |PARQUET    |{2023, 9} |2846722     |
|0      |/Users/akashdeepgup

In `files` metadata table, `content` column can have 3 possible values: 
- `0`: represents a Data File.
- `1`: represents a Positional Delete File.
- `2`: represents a Equality Delete File.

In the output above the rows with `1` shows that 2 Positional Delete Files are added.

In [23]:
# Check the content for one of the deleted_files
delete_file_path = "/Users/akashdeepgupta/Documents/project-repos/pyspark-playground/warehouse/nyc_tlc/yellow_taxi_trips_mor/data/year=2023/month=9/00000-44-e4cc2c1d-c517-4b89-95f2-1597396e193d-00001-deletes.parquet"
spark.read.parquet(delete_file_path).show(10, truncate=False)

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|file_path                                                                                                                                                                                  |pos    |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------+
|/Users/akashdeepgupta/Documents/project-repos/pyspark-playground/warehouse/nyc_tlc/yellow_taxi_trips_mor/data/year=2023/month=9/00000-29-2d94390c-70a5-481f-9b11-203c6b6026a4-00001.parquet|2706579|
|/Users/akashdeepgupta/Documents/project-repos/pyspark-playground/warehouse/nyc_tlc/yellow_taxi_trips_mor/data/year=2023/month=9/00000-29-2d94390c-70a5-481f-9b11-203c6b6026a4-00001.parquet|2707160|
|/Users/ak

Positional Delete Files store the `file_path` of the Data File from which a record is deleted along with the exact position `pos` of this deleted record from that data file.

## Performing update operation on MOR table

In [24]:
# Performing an update operation
spark.sql(f"update {mor_table} set fare_amount = 0 where VendorID=2 and fare_amount < 0")

24/01/24 17:10:09 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

DataFrame[]

In [25]:
spark.sql(f"select * from {mor_table}.history").orderBy(col("made_current_at").desc()).show(truncate=False)

+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2024-01-24 17:10:14.418|3121591312207175607|1607848476326965355|true               |
|2024-01-24 16:39:14.878|1607848476326965355|7732589754235441173|true               |
|2024-01-24 16:25:29.092|7732589754235441173|NULL               |true               |
+-----------------------+-------------------+-------------------+-------------------+



In [26]:
spark.sql(f"select * from {mor_table}.snapshots").orderBy(col("committed_at").desc()).show(truncate=False)

+-----------------------+-------------------+-------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|committed_at           |snapshot_id        |parent_id          |operation|manifest_list                                                                                                                                                                         |summary                                                                         

In [27]:
latest_snapshot_id = 3121591312207175607
spark.sql(f"select * from {mor_table}.manifests").drop("partition_summaries").filter(col("added_snapshot_id") == latest_snapshot_id).show(truncate=False)

+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+
|content|path                                                                                                                                                          |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|
+-------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+--

Looking into the `manifest` metadata table output above for the current snapshot after UPDATE operation:
- In `content` = `0` row, it can be seen that there are 2 data files that has been added, `added_data_files_count` = `2`. These are the Data files that contains the records with updated values
- In `content` = `1` row, it can be seen that there are 2 Delete Files that has been added, `added_delete_files_count` = `2`. These are the Delete files that contains the file_path of data files and position of the deleted records within this data file.

This shows that there are no data files rewritten as there are 0 `delete_data_files_count`.

In [5]:
# impacted records from update operation
yellow_df.filter((col("fare_amount") < 0) & (col("VendorID") == 2)).groupBy("VendorID", "month").count().show()

+--------+-----+-----+
|VendorID|month|count|
+--------+-----+-----+
|       2|    9|29562|
|       2|   10|37099|
+--------+-----+-----+



In [40]:
spark.sql(f"select content, file_path, file_format, partition,record_count from {mor_table}.files").show(truncate=False)

+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------+------------+
|content|file_path                                                                                                                                                                                           |file_format|partition |record_count|
+-------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+----------+------------+
|0      |/Users/akashdeepgupta/Documents/project-repos/pyspark-playground/warehouse/nyc_tlc/yellow_taxi_trips_mor/data/year=2023/month=9/00000-86-ec64a270-e3b3-47a9-887f-f848abe94fd9-00001.parquet         |PARQUET    |{2023, 9} |29562       |
|0      |/Users/akashdeepgup

- Row#1 and Row#2 with `content` = `0`, shows that the same number of records are present in the data files as per the previous data file output.
- Row#5 and Row#6 with `content` = `1`, shows that the same number of records present in the Delete Files.