## Introduction to Iceberg Architecture

In [1]:
!pip3 install -r requirements.txt

Collecting cmlbootstrap
  Cloning https://github.com/fastforwardlabs/cmlbootstrap to /tmp/pip-install-tb34dxdk/cmlbootstrap_c4102002e2d549098c3141d566e609e9
  Running command git clone -q https://github.com/fastforwardlabs/cmlbootstrap /tmp/pip-install-tb34dxdk/cmlbootstrap_c4102002e2d549098c3141d566e609e9


#### Launching a Spark Session with Iceberg

In [2]:
import cml.data_v1 as cmldata

CONNECTION_NAME = "go01-aw-dl"
conn = cmldata.get_connection(CONNECTION_NAME)
spark = conn.get_spark_session()

# Sample usage to run query through spark
EXAMPLE_SQL_QUERY = "show databases"
spark.sql(EXAMPLE_SQL_QUERY).show()

23/09/04 03:58:03 WARN SparkConf: The configuration key 'spark.yarn.access.hadoopFileSystems' has been deprecated as of Spark 3.0 and may be removed in the future. Please use the new key 'spark.kerberos.access.hadoopFileSystems' instead.
23/09/04 03:58:03 WARN SparkConf: The configuration key 'spark.yarn.access.hadoopFileSystems' has been deprecated as of Spark 3.0 and may be removed in the future. Please use the new key 'spark.kerberos.access.hadoopFileSystems' instead.
Setting spark.hadoop.yarn.resourcemanager.principal to pauldefusco
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/09/04 03:58:03 WARN SparkConf: The configuration key 'spark.yarn.access.hadoopFileSystems' has been deprecated as of Spark 3.0 and may be removed in the future. Please use the new key 'spark.kerberos.access.hadoopFileSystems' instead.
23/09/04 

+--------------------+
|           namespace|
+--------------------+
|         01_car_data|
|           01_car_dw|
|              adb101|
|            airlines|
|        airlines_csv|
|    airlines_iceberg|
|airlines_iceberg_...|
|      airlines_mjain|
|          airquality|
|                ajvp|
|          atlas_demo|
|            bankdemo|
|          bca_jps_l0|
|        cde_workshop|
|             cdedemo|
|        cdp_overview|
|      ceht_open_data|
|        ceht_scratch|
| ceht_transportation|
|        cgsifacebook|
+--------------------+
only showing top 20 rows



In [3]:
spark.sparkContext.getConf().getAll()

23/09/04 03:58:25 WARN SparkConf: The configuration key 'spark.yarn.access.hadoopFileSystems' has been deprecated as of Spark 3.0 and may be removed in the future. Please use the new key 'spark.kerberos.access.hadoopFileSystems' instead.
23/09/04 03:58:25 WARN SparkConf: The configuration key 'spark.yarn.access.hadoopFileSystems' has been deprecated as of Spark 3.0 and may be removed in the future. Please use the new key 'spark.kerberos.access.hadoopFileSystems' instead.


[('spark.eventLog.enabled', 'true'),
 ('spark.network.crypto.enabled', 'true'),
 ('spark.sql.hive.hwc.execution.mode', 'spark'),
 ('spark.jars',
  '/opt/spark/optional-lib/hive-warehouse-connector-assembly.jar,/opt/spark/optional-lib/iceberg-hive-runtime.jar,/opt/spark/optional-lib/iceberg-spark-runtime.jar'),
 ('spark.app.startTime', '1693799883971'),
 ('spark.kerberos.renewal.credentials', 'ccache'),
 ('spark.sql.catalog.spark_catalog',
  'org.apache.iceberg.spark.SparkSessionCatalog'),
 ('spark.dynamicAllocation.maxExecutors', '49'),
 ('spark.app.id', 'spark-application-1693799885897'),
 ('spark.eventLog.dir', 'file:///sparkeventlogs'),
 ('spark.hadoop.yarn.resourcemanager.principal', 'pauldefusco'),
 ('spark.kubernetes.driver.annotation.cluster-autoscaler.kubernetes.io/safe-to-evict',
  'false'),
 ('spark.ui.port', '20049'),
 ('spark.yarn.access.hadoopFileSystems',
  's3a://go01-demo/warehouse/tablespace/external/hive'),
 ('spark.sql.extensions',
  'com.qubole.spark.hiveacid.HiveAc

### Iceberg Architecture

![alt text](../img/iceberg-metadata.png)

#### Iceberg Catalog

Iceberg comes with catalogs that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under spark.sql.catalog.(catalog_name).

In [4]:
# Show catalog and database
spark.sql("SHOW CURRENT NAMESPACE").show()

+-------------+---------+
|      catalog|namespace|
+-------------+---------+
|spark_catalog|  default|
+-------------+---------+



In [5]:
# Create a new database
#spark.sql("DROP DATABASE IF EXISTS spark_catalog.lakehouse")
spark.sql("CREATE DATABASE IF NOT EXISTS spark_catalog.lakehouse")
spark.sql("USE spark_catalog.lakehouse")

DataFrame[]

In [6]:
# Show catalog and database
spark.sql("SHOW CURRENT NAMESPACE").show()

+-------------+---------+
|      catalog|namespace|
+-------------+---------+
|spark_catalog|lakehouse|
+-------------+---------+



#### Create an Iceberg Table with Spark SQL

In [7]:
spark.sql("DROP TABLE IF EXISTS lakehouse.coffees_table_3 PURGE")

                                                                                

DataFrame[]

# TEST 3 - COW

In [8]:
spark.sql("CREATE TABLE IF NOT EXISTS coffees_table_3 (coffee_id BIGINT, coffee_size STRING, coffee_sale_ts TIMESTAMP)\
          USING ICEBERG\
          PARTITIONED BY (months(coffee_sale_ts))\
          TBLPROPERTIES ('write.delete.mode'='merge-on-read',\
                          'write.update.mode'='merge-on-read',\
                          'write.merge.mode'='merge-on-read',\
                          'format-version' = '2')")

DataFrame[]

#### Verify that a Metadata JSON file has been created under the Metadata directory

In [9]:
metadata_path = "warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3"

In [10]:
import boto3

s3 = boto3.resource('s3')

my_bucket = s3.Bucket("go01-demo")

for object_summary in my_bucket.objects.filter(Prefix=metadata_path):
    #print(object_summary.key)
    metadata_file = object_summary.key
    
print("Metadata File Path: {}".format(metadata_file))

Metadata File Path: warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/00000-60da2bb9-7e09-4618-af3f-68ab7e8df9fe.metadata.json


In [11]:
import pandas as pd
spark.read.option("multiline","true").json("s3a://go01-demo/" + "warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata").toPandas()

                                                                                

Unnamed: 0,current-schema-id,current-snapshot-id,default-sort-order-id,default-spec-id,format-version,last-column-id,last-partition-id,last-sequence-number,last-updated-ms,location,metadata-log,partition-specs,properties,schemas,snapshot-log,snapshots,sort-orders,statistics,table-uuid
0,0,-1,0,0,2,3,1000,0,1693799933483,s3a://go01-demo/warehouse/tablespace/external/...,[],"[([Row(field-id=1000, name='coffee_sale_ts_mon...","(pauldefusco, merge-on-read, merge-on-read, me...","[([Row(id=1, name='coffee_id', required=False,...",[],[],"[([], 0)]",[],0d731042-eee8-4223-aa6a-404e6248d543


![alt text](../img/s3_metadata.png)

#### Notice that no snapshots or other files have been created as data has not yet been inserted.

In [12]:
spark.sql("SELECT * FROM lakehouse.coffees_table_3.history").show()

+---------------+-----------+---------+-------------------+
|made_current_at|snapshot_id|parent_id|is_current_ancestor|
+---------------+-----------+---------+-------------------+
+---------------+-----------+---------+-------------------+



In [13]:
spark.sql("SELECT * FROM lakehouse.coffees_table_3.snapshots;").show()

+------------+-----------+---------+---------+-------------+-------+
|committed_at|snapshot_id|parent_id|operation|manifest_list|summary|
+------------+-----------+---------+---------+-------------+-------+
+------------+-----------+---------+---------+-------------+-------+



In [14]:
spark.sql("SELECT * FROM lakehouse.coffees_table_3.files;").show()

+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+
|content|file_path|file_format|spec_id|partition|record_count|file_size_in_bytes|column_sizes|value_counts|null_value_counts|nan_value_counts|lower_bounds|upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|
+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+
+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+



In [15]:
spark.sql("SELECT * FROM lakehouse.coffees_table_3.manifests;").show()

+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path|length|partition_spec_id|added_snapshot_id|added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+



In [16]:
spark.sql("SELECT * FROM lakehouse.coffees_table_3.all_data_files;").show()

+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+
|content|file_path|file_format|spec_id|partition|record_count|file_size_in_bytes|column_sizes|value_counts|null_value_counts|nan_value_counts|lower_bounds|upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|
+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+
+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+



In [17]:
spark.sql("SELECT * FROM lakehouse.coffees_table_3.all_manifests;").show()

+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+
|content|path|length|partition_spec_id|added_snapshot_id|added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|reference_snapshot_id|
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+-------

### Table Insert

In [18]:
from pyspark.sql.functions import date_format

In [19]:
#Row: coffee_id = 1, coffee_size = venti, coffee_sale_ts = 2023-07-01
#Row: coffee_id = 2, coffee_size = grande, coffee_sale_ts = 2023-07-01
#Row: coffee_id = 3, coffee_size = tall, coffee_sale_ts = 2023-04-01

spark.sql("INSERT INTO lakehouse.coffees_table_3 VALUES (1, 'venti', cast(date_format('2023-07-01 10:00:00', 'yyyy-MM-dd HH:mm:ss') as timestamp)),\
            (2, 'grande', cast(date_format('2023-07-01 10:00:00', 'yyyy-MM-dd HH:mm:ss') as timestamp)),\
            (3, 'tall', cast(date_format('2023-04-01 10:00:00', 'yyyy-MM-dd HH:mm:ss') as timestamp))")

                                                                                

DataFrame[]

#### Data has been added to the data folder

In [20]:
QUERY = "select h.made_current_at,\
            s.operation,\
            h.snapshot_id,\
            h.is_current_ancestor,\
            s.summary['spark.app.id']\
        from lakehouse.coffees_table.history h\
        join lakehouse.coffees_table.snapshots s\
            on h.snapshot_id = s.snapshot_id\
            order by made_current_at;"

In [21]:
spark.sql(QUERY).toPandas()

23/09/04 03:59:14 WARN SparkConf: The configuration key 'spark.yarn.access.hadoopFileSystems' has been deprecated as of Spark 3.0 and may be removed in the future. Please use the new key 'spark.kerberos.access.hadoopFileSystems' instead.


Unnamed: 0,made_current_at,operation,snapshot_id,is_current_ancestor,summary[spark.app.id]
0,2023-09-04 02:47:21.766,append,6464031631726591033,True,spark-application-1693795583203
1,2023-09-04 02:47:43.813,overwrite,5876718107486774833,True,spark-application-1693795583203


#### Notice there are now two json files and two avro files. 

The first json file is the metadata file created when the table was created. This is the metata file prefixed by 00000. The second json file is the new metadata file reflecting the insert of one row. This is the metadata file prefixed by 00001.

The avro file with the "snap" prefix is the manifest list. The other avro file created is the corresponding manifest file.

In [22]:
s3 = boto3.resource('s3')
my_bucket = s3.Bucket("go01-demo")

metadata_file_list = []

print("Current Metadata Files: \n")
for object_summary in my_bucket.objects.filter(Prefix=metadata_path+"/metadata"):
    print(object_summary.key +"\n")
    metadata_file_list.append(object_summary.key)

Current Metadata Files: 

warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/00000-60da2bb9-7e09-4618-af3f-68ab7e8df9fe.metadata.json

warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/00001-eeec84c8-a650-4033-83ef-9ef2cd6c81b7.metadata.json

warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/ebfddaac-adcc-4d9b-91b6-6b41c31286c5-m0.avro

warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/snap-4815467340052359539-1-ebfddaac-adcc-4d9b-91b6-6b41c31286c5.avro



Showing Metadata Files (JSON)

In [23]:
import pandas as pd

print("Showing " + metadata_file_list[0])
spark.read.option("multiline","true").json("s3a://go01-demo/" + metadata_file_list[0]).toPandas()

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/00000-60da2bb9-7e09-4618-af3f-68ab7e8df9fe.metadata.json


                                                                                

Unnamed: 0,current-schema-id,current-snapshot-id,default-sort-order-id,default-spec-id,format-version,last-column-id,last-partition-id,last-sequence-number,last-updated-ms,location,metadata-log,partition-specs,properties,schemas,snapshot-log,snapshots,sort-orders,statistics,table-uuid
0,0,-1,0,0,2,3,1000,0,1693799933483,s3a://go01-demo/warehouse/tablespace/external/...,[],"[([Row(field-id=1000, name='coffee_sale_ts_mon...","(pauldefusco, merge-on-read, merge-on-read, me...","[([Row(id=1, name='coffee_id', required=False,...",[],[],"[([], 0)]",[],0d731042-eee8-4223-aa6a-404e6248d543


In [24]:
print("Showing " + metadata_file_list[1])
spark.read.option("multiline","true").json("s3a://go01-demo/" + metadata_file_list[1]).toPandas()

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/00001-eeec84c8-a650-4033-83ef-9ef2cd6c81b7.metadata.json


                                                                                

Unnamed: 0,current-schema-id,current-snapshot-id,default-sort-order-id,default-spec-id,format-version,last-column-id,last-partition-id,last-sequence-number,last-updated-ms,location,metadata-log,partition-specs,properties,refs,schemas,snapshot-log,snapshots,sort-orders,statistics,table-uuid
0,0,4815467340052359539,0,0,2,3,1000,1,1693799952760,s3a://go01-demo/warehouse/tablespace/external/...,[(s3a://go01-demo/warehouse/tablespace/externa...,"[([Row(field-id=1000, name='coffee_sale_ts_mon...","(pauldefusco, merge-on-read, merge-on-read, me...","((4815467340052359539, branch),)","[([Row(id=1, name='coffee_id', required=False,...","[(4815467340052359539, 1693799952760)]",[(s3a://go01-demo/warehouse/tablespace/externa...,"[([], 0)]",[],0d731042-eee8-4223-aa6a-404e6248d543


Showing Manifest List (AVRO - prefixed by "SNAP")

In [25]:
print("Showing " + metadata_file_list[3])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[3]).toPandas()

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/snap-4815467340052359539-1-ebfddaac-adcc-4d9b-91b6-6b41c31286c5.avro


                                                                                

Unnamed: 0,manifest_path,manifest_length,partition_spec_id,content,sequence_number,min_sequence_number,added_snapshot_id,added_data_files_count,existing_data_files_count,deleted_data_files_count,added_rows_count,existing_rows_count,deleted_rows_count,partitions
0,s3a://go01-demo/warehouse/tablespace/external/...,7163,0,0,1,1,4815467340052359539,2,0,0,3,0,0,"[(False, False, [127, 2, 0, 0], [130, 2, 0, 0])]"


Showing Manifest Files (Avro) i.e. which shows data file locations according to each snapshot_id

In [26]:
print("Showing " + metadata_file_list[2])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[2]).toPandas()

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/ebfddaac-adcc-4d9b-91b6-6b41c31286c5-m0.avro


Unnamed: 0,status,snapshot_id,sequence_number,file_sequence_number,data_file
0,1,4815467340052359539,,,"(0, s3a://go01-demo/warehouse/tablespace/exter..."
1,1,4815467340052359539,,,"(0, s3a://go01-demo/warehouse/tablespace/exter..."


### Table Merge Into

Create a staging table

In [27]:
spark.sql("DROP TABLE IF EXISTS lakehouse.coffee_staging_3 PURGE")

                                                                                

DataFrame[]

In [28]:
spark.sql("CREATE TABLE IF NOT EXISTS lakehouse.coffee_staging_3\
            (coffee_id BIGINT, coffee_size STRING, coffee_sale_ts TIMESTAMP)\
            USING iceberg\
            PARTITIONED BY (months(coffee_sale_ts))")

DataFrame[]

In [29]:
spark.sql("INSERT INTO lakehouse.coffee_staging_3\
    VALUES (2, 'tall', cast(date_format('2023-08-01 11:10:00', 'yyyy-MM-dd HH:mm:ss') as timestamp)),\
    (3, 'venti', cast(date_format('2023-04-01 12:01:00', 'yyyy-MM-dd HH:mm:ss') as timestamp)),\
    (4, 'venti', cast(date_format('2023-07-01 12:01:00', 'yyyy-MM-dd HH:mm:ss') as timestamp)),\
    (5, 'grande', cast(date_format('2023-07-01 12:01:00', 'yyyy-MM-dd HH:mm:ss') as timestamp)),\
    (6, 'grande', cast(date_format('2023-07-01 12:01:00', 'yyyy-MM-dd HH:mm:ss') as timestamp)),\
    (7, 'venti', cast(date_format('2023-05-01 12:01:00', 'yyyy-MM-dd HH:mm:ss') as timestamp)),\
    (8, 'grande', cast(date_format('2023-04-01 12:01:00', 'yyyy-MM-dd HH:mm:ss') as timestamp)),\
    (9, 'tall', cast(date_format('2023-05-01 12:01:00', 'yyyy-MM-dd HH:mm:ss') as timestamp)),\
    (10, 'tall', cast(date_format('2023-05-01 12:01:00', 'yyyy-MM-dd HH:mm:ss') as timestamp))")

#Row: coffee_id = 2, coffee_size = tall, coffee_sale_ts = 2023-08-01
#Row: coffee_id = 3, coffee_size = venti, coffee_sale_ts = 2023-04-01
#Row: coffee_id = 4, coffee_size = venti, coffee_sale_ts = 2023-07-01
#Row: coffee_id = 5, coffee_size = grande, coffee_sale_ts = 2023-07-01
#Row: coffee_id = 7, coffee_size = grande, coffee_sale_ts = 2023-07-01
#Row: coffee_id = 8, coffee_size = venti, coffee_sale_ts = 2023-07-01
#Row: coffee_id = 9, coffee_size = grande, coffee_sale_ts = 2023-07-01
#Row: coffee_id = 10, coffee_size = venti, coffee_sale_ts = 2023-07-01
#Row: coffee_id = 11, coffee_size = grande, coffee_sale_ts = 2023-07-01


                                                                                

DataFrame[]

Merge Into Customers Table

In [30]:
spark.sql("MERGE INTO lakehouse.coffees_table_3 c\
            USING (SELECT * FROM lakehouse.coffee_staging_3) s\
            ON c.coffee_id = s.coffee_id \
            WHEN MATCHED THEN UPDATE SET c.coffee_size = s.coffee_size \
            WHEN NOT MATCHED THEN INSERT *")

                                                                                

DataFrame[]

In [31]:
spark.sql("SELECT * FROM lakehouse.coffees_table_3.snapshots;").toPandas()

Unnamed: 0,committed_at,snapshot_id,parent_id,operation,manifest_list,summary
0,2023-09-04 03:59:12.760,4815467340052359539,,append,s3a://go01-demo/warehouse/tablespace/external/...,{'spark.app.id': 'spark-application-1693799885...
1,2023-09-04 03:59:55.115,384109576346467287,4.815467e+18,overwrite,s3a://go01-demo/warehouse/tablespace/external/...,"{'added-data-files': '5', 'added-position-dele..."


In [32]:
spark.sql("SELECT * FROM lakehouse.coffees_table_3.manifests;").toPandas()

Unnamed: 0,content,path,length,partition_spec_id,added_snapshot_id,added_data_files_count,existing_data_files_count,deleted_data_files_count,added_delete_files_count,existing_delete_files_count,deleted_delete_files_count,partition_summaries
0,0,s3a://go01-demo/warehouse/tablespace/external/...,7350,0,384109576346467287,5,0,0,0,0,0,"[(False, False, 2023-04, 2023-07)]"
1,0,s3a://go01-demo/warehouse/tablespace/external/...,7163,0,4815467340052359539,2,0,0,0,0,0,"[(False, False, 2023-04, 2023-07)]"
2,1,s3a://go01-demo/warehouse/tablespace/external/...,7186,0,384109576346467287,0,0,0,2,0,0,"[(False, False, 2023-04, 2023-07)]"


In [33]:
spark.sql("SELECT * FROM lakehouse.coffees_table_3.all_data_files;").toPandas()

Unnamed: 0,content,file_path,file_format,spec_id,partition,record_count,file_size_in_bytes,column_sizes,value_counts,null_value_counts,nan_value_counts,lower_bounds,upper_bounds,key_metadata,split_offsets,equality_ids,sort_order_id
0,0,s3a://go01-demo/warehouse/tablespace/external/...,PARQUET,0,"(642,)",3,1074,"{1: 52, 2: 74, 3: 62}","{1: 3, 2: 3, 3: 3}","{1: 0, 2: 0, 3: 0}",{},"{1: [4, 0, 0, 0, 0, 0, 0, 0], 2: [103, 114, 97...","{1: [6, 0, 0, 0, 0, 0, 0, 0], 2: [118, 101, 11...",,[4],,0
1,0,s3a://go01-demo/warehouse/tablespace/external/...,PARQUET,0,"(642,)",1,985,"{1: 39, 2: 39, 3: 39}","{1: 1, 2: 1, 3: 1}","{1: 0, 2: 0, 3: 0}",{},"{1: [2, 0, 0, 0, 0, 0, 0, 0], 2: [116, 97, 108...","{1: [2, 0, 0, 0, 0, 0, 0, 0], 2: [116, 97, 108...",,[4],,0
2,0,s3a://go01-demo/warehouse/tablespace/external/...,PARQUET,0,"(640,)",3,1068,"{1: 52, 2: 72, 3: 62}","{1: 3, 2: 3, 3: 3}","{1: 0, 2: 0, 3: 0}",{},"{1: [7, 0, 0, 0, 0, 0, 0, 0], 2: [116, 97, 108...","{1: [10, 0, 0, 0, 0, 0, 0, 0], 2: [118, 101, 1...",,[4],,0
3,0,s3a://go01-demo/warehouse/tablespace/external/...,PARQUET,0,"(639,)",1,999,"{1: 39, 2: 41, 3: 39}","{1: 1, 2: 1, 3: 1}","{1: 0, 2: 0, 3: 0}",{},"{1: [8, 0, 0, 0, 0, 0, 0, 0], 2: [103, 114, 97...","{1: [8, 0, 0, 0, 0, 0, 0, 0], 2: [103, 114, 97...",,[4],,0
4,0,s3a://go01-demo/warehouse/tablespace/external/...,PARQUET,0,"(639,)",1,992,"{1: 39, 2: 40, 3: 39}","{1: 1, 2: 1, 3: 1}","{1: 0, 2: 0, 3: 0}",{},"{1: [3, 0, 0, 0, 0, 0, 0, 0], 2: [118, 101, 11...","{1: [3, 0, 0, 0, 0, 0, 0, 0], 2: [118, 101, 11...",,[4],,0
5,0,s3a://go01-demo/warehouse/tablespace/external/...,PARQUET,0,"(642,)",2,1015,"{1: 41, 2: 44, 3: 62}","{1: 2, 2: 2, 3: 2}","{1: 0, 2: 0, 3: 0}",{},"{1: [1, 0, 0, 0, 0, 0, 0, 0], 2: [103, 114, 97...","{1: [2, 0, 0, 0, 0, 0, 0, 0], 2: [118, 101, 11...",,[4],,0
6,0,s3a://go01-demo/warehouse/tablespace/external/...,PARQUET,0,"(639,)",1,969,"{1: 33, 2: 33, 3: 39}","{1: 1, 2: 1, 3: 1}","{1: 0, 2: 0, 3: 0}",{},"{1: [3, 0, 0, 0, 0, 0, 0, 0], 2: [116, 97, 108...","{1: [3, 0, 0, 0, 0, 0, 0, 0], 2: [116, 97, 108...",,[4],,0


In [34]:
spark.sql("SELECT * FROM lakehouse.coffees_table_3.all_data_files;").toPandas()['file_path'][1]

's3a://go01-demo/warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/data/coffee_sale_ts_month=2023-07/00036-1035-7c946d62-1215-44f1-9a50-7d5ef7eb35e7-00001.parquet'

#### There is a new metadata file (json) prefixed by 0002.

#### There is a new manifest list file (avro) prefixed by "snap"

#### There is a new manifest file (avro)

In [35]:
import boto3

s3 = boto3.resource('s3')
my_bucket = s3.Bucket("go01-demo")

metadata_file_list = []

print("Current Metadata Files: \n")

for object_summary in my_bucket.objects.filter(Prefix=metadata_path+"/metadata"):
    #print(object_summary.key +"\n")
    metadata_file_list.append(object_summary.key)
    
print(*metadata_file_list, sep = "\n")

print("There is a total of " + str(len(metadata_file_list)) + " files in the Metadata layer")

Current Metadata Files: 

warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/00000-60da2bb9-7e09-4618-af3f-68ab7e8df9fe.metadata.json
warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/00001-eeec84c8-a650-4033-83ef-9ef2cd6c81b7.metadata.json
warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/00002-f64af39b-d246-4783-afd2-2eef8c6a9941.metadata.json
warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/92964479-485c-40b2-9b4e-e49a1f19fdc9-m0.avro
warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/92964479-485c-40b2-9b4e-e49a1f19fdc9-m1.avro
warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/ebfddaac-adcc-4d9b-91b6-6b41c31286c5-m0.avro
warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/snap-384109576346467287-1-92964479-485c-40b2-9b4e-e49a1f19fdc9.avro
warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/snap-48154673400

Showing Latest (Current) Metadata File (JSON)

In [36]:
print("Showing " + metadata_file_list[2])
spark.read.option("multiline","true").json("s3a://go01-demo/" + metadata_file_list[0]).toPandas()

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/00002-f64af39b-d246-4783-afd2-2eef8c6a9941.metadata.json


Unnamed: 0,current-schema-id,current-snapshot-id,default-sort-order-id,default-spec-id,format-version,last-column-id,last-partition-id,last-sequence-number,last-updated-ms,location,metadata-log,partition-specs,properties,schemas,snapshot-log,snapshots,sort-orders,statistics,table-uuid
0,0,-1,0,0,2,3,1000,0,1693799933483,s3a://go01-demo/warehouse/tablespace/external/...,[],"[([Row(field-id=1000, name='coffee_sale_ts_mon...","(pauldefusco, merge-on-read, merge-on-read, me...","[([Row(id=1, name='coffee_id', required=False,...",[],[],"[([], 0)]",[],0d731042-eee8-4223-aa6a-404e6248d543


Showing Latest Manifest List (AVRO - prefixed by "SNAP")

In [37]:
print("Showing " + metadata_file_list[6])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[6]).toPandas()

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/snap-384109576346467287-1-92964479-485c-40b2-9b4e-e49a1f19fdc9.avro


Unnamed: 0,manifest_path,manifest_length,partition_spec_id,content,sequence_number,min_sequence_number,added_snapshot_id,added_data_files_count,existing_data_files_count,deleted_data_files_count,added_rows_count,existing_rows_count,deleted_rows_count,partitions
0,s3a://go01-demo/warehouse/tablespace/external/...,7350,0,0,2,2,384109576346467287,5,0,0,9,0,0,"[(False, False, [127, 2, 0, 0], [130, 2, 0, 0])]"
1,s3a://go01-demo/warehouse/tablespace/external/...,7163,0,0,1,1,4815467340052359539,2,0,0,3,0,0,"[(False, False, [127, 2, 0, 0], [130, 2, 0, 0])]"
2,s3a://go01-demo/warehouse/tablespace/external/...,7186,0,1,2,2,384109576346467287,2,0,0,2,0,0,"[(False, False, [127, 2, 0, 0], [130, 2, 0, 0])]"


In [38]:
print("Showing " + metadata_file_list[6])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[6]).toPandas()['manifest_path'][0]

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/snap-384109576346467287-1-92964479-485c-40b2-9b4e-e49a1f19fdc9.avro


's3a://go01-demo/warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/92964479-485c-40b2-9b4e-e49a1f19fdc9-m0.avro'

In [39]:
print("Showing " + metadata_file_list[6])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[6]).toPandas()['manifest_path'][1]

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/snap-384109576346467287-1-92964479-485c-40b2-9b4e-e49a1f19fdc9.avro


's3a://go01-demo/warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/ebfddaac-adcc-4d9b-91b6-6b41c31286c5-m0.avro'

In [40]:
print("Showing " + metadata_file_list[6])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[6]).toPandas()['manifest_path'][2]

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/snap-384109576346467287-1-92964479-485c-40b2-9b4e-e49a1f19fdc9.avro


's3a://go01-demo/warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/92964479-485c-40b2-9b4e-e49a1f19fdc9-m1.avro'

In [41]:
print("Showing " + metadata_file_list[7])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[7]).toPandas()

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/snap-4815467340052359539-1-ebfddaac-adcc-4d9b-91b6-6b41c31286c5.avro


                                                                                

Unnamed: 0,manifest_path,manifest_length,partition_spec_id,content,sequence_number,min_sequence_number,added_snapshot_id,added_data_files_count,existing_data_files_count,deleted_data_files_count,added_rows_count,existing_rows_count,deleted_rows_count,partitions
0,s3a://go01-demo/warehouse/tablespace/external/...,7163,0,0,1,1,4815467340052359539,2,0,0,3,0,0,"[(False, False, [127, 2, 0, 0], [130, 2, 0, 0])]"


In [42]:
print("Showing " + metadata_file_list[7])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[7]).toPandas()['manifest_path'][0]

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/snap-4815467340052359539-1-ebfddaac-adcc-4d9b-91b6-6b41c31286c5.avro


's3a://go01-demo/warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/ebfddaac-adcc-4d9b-91b6-6b41c31286c5-m0.avro'

Showing Manifest Files (Avro) i.e. list of table partitions mapped to snapshot ID

In [43]:
print("Showing " + metadata_file_list[3])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[3]).toPandas()

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/92964479-485c-40b2-9b4e-e49a1f19fdc9-m0.avro


Unnamed: 0,status,snapshot_id,sequence_number,file_sequence_number,data_file
0,1,384109576346467287,,,"(0, s3a://go01-demo/warehouse/tablespace/exter..."
1,1,384109576346467287,,,"(0, s3a://go01-demo/warehouse/tablespace/exter..."
2,1,384109576346467287,,,"(0, s3a://go01-demo/warehouse/tablespace/exter..."
3,1,384109576346467287,,,"(0, s3a://go01-demo/warehouse/tablespace/exter..."
4,1,384109576346467287,,,"(0, s3a://go01-demo/warehouse/tablespace/exter..."


In [44]:
print("Showing " + metadata_file_list[4])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[4]).toPandas()

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/92964479-485c-40b2-9b4e-e49a1f19fdc9-m1.avro


Unnamed: 0,status,snapshot_id,sequence_number,file_sequence_number,data_file
0,1,384109576346467287,,,"(1, s3a://go01-demo/warehouse/tablespace/exter..."
1,1,384109576346467287,,,"(1, s3a://go01-demo/warehouse/tablespace/exter..."


In [45]:
print("Showing " + metadata_file_list[5])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[5]).toPandas()

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/ebfddaac-adcc-4d9b-91b6-6b41c31286c5-m0.avro


                                                                                

Unnamed: 0,status,snapshot_id,sequence_number,file_sequence_number,data_file
0,1,4815467340052359539,,,"(0, s3a://go01-demo/warehouse/tablespace/exter..."
1,1,4815467340052359539,,,"(0, s3a://go01-demo/warehouse/tablespace/exter..."


In [46]:
print("Showing " + metadata_file_list[3])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[3]).toPandas()['data_file'][0]

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/92964479-485c-40b2-9b4e-e49a1f19fdc9-m0.avro


Row(content=0, file_path='s3a://go01-demo/warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/data/coffee_sale_ts_month=2023-07/00011-839-9cc8e7da-16c5-46fa-8612-37d795596d15-00001.parquet', file_format='PARQUET', partition=Row(coffee_sale_ts_month=642), record_count=3, file_size_in_bytes=1074, column_sizes=[Row(key=1, value=52), Row(key=2, value=74), Row(key=3, value=62)], value_counts=[Row(key=1, value=3), Row(key=2, value=3), Row(key=3, value=3)], null_value_counts=[Row(key=1, value=0), Row(key=2, value=0), Row(key=3, value=0)], nan_value_counts=[], lower_bounds=[Row(key=1, value=bytearray(b'\x04\x00\x00\x00\x00\x00\x00\x00')), Row(key=2, value=bytearray(b'grande')), Row(key=3, value=bytearray(b'\x00W\xd3\xafk\xff\x05\x00'))], upper_bounds=[Row(key=1, value=bytearray(b'\x06\x00\x00\x00\x00\x00\x00\x00')), Row(key=2, value=bytearray(b'venti')), Row(key=3, value=bytearray(b'\x00W\xd3\xafk\xff\x05\x00'))], key_metadata=None, split_offsets=[4], equality_ids=None, sort_order_

In [47]:
data_file_path = spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[3]).toPandas()['data_file'][0][1]
spark.read.parquet(data_file_path).show()

[Stage 51:>                                                         (0 + 1) / 1]

+---------+-----------+-------------------+
|coffee_id|coffee_size|     coffee_sale_ts|
+---------+-----------+-------------------+
|        4|      venti|2023-07-01 12:01:00|
|        5|     grande|2023-07-01 12:01:00|
|        6|     grande|2023-07-01 12:01:00|
+---------+-----------+-------------------+



                                                                                

In [48]:
data_file_path = spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[3]).toPandas()['data_file'][1][1]
spark.read.parquet(data_file_path).show()

[Stage 54:>                                                         (0 + 1) / 1]

+---------+-----------+-------------------+
|coffee_id|coffee_size|     coffee_sale_ts|
+---------+-----------+-------------------+
|        2|       tall|2023-07-01 10:00:00|
+---------+-----------+-------------------+



                                                                                

In [69]:
data_file_path = spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[3]).toPandas()['data_file'][2][1]
spark.read.parquet(data_file_path).show()

+---------+-----------+-------------------+
|coffee_id|coffee_size|     coffee_sale_ts|
+---------+-----------+-------------------+
|        7|      venti|2023-05-01 12:01:00|
|        9|       tall|2023-05-01 12:01:00|
|       10|       tall|2023-05-01 12:01:00|
+---------+-----------+-------------------+



23/09/04 04:31:07 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
io.fabric8.kubernetes.client.WatcherException: too old resource version: 13837705 (13839217)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onStatus(AbstractWatchManager.java:265)
	at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:249)
	at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:93)
	at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
	at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
	at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
	at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
	at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
	at okhttp3.RealCall$AsyncCall.execute(Re

In [49]:
print("Showing " + metadata_file_list[4])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[4]).toPandas()['data_file'][0]

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/92964479-485c-40b2-9b4e-e49a1f19fdc9-m1.avro


Row(content=1, file_path='s3a://go01-demo/warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/data/coffee_sale_ts_month=2023-07/00036-1035-022f4aa1-8d4c-4ee9-812a-edb7044d29f6-00001.parquet', file_format='PARQUET', partition=Row(coffee_sale_ts_month=642), record_count=1, file_size_in_bytes=1794, column_sizes=[Row(key=2147483546, value=199), Row(key=2147483545, value=33)], value_counts=[Row(key=2147483546, value=1), Row(key=2147483545, value=1)], null_value_counts=[Row(key=2147483546, value=0), Row(key=2147483545, value=0)], nan_value_counts=[], lower_bounds=[Row(key=2147483546, value=bytearray(b's3a://go01-demo/warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/data/coffee_sale_ts_month=2023-07/00011-214-64d52d0e-b039-4681-89d9-6f15e487760b-00001.parquet')), Row(key=2147483545, value=bytearray(b'\x00\x00\x00\x00\x00\x00\x00\x00'))], upper_bounds=[Row(key=2147483546, value=bytearray(b's3a://go01-demo/warehouse/tablespace/external/hive/lakehouse.db/coffees_table_

In [54]:
data_file_path = spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[4]).toPandas()['data_file'][0][1]
spark.read.parquet(data_file_path).show()

+--------------------+---+
|           file_path|pos|
+--------------------+---+
|s3a://go01-demo/w...|  0|
+--------------------+---+



In [57]:
data_file_path = spark.read.parquet(data_file_path).toPandas()['file_path'][0]
spark.read.parquet(data_file_path).show()

                                                                                

+---------+-----------+-------------------+
|coffee_id|coffee_size|     coffee_sale_ts|
+---------+-----------+-------------------+
|        2|     grande|2023-07-01 10:00:00|
|        1|      venti|2023-07-01 10:00:00|
+---------+-----------+-------------------+



In [58]:
data_file_path = spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[4]).toPandas()['data_file'][1][1]
spark.read.parquet(data_file_path).show()

+--------------------+---+
|           file_path|pos|
+--------------------+---+
|s3a://go01-demo/w...|  0|
+--------------------+---+



In [59]:
data_file_path = spark.read.parquet(data_file_path).toPandas()['file_path'][0]
spark.read.parquet(data_file_path).show()

[Stage 80:>                                                         (0 + 1) / 1]

+---------+-----------+-------------------+
|coffee_id|coffee_size|     coffee_sale_ts|
+---------+-----------+-------------------+
|        3|       tall|2023-04-01 10:00:00|
+---------+-----------+-------------------+



                                                                                

In [60]:
print("Showing " + metadata_file_list[5])
spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[5]).toPandas()['data_file'][0]

Showing warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/metadata/ebfddaac-adcc-4d9b-91b6-6b41c31286c5-m0.avro


Row(content=0, file_path='s3a://go01-demo/warehouse/tablespace/external/hive/lakehouse.db/coffees_table_3/data/coffee_sale_ts_month=2023-07/00011-214-64d52d0e-b039-4681-89d9-6f15e487760b-00001.parquet', file_format='PARQUET', partition=Row(coffee_sale_ts_month=642), record_count=2, file_size_in_bytes=1015, column_sizes=[Row(key=1, value=41), Row(key=2, value=44), Row(key=3, value=62)], value_counts=[Row(key=1, value=2), Row(key=2, value=2), Row(key=3, value=2)], null_value_counts=[Row(key=1, value=0), Row(key=2, value=0), Row(key=3, value=0)], nan_value_counts=[], lower_bounds=[Row(key=1, value=bytearray(b'\x01\x00\x00\x00\x00\x00\x00\x00')), Row(key=2, value=bytearray(b'grande')), Row(key=3, value=bytearray(b'\x00\x88\x18\xffi\xff\x05\x00'))], upper_bounds=[Row(key=1, value=bytearray(b'\x02\x00\x00\x00\x00\x00\x00\x00')), Row(key=2, value=bytearray(b'venti')), Row(key=3, value=bytearray(b'\x00\x88\x18\xffi\xff\x05\x00'))], key_metadata=None, split_offsets=[4], equality_ids=None, sort_

In [61]:
data_file_path = spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[5]).toPandas()['data_file'][0][1]
spark.read.parquet(data_file_path).show()

+---------+-----------+-------------------+
|coffee_id|coffee_size|     coffee_sale_ts|
+---------+-----------+-------------------+
|        2|     grande|2023-07-01 10:00:00|
|        1|      venti|2023-07-01 10:00:00|
+---------+-----------+-------------------+



In [62]:
data_file_path = spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[5]).toPandas()['data_file'][1][1]
spark.read.parquet(data_file_path).show()

+---------+-----------+-------------------+
|coffee_id|coffee_size|     coffee_sale_ts|
+---------+-----------+-------------------+
|        3|       tall|2023-04-01 10:00:00|
+---------+-----------+-------------------+



In [66]:
spark.sql("SELECT * FROM lakehouse.coffees_table_3").show()

23/09/04 04:09:09 WARN HiveMetaStoreClient: Failed to connect to the MetaStore Server...
[Stage 91:>                                                         (0 + 1) / 1]

+---------+-----------+-------------------+
|coffee_id|coffee_size|     coffee_sale_ts|
+---------+-----------+-------------------+
|        4|      venti|2023-07-01 12:01:00|
|        5|     grande|2023-07-01 12:01:00|
|        6|     grande|2023-07-01 12:01:00|
|        2|       tall|2023-07-01 10:00:00|
|        7|      venti|2023-05-01 12:01:00|
|        9|       tall|2023-05-01 12:01:00|
|       10|       tall|2023-05-01 12:01:00|
|        8|     grande|2023-04-01 12:01:00|
|        3|      venti|2023-04-01 10:00:00|
|        1|      venti|2023-07-01 10:00:00|
+---------+-----------+-------------------+



                                                                                

### Time Travel 

In [None]:
snapshots_df = spark.sql("SELECT * FROM lakehouse.customer_table.snapshots;")

In [None]:
first_snapshot = snapshots_df.select("snapshot_id").head(1)[0][0]

#### Validate that the output dataframe only includes one row per the original insert

In [None]:
spark.read\
    .option("snapshot-id", first_snapshot)\
    .format("iceberg")\
    .load("lakehouse.customer_table").toPandas()

In [None]:
avro_tempdf = spark.read.format("avro").load("s3a://go01-demo/" + metadata_file_list[6]).toPandas()

In [None]:
avro_tempdf.columns

In [None]:
avro_tempdf['partitions']

In [None]:
avro_tempdf['added_rows_count']

In [None]:
avro_tempdf['existing_rows_count']

In [None]:
avro_tempdf['added_data_files_count']

In [None]:
print("Showing " + metadata_file_list[2])
json_tempdf = spark.read.option("multiline","true").json("s3a://go01-demo/" + metadata_file_list[2]).toPandas()

In [None]:
json_tempdf.columns

In [None]:
json_tempdf['current-schema-id']

In [None]:
list(json_tempdf['snapshots'])

In [None]:
json_tempdf['partition-spec']

In [None]:
list(json_tempdf['partition-specs'])

In [None]:
spark.sql("SELECT * FROM lakehouse.coffees_table_2.all_data_files;").show()

### Partition Evolution

Spark partitioning is a way to split the data into multiple partitions so that you can execute transformations on multiple partitions in parallel which allows completing the job faster. You can also write partitioned data into a file system (multiple sub-directories) for faster reads by downstream systems.

Spark has several partitioning methods to achieve parallelism, based on your need, you should choose which one to use.

Creating New Data to Test Partition Evolution

In [None]:
from pyspark.sql.types import LongType, IntegerType, StringType

import dbldatagen as dg

shuffle_partitions_requested = 20
device_population = 100000
data_rows = 20 * 1000000
#partitions_requested = 20

spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions_requested)

country_codes = [
    "CN", "US", "FR", "CA", "IN", "JM", "IE", "PK", "GB", "IL", "AU", 
    "SG", "ES", "GE", "MX", "ET", "SA", "LB", "NL", "IT"
]
#country_weights = [
#    1300, 365, 67, 38, 1300, 3, 7, 212, 67, 9, 25, 6, 47, 83, 
#    126, 109, 58, 8, 17,
#]

manufacturers = [
    "Delta corp", "Xyzzy Inc.", "Lakehouse Ltd", "Acme Corp", "Embanks Devices",
]

lines = ["delta", "xyzzy", "lakehouse", "gadget", "droid"]

testDataSpec = (
    dg.DataGenerator(spark, name="device_data_set", rows=data_rows) 
                     #,partitions=partitions_requested)
    .withIdOutput()
    # we'll use hash of the base field to generate the ids to
    # avoid a simple incrementing sequence
    .withColumn("internal_device_id", "long", minValue=0x1000000000000, 
                uniqueValues=device_population, omit=True, baseColumnType="hash",
    )
    # note for format strings, we must use "%lx" not "%x" as the
    # underlying value is a long
    .withColumn(
        "device_id", "string", format="0x%013x", baseColumn="internal_device_id"
    )
    # the device / user attributes will be the same for the same device id
    # so lets use the internal device id as the base column for these attribute
    .withColumn("country", "string", values=country_codes, #weights=country_weights, 
                baseColumn="internal_device_id")
    .withColumn("manufacturer", "string", values=manufacturers, 
                baseColumn="internal_device_id", )
    # use omit = True if you don't want a column to appear in the final output
    # but just want to use it as part of generation of another column
    .withColumn("line", "string", values=lines, baseColumn="manufacturer", 
                baseColumnType="hash", omit=True )
    .withColumn("model_ser", "integer", minValue=1, maxValue=11, baseColumn="device_id", 
                baseColumnType="hash", omit=True, )
    .withColumn("model_line", "string", expr="concat(line, '#', model_ser)", 
                baseColumn=["line", "model_ser"] )
    .withColumn("event_type", "string", 
                values=["activation", "deactivation", "plan change", "telecoms activity", 
                        "internet activity", "device error", ],
                random=True)
    .withColumn("event_ts", "timestamp", begin="2020-01-01 01:00:00", 
                end="2020-12-31 23:59:00", 
                interval="1 minute", random=True )
)

dfTestData = testDataSpec.build()

display(dfTestData)

In [None]:
dfTestData.head()

In [None]:
spark.conf.set("spark.sql.sources.partitionColumnTypeInference.enabled", "false")

In [None]:
spark.sql("DROP TABLE IF EXISTS spark_catalog.lakehouse.partition_evol_tbl PURGE")

In [None]:
#dfTestData.groupBy("country").count().show()

In [None]:
#dfTestData.rdd.getNumPartitions()

Iceberg requires the data to be sorted according to the partition spec per task (Spark partition) in prior to write against partitioned table. This applies both Writing with SQL and Writing with DataFrames.

In [None]:
dfTestData.sortWithinPartitions("country").writeTo("spark_catalog.lakehouse.p_evol_tbl").partitionedBy("country").using("iceberg").create()#.append()#replace()#overwritePartitions()#create()

In [None]:
#spark.sql("SELECT * FROM spark_catalog.lakehouse.part_evol_tbl.PARTITIONS").show()

In [None]:
#spark.sql("SELECT * FROM spark_catalog.lakehouse.part_evol_tbl.files").show()

In [None]:
#spark.sql("SELECT * FROM spark_catalog.lakehouse.part_evol_tbl.manifests").show()

In [None]:
#spark.sql("SELECT * FROM spark_catalog.lakehouse.part_evol_tbl.all_manifests").show()

In [None]:
#spark.sql("SELECT * FROM spark_catalog.lakehouse.part_evol_tbl.all_data_files").show()

In [None]:
#spark.sql("SELECT * FROM spark_catalog.lakehouse.part_evol_tbl.snapshots").show()

Adding a partition field is a metadata operation and does not change any of the existing table data. New data will be written with the new partitioning, but existing data will remain in the old partition layout. Old data files will have null values for the new partition fields in metadata tables.

In [None]:
print("TABLE PARTITIONS BEFORE ALTER PARTITION STATEMENT: ")
spark.sql("SELECT * FROM spark_catalog.lakehouse.p_evol_tbl.PARTITIONS").show()

In [None]:
print("ADD PARTITION BY EVENT TIMESTAMP MONTHS: ")
print("ALTER TABLE spark_catalog.lakehouse.p_evol_tbl ADD PARTITION FIELD months(event_ts)")
spark.sql("ALTER TABLE spark_catalog.lakehouse.p_evol_tbl ADD PARTITION FIELD months(event_ts)")
#spark.sql("ALTER TABLE spark_catalog.lakehouse.part_evol_tbl REPLACE PARTITION FIELD hours(dob) WITH state")
#spark.sql("ALTER TABLE prod.db.sample ADD PARTITION FIELD month")

#ALTER TABLE spark_catalog.lakehouse.part_evol_tbl ADD PARTITION FIELD days(event_ts)

In [None]:
print("TABLE PARTITIONS AFTER ALTER PARTITION STATEMENT: ")
spark.sql("SELECT * FROM spark_catalog.lakehouse.p_evol_tbl.PARTITIONS").show()

In [None]:
appendDf = dfTestData.sample(fraction=0.3, seed=3)

In [None]:
appendDf.dtypes

In [None]:
appendDf.rdd.getNumPartitions()

In [None]:
appendDf.show()

In [None]:
appendDf.sortWithinPartitions("country").show()

In [None]:
#appendDf.sortWithinPartitions("country", "month(event_ts)").show()

In [None]:
appendDf.sortWithinPartitions("country").writeTo("spark_catalog.lakehouse.p_evol_tbl").using("iceberg").append() #.append()#replace()#overwritePartitions()#create()

In [None]:
print("TABLE PARTITIONS AFTER APPEND: ")
spark.sql("SELECT * FROM spark_catalog.lakehouse.p_evol_tbl.PARTITIONS").show(100)

Dropping a partition field is a metadata operation and does not change any of the existing table data. New data will be written with the new partitioning, but existing data will remain in the old partition layout.

In [None]:
spark.sql("ALTER TABLE spark_catalog.lakehouse.part_evol_tbl DROP PARTITION FIELD bucket(16, device_id)")

In [None]:
print("TABLE PARTITIONS AFTER ALTER PARTITION STATEMENT: ")
spark.sql("SELECT * FROM spark_catalog.lakehouse.part_evol_tbl.PARTITIONS").show()

##### Only json files have been added (one per each time you repartitioned) but Avro files have stayed the same

In [None]:
s3 = boto3.resource('s3')
my_bucket = s3.Bucket("go01-demo")

metadata_file_list = []

print("Current Metadata Files: \n")

for object_summary in my_bucket.objects.filter(Prefix=metadata_path+"/metadata"):
    #print(object_summary.key +"\n")
    metadata_file_list.append(object_summary.key)
    
metadata_file_list

In [None]:
spark.sql("CREATE TABLE IF NOT EXISTS customer_table (id BIGINT, state STRING, country STRING, dob TIMESTAMP) USING iceberg PARTITIONED BY ( hours(dob))")

In [None]:
spark.sql("SELECT HOUR(dob) FROM spark_catalog.lakehouse.customer_table").show()

In [None]:
spark.sql("SELECT DAY(dob) FROM spark_catalog.lakehouse.customer_table").show()

### Dropping Tables

In [None]:
spark.sql("DROP TABLE IF EXISTS lakehouse.staging")

Validate that the metadata folder is now empty but the data folder still retains parquet files.

![alt text](../img/s3_droptable_1.png)

![alt text](../img/s3_droptable_2.png)

![alt text](../img/s3_droptable_3.png)

In [None]:
spark.sql("ALTER TABLE lakehouse.customers_table\
            SET TBLPROPERTIES ('format-version' = '2')")

In [None]:
s3 = boto3.resource('s3')
my_bucket = s3.Bucket("go01-demo")

metadata_file_list = []

for object_summary in my_bucket.objects.filter(Prefix=metadata_path):
    print(object_summary.key +"\n")
    metadata_file_list.append(object_summary.key)

In [None]:
print("Showing " + metadata_file_list[3])
spark.read.option("multiline","true").json("s3a://go01-demo/" + metadata_file_list[3]).toPandas()