## Introduction to Iceberg Architecture

#### Launching a Spark Session with Iceberg

In [2]:
import cml.data_v1 as cmldata

CONNECTION_NAME = "go01-aw-dl"
conn = cmldata.get_connection(CONNECTION_NAME)
spark = conn.get_spark_session()

# Sample usage to run query through spark
EXAMPLE_SQL_QUERY = "show databases"
spark.sql(EXAMPLE_SQL_QUERY).show()

Setting spark.hadoop.yarn.resourcemanager.principal to pauldefusco
Hive Session ID = 6af89b0f-d436-44c5-89a0-0187137f3aa6


+--------------------+
|           namespace|
+--------------------+
|         01_car_data|
|           01_car_dw|
|      adash_car_data|
|             airline|
|          airline_dw|
|            airlines|
|        airlines_csv|
|       airlines_csv1|
|   airlines_csv_vish|
|    airlines_iceberg|
|   airlines_iceberg1|
|airlines_iceberg_...|
|      airlines_mjain|
|          airquality|
|          atlas_demo|
|            bankdemo|
|              bhagan|
|             cdedemo|
|        cdp_overview|
|        cgsifacebook|
+--------------------+
only showing top 20 rows



In [3]:
spark.sparkContext.getConf().getAll()

[('spark.eventLog.enabled', 'true'),
 ('spark.repl.local.jars',
  'file:///runtime-addons/spark320-17-hf1-6xa3lk/opt/spark/optional-lib/iceberg-spark-runtime-3.2_2.12-0.14.1.1.17.7215.0-31.jar'),
 ('spark.network.crypto.enabled', 'true'),
 ('spark.ui.proxyRedirectUri',
  'https://spark-70c18njoj0mcvhmn.ml-4c5feac0-3ec.go01-dem.ylcu-atmi.cloudera.site'),
 ('spark.sql.hive.hwc.execution.mode', 'spark'),
 ('spark.kerberos.renewal.credentials', 'ccache'),
 ('spark.driver.bindAddress', '100.100.84.88'),
 ('spark.sql.catalog.spark_catalog',
  'org.apache.iceberg.spark.SparkSessionCatalog'),
 ('spark.dynamicAllocation.maxExecutors', '49'),
 ('spark.eventLog.dir', 'file:///sparkeventlogs'),
 ('spark.hadoop.yarn.resourcemanager.principal', 'pauldefusco'),
 ('spark.driver.port', '37549'),
 ('spark.ui.port', '20049'),
 ('spark.kubernetes.driver.annotation.cluster-autoscaler.kubernetes.io/safe-to-evict',
  'false'),
 ('spark.yarn.access.hadoopFileSystems',
  's3a://go01-demo/warehouse/tablespace/e

### Iceberg Architecture

![alt text](../img/iceberg-metadata.png)

#### Iceberg Catalog

Iceberg comes with catalogs that enable SQL commands to manage tables and load them by name. Catalogs are configured using properties under spark.sql.catalog.(catalog_name).

In [4]:
# Show catalog and database
spark.sql("SHOW CURRENT NAMESPACE").show()

+-------------+---------+
|      catalog|namespace|
+-------------+---------+
|spark_catalog|  default|
+-------------+---------+



In [10]:
# Create a new database
spark.sql("CREATE DATABASE IF NOT EXISTS spark_catalog.lakehouse_catalog")
spark.sql("USE spark_catalog.lakehouse_catalog")

DataFrame[]

In [11]:
# Show catalog and database
spark.sql("SHOW CURRENT NAMESPACE").show()

+-------------+-----------------+
|      catalog|        namespace|
+-------------+-----------------+
|spark_catalog|lakehouse_catalog|
+-------------+-----------------+



#### Create an Iceberg Table with Spark SQL

In [44]:
spark.sql("DROP TABLE IF EXISTS customers_table")

                                                                                

DataFrame[]

In [45]:
spark.sql("CREATE TABLE IF NOT EXISTS customers_table (id BIGINT, state STRING, country STRING, dob TIMESTAMP) USING iceberg PARTITIONED BY ( hours(dob))")

DataFrame[]

#### Verify that a Metadata JSON file has been created under the Metadata directory

In [47]:
metadata_path = "s3a://go01-demo/warehouse/tablespace/external/hive/lakehouse_catalog.db/customers_table/metadata/00000-d8c7b4ba-15f5-4bfa-a9c8-a6ad17eaa44b.metadata.json"

In [55]:
import pandas as pd
spark.read.option("multiline","true").json(metadata_path).toPandas()

                                                                                

Unnamed: 0,current-schema-id,current-snapshot-id,default-sort-order-id,default-spec-id,format-version,last-column-id,last-partition-id,last-updated-ms,location,metadata-log,partition-spec,partition-specs,properties,schema,schemas,snapshot-log,snapshots,sort-orders,table-uuid
0,0,-1,0,0,1,4,1000,1682462401362,s3a://go01-demo/warehouse/tablespace/external/...,[],"[(1000, dob_hour, 4, hour)]","[([Row(field-id=1000, name='dob_hour', source-...","(pauldefusco,)","([(1, id, False, long), (2, state, False, stri...","[([Row(id=1, name='id', required=False, type='...",[],[],"[([], 0)]",a56f75e7-56b4-4b46-afd6-2a2773957f7d


![alt text](../img/s3_metadata.png)

#### Notice that no snapshots or other files have been created as data has not yet been inserted.

In [40]:
spark.sql("SELECT * FROM lakehouse_catalog.customers_table.history").show()

[Stage 29:>                                                         (0 + 1) / 1]

+---------------+-----------+---------+-------------------+
|made_current_at|snapshot_id|parent_id|is_current_ancestor|
+---------------+-----------+---------+-------------------+
+---------------+-----------+---------+-------------------+



                                                                                

In [30]:
spark.sql("SELECT * FROM lakehouse_catalog.customers_table.snapshots;").show()

[Stage 28:>                                                         (0 + 1) / 1]

+------------+-----------+---------+---------+-------------+-------+
|committed_at|snapshot_id|parent_id|operation|manifest_list|summary|
+------------+-----------+---------+---------+-------------+-------+
+------------+-----------+---------+---------+-------------+-------+



                                                                                

In [27]:
spark.sql("SELECT * FROM lakehouse_catalog.customers_table.files;").show()

+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+
|content|file_path|file_format|spec_id|partition|record_count|file_size_in_bytes|column_sizes|value_counts|null_value_counts|nan_value_counts|lower_bounds|upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|
+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+
+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+



In [29]:
spark.sql("SELECT * FROM lakehouse_catalog.customers_table.manifests;").show()

+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path|length|partition_spec_id|added_snapshot_id|added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+



In [32]:
spark.sql("SELECT * FROM lakehouse_catalog.customers_table.all_data_files;").show()

+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+
|content|file_path|file_format|spec_id|partition|record_count|file_size_in_bytes|column_sizes|value_counts|null_value_counts|nan_value_counts|lower_bounds|upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|
+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+
+-------+---------+-----------+-------+---------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+



In [34]:
spark.sql("SELECT * FROM lakehouse_catalog.customers_table.all_manifests;").show()

+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+
|content|path|length|partition_spec_id|added_snapshot_id|added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|reference_snapshot_id|
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+-------

#### Insert a record into the table 

In [58]:
from pyspark.sql.functions import date_format

In [62]:
spark.sql("INSERT INTO lakehouse_catalog.customers_table VALUES (1, 'CA', 'USA', cast(date_format('2000-01-01 00:00:00', 'yyyy-MM-dd HH:mm:ss') as timestamp))")

                                                                                

DataFrame[]

In [71]:
QUERY = "select h.made_current_at,\
            s.operation,\
            h.snapshot_id,\
            h.is_current_ancestor,\
            s.summary['spark.app.id']\
        from lakehouse_catalog.customers_table.history h\
        join lakehouse_catalog.customers_table.snapshots s\
            on h.snapshot_id = s.snapshot_id\
            order by made_current_at;"

In [73]:
spark.sql(QUERY).show()

                                                                                

+--------------------+---------+-------------------+-------------------+---------------------+
|     made_current_at|operation|        snapshot_id|is_current_ancestor|summary[spark.app.id]|
+--------------------+---------+-------------------+-------------------+---------------------+
|2023-04-25 23:06:...|   append|7935150167096816875|               true| spark-application...|
+--------------------+---------+-------------------+-------------------+---------------------+



![alt text](../img/s3_data_1.png)

![alt text](../img/s3_data_2.png)