# Iceberg Lab 
## Unit 5: Table Inspection

In the previous unit, we-
1. Learned how Schema is enforced in Iceberg
2. Learned how to perform Schema Evolution and how Iceberg keeps track of it

In this unit, we will-
1. Explore metadata inspection tables that iceberg provides

### 1. Imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col


import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark.sparkContext.setLogLevel("WARN")
spark

### 3. Declare variables

In [3]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  nikhim-iceberg-lab


In [4]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  nikhim-iceberg-lab


In [5]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  928505941962


In [6]:
DPMS_NAME=f"iceberg-hms-{PROJECT_NUMBER}"
LOCATION="us-central1"

metastore_dir = !gcloud metastore services describe $DPMS_NAME --location $LOCATION |grep 'hive.metastore.warehouse.dir'| cut -d':' -f2- | xargs 
HIVE_METASTORE_WAREHOUSE_DIR = metastore_dir[0]
print("HIVE_METASTORE_WAREHOUSE_DIR",HIVE_METASTORE_WAREHOUSE_DIR)

HIVE_METASTORE_WAREHOUSE_DIR gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse


In [7]:
TABLE_NAME="loans_by_state_iceberg"
DB_NAME="loan_db"

#fully qualified table name
FQTN=f"{DB_NAME}.{TABLE_NAME}"

print("Fully quailified table name :",FQTN)

Fully quailified table name : loan_db.loans_by_state_iceberg


### 4. Table Inspection
Iceberg provides a set of metadata tables that makes it easier to read the files from Metadata folders and the information from these tables can be used to perform time_travel, rollback, snapshot correction or maintenance.


#### a. history

In [8]:
# Shows a history of snapshot updates on the table
spark.table("loan_db.loans_by_state_iceberg.history").show(truncate=False)

[Stage 0:>                                                          (0 + 1) / 1]

+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2023-02-10 15:26:16.364|3648627921780331930|null               |true               |
|2023-02-10 22:29:45.022|5222601969543758311|3648627921780331930|true               |
|2023-02-10 22:33:06.455|9145457862466461068|5222601969543758311|true               |
|2023-02-10 22:35:11.244|8627182030940064924|9145457862466461068|true               |
|2023-02-10 22:36:30.093|2697368997376323351|8627182030940064924|true               |
|2023-02-10 22:44:01.95 |5865803199727045458|2697368997376323351|true               |
+-----------------------+-------------------+-------------------+-------------------+



                                                                                

#### b. metadata_log_entries

In [9]:
# Keeps a track of metadata log entries and their current snapshot at the time of metadata file update
spark.table("loan_db.loans_by_state_iceberg.metadata_log_entries").show(truncate=False)

[Stage 1:>                                                          (0 + 1) / 1]

+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------------+----------------------+
|timestamp              |file                                                                                                                                                                                   |latest_snapshot_id |latest_schema_id|latest_sequence_number|
+-----------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------------+----------------+----------------------+
|2023-02-10 15:26:16.364|gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/00000-e9f97377-9cf7-4d6b-a97f-cde34a16

                                                                                

#### c. snapshots

In [12]:
# Reads data from snapshot avro file that keeps a track of snapshot updates, operations performed, partition and table statistics and parent snapshots
# The entry with "null" parent_id is the first snapshot created 
spark.table("loan_db.loans_by_state_iceberg.snapshots").show()

+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|        committed_at|        snapshot_id|          parent_id|operation|       manifest_list|             summary|
+--------------------+-------------------+-------------------+---------+--------------------+--------------------+
|2023-02-10 15:26:...|3648627921780331930|               null|   append|gs://gcs-bucket-i...|{spark.app.id -> ...|
|2023-02-10 22:29:...|5222601969543758311|3648627921780331930|overwrite|gs://gcs-bucket-i...|{spark.app.id -> ...|
|2023-02-10 22:33:...|9145457862466461068|5222601969543758311|   append|gs://gcs-bucket-i...|{spark.app.id -> ...|
|2023-02-10 22:35:...|8627182030940064924|9145457862466461068|overwrite|gs://gcs-bucket-i...|{spark.app.id -> ...|
|2023-02-10 22:36:...|2697368997376323351|8627182030940064924|overwrite|gs://gcs-bucket-i...|{spark.app.id -> ...|
|2023-02-10 22:44:...|5865803199727045458|2697368997376323351|overwrite|gs://gcs

#### c. files

In [14]:
# Shows details of current data files only, their respective metadata and statistics for efficient querying

spark.table("loan_db.loans_by_state_iceberg.files").show(truncate=True)

[Stage 6:>                                                          (0 + 1) / 1]

+-------+--------------------+-----------+-------+------------+------------------+--------------------+------------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+
|content|           file_path|file_format|spec_id|record_count|file_size_in_bytes|        column_sizes|      value_counts|null_value_counts|nan_value_counts|        lower_bounds|        upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|
+-------+--------------------+-----------+-------+------------+------------------+--------------------+------------------+-----------------+----------------+--------------------+--------------------+------------+-------------+------------+-------------+
|      0|gs://gcs-bucket-i...|    PARQUET|      0|          51|               998|{1 -> 175, 2 -> 220}|{1 -> 51, 2 -> 51}| {1 -> 0, 2 -> 0}|              {}|{1 -> AK, 2 -> \n...|{1 -> WY, 2 -> ��...|        null|          [4]|        null

                                                                                

#### d. all_files

In [15]:
# Similar to "files" above but gives details of all files for a given table

spark.table("loan_db.loans_by_state_iceberg.all_files").show(truncate=False)

                                                                                

+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+--------------------+------------------+-----------------+----------------+-------------------------+------------------------+------------+-------------+------------+-------------+
|content|file_path                                                                                                                                                                              |file_format|spec_id|record_count|file_size_in_bytes|column_sizes        |value_counts      |null_value_counts|nan_value_counts|lower_bounds             |upper_bounds            |key_metadata|split_offsets|equality_ids|sort_order_id|
+-------+-------------------------------------------------------------------------------------------------------------------------------------------

#### e. manifests

In [16]:
# Shows details of manifest files for current snapshot only. Reads data from the manifest avro file in metadata folder

spark.table("loan_db.loans_by_state_iceberg.manifests").show(truncate=False)

+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path                                                                                                                                                                       |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+----------

#### e. all_manifests

In [17]:
# Similar to "manifests" above but gives details of all manifest files for a given table

spark.table("loan_db.loans_by_state_iceberg.all_manifests").show(truncate=False)

[Stage 10:>                                                         (0 + 1) / 1]

+-------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+
|content|path                                                                                                                                                                       |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|reference_snapshot_id|
+-------+-------------------------------------------------------------------------------------------------------------------------------------------------

                                                                                

#### f. partitions

In [19]:
# Below is the output for an unpartitioned table
spark.table("loan_db.loans_by_state_iceberg.partitions").show(truncate=False)

+------------+----------+
|record_count|file_count|
+------------+----------+
|51          |1         |
+------------+----------+



In [20]:
# Below statement shows a much descriptive output on a partitioned table indicating record count in each partition, files in each partition and spec_id
#( In our case the spec_id = 0 indicating the first column "addr_state" as the partition column)

spark.table("loan_db.loans_by_state_iceberg_partitioned.partitions").show(truncate=False)

+---------+------------+----------+-------+
|partition|record_count|file_count|spec_id|
+---------+------------+----------+-------+
|{NE}     |1           |1         |0      |
|{ND}     |1           |1         |0      |
|{VA}     |1           |1         |0      |
|{DE}     |1           |1         |0      |
|{AR}     |1           |1         |0      |
|{KY}     |1           |1         |0      |
|{MI}     |1           |1         |0      |
|{RI}     |1           |1         |0      |
|{NM}     |1           |1         |0      |
|{MN}     |1           |1         |0      |
|{UT}     |1           |1         |0      |
|{NJ}     |1           |1         |0      |
|{WY}     |1           |1         |0      |
|{DC}     |1           |1         |0      |
|{GA}     |1           |1         |0      |
|{KS}     |1           |1         |0      |
|{WI}     |1           |1         |0      |
|{NV}     |1           |1         |0      |
|{OH}     |1           |1         |0      |
|{ID}     |1           |1       

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK