# Apache Iceberg Lab 
## Unit 2: Create Iceberg table
In the previous unit -
1. We read parquet data in the datalake
2. Cleansed it, reduced dataset and persisted it as parquet to the datalake parquet-consumable directory
3. We created a database called loan_db and defined an external table on the data in parquet-consumable

In this unit you will learn to -
1. Create an Iceberg table in Hadoop Catalog off of the Parquet table in previous notebook and explore the folder structure
2. Create Iceberg table in Hive Catalog off of the Parquet table in the prior notebook and explore the folder structure
3. Check the Data and metadata files

### 1. Imports

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

import warnings
warnings.filterwarnings('ignore')

### 2. Create a Spark session powered by Cloud Dataproc 

In [2]:
spark = SparkSession.builder.appName('Loan Analysis').getOrCreate()
spark.sparkContext.setLogLevel("WARN")

In [7]:
val = spark.conf.get("spark.sql.catalog.hdp_ctlg.warehouse")
print(val)

gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir


### 3. Declare variables

In [4]:
project_id_output = !gcloud config list --format "value(core.project)" 2>/dev/null
PROJECT_ID = project_id_output[0]
print("PROJECT_ID: ", PROJECT_ID)

PROJECT_ID:  nikhim-iceberg-lab


In [5]:
project_name_output = !gcloud projects describe $PROJECT_ID | grep name | cut -d':' -f2 | xargs
PROJECT_NAME = project_name_output[0]
print("PROJECT_NAME: ", PROJECT_NAME)

PROJECT_NAME:  nikhim-iceberg-lab


In [6]:
project_number_output = !gcloud projects describe $PROJECT_ID | grep projectNumber | cut -d':' -f2 | xargs
PROJECT_NUMBER = project_number_output[0]
print("PROJECT_NUMBER: ", PROJECT_NUMBER)

PROJECT_NUMBER:  928505941962


In [8]:
DATA_LAKE_ROOT_PATH= f"gs://iceberg-data-bucket-{PROJECT_NUMBER}"


#### **Note:** Iceberg provides the option to use different types of catalogs for metadata tracking. In this notebook we are exploring:

    1. Hadoop Catalog - (Folder/File based metadata tracking)
    2. Hive Catalog - (Hive metastore based metadata tracking)

### 4. Create an unpartitioned Iceberg table in  **"Hadoop"** Catalog

In [10]:
#Fetch the hadoop warehouse directory for iceberg. All hadoop catalog tables will be created in this directory

ICEBERG_HDP_WAREHOUSE_DIR = spark.conf.get("spark.sql.catalog.hdp_ctlg.warehouse")

print("ICEBERG_HDP_WAREHOUSE_DIR = ",ICEBERG_HDP_WAREHOUSE_DIR)

ICEBERG_WAREHOUSE_DIR =  gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir


In [12]:
# Create iceberg table from the Parquet table using Hadoop Catalog
spark.sql("SELECT addr_state,loan_count FROM loan_db.loans_by_state_parquet").writeTo("hdp_ctlg.loan_db.loans_by_state_iceberg_hdp").using("iceberg").createOrReplace()

23/02/10 15:19:57 WARN HadoopTableOperations: Error reading version hint file gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/loan_db/loans_by_state_iceberg_hdp/metadata/version-hint.text
java.io.FileNotFoundException: Item not found: 'gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/loan_db/loans_by_state_iceberg_hdp/metadata/version-hint.text'. Note, it is possible that the live version is still available but the requested generation is deleted.
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageExceptions.createFileNotFoundException(GoogleCloudStorageExceptions.java:46)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.open(GoogleCloudStorageImpl.java:824)
	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageFileSystem.open(GoogleCloudStorageFileSystem.java:325)
	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.<init>(GoogleH

**NOTE:** During table creation the statement throws **_FileNotFoundException_** as **_version-hint.text_** does not exist in the folder. It then proceeds to create the folder structure and add necessary files required by Iceberg

In [13]:
spark.sql("SHOW TABLES FROM hdp_ctlg.loan_db").show(truncate=False)

+---------+--------------------------+-----------+
|namespace|tableName                 |isTemporary|
+---------+--------------------------+-----------+
|loan_db  |loans_by_state_iceberg_hdp|false      |
+---------+--------------------------+-----------+



In [14]:
spark.sql("DESCRIBE FORMATTED hdp_ctlg.loan_db.loans_by_state_iceberg_hdp").show(truncate=False)

+----------------------------+-----------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                      |comment|
+----------------------------+-----------------------------------------------------------------------------------------------+-------+
|addr_state                  |string                                                                                         |       |
|loan_count                  |bigint                                                                                         |       |
|                            |                                                                                               |       |
|# Partitioning              |                                                                                               |       |
|Not partitioned             |                         

In [16]:
#Take a peek at the Data Layout in hdp_ctlg warehouse directory
!gsutil ls -r $ICEBERG_HDP_WAREHOUSE_DIR 

gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/:

gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/loan_db/:

gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/loan_db/loans_by_state_iceberg_hdp/:

gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/loan_db/loans_by_state_iceberg_hdp/data/:
gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/loan_db/loans_by_state_iceberg_hdp/data/00000-0-3df93f8c-6e76-4966-91f4-4fef143630d2-00001.parquet

gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/loan_db/loans_by_state_iceberg_hdp/metadata/:
gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/loan_db/loans_by_state_iceberg_hdp/metadata/
gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/loan_db/loans_by_state_iceberg_hdp/metadata/df295a30-7f8c-4a82-b4f2-19ffb7665ed0-m0.avro
gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/loan_db/loans_by_state_iceberg_hdp/metadata/snap-2630534839670009067-1-df295a

#### **Note**: We have two folders created for the table
    1. Data - Contains the parquet file for actual data
    2. Metadata - Contains metadata associated with the table
    

In [18]:
#Version hint keeps track of the latest metadata file of the table. 

version_hint = !gsutil cat $ICEBERG_HDP_WAREHOUSE_DIR/loan_db/loans_by_state_iceberg_hdp/metadata/version-hint.text

METADATA_VERSION = version_hint[0]
print("METADATA_VERSION =",METADATA_VERSION)

METADATA_VERSION 1


In [19]:
# Exploring contents of latest metadata file. It keeps track of partition  info, schema info and latest snapshot information

!gsutil cat {ICEBERG_HDP_WAREHOUSE_DIR}/loan_db/loans_by_state_iceberg_hdp/metadata/v{METADATA_VERSION}.metadata.json

{
  "format-version" : 1,
  "table-uuid" : "d81b513d-980d-40dd-9067-5b404de1c722",
  "location" : "gs://iceberg-spark-bucket-928505941962/iceberg-warehouse-dir/loan_db/loans_by_state_iceberg_hdp",
  "last-updated-ms" : 1676042397490,
  "last-column-id" : 2,
  "schema" : {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "addr_state",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "loan_count",
      "required" : false,
      "type" : "long"
    } ]
  },
  "current-schema-id" : 0,
  "schemas" : [ {
    "type" : "struct",
    "schema-id" : 0,
    "fields" : [ {
      "id" : 1,
      "name" : "addr_state",
      "required" : false,
      "type" : "string"
    }, {
      "id" : 2,
      "name" : "loan_count",
      "required" : false,
      "type" : "long"
    } ]
  } ],
  "partition-spec" : [ ],
  "default-spec-id" : 0,
  "partition-specs" : [ {
    "spec-id" : 0,
    "fields" : [ ]
  } ],
  "last-part

### 5. Create Tables in **"Hive"** Catalog

In [20]:
#Fetch the hive metastore directory 

DPMS_NAME=f"iceberg-hms-{PROJECT_NUMBER}"
LOCATION="us-central1"

metastore_dir = !gcloud metastore services describe $DPMS_NAME --location $LOCATION |grep 'hive.metastore.warehouse.dir'| cut -d':' -f2- | xargs 
HIVE_METASTORE_WAREHOUSE_DIR = metastore_dir[0]

print("HIVE_METASTORE_WAREHOUSE_DIR =",HIVE_METASTORE_WAREHOUSE_DIR)

HIVE_METASTORE_WAREHOUSE_DIR = gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse


####    a. Creating Unpartitioned Table
Note: We will use these tables for the rest of the lab

In [22]:
# Create iceberg table from the Parquet table
spark.sql("SELECT addr_state,loan_count FROM loan_db.loans_by_state_parquet").writeTo("loan_db.loans_by_state_iceberg").using("iceberg").createOrReplace()

                                                                                

In [23]:
spark.sql("show tables from loan_db;").show(truncate=False)

+---------+----------------------+-----------+
|namespace|tableName             |isTemporary|
+---------+----------------------+-----------+
|loan_db  |loans_by_state_iceberg|false      |
|loan_db  |loans_by_state_parquet|false      |
|loan_db  |loans_cleansed_parquet|false      |
+---------+----------------------+-----------+



Note: **spark_catalog.loan_db** namespace is not the same as **hdp_ctlg.loan_db**. Both the namespaces belong to different catalogs and are identified with a catalog prefix

In [24]:
spark.sql("DESCRIBE FORMATTED loan_db.loans_by_state_iceberg").show(truncate=False)

+----------------------------+---------------------------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                                            |comment|
+----------------------------+---------------------------------------------------------------------------------------------------------------------+-------+
|addr_state                  |string                                                                                                               |       |
|loan_count                  |bigint                                                                                                               |       |
|                            |                                                                                                                     |       |
|# Partitioning              |                            

**NOTE:** "Hive-Catalog" tables are created in the hive metastore warehouse directory by default

In [25]:
spark.sql("select * from loan_db.loans_by_state_iceberg limit 2").show()

+----------+----------+
|addr_state|loan_count|
+----------+----------+
|        AZ|     10318|
|        SC|      5460|
+----------+----------+



                                                                                

#### b. Create Partitioned Iceberg table

In [29]:
# Create Iceberg partitioned table from the Parquet table
spark.sql("SELECT addr_state,loan_count FROM loan_db.loans_by_state_parquet") \
.writeTo("loan_db.loans_by_state_iceberg_partitioned") \
.partitionedBy("addr_state") \
.using("iceberg") \
.createOrReplace()

                                                                                

In [30]:
spark.sql("DESCRIBE FORMATTED loan_db.loans_by_state_iceberg_partitioned").show(truncate=False)

+----------------------------+---------------------------------------------------------------------------------------------------------------------------------+-------+
|col_name                    |data_type                                                                                                                        |comment|
+----------------------------+---------------------------------------------------------------------------------------------------------------------------------+-------+
|addr_state                  |string                                                                                                                           |       |
|loan_count                  |bigint                                                                                                                           |       |
|                            |                                                                                                                             

### 6. A quick peek at the data layout in hive metastore

Note that similar to Hadoop Catalog, Hive catalog also creates data and metdata folders for both partitioned and unpartitioned tables.
One noticeable difference is that Hive catalog does not create **version-hint.text** file because it tracks the latest version in Hive metastore table instead.

In [26]:
!gsutil ls -r $HIVE_METASTORE_WAREHOUSE_DIR/loan_db.db/loans_by_state_iceberg

gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg/:

gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg/data/:
gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg/data/00000-2-b2d1021e-d3ad-4aff-baa7-1c740e8a3144-00001.parquet

gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/:
gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/00000-e9f97377-9cf7-4d6b-a97f-cde34a16cb51.metadata.json
gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg/metadata/d37b2f20-215b-4690-9fdf-1e59a8e68f24-m0.avro
gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/lo

In [31]:
!gsutil ls -r $HIVE_METASTORE_WAREHOUSE_DIR/loan_db.db/loans_by_state_iceberg_partitioned

gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/:

gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/data/:

gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/data/addr_state=AK/:
gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/data/addr_state=AK/00000-7-b52442f2-5e23-498a-be5f-548faba5a686-00047.parquet

gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/data/addr_state=AL/:
gs://gcs-bucket-iceberg-hms-928505941962-71d67f3e-cf27-4b25-a996-86a/hive-warehouse/loan_db.db/loans_by_state_iceberg_partitioned/data/addr_state=AL/00000-7-b52442f2-5e23-498a-be5f-548faba5a686-00031.parquet

gs://gcs-

### THIS CONCLUDES THIS UNIT. PROCEED TO THE NEXT NOTEBOOK