# Using Apache Iceberg with Spark 3 in CML

The official documentation for Apache Iceberg with Spark is located at [this link](https://iceberg.apache.org/#getting-started/#using-iceberg-in-spark-3)

For a full list of Apache Iceberg terms, please visit [this link](https://iceberg.apache.org/#terms/)

### Start a PySpark Session as shown below. You will want to set the Spark Catalog configurations as shown

In [1]:
"""from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]') \
  .config("spark.jars.packages","org.apache.iceberg:iceberg-spark3-runtime:0.12.1") \
  .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
  .config("spark.sql.catalog.spark_catalog","org.apache.iceberg.spark.SparkSessionCatalog") \
  .config("spark.sql.catalog.spark_catalog.type","hive") \
  .getOrCreate()"""

"""SimpleApp.py"""
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder\
  .appName("1.1 - Ingest") \
  .config("spark.hadoop.fs.s3a.s3guard.ddb.region", "us-east-2")\
  .config("spark.yarn.access.hadoopFileSystems", "s3a://demo-aws-go02")\
  .config("spark.jars","/home/cdsw/lib/iceberg-spark3-runtime-0.9.1.1.13.317211.0-9.jar") \
  .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
  .config("spark.sql.catalog.spark_catalog","org.apache.iceberg.spark.SparkSessionCatalog") \
  .config("spark.sql.catalog.spark_catalog.type","hive") \
  .getOrCreate()

### Iceberg comes with catalogs that enable SQL commands to manage tables and load them by name. 
### Catalogs are configured using properties under spark.sql.catalog.(catalog_name).

In [2]:
  # Using a local Spark Catalog

spark.sql("CREATE DATABASE IF NOT EXISTS spark_catalog.newjar")
spark.sql("USE spark_catalog.testdb")
spark.sql("SHOW CURRENT NAMESPACE").show()
#spark.sql("DROP TABLE testtable")

+-------------+---------+
|      catalog|namespace|
+-------------+---------+
|spark_catalog|   testdb|
+-------------+---------+



### You can use simple Spark SQL commands to create Spark tables as you always have. Just make sure to specify the USING iceberg clause.

In [3]:
spark.sql("CREATE TABLE IF NOT EXISTS newtesttable (id bigint, data string) USING iceberg")

DataFrame[]

### To select a specific table snapshot or the snapshot at some time, Iceberg supports two Spark read options:

* snapshot-id selects a specific table snapshot
* as-of-timestamp selects the current snapshot at a timestamp, in milliseconds

#### You can view all snapshots associated with the table

In [4]:
spark.sql("SELECT * FROM spark_catalog.testdb.newtesttable")

DataFrame[id: bigint, data: string]

In [5]:
spark.read.format("iceberg").load("spark_catalog.testdb.testtable.snapshots").show(20, False)

+-----------------------+-------------------+-------------------+---------+--------------------------------------------------------------------------------------------------------------------------------------------------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|committed_at           |snapshot_id        |parent_id          |operation|manifest_list                                                                                                                                           |summary                                                                                                                                                                                                                                                    

#### Or a full table version history 

In [6]:
spark.read.format("iceberg").load("spark_catalog.testdb.testtable.history").show(20, False)

+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2022-02-17 18:25:41.503|3386067383611655633|null               |true               |
|2022-02-17 18:25:47.601|555494963528835889 |3386067383611655633|true               |
|2022-02-17 18:26:26.741|6919659684029084940|555494963528835889 |true               |
|2022-02-17 18:50:33.923|7094097289391594974|6919659684029084940|true               |
|2022-02-17 18:51:14.966|4337728976320510621|7094097289391594974|true               |
|2022-02-27 21:27:55.773|81890142028154896  |4337728976320510621|true               |
|2022-02-27 21:29:02.726|3184551259780646899|81890142028154896  |true               |
|2022-02-27 22:52:34.715|6113538126025083294|3184551259780646899|true               |
|2022-02-27 22:52:41.881|330895133654850392 |611353812

#### To show a table’s data files and each file’s metadata, run:

In [7]:
spark.read.format("iceberg").load("spark_catalog.testdb.testtable.files").show(20, False)

+-------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+------------+------------------+------------------+----------------+-----------------+----------------+-----------------------+-----------------------+------------+-------------+------------+
|content|file_path                                                                                                                                  |file_format|record_count|file_size_in_bytes|column_sizes      |value_counts    |null_value_counts|nan_value_counts|lower_bounds           |upper_bounds           |key_metadata|split_offsets|equality_ids|
+-------+-------------------------------------------------------------------------------------------------------------------------------------------+-----------+------------+------------------+------------------+----------------+-----------------+----------------+-----------------------+------

### A manifest file is a metadata file that lists a subset of data files that make up a snapshot.

### Each data file in a manifest is stored with a partition tuple, column-level stats, and summary information used to prune splits during scan planning.

#### To show a table’s file manifests and each file’s metadata, run:

In [8]:
spark.read.format("iceberg").load("spark_catalog.testdb.testtable.manifests").show(20, False)

+--------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+-------------------+
|path                                                                                                                            |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|partition_summaries|
+--------------------------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+-------------------+
|s3a://demo-aws-go02/warehouse/tablespace/external/hive/testdb.db/testtable/metadata/b4aceee5-0e2f-4d85-a47f-6c24de143bdd-m0.avro|5658  |0                |333250371788972430 |2              

## Time Travel

### Using snapshots as shown above, we can insert some data into the table and roll back to its original state

In [9]:
# Insert using Iceberg format
spark.sql("INSERT INTO spark_catalog.testdb.testtable VALUES (1, 'x'), (2, 'y'), (3, 'z')")

DataFrame[]

In [10]:
# Query using select
spark.sql("SELECT * FROM spark_catalog.testdb.testtable").show()

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   x|
|  2|   y|
+---+----+
only showing top 20 rows



In [11]:
# Query using DF - All Data
df = spark.table("spark_catalog.testdb.testtable")
df.show(100)

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
+---+----+



In [12]:
from datetime import datetime

# current date and time
now = datetime.now()

timestamp = datetime.timestamp(now)
print("timestamp =", timestamp)

timestamp = 1646334004.441659


#### Timestamps can be tricky. Please make sure to round your timestamp as shown below.

In [13]:
# Query using a point in time
df = spark.read.option("as-of-timestamp", int(timestamp*1000)).format("iceberg").load("spark_catalog.testdb.testtable")
df.show(100)

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
+---+----+



In [14]:
# Insert using Iceberg format
spark.sql("INSERT INTO spark_catalog.testdb.testtable VALUES (1, 'd'), (2, 'e'), (3, 'f')")

DataFrame[]

In [16]:
df.writeTo("spark_catalog.testdb.testtablethree").create()

#df.write.format("iceberg").mode("overwrite").save("testdb.testtabletwo")

In [16]:
df.writeTo("spark_catalog.testdb.testtabletwo").append()

AnalysisException: Cannot write into v1 table: `testdb`.`testtabletwo`

In [17]:
# Query using select
spark.sql("SELECT * FROM spark_catalog.testdb.testtablethree").show()

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   x|
|  2|   y|
+---+----+
only showing top 20 rows



In [None]:
import os
import time
import json
import requests
import xml.etree.ElementTree as ET
import datetime
import yaml

#Extracting the correct URL from hive-site.xml
tree = ET.parse('/etc/hadoop/conf/hive-site.xml')
root = tree.getroot()

for prop in root.findall('property'):
    if prop.find('name').text == "hive.metastore.warehouse.dir":
        storage = prop.find('value').text.split("/")[0] + "//" + prop.find('value').text.split("/")[2]

print("The correct CLoud Storage URL is:{}".format(storage))

os.environ['STORAGE'] = storage