# Using Apache Iceberg with Spark 3 in CML

The official documentation for Apache Iceberg with Spark is located at [this link](https://iceberg.apache.org/#getting-started/#using-iceberg-in-spark-3)

For a full list of Apache Iceberg terms, please visit [this link](https://iceberg.apache.org/#terms/)

In [3]:
import cml.data_v1 as cmldata

# Sample in-code customization of spark configurations
#from pyspark import SparkContext
#SparkContext.setSystemProperty('spark.executor.cores', '1')
#SparkContext.setSystemProperty('spark.executor.memory', '2g')

CONNECTION_NAME = "bco-cdp-prd-datalake"
conn = cmldata.get_connection(CONNECTION_NAME)
spark = conn.get_spark_session()

In [4]:
import os

username = os.environ["PROJECT_OWNER"]

#### You can use simple Spark SQL commands to create Spark tables as you always have. Just make sure to specify the USING iceberg clause.

In [5]:
spark.sql("CREATE TABLE IF NOT EXISTS proceso.{}_ice (id bigint, data string) USING iceberg".format(username))

24/03/19 23:55:46 WARN HiveMetaStoreClient: Failed to connect to the MetaStore Server...
24/03/19 23:55:47 WARN HiveClientImpl: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic
Hive Session ID = 71b95076-9aae-4287-80e8-21879de1b940
24/03/19 23:55:48 WARN HiveMetaStoreClient: Failed to connect to the MetaStore Server...


DataFrame[]

#### To select a specific table snapshot or the snapshot at some time, Iceberg supports two Spark read options:

* snapshot-id selects a specific table snapshot
* as-of-timestamp selects the current snapshot at a timestamp, in milliseconds

#### You can view all snapshots associated with the table

In [6]:
spark.sql("SELECT * FROM proceso.{}_ice".format(username))

DataFrame[id: bigint, data: string]

#### Or a full table version history 

In [7]:
spark.read.format("iceberg").load("proceso.{}_ice.history".format(username)).show(20, False)

[Stage 0:>                                                          (0 + 1) / 1]

+---------------+-----------+---------+-------------------+
|made_current_at|snapshot_id|parent_id|is_current_ancestor|
+---------------+-----------+---------+-------------------+
+---------------+-----------+---------+-------------------+



                                                                                

##### A manifest file is a metadata file that lists a subset of data files that make up a snapshot.

##### Each data file in a manifest is stored with a partition tuple, column-level stats, and summary information used to prune splits during scan planning.

##### To show a table’s file manifests and each file’s metadata, run:

In [8]:
spark.read.format("iceberg").load("proceso.{}_ice.manifests".format(username)).show(5, False)

+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path|length|partition_spec_id|added_snapshot_id|added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+



## Time Travel

### Using snapshots as shown above, we can insert some data into the table and roll back to its original state

In [9]:
# Insert using Iceberg format
spark.sql("INSERT INTO proceso.{}_ice VALUES (1, 'x'), (2, 'y'), (3, 'z')".format(username))

                                                                                

DataFrame[]

In [10]:
# Query using select
spark.sql("SELECT * FROM proceso.{}_ice".format(username)).show()

[Stage 2:>                                                          (0 + 1) / 1]

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



                                                                                

In [11]:
from datetime import datetime

# current date and time
now = datetime.now()

timestamp = datetime.timestamp(now)
print("timestamp =", timestamp)

timestamp = 1710892592.049442


In [12]:
# Query using a point in time
df = spark.read.option("as-of-timestamp", int(timestamp*1000)).format("iceberg").load("proceso.{}_ice".format(username))
df.show(100)

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



In [13]:
# Insert using Iceberg format
spark.sql("INSERT INTO proceso.{}_ice VALUES (1, 'd'), (2, 'e'), (3, 'f')".format(username))

DataFrame[]

#### Let's insert more data into the table

In [14]:
# Insert using Iceberg format
import string
import random

for i in range(25):
    number = random.randint(0, 10)
    letter = random.choice(string.ascii_letters)
    spark.sql("INSERT INTO proceso.{}_ice VALUES ({}, '{}')".format(username, number, letter))

                                                                                

#### Now let's access the data again. Let's access it with the same timestemp as before. Notice we have a smaller number of rows than we just inserted.

In [15]:
# Query using a point in time
df = spark.read.option("as-of-timestamp", int(timestamp*1000)).format("iceberg").load("proceso.{}_ice".format(username))
df.show(100)

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



### Observe that many new Snapshots have been created

In [16]:
spark.read.format("iceberg").load("proceso.{}_ice.history".format(username)).show(10, False)

+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2024-03-19 23:56:29.051|9053106843819164994|null               |true               |
|2024-03-19 23:56:36.09 |4697307940450426611|9053106843819164994|true               |
|2024-03-19 23:56:44.493|538970338601657260 |4697307940450426611|true               |
|2024-03-19 23:56:46.104|8214684284224589907|538970338601657260 |true               |
|2024-03-19 23:56:47.452|6254021437124090673|8214684284224589907|true               |
|2024-03-19 23:56:48.781|3383582963401032161|6254021437124090673|true               |
|2024-03-19 23:56:50.073|4335001182549724735|3383582963401032161|true               |
|2024-03-19 23:56:51.451|8369306874504182592|4335001182549724735|true               |
|2024-03-19 23:56:52.679|4124923018102035572|836930687

### You can also query the table in its previous state as of a specific partition.

#### Copy paste a snapshot_id from above and paste it in the next Spark command

In [18]:
spark.read\
    .option("snapshot-id", 538970338601657260)\
    .table("proceso.{}_ice".format(username)).show()

+---+----+
| id|data|
+---+----+
|  1|   d|
|  2|   e|
|  3|   f|
|  4|   T|
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



### The Iceberg API allows you to create tables from Spark Dataframes, and more

In [19]:
new_df = spark.sql("SELECT * FROM proceso.{}_ice".format(username)).sample(fraction=0.5, seed=3)

In [20]:
new_df.dtypes

[('id', 'bigint'), ('data', 'string')]

In [21]:
new_df.show(10)

[Stage 33:>                                                         (0 + 1) / 1]

+---+----+
| id|data|
+---+----+
|  4|   B|
|  1|   C|
|  4|   p|
|  9|   i|
|  6|   p|
|  6|   D|
|  2|   e|
|  5|   Z|
|  2|   s|
|  6|   r|
+---+----+
only showing top 10 rows



                                                                                