# Using Apache Iceberg with Spark 3 in CML

The official documentation for Apache Iceberg with Spark is located at [this link](https://iceberg.apache.org/#getting-started/#using-iceberg-in-spark-3)

For a full list of Apache Iceberg terms, please visit [this link](https://iceberg.apache.org/#terms/)

In [6]:
spark.sql("DROP TABLE IF EXISTS spark_catalog.testdb.newtesttable")
spark.sql("DROP TABLE IF EXISTS spark_catalog.testdb.secondtesttable")

Hive Session ID = b7dbfa88-a109-42da-949b-2011342a9f60


DataFrame[]

In [22]:
#spark.stop()

### Start a PySpark Session as shown below. You will want to set the Spark Catalog configurations as shown

In [5]:
"""SimpleApp.py"""
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder\
  .appName("1.1 - Ingest") \
  .config("spark.hadoop.fs.s3a.s3guard.ddb.region", "us-east-2")\
  .config("spark.yarn.access.hadoopFileSystems", "s3a://demo-aws-go02")\
  .config("spark.jars","/home/cdsw/lib/iceberg-spark3-runtime-0.9.1.1.13.317211.0-9.jar") \
  .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
  .config("spark.sql.catalog.spark_catalog","org.apache.iceberg.spark.SparkSessionCatalog") \
  .config("spark.sql.catalog.local","org.apache.iceberg.spark.SparkCatalog") \
  .config("spark.sql.catalog.local.type","hadoop") \
  .config("spark.sql.catalog.spark_catalog.type","hive") \
  .getOrCreate()

Setting spark.hadoop.yarn.resourcemanager.principal to pauldefusco


### Iceberg comes with catalogs that enable SQL commands to manage tables and load them by name. 
### Catalogs are configured using properties under spark.sql.catalog.(catalog_name).

In [7]:
  # Using a local Spark Catalog

#spark.sql("CREATE DATABASE IF NOT EXISTS spark_catalog.newjar")
spark.sql("USE spark_catalog.testdb")
spark.sql("SHOW CURRENT NAMESPACE").show()
#spark.sql("DROP TABLE testtable")

+-------------+---------+
|      catalog|namespace|
+-------------+---------+
|spark_catalog|   testdb|
+-------------+---------+



### You can use simple Spark SQL commands to create Spark tables as you always have. Just make sure to specify the USING iceberg clause.

In [8]:
spark.sql("CREATE TABLE IF NOT EXISTS newtesttable (id bigint, data string) USING iceberg")

DataFrame[]

### To select a specific table snapshot or the snapshot at some time, Iceberg supports two Spark read options:

* snapshot-id selects a specific table snapshot
* as-of-timestamp selects the current snapshot at a timestamp, in milliseconds

#### You can view all snapshots associated with the table

In [9]:
spark.sql("SELECT * FROM spark_catalog.testdb.newtesttable")

DataFrame[id: bigint, data: string]

In [10]:
#spark.read.format("iceberg").load("spark_catalog.testdb.newtesttable.snapshots").show(20, False)

#### Or a full table version history 

In [11]:
spark.read.format("iceberg").load("spark_catalog.testdb.newtesttable.history").show(20, False)

+---------------+-----------+---------+-------------------+
|made_current_at|snapshot_id|parent_id|is_current_ancestor|
+---------------+-----------+---------+-------------------+
+---------------+-----------+---------+-------------------+



#### To show a table’s data files and each file’s metadata, run:

In [12]:
#spark.read.format("iceberg").load("spark_catalog.testdb.newtesttable.files").show(20, False)

### A manifest file is a metadata file that lists a subset of data files that make up a snapshot.

### Each data file in a manifest is stored with a partition tuple, column-level stats, and summary information used to prune splits during scan planning.

#### To show a table’s file manifests and each file’s metadata, run:

In [13]:
#spark.read.format("iceberg").load("spark_catalog.testdb.newtesttable.manifests").show(20, False)

## Time Travel

### Using snapshots as shown above, we can insert some data into the table and roll back to its original state

In [14]:
# Insert using Iceberg format
spark.sql("INSERT INTO spark_catalog.testdb.newtesttable VALUES (1, 'x'), (2, 'y'), (3, 'z')")

                                                                                

DataFrame[]

In [15]:
# Query using select
spark.sql("SELECT * FROM spark_catalog.testdb.newtesttable").show()

[Stage 1:>                                                          (0 + 1) / 1]

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



                                                                                

In [16]:
# Query using DF - All Data
df = spark.table("spark_catalog.testdb.newtesttable")
df.show(100)

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



In [17]:
from datetime import datetime

# current date and time
now = datetime.now()

timestamp = datetime.timestamp(now)
print("timestamp =", timestamp)

timestamp = 1653497360.858986


#### Timestamps can be tricky. Please make sure to round your timestamp as shown below.

In [18]:
# Query using a point in time
df = spark.read.option("as-of-timestamp", int(timestamp*1000)).format("iceberg").load("spark_catalog.testdb.newtesttable")
df.show(100)

[Stage 3:>                                                          (0 + 1) / 1]

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



                                                                                

In [19]:
# Insert using Iceberg format
spark.sql("INSERT INTO spark_catalog.testdb.newtesttable VALUES (1, 'd'), (2, 'e'), (3, 'f')")

DataFrame[]

### Let's insert more data into the table

In [20]:
# Insert using Iceberg format
import string
import random

for i in range(25):
    number = random.randint(0, 10)
    letter = random.choice(string.ascii_letters)
    spark.sql("INSERT INTO spark_catalog.testdb.newtesttable VALUES ({}, '{}')".format(number, letter))

### Now let's access the data again. Let's access it with the same timestemp as before. Notice we have a smaller number of rows than we just inserted.

In [21]:
# Query using a point in time
df = spark.read.option("as-of-timestamp", int(timestamp*1000)).format("iceberg").load("spark_catalog.testdb.newtesttable")
df.show(100)

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



### Observe that many new Snapshots have been created

In [22]:
spark.read.format("iceberg").load("spark_catalog.testdb.newtesttable.history").show(10, False)

+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2022-05-25 16:49:18.821|7674920815033026630|null               |true               |
|2022-05-25 16:49:29.308|1336987487715714838|7674920815033026630|true               |
|2022-05-25 16:49:32.32 |7194407856781086301|1336987487715714838|true               |
|2022-05-25 16:49:33.34 |1366768287190112582|7194407856781086301|true               |
|2022-05-25 16:49:34.344|7096417147095079776|1366768287190112582|true               |
|2022-05-25 16:49:35.354|2022712147794752114|7096417147095079776|true               |
|2022-05-25 16:49:36.422|611464847134655691 |2022712147794752114|true               |
|2022-05-25 16:49:37.476|6160926064314366538|611464847134655691 |true               |
|2022-05-25 16:49:38.489|4554987325204184770|616092606

### You can also query the table in its previous state as of a specific partition.

#### Copy paste a partition_id from above and paste it in the next Spark command

In [24]:
spark.read\
    .option("snapshot-id", 1366768287190112582)\
    .table("spark_catalog.testdb.newtesttable").show()

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
|  1|   d|
|  2|   e|
|  3|   f|
|  8|   j|
|  3|   E|
+---+----+



### The Iceberg API allows you to create tables from Spark Dataframes, and more

In [25]:
new_df = spark.sql("SELECT * FROM spark_catalog.testdb.newtesttable").sample(fraction=0.5, seed=3)

In [26]:
new_df.dtypes

[('id', 'bigint'), ('data', 'string')]

In [27]:
new_df.show(10)

[Stage 33:>                                                         (0 + 1) / 1]

+---+----+
| id|data|
+---+----+
|  8|   p|
|  5|   v|
| 10|   B|
|  3|   E|
| 10|   p|
|  8|   T|
|  1|   d|
|  2|   e|
| 10|   L|
|  5|   m|
+---+----+
only showing top 10 rows



                                                                                

#### Creating a new Spark Table with the API and Loading It

In [28]:
#spark.sql("DROP TABLE IF EXISTS spark_catalog.testdb.secondtesttable")

In [29]:
new_df.writeTo("spark_catalog.testdb.secondtesttable").create()

                                                                                

In [30]:
new_df_nodups = new_df.dropDuplicates(["id"])

In [31]:
new_df_nodups.show()

[Stage 35:>                                                         (0 + 1) / 1]

+---+----+
| id|data|
+---+----+
|  1|   d|
|  2|   e|
|  3|   E|
|  5|   v|
|  7|   t|
|  8|   p|
| 10|   B|
+---+----+



                                                                                

#### More ETL SQL Operations

In [32]:
spark.sql("INSERT INTO spark_catalog.testdb.secondtesttable SELECT * FROM spark_catalog.testdb.newtesttable")

                                                                                

DataFrame[]

In [33]:
new_df.count()

                                                                                

14

In [34]:
second_df = spark.sql("SELECT * FROM spark_catalog.testdb.secondtesttable") 

In [35]:
second_df.count()

45

In [36]:
second_df=second_df.withColumn("id", second_df.id*3)

In [37]:
sec_df_nodups = second_df.dropDuplicates(["id"])

In [38]:
sec_df_nodups.show()

                                                                                

+---+----+
| id|data|
+---+----+
|  0|   J|
|  6|   e|
|  9|   E|
| 27|   K|
|  3|   d|
| 12|   K|
| 18|   K|
| 21|   t|
| 15|   v|
| 30|   B|
| 24|   p|
+---+----+



In [40]:
#new_df_nodups.writeTo("spark_catalog.testdb.new_df_nodups").create()

In [41]:
#sec_df_nodups.writeTo("spark_catalog.testdb.sec_df_nodups").create()

#### Update and Merge Into SQL Operations

In [94]:
#spark.sql("UPDATE spark_catalog.testdb.new_df_nodups SET data = '?' WHERE id = (SELECT id FROM spark_catalog.testdb.sec_df_nodups)")

In [None]:
#spark.sql(
#"MERGE INTO spark_catalog.testdb.new_df_nodups t USING (SELECT * FROM spark_catalog.testdb.sec_df_nodups) u ON t.id = u.id \
#WHEN MATCHED THEN UPDATE SET t.data = u.data + t.data \
#WHEN NOT MATCHED THEN INSERT *")

In [None]:
#spark.sql("SELECT count(*) FROM spark_catalog.testdb.new_df_nodups").show() 

In [None]:
#spark.sql("SELECT * FROM spark_catalog.testdb.new_df_nodups").show() 