# Using Apache Iceberg with Spark 3 in CML

The official documentation for Apache Iceberg with Spark is located at [this link](https://iceberg.apache.org/#getting-started/#using-iceberg-in-spark-3)

For a full list of Apache Iceberg terms, please visit [this link](https://iceberg.apache.org/#terms/)

In [41]:
#spark.sql("DROP TABLE IF EXISTS spark_catalog.testdb.newtesttable")
#spark.sql("DROP TABLE IF EXISTS spark_catalog.testdb.secondtesttable")

### Start a PySpark Session as shown below. You will want to set the Spark Catalog configurations as shown

In [1]:
"""from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.master('local[*]') \
  .config("spark.jars.packages","org.apache.iceberg:iceberg-spark3-runtime:0.12.1") \
  .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
  .config("spark.sql.catalog.spark_catalog","org.apache.iceberg.spark.SparkSessionCatalog") \
  .config("spark.sql.catalog.spark_catalog.type","hive") \
  .getOrCreate()"""

"""SimpleApp.py"""
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder\
  .appName("1.1 - Ingest") \
  .config("spark.hadoop.fs.s3a.s3guard.ddb.region", "us-east-2")\
  .config("spark.yarn.access.hadoopFileSystems", "s3a://demo-aws-go02")\
  .config("spark.jars","/home/cdsw/lib/iceberg-spark3-runtime-0.9.1.1.13.317211.0-9.jar") \
  .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
  .config("spark.sql.catalog.spark_catalog","org.apache.iceberg.spark.SparkSessionCatalog") \
  .config("spark.sql.catalog.spark_catalog.type","hive") \
  .getOrCreate()

### Iceberg comes with catalogs that enable SQL commands to manage tables and load them by name. 
### Catalogs are configured using properties under spark.sql.catalog.(catalog_name).

In [2]:
  # Using a local Spark Catalog

#spark.sql("CREATE DATABASE IF NOT EXISTS spark_catalog.newjar")
spark.sql("USE spark_catalog.testdb")
spark.sql("SHOW CURRENT NAMESPACE").show()
#spark.sql("DROP TABLE testtable")

+-------------+---------+
|      catalog|namespace|
+-------------+---------+
|spark_catalog|   testdb|
+-------------+---------+



### You can use simple Spark SQL commands to create Spark tables as you always have. Just make sure to specify the USING iceberg clause.

In [3]:
spark.sql("CREATE TABLE IF NOT EXISTS newtesttable (id bigint, data string) USING iceberg")

DataFrame[]

### To select a specific table snapshot or the snapshot at some time, Iceberg supports two Spark read options:

* snapshot-id selects a specific table snapshot
* as-of-timestamp selects the current snapshot at a timestamp, in milliseconds

#### You can view all snapshots associated with the table

In [5]:
spark.sql("SELECT * FROM spark_catalog.testdb.newtesttable")

DataFrame[id: bigint, data: string]

In [11]:
spark.read.format("iceberg").load("spark_catalog.testdb.newtesttable.snapshots").show(20, False)

+------------+-----------+---------+---------+-------------+-------+
|committed_at|snapshot_id|parent_id|operation|manifest_list|summary|
+------------+-----------+---------+---------+-------------+-------+
+------------+-----------+---------+---------+-------------+-------+



#### Or a full table version history 

In [12]:
spark.read.format("iceberg").load("spark_catalog.testdb.newtesttable.history").show(20, False)

+---------------+-----------+---------+-------------------+
|made_current_at|snapshot_id|parent_id|is_current_ancestor|
+---------------+-----------+---------+-------------------+
+---------------+-----------+---------+-------------------+



#### To show a table’s data files and each file’s metadata, run:

In [13]:
spark.read.format("iceberg").load("spark_catalog.testdb.newtesttable.files").show(20, False)

+-------+---------+-----------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+
|content|file_path|file_format|record_count|file_size_in_bytes|column_sizes|value_counts|null_value_counts|nan_value_counts|lower_bounds|upper_bounds|key_metadata|split_offsets|equality_ids|
+-------+---------+-----------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+
+-------+---------+-----------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+



### A manifest file is a metadata file that lists a subset of data files that make up a snapshot.

### Each data file in a manifest is stored with a partition tuple, column-level stats, and summary information used to prune splits during scan planning.

#### To show a table’s file manifests and each file’s metadata, run:

In [14]:
spark.read.format("iceberg").load("spark_catalog.testdb.newtesttable.manifests").show(20, False)

+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+-------------------+
|path|length|partition_spec_id|added_snapshot_id|added_data_files_count|existing_data_files_count|deleted_data_files_count|partition_summaries|
+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+-------------------+
+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+-------------------+



## Time Travel

### Using snapshots as shown above, we can insert some data into the table and roll back to its original state

In [15]:
# Insert using Iceberg format
spark.sql("INSERT INTO spark_catalog.testdb.newtesttable VALUES (1, 'x'), (2, 'y'), (3, 'z')")

DataFrame[]

In [16]:
# Query using select
spark.sql("SELECT * FROM spark_catalog.testdb.newtesttable").show()

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



In [17]:
# Query using DF - All Data
df = spark.table("spark_catalog.testdb.newtesttable")
df.show(100)

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



In [18]:
from datetime import datetime

# current date and time
now = datetime.now()

timestamp = datetime.timestamp(now)
print("timestamp =", timestamp)

timestamp = 1652920085.489827


#### Timestamps can be tricky. Please make sure to round your timestamp as shown below.

In [19]:
# Query using a point in time
df = spark.read.option("as-of-timestamp", int(timestamp*1000)).format("iceberg").load("spark_catalog.testdb.newtesttable")
df.show(100)

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



In [20]:
# Insert using Iceberg format
spark.sql("INSERT INTO spark_catalog.testdb.newtesttable VALUES (1, 'd'), (2, 'e'), (3, 'f')")

DataFrame[]

### Let's insert more data into the table

In [31]:
# Insert using Iceberg format
import string
import random

for i in range(25):
    number = random.randint(0, 10)
    letter = random.choice(string.ascii_letters)
    spark.sql("INSERT INTO spark_catalog.testdb.newtesttable VALUES ({}, '{}')".format(number, letter))

### Now let's access the data again. Let's access it with the same timestemp as before. Notice we have a smaller number of rows than we just inserted.

In [33]:
# Query using a point in time
df = spark.read.option("as-of-timestamp", int(timestamp*1000)).format("iceberg").load("spark_catalog.testdb.newtesttable")
df.show(100)

+---+----+
| id|data|
+---+----+
|  1|   x|
|  2|   y|
|  3|   z|
+---+----+



### Observe that many new Snapshots have been created

In [32]:
spark.read.format("iceberg").load("spark_catalog.testdb.newtesttable.history").show(20, False)

+-----------------------+-------------------+-------------------+-------------------+
|made_current_at        |snapshot_id        |parent_id          |is_current_ancestor|
+-----------------------+-------------------+-------------------+-------------------+
|2022-05-19 00:27:36.444|5797822168349485877|null               |true               |
|2022-05-19 00:28:16.897|3326689113563627160|5797822168349485877|true               |
|2022-05-19 00:36:41.604|2080039523045202901|3326689113563627160|true               |
|2022-05-19 00:36:42.87 |8726800525946754873|2080039523045202901|true               |
|2022-05-19 00:36:43.934|5250560437891035012|8726800525946754873|true               |
|2022-05-19 00:36:44.966|3502850421386520639|5250560437891035012|true               |
|2022-05-19 00:36:46.076|6484488418513084975|3502850421386520639|true               |
|2022-05-19 00:36:47.109|6131843803655349628|6484488418513084975|true               |
|2022-05-19 00:36:48.125|9149412625213001646|613184380

### You can also query the table in its previous state as of a specific partition.

#### Copy paste a partition_id from above and paste it in the next Spark command

In [37]:
spark.read\
    .option("snapshot-id", 6484488418513084975)\
    .table("spark_catalog.testdb.newtesttable").show()

+---+----+
| id|data|
+---+----+
|  8|   p|
|  1|   d|
|  2|   e|
|  3|   f|
|  1|   w|
|  7|   v|
|  0|   g|
|  1|   x|
|  2|   y|
|  3|   z|
|  7|   v|
+---+----+



### The Iceberg API allows you to create tables from Spark Datafrmaes, and more

In [38]:
new_df = spark.sql("SELECT * FROM spark_catalog.testdb.newtesttable").sample(fraction=0.5, seed=3)

In [39]:
new_df.show()

+---+----+
| id|data|
+---+----+
| 10|   M|
|  6|   u|
|  3|   g|
|  6|   F|
|  8|   Q|
|  2|   U|
|  5|   p|
|  7|   v|
|  3|   X|
|  8|   y|
| 10|   w|
|  3|   z|
|  1|   x|
|  2|   y|
+---+----+



#### Creating a new Spark Table with the API and Loading It

In [40]:
new_df.writeTo("spark_catalog.testdb.secondtesttable").create()

In [42]:
spark.sql("INSERT INTO spark_catalog.testdb.secondtesttable SELECT * FROM spark_catalog.testdb.newtesttable")

DataFrame[]

#### The Python script iceberg.py shows a merge operation if you are interested