Nessie Demo
===========
This demo showcases how to use Nessie python API along with Spark3 from Iceberg

Initialize Pyspark + Nessie environment
----------------------------------------------

In [1]:
import os
import findspark
from pyspark.sql import *
from pyspark import SparkConf, SparkContext
from py4j.java_gateway import java_import
findspark.init()

spark = SparkSession.builder \
                    .config("spark.jars", "../../clients/iceberg/spark3/target/nessie-iceberg-spark3-0.1-SNAPSHOT.jar") \
                    .config("spark.sql.execution.pyarrow.enabled", "true") \
                    .config("spark.hadoop.fs.defaultFS", 'file://' + os.getcwd() + '/spark_warehouse') \
                    .config("spark.hadoop.nessie.url", "http://localhost:19120/api/v1") \
                    .config("spark.hadoop.nessie.ref", "main") \
                    .config("spark.sql.catalog.nessie", "com.dremio.nessie.iceberg.spark.NessieIcebergSparkCatalog") \
                    .getOrCreate()
sc = spark.sparkContext
jvm = sc._gateway.jvm

java_import(jvm, "com.dremio.nessie.iceberg.NessieCatalog")
java_import(jvm, "org.apache.iceberg.catalog.TableIdentifier")
java_import(jvm, "org.apache.iceberg.Schema")
java_import(jvm, "org.apache.iceberg.types.Types")
java_import(jvm, "org.apache.iceberg.PartitionSpec")

Set up nessie branches
----------------------------

- Branch `main` already exists
- Create branch `dev`
- List all branches (pipe JSON result into jq)

In [2]:
!nessie create-branch dev

Traceback (most recent call last):
  File "/home/ryan/workspace/nessie/python/demo/venv/bin/nessie", line 10, in <module>
    sys.exit(cli())
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/py

In [3]:
!nessie list-references | jq .

[1;39m[
  [1;39m{
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"main"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"e0b41c30f0710277532f51242994e10acfdc46bf"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"dev"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"e0b41c30f0710277532f51242994e10acfdc46bf"[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m][0m


Create tables under dev branch
-------------------------------------

Creating two tables under the `dev` branch:
- region
- nation

It is not yet possible to create table using pyspark and iceberg, so Java code is used instead

In [4]:
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("nessie.ref", "dev")
catalog = jvm.NessieCatalog(sc._jsc.hadoopConfiguration())

# Creating region table
region_name = jvm.TableIdentifier.parse("testing.region")
region_schema = jvm.Schema([
    jvm.Types.NestedField.optional(1, "R_REGIONKEY", jvm.Types.LongType.get()),
    jvm.Types.NestedField.optional(2, "R_NAME", jvm.Types.StringType.get()),
    jvm.Types.NestedField.optional(3, "R_COMMENT", jvm.Types.StringType.get()),
])
region_spec = jvm.PartitionSpec.unpartitioned()

region_table = catalog.createTable(region_name, region_schema, region_spec)
region_df = spark.read.load("data/region.parquet")
region_df.write.option('hadoop.nessie.ref', 'dev').format("iceberg").mode("overwrite").save("testing.region")

# Creating nation table
nation_name = jvm.TableIdentifier.parse("testing.nation")
nation_schema = jvm.Schema([
    jvm.Types.NestedField.optional(1, "N_NATIONKEY", jvm.Types.LongType.get()),
    jvm.Types.NestedField.optional(2, "N_NAME", jvm.Types.StringType.get()),
    jvm.Types.NestedField.optional(3, "N_REGIONKEY", jvm.Types.LongType.get()),
    jvm.Types.NestedField.optional(4, "N_COMMENT", jvm.Types.StringType.get()),
])
nation_spec = jvm.PartitionSpec.builderFor(nation_schema).truncate("N_NAME", 2).build()
nation_table = catalog.createTable(nation_name, nation_schema, nation_spec)

nation_df = spark.read.load("data/nation.parquet")
nation_df.write.option('hadoop.nessie.ref', 'dev').format("iceberg").mode("overwrite").save("testing.nation")


Check generated tables
----------------------------
   
Check tables generated under the dev branch (and that the main branch does not have any tables)

In [5]:
!nessie list-tables main

Entries(entries=[], has_more=False, token=None)


In [6]:
!nessie list-tables dev

Entries(entries=[Entry(kind='UNKNOWN', name=EntryName(elements=['testing', 'nation'])), Entry(kind='UNKNOWN', name=EntryName(elements=['testing', 'region']))], has_more=False, token=None)


Note that the `dev` and `main` branches point to different commits now

In [7]:
!nessie list-references | jq .

[1;39m[
  [1;39m{
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"e0b41c30f0710277532f51242994e10acfdc46bf"[0m[1;39m,
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"main"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"133fc5ce95c57a7a92f03305c855579e1a0585d0"[0m[1;39m,
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"dev"[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m][0m


Dev promotion
-------------

Promote dev branch promotion to main.

* main now has the same tables as dev
* main and dev point to the same commit

In [8]:
!nessie assign-branch main dev




In [9]:
!nessie list-tables main

Entries(entries=[Entry(kind='UNKNOWN', name=EntryName(elements=['testing', 'nation'])), Entry(kind='UNKNOWN', name=EntryName(elements=['testing', 'region']))], has_more=False, token=None)


In [10]:
!nessie list-references | jq .

[1;39m[
  [1;39m{
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"133fc5ce95c57a7a92f03305c855579e1a0585d0"[0m[1;39m,
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"main"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"133fc5ce95c57a7a92f03305c855579e1a0585d0"[0m[1;39m,
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"dev"[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m][0m


Create `etl` branch
----------------------

- Create a branch `etl` out of `main`
- add data to nation
- alter the schema of region
- create table city
- query the tables in `etl`
- query the tables in `main`
- promote `etl` branch to `main`

In [11]:
!nessie create-branch etl -r `nessie show-reference main | jq .hash | sed 's/"//g'`




In [12]:
Nation = Row("N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT")
new_nations = spark.createDataFrame([
    Nation(25, "SYLDAVIA", 3, "King Ottokar's Sceptre"),
    Nation(26, "SAN THEODOROS", 1, "The Picaros")])
new_nations.write.option('hadoop.nessie.ref', 'etl').format("iceberg").mode("append").save("testing.nation")

In [13]:
# changing the default branch
hadoop_conf.set('nessie.ref', 'etl')

etl_catalog = jvm.NessieCatalog(hadoop_conf)
etl_catalog.loadTable(region_name).updateSchema().addColumn('R_ABBREV', jvm.Types.StringType.get()).commit()

In [14]:
# Creating city table
sc.getConf().set("spark.hadoop.nessie.ref", "etl")
spark.sql("CREATE TABLE nessie.testing.city (C_CITYKEY BIGINT, C_NAME STRING, N_NATIONKEY BIGINT, C_COMMNT STRING) USING iceberg PARTITIONED BY (N_NATIONKEY)")

DataFrame[]

In [15]:
from pynessie import init
nessie = init()
[i.name for i in nessie.list_tables('main').entries]

[EntryName(elements=['testing', 'nation']),
 EntryName(elements=['testing', 'region'])]

In [16]:
[i.name for i in nessie.list_tables('etl').entries]

[EntryName(elements=['testing', 'city']),
 EntryName(elements=['testing', 'nation']),
 EntryName(elements=['testing', 'region'])]

In [17]:
{i.name:i.hash_ for i in nessie.list_references()}

{'main': '133fc5ce95c57a7a92f03305c855579e1a0585d0',
 'dev': '133fc5ce95c57a7a92f03305c855579e1a0585d0',
 'etl': '4defe1e37e300cb890c4dd8509089ed9e5103ec5'}

In [18]:
nessie.assign('main','etl')

In [19]:
{i.name:i.hash_ for i in nessie.list_references()}

{'main': '4defe1e37e300cb890c4dd8509089ed9e5103ec5',
 'dev': '133fc5ce95c57a7a92f03305c855579e1a0585d0',
 'etl': '4defe1e37e300cb890c4dd8509089ed9e5103ec5'}

Create `experiment` branch
--------------------------------

- create `experiment` branch from `main`
- drop `nation` table
- add data to `region` table
- compare `experiment` and `main` tables

In [20]:
!nessie create-branch experiment -r `nessie show-reference main | jq .hash | sed 's/"//g'`




In [21]:
# changing the default branch
hadoop_conf.set('nessie.ref', 'experiment')

catalog = jvm.NessieCatalog(hadoop_conf)
catalog.dropTable(jvm.TableIdentifier.parse("testing.nation"), False)

True

In [22]:
spark.sql("set spark.hadoop.nessie.ref=experiment")
spark.sql('INSERT INTO TABLE nessie.testing.region VALUES (5, "AUSTRALIA", "Let\'s hop there", "AUS")')
spark.sql('INSERT INTO TABLE nessie.testing.region VALUES (6, "ANTARTICA", "It\'s cold", "ANT")')

DataFrame[]

In [23]:
!nessie list-tables experiment

Entries(entries=[Entry(kind='UNKNOWN', name=EntryName(elements=['testing', 'city'])), Entry(kind='UNKNOWN', name=EntryName(elements=['testing', 'region']))], has_more=False, token=None)


Lets take a look at the contents of the region table on the experiment branch. Notice the use of the `nessie` catalog.

In [24]:
spark.sql("select * from nessie.testing.region").toPandas()

Unnamed: 0,R_REGIONKEY,R_NAME,R_COMMENT,R_ABBREV
0,0,AFRICA,lar deposits. blithely final packages cajole. ...,
1,1,AMERICA,"hs use ironic, even requests. s",
2,2,ASIA,ges. thinly even pinto beans ca,
3,3,EUROPE,ly final courts cajole furiously final excuse,
4,4,MIDDLE EAST,uickly special accounts cajole carefully blith...,
5,5,AUSTRALIA,Let's hop there,AUS
6,6,ANTARTICA,It's cold,ANT


and compare to the contents of the region table on the main branch. Notice the use of `@main` to view data on the main branch

In [26]:
spark.sql("select * from nessie.testing.`region@main`").toPandas()

Unnamed: 0,R_REGIONKEY,R_NAME,R_COMMENT,R_ABBREV
0,0,AFRICA,lar deposits. blithely final packages cajole. ...,
1,1,AMERICA,"hs use ironic, even requests. s",
2,2,ASIA,ges. thinly even pinto beans ca,
3,3,EUROPE,ly final courts cajole furiously final excuse,
4,4,MIDDLE EAST,uickly special accounts cajole carefully blith...,
