Nessie Demo
===========
This demo showcases how to use Nessie python API along with Spark

Initialize Pyspark + Nessie environment
----------------------------------------------

In [1]:
import os
import findspark
from pyspark.sql import *
from pyspark import SparkConf, SparkContext
from py4j.java_gateway import java_import
findspark.init()

spark = SparkSession.builder \
                    .config("spark.jars", "../../clients/deltalake/spark3/target/nessie-deltalake-spark3-0.1-SNAPSHOT.jar") \
                    .config("spark.sql.execution.pyarrow.enabled", "true") \
                    .config("spark.hadoop.fs.defaultFS", 'file://' + os.getcwd() + '/spark_warehouse') \
                    .config("spark.hadoop.nessie.url", "http://localhost:19120/api/v1") \
                    .config("spark.hadoop.nessie.ref", "main") \
                    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
                    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
                    .config("spark.delta.logFileHandler.class", "com.dremio.nessie.deltalake.NessieLogFileMetaParser") \
                    .config("spark.delta.logStore.class", "com.dremio.nessie.deltalake.NessieLogStore") \
                    .getOrCreate()
sc = spark.sparkContext
jvm = sc._gateway.jvm

java_import(jvm, "org.apache.spark.sql.delta.DeltaLog")
java_import(jvm, "io.delta.tables.DeltaTable")

Set up nessie branches
----------------------------

- Branch `main` already exists
- Create branch `dev`
- List all branches (pipe JSON result into jq)

In [2]:
!nessie create-branch dev

Traceback (most recent call last):
  File "/home/ryan/workspace/nessie/python/demo/venv/bin/nessie", line 10, in <module>
    sys.exit(cli())
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "/home/ryan/workspace/nessie/python/demo/venv/lib/py

In [3]:
!nessie list-references | jq .

[1;39m[
  [1;39m{
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"main"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"e0b41c30f0710277532f51242994e10acfdc46bf"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"dev"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"e0b41c30f0710277532f51242994e10acfdc46bf"[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m][0m


Create tables under dev branch
-------------------------------------

Creating two tables under the `dev` branch:
- region
- nation

It is not yet possible to create table using pyspark and iceberg, so Java code is used instead

In [4]:
hadoop_conf = sc._jsc.hadoopConfiguration()
hadoop_conf.set("nessie.ref", "dev")

region_df = spark.read.load("data/region.parquet")
region_df.write.format("delta").save("spark_warehouse/testing/region")

nation_df = spark.read.load("data/nation.parquet")
nation_df.write.format("delta").save("spark_warehouse/testing/nation")


Check generated tables
----------------------------
   
Check tables generated under the dev branch (and that the main branch does not have any tables)

In [5]:
!nessie list-tables main

Entries(entries=[], has_more=False, token=None)


In [6]:
!nessie list-tables dev

Entries(entries=[Entry(kind='UNKNOWN', name=EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'nation', '_delta_log'])), Entry(kind='UNKNOWN', name=EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'region', '_delta_log']))], has_more=False, token=None)


In [7]:
!nessie list-references | jq .

[1;39m[
  [1;39m{
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"main"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"e0b41c30f0710277532f51242994e10acfdc46bf"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"dev"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"0374795983216838f3febfaaf8af000b4390a784"[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m][0m


Dev promotion
-------------

Promote dev branch promotion to main

In [8]:
!nessie assign-branch main dev




In [9]:
!nessie list-tables main

Entries(entries=[Entry(kind='UNKNOWN', name=EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'nation', '_delta_log'])), Entry(kind='UNKNOWN', name=EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'region', '_delta_log']))], has_more=False, token=None)


In [10]:
!nessie list-references | jq .

[1;39m[
  [1;39m{
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"main"[0m[1;39m,
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"0374795983216838f3febfaaf8af000b4390a784"[0m[1;39m
  [1;39m}[0m[1;39m,
  [1;39m{
    [0m[34;1m"name"[0m[1;39m: [0m[0;32m"dev"[0m[1;39m,
    [0m[34;1m"type"[0m[1;39m: [0m[0;32m"BRANCH"[0m[1;39m,
    [0m[34;1m"hash"[0m[1;39m: [0m[0;32m"0374795983216838f3febfaaf8af000b4390a784"[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m][0m


Create `etl` branch
----------------------

- Create a branch `etl` out of `main`
- add data to nation
- alter region
- create table city
- query the tables in `etl`
- query the tables in `main`
- promote `etl` branch to `main`

In [15]:
!nessie create-branch etl -r `nessie show-reference main | jq .hash | sed 's/"//g'`




In [16]:
hadoop_conf.set("nessie.ref", "etl")
Nation = Row("N_NATIONKEY", "N_NAME", "N_REGIONKEY", "N_COMMENT")
new_nations = spark.createDataFrame([
    Nation(25, "SYLDAVIA", 3, "King Ottokar's Sceptre"),
    Nation(26, "SAN THEODOROS", 1, "The Picaros")])
new_nations.write.option('hadoop.nessie.ref', 'etl').format("delta").mode("append").save("testing.nation")

In [20]:
# changing the default branch
hadoop_conf.set('nessie.ref', 'etl')
base_table = os.getcwd() + "/spark_warehouse/testing/"
spark.sql("ALTER TABLE delta.`" + base_table + "region` ADD COLUMNS (R_ABBREV STRING)")

DataFrame[]

In [21]:
# Creating city table
sc.getConf().set("spark.hadoop.nessie.ref", "etl")
spark.sql("CREATE TABLE city (C_CITYKEY BIGINT, C_NAME STRING, N_NATIONKEY BIGINT, C_COMMNT STRING) USING delta PARTITIONED BY (N_NATIONKEY) LOCATION 'spark_warehouse/testing/city'")

DataFrame[]

In [22]:
from pynessie import init
nessie = init()
[i.name for i in nessie.list_tables('main').entries]

[EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'nation', '_delta_log']),
 EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'region', '_delta_log'])]

In [23]:
[i.name for i in nessie.list_tables('etl').entries]

[EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'city', '_delta_log']),
 EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'nation', '_delta_log']),
 EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'region', '_delta_log']),
 EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'testing.nation', '_delta_log'])]

In [24]:
{i.name:i.hash_ for i in nessie.list_references()}

{'main': '0374795983216838f3febfaaf8af000b4390a784',
 'dev': '0374795983216838f3febfaaf8af000b4390a784',
 'etl': '5965f6e72fe80fdb7f99d9d07f26eff0298f6888'}

In [25]:
nessie.assign('main','etl')

In [26]:
{i.name:i.hash_ for i in nessie.list_references()}

{'main': '5965f6e72fe80fdb7f99d9d07f26eff0298f6888',
 'dev': '0374795983216838f3febfaaf8af000b4390a784',
 'etl': '5965f6e72fe80fdb7f99d9d07f26eff0298f6888'}

Create `experiment` branch
--------------------------------

- create `experiment` branch from `main`
- drop `nation` table
- add data to `region` table
- compare `experiment` and `main` tables

In [28]:
!nessie create-branch experiment -r `nessie show-reference main | jq .hash | sed 's/"//g'`




In [29]:
# changing the default branch
hadoop_conf.set('nessie.ref', 'experiment')


jvm.DeltaLog.clearCache()
deltaTable = jvm.DeltaTable.forPath("spark_warehouse/testing/nation")
deltaTable.delete()

In [32]:
spark.sql("set spark.hadoop.nessie.ref=experiment")
spark.sql('INSERT INTO TABLE delta.`' + base_table + 'region` VALUES (5, "AUSTRALIA", "Let\'s hop there", "AUS")')
spark.sql('INSERT INTO TABLE delta.`' + base_table + 'region` VALUES (6, "ANTARTICA", "It\'s cold", "ANT")')

DataFrame[]

In [33]:
!nessie list-tables experiment

Entries(entries=[Entry(kind='UNKNOWN', name=EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'city', '_delta_log'])), Entry(kind='UNKNOWN', name=EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'nation', '_delta_log'])), Entry(kind='UNKNOWN', name=EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'spark_warehouse', 'testing', 'region', '_delta_log'])), Entry(kind='UNKNOWN', name=EntryName(elements=['home', 'ryan', 'workspace', 'nessie', 'python', 'demo', 'testing.nation', '_delta_log']))], has_more=False, token=None)


In [35]:
spark.sql("select * from delta.`" + base_table + "region`").toPandas()

Unnamed: 0,r_regionkey,r_name,r_comment,R_ABBREV
0,0,AFRICA,lar deposits. blithely final packages cajole. ...,
1,1,AMERICA,"hs use ironic, even requests. s",
2,2,ASIA,ges. thinly even pinto beans ca,
3,3,EUROPE,ly final courts cajole furiously final excuse,
4,4,MIDDLE EAST,uickly special accounts cajole carefully blith...,
5,5,AUSTRALIA,Let's hop there,AUS
6,6,ANTARTICA,It's cold,ANT


The branch used for Delta queries should be changed manually to query a different branch

In [37]:
hadoop_conf.set('nessie.ref', 'main')
jvm.DeltaLog.clearCache()
spark.sql("set spark.hadoop.nessie.ref=main")
spark.sql("select * from delta.`/home/ryan/workspace/nessie/python/demo/spark_warehouse/testing/region`").toPandas()

Unnamed: 0,r_regionkey,r_name,r_comment,R_ABBREV
0,0,AFRICA,lar deposits. blithely final packages cajole. ...,
1,1,AMERICA,"hs use ironic, even requests. s",
2,2,ASIA,ges. thinly even pinto beans ca,
3,3,EUROPE,ly final courts cajole furiously final excuse,
4,4,MIDDLE EAST,uickly special accounts cajole carefully blith...,
