Nessie Spark SQL Demo with NBA Dataset
============================
This demo showcases how to use Nessie Python API along with Spark3 from Iceberg

Initialize Pyspark + Nessie environment
----------------------------------------------

In [None]:
# install the nessiedemo lib, which configures all required dependencies
!pip install -i https://test.pypi.org/simple/ nessiedemo


In [None]:
# Setup the Demo: installs the required Python dependencies, downloads the sample datasets and
# downloads + starts the Nessie-Quarkus-Runner.
from nessiedemo.demo import setup_demo
demo = setup_demo("nessie-0.5-iceberg-0.11.yml", ["nba"])

# This is separate, because NessieDemo.prepare() via .start() implicitly installs the required dependencies.
# Downloads Spark and sets up SparkSession, SparkContext, JVM-gateway
from nessiedemo.spark import spark_for_demo
spark, sc, jvm, demo_spark = spark_for_demo(demo)

Set up nessie branches
----------------------------

- Branch `main` already exists
- Create branch `dev`
- List all branches (pipe JSON result into jq)

In [None]:
# create a new dev branch
!nessie branch dev

# session for dev branch
spark_dev = demo_spark.session_for_ref("dev")

In [None]:
# list all branches
!nessie --verbose branch

Create tables under dev branch
-------------------------------------

We create two tables under the `dev` branch using the `spark_dev` session:
- `salaries`
- `totals_stats`


In [None]:
# load the dataset
dataset = demo.fetch_dataset("nba")

# Creating salaries table
spark_dev.sql("CREATE TABLE IF NOT EXISTS nessie.nba.salaries (Season STRING, Team STRING, Salary STRING, Player STRING) USING iceberg")
salaries_df = spark_dev.read.csv(dataset["salaries.csv"], header=True)
salaries_df.write.format("iceberg").mode("overwrite").save("nessie.nba.salaries")

# Creating totals_stats table
spark_dev.sql("CREATE TABLE IF NOT EXISTS nessie.nba.totals_stats (Season STRING, Age STRING, Team STRING, ORB STRING, DRB STRING, TRB STRING, AST STRING, STL STRING, BLK STRING, TOV STRING, PTS STRING, Player STRING, RSorPO STRING) USING iceberg")
totals_stats_df = spark_dev.read.csv(dataset["totals_stats.csv"], header=True)
totals_stats_df.write.format("iceberg").mode("overwrite").save("nessie.nba.totals_stats")


In [None]:
# notice how we view the data of the salaries table on the dev branch via @dev
spark.sql("select * from nessie.nba.`salaries@dev`").show()

Check generated tables
----------------------------

Check tables generated under the `dev` branch (and that the `main` branch does not have any tables)

In [None]:
# there are no tables on the main branch
!nessie contents --list

In [None]:
# we should see the salaries & totals_stats tables on the dev branch
!nessie contents --list --ref dev

Note that the `dev` and `main` branches point to different commits now

In [None]:
# list all branches
!nessie --verbose branch

Dev promotion
-------------

Promote `dev` branch to `main`.

* `main` now has the same tables as `dev`
* `main` and `dev` point to the same commit

In [None]:
# merge dev into main
!nessie merge dev -b main --force

In [None]:
# list all branches
!nessie --verbose branch

Create `etl` branch
----------------------

- Create a branch `etl` out of `main`
- add data to `salaries`
- alter the schema of `totals_stats`
- create table `allstar_games_stats`
- query the tables in `etl`
- query the tables in `main`
- promote `etl` branch to `main`

In [None]:
# create the etl branch based on main
!nessie branch etl main

# session for etl branch
spark_etl = demo_spark.session_for_ref("etl")

In [None]:
# add some salaries for Kevin Durant
from pyspark.sql import Row
Salary = Row("Season", "Team", "Salary", "Player")
kevin_durant = spark_etl.createDataFrame([
    Salary("2017-18", "Golden State Warriors", "$25000000", "Kevin Durant"),
    Salary("2018-19", "Golden State Warriors", "$30000000", "Kevin Durant"),
    Salary("2019-20", "Brooklyn Nets", "$37199000", "Kevin Durant"),
    Salary("2020-21", "Brooklyn Nets", "$39058950", "Kevin Durant")])
kevin_durant.write.format("iceberg").mode("append").save("nessie.nba.salaries")

In [None]:
# dropping a column in the totals_stats table
spark_etl.sql("ALTER TABLE nessie.nba.totals_stats DROP COLUMN Age")

In [None]:
# Creating allstar_games_stats table and viewing the contents
spark_etl.sql("CREATE TABLE IF NOT EXISTS nessie.nba.allstar_games_stats (Season STRING, Age STRING, Team STRING, ORB STRING, TRB STRING, AST STRING, STL STRING, BLK STRING, TOV STRING, PF STRING, PTS STRING, Player STRING) USING iceberg")
allstar_games_stats_df = spark_etl.read.csv(dataset["allstar_games_stats.csv"], header=True)
allstar_games_stats_df.write.format("iceberg").mode("overwrite").save("nessie.nba.allstar_games_stats")

spark.sql("select * from nessie.nba.`allstar_games_stats@etl`").show()

In [None]:
# allstar_games_stats is not on the main branch
!nessie contents --list

In [None]:
# we should see allstar_games_stats on the etl branch
!nessie contents --list --ref etl


In [None]:
# now merge the etl branch into main
!nessie merge etl -b main --force

In [None]:
# the etl and main branch should have the same revision
!nessie --verbose branch


Create `experiment` branch
--------------------------------

- create `experiment` branch from `main`
- drop `totals_stats` table
- add data to `salaries` table
- compare `experiment` and `main` tables

In [None]:
# create the experiment branch from main
!nessie branch experiment main

# session for experiment branch
spark_experiment = demo_spark.session_for_ref("experiment")

In [None]:
# drop the `totals_stats` table
spark_experiment.sql("DROP TABLE IF EXISTS nessie.nba.totals_stats")

In [None]:
# add some salaries for Dirk Nowitzki
Salary = Row("Season", "Team", "Salary", "Player")
dirk_nowitzki = spark_experiment.createDataFrame([
    Salary("2015-16", "Dallas Mavericks", "$8333333", "Dirk Nowitzki"),
    Salary("2016-17", "Dallas Mavericks", "$25000000", "Dirk Nowitzki"),
    Salary("2017-28", "Dallas Mavericks", "$5000000", "Dirk Nowitzki"),
    Salary("2018-19", "Dallas Mavericks", "$5000000", "Dirk Nowitzki")])
dirk_nowitzki.write.format("iceberg").mode("append").save("nessie.nba.salaries")

In [None]:
# we should see the salaries and allstar_games_stats tables only
!nessie contents --list --ref experiment

In [None]:
# main should still see the totals_stats table
!nessie contents --list

Let's take a look at the contents of the `salaries` table on the `experiment` branch.
Notice the use of the `nessie` catalog and the use of `@experiment` to view data on the `experiment` branch

In [None]:
spark.sql("select count(*) from nessie.nba.`salaries@experiment`").show()

and compare to the contents of the `salaries` table on the `main` branch. Notice that we didn't have to specify `@branchName` as it defaulted
to the `main` branch

In [None]:
spark.sql("select count(*) from nessie.nba.salaries").show()