Nessie Iceberg/Flink SQL Demo with NBA Dataset
============================
This demo showcases how to use Nessie Python API along with Flink from Iceberg

Initialize PyFlink
----------------------------------------------
To get started, we will first have to do a few setup steps that give us everything we need
to get started with Nessie. In case you're interested in the detailed setup steps for Flink, you can check out the [docs](https://projectnessie.org/tools/iceberg/flink/)

The Binder server has downloaded flink and some data for us as well as started a Nessie server in the background. All we have to do is start Flink

The below cell starts a local Flink session with parameters needed to configure Nessie. Each config option is followed by a comment explaining its purpose.

In [1]:
import os
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import StreamTableEnvironment
from pyflink.table.expressions import lit
from pynessie import init

# where we will store our data
warehouse = os.path.join(os.getcwd(), "flink-warehouse")
# this was downloaded when Binder started, its available on maven central
iceberg_flink_runtime_jar = os.path.join(os.getcwd(), "../iceberg-flink-runtime-0.12.0.jar")

env = StreamExecutionEnvironment.get_execution_environment()
env.add_jars("file://{}".format(iceberg_flink_runtime_jar))
table_env = StreamTableEnvironment.create(env)

nessie_client = init()

def create_ref_catalog(ref):
    """
    Create a flink catalog that is tied to a specific ref.

    In order to create the catalog we have to first create the branch
    """
    hash_ = nessie_client.get_reference(nessie_client.get_default_branch()).hash_
    try:
        nessie_client.create_branch(ref, hash_)
    except:
        pass # already created
    # The important args below are:
    # type: tell Flink to use Iceberg as the catalog
    # catalog-impl: which Iceberg catalog to use, in this case we want Nessie
    # uri: the location of the nessie server.
    # ref: the Nessie ref/branch we want to use (defaults to main)
    # warehouse: the location this catalog should store its data
    table_env.execute_sql(
            f"""CREATE CATALOG {ref}_catalog WITH (
            'type'='iceberg',
            'catalog-impl'='org.apache.iceberg.nessie.NessieCatalog',
            'uri'='http://localhost:19120/api/v1',
            'ref'='{ref}',
            'warehouse' = '{warehouse}')"""
        )
create_ref_catalog(nessie_client.get_default_branch())
print("\n\n\nFlink running\n\n\n")

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/srv/conda/envs/flink-demo/lib/python3.7/site-packages/pyflink/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/srv/conda/envs/flink-demo/lib/python3.7/site-packages/pyflink/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]





Flink running





Solving Data Engineering problems with Nessie
============================

In this Demo we are a data engineer working at a fictional sports analytics blog. In order for the authors to write articles they have to have access to the relevant data. They need to be able to retrieve data quickly and be able to create charts with it.

We have been asked to collect and expose some information about basketball players. We have located some data sources and are now ready to start ingesting data into our data lakehouse. We will perform the ingestion steps on a Nessie branch to test and validate the data before exposing to the analysts.

Set up Nessie branches (via Nessie CLI)
----------------------------
Once all dependencies are configured, we can get started with ingesting our basketball data into `Nessie` with the following steps:

- Create a new branch named `dev`
- List all branches

It is worth mentioning that we don't have to explicitly create a `main` branch, since it's the default branch.

In [2]:
create_ref_catalog("dev")

We have created the branch `dev` and we can see the branch with the Nessie `hash` its currently pointing to.

Below we list all branches. Note that the auto created `main` branch already exists and both branches point at the same `hash`

In [3]:
!nessie --verbose branch

[33m* main  2e1cfa82b035c26cbbbdae632cea070514eb8b773f616aaeaf668e2f0be8f10d comment
[0m  dev   2e1cfa82b035c26cbbbdae632cea070514eb8b773f616aaeaf668e2f0be8f10d comment



Create tables under dev branch
-------------------------------------
Once we created the `dev` branch and verified that it exists, we can create some tables and add some data.

We create two tables under the `dev` branch:
- `salaries`
- `totals_stats`

These tables list the salaries per player per year and their stats per year.

To create the data we:

1. switch our branch context to dev
2. create the table
3. insert the data from an existing csv file. This csv file is already stored locally on the demo machine. A production use case would likely take feeds from official data sources

In [4]:
# Load the dataset
from pyflink.table import DataTypes
from pyflink.table.descriptors import Schema, OldCsv, FileSystem

# Creating `salaries` table
(table_env.connect(FileSystem().path('../datasets/nba/salaries.csv'))
  .with_format(OldCsv()
               .field('Season', DataTypes.STRING()).field("Team", DataTypes.STRING())
               .field("Salary", DataTypes.STRING()).field("Player", DataTypes.STRING()))
  .with_schema(Schema()
               .field('Season', DataTypes.STRING()).field("Team", DataTypes.STRING())
               .field("Salary", DataTypes.STRING()).field("Player", DataTypes.STRING()))
  .create_temporary_table('dev_catalog.nba.salaries_temp'))

table_env.execute_sql("""CREATE TABLE IF NOT EXISTS dev_catalog.nba.salaries
            (Season STRING, Team STRING, Salary STRING, Player STRING)""").wait()

tab = table_env.from_path('dev_catalog.nba.salaries_temp')
tab.execute_insert('dev_catalog.nba.salaries').wait()

# Creating `totals_stats` table
(table_env.connect(FileSystem().path('../datasets/nba/totals_stats.csv'))
  .with_format(OldCsv()
               .field('Season', DataTypes.STRING()).field("Age", DataTypes.STRING()).field("Team", DataTypes.STRING())
               .field("ORB", DataTypes.STRING()).field("DRB", DataTypes.STRING()).field("TRB", DataTypes.STRING())
               .field("AST", DataTypes.STRING()).field("STL", DataTypes.STRING()).field("BLK", DataTypes.STRING())
               .field("TOV", DataTypes.STRING()).field("PTS", DataTypes.STRING()).field("Player", DataTypes.STRING())
               .field("RSorPO", DataTypes.STRING()))
  .with_schema(Schema()
               .field('Season', DataTypes.STRING()).field("Age", DataTypes.STRING()).field("Team", DataTypes.STRING())
               .field("ORB", DataTypes.STRING()).field("DRB", DataTypes.STRING()).field("TRB", DataTypes.STRING())
               .field("AST", DataTypes.STRING()).field("STL", DataTypes.STRING()).field("BLK", DataTypes.STRING())
               .field("TOV", DataTypes.STRING()).field("PTS", DataTypes.STRING()).field("Player", DataTypes.STRING())
               .field("RSorPO", DataTypes.STRING()))
  .create_temporary_table('dev_catalog.nba.totals_stats_temp'))

table_env.execute_sql(
        """CREATE TABLE IF NOT EXISTS dev_catalog.nba.totals_stats (Season STRING, Age STRING, Team STRING,
        ORB STRING, DRB STRING, TRB STRING, AST STRING, STL STRING, BLK STRING, TOV STRING, PTS STRING,
        Player STRING, RSorPO STRING)""").wait()

tab = table_env.from_path('dev_catalog.nba.totals_stats_temp')
tab.execute_insert('dev_catalog.nba.totals_stats').wait()

salaries = table_env.from_path('main_catalog.nba.`salaries@dev`').select(lit(1).count).to_pandas().values[0][0]
totals_stats = table_env.from_path('main_catalog.nba.`totals_stats@dev`').select(lit(1).count).to_pandas().values[0][0]
print(f"\n\n\nAdded {salaries} rows to the salaries table and {totals_stats} rows to the totals_stats table.\n\n\n")



2022-03-10 17:08:05,907 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:05,909 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:05,909 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:05,907 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:05,907 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:05,907 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:05,910 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:05,907 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new c

Now we count the rows in our tables to ensure they are the same number as the csv files. Note we use the `table@branch` notation which overrides the context set by the catalog.

In [5]:
table_count = table_env.from_path('dev_catalog.nba.`salaries@dev`').select('Season.count').to_pandas().values[0][0]
csv_count = table_env.from_path('dev_catalog.nba.salaries_temp').select('Season.count').to_pandas().values[0][0]
assert table_count == csv_count
print(table_count)

table_count = table_env.from_path('dev_catalog.nba.`totals_stats@dev`').select('Season.count').to_pandas().values[0][0]
csv_count = table_env.from_path('dev_catalog.nba.totals_stats_temp').select('Season.count').to_pandas().values[0][0]
assert table_count == csv_count
print(table_count)

2022-03-10 17:08:14,552 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:14,625 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:14,628 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:14,631 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:14,633 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:14,636 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:14,638 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:14,640 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - G

Check generated tables
----------------------------
Since we have been working solely on the `dev` branch, where we created 2 tables and added some data,
let's verify that the `main` branch was not altered by our changes.

In [6]:
!nessie contents --list




And on the `dev` branch we expect to see two tables

In [7]:
!nessie contents --list --ref dev

ICEBERG_TABLE:
	nba.salaries
	nba.totals_stats



We can also verify that the `dev` and `main` branches point to different commits

In [8]:
!nessie --verbose branch

[33m* main  2e1cfa82b035c26cbbbdae632cea070514eb8b773f616aaeaf668e2f0be8f10d comment
[0m  dev   3f743d120c8b4898237b19f7e82d3193bea7c5cfb6e1ffb4b8671de548f70422 comment



Dev promotion into main
-----------------------
Once we are done with our changes on the `dev` branch, we would like to merge those changes into `main`.
We merge `dev` into `main` via the command line `merge` command.
Both branches should be at the same revision after merging/promotion.

In [9]:
!nessie merge dev -b main --force




We can verify the branches are at the same hash and that the `main` branch now contains the expected tables and row counts.

The tables are now on `main` and ready for consumtion by our blog authors and analysts!

In [10]:
!nessie --verbose branch

[33m* main  3f743d120c8b4898237b19f7e82d3193bea7c5cfb6e1ffb4b8671de548f70422 comment
[0m  dev   3f743d120c8b4898237b19f7e82d3193bea7c5cfb6e1ffb4b8671de548f70422 comment



In [11]:
!nessie contents --list

ICEBERG_TABLE:
	nba.salaries
	nba.totals_stats



In [12]:
table_count = table_env.from_path('main_catalog.nba.salaries').select('Season.count').to_pandas().values[0][0]
csv_count = table_env.from_path('dev_catalog.nba.salaries_temp').select('Season.count').to_pandas().values[0][0]
assert table_count == csv_count

table_count = table_env.from_path('main_catalog.nba.totals_stats').select('Season.count').to_pandas().values[0][0]
csv_count = table_env.from_path('dev_catalog.nba.totals_stats_temp').select('Season.count').to_pandas().values[0][0]
assert table_count == csv_count

2022-03-10 17:08:21,338 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:21,432 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:21,435 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:21,437 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:21,440 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:21,444 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:21,448 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:21,451 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - G

Perform regular ETL on the new tables
-------------------
Our analysts are happy with the data and we want to now regularly ingest data to keep things up to date. Our first ETL job consists of the following:

1. Update the salaries table to add new data
2. We have decided the `Age` column isn't required in the `totals_stats` table so we will drop the column
3. We create a new table to hold information about the players appearances in all star games

As always we will do this work on a branch and verify the results. This ETL job can then be set up to run nightly with new stats and salary information.

In [13]:
create_ref_catalog("etl")

In [14]:
# add some salaries for Kevin Durant
table_env.execute_sql("""INSERT INTO etl_catalog.nba.salaries
                        VALUES ('2017-18', 'Golden State Warriors', '$25000000', 'Kevin Durant'),
                        ('2018-19', 'Golden State Warriors', '$30000000', 'Kevin Durant'),
                        ('2019-20', 'Brooklyn Nets', '$37199000', 'Kevin Durant'),
                        ('2020-21', 'Brooklyn Nets', '$39058950', 'Kevin Durant')""").wait()

2022-03-10 17:08:25,838 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:25,838 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:25,838 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:25,838 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:25,838 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:25,838 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:25,839 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:25,839 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new c

In [15]:
# Rename the table `totals_stats` to `new_totals_stats`
table_env.execute_sql("ALTER TABLE etl_catalog.nba.totals_stats RENAME TO etl_catalog.nba.new_totals_stats").wait()

In [16]:
# Creating `allstar_games_stats` table
(table_env.connect(FileSystem().path('../datasets/nba/allstar_games_stats.csv'))
    .with_format(OldCsv()
                 .field('Season', DataTypes.STRING()).field("Age", DataTypes.STRING()).field("Team", DataTypes.STRING())
                 .field("ORB", DataTypes.STRING()).field("TRB", DataTypes.STRING()).field("AST", DataTypes.STRING())
                 .field("STL", DataTypes.STRING()).field("BLK", DataTypes.STRING()).field("TOV", DataTypes.STRING())
                 .field("PF", DataTypes.STRING()).field("PTS", DataTypes.STRING()).field("Player", DataTypes.STRING()))
    .with_schema(Schema()
                 .field('Season', DataTypes.STRING()).field("Age", DataTypes.STRING()).field("Team", DataTypes.STRING())
                 .field("ORB", DataTypes.STRING()).field("TRB", DataTypes.STRING()).field("AST", DataTypes.STRING())
                 .field("STL", DataTypes.STRING()).field("BLK", DataTypes.STRING()).field("TOV", DataTypes.STRING())
                 .field("PF", DataTypes.STRING()).field("PTS", DataTypes.STRING()).field("Player", DataTypes.STRING()))
    .create_temporary_table('etl_catalog.nba.allstar_games_stats_temp'))

table_env.execute_sql(
        """CREATE TABLE IF NOT EXISTS etl_catalog.nba.allstar_games_stats (Season STRING, Age STRING,
        Team STRING, ORB STRING, TRB STRING, AST STRING, STL STRING, BLK STRING, TOV STRING,
        PF STRING, PTS STRING, Player STRING)""").wait()

tab = table_env.from_path('etl_catalog.nba.allstar_games_stats_temp')
tab.execute_insert('etl_catalog.nba.allstar_games_stats').wait()

# Notice how we view the data on the etl branch via @etl
table_env.from_path('etl_catalog.nba.`allstar_games_stats@etl`').to_pandas()

2022-03-10 17:08:26,962 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:26,962 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:26,962 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:26,962 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:26,962 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:26,963 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:26,963 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:26,963 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new c

Unnamed: 0,Season,Age,Team,ORB,TRB,AST,STL,BLK,TOV,PF,PTS,Player
0,2009-10,25,CLE,1,5,6,4,0,2,1,25,Lebron James
1,2010-11,26,MIA,2,12,10,0,0,4,3,29,Lebron James
2,2011-12,27,MIA,0,6,7,0,0,4,2,36,Lebron James
3,2012-13,28,MIA,0,3,5,1,0,4,0,19,Lebron James
4,2013-14,29,MIA,1,7,7,3,0,1,0,22,Lebron James
5,2014-15,30,CLE,1,5,7,2,0,4,1,30,Lebron James
6,2004-05,26,LAL,3,6,7,3,1,4,5,16,Kobe Bryant
7,2005-06,27,LAL,0,7,8,3,0,3,5,8,Kobe Bryant
8,2006-07,28,LAL,1,5,6,6,0,4,1,31,Kobe Bryant
9,2007-08,29,LAL,0,1,0,0,0,0,0,0,Kobe Bryant


We can verify that the new table isn't on the `main` branch but is present on the etl branch

In [17]:
# Since we have been working on the `etl` branch, the `allstar_games_stats` table is not on the `main` branch
!nessie contents --list

ICEBERG_TABLE:
	nba.salaries
	nba.totals_stats



In [18]:
# We should see `allstar_games_stats` and the `new_totals_stats` on the `etl` branch
!nessie contents --list --ref etl

ICEBERG_TABLE:
	nba.salaries
	nba.allstar_games_stats
	nba.new_totals_stats



Now that we are happy with the data we can again merge it into `main`

In [19]:
!nessie merge etl -b main --force




Now lets verify that the changes exist on the `main` branch and that the `main` and `etl` branches have the same `hash`

In [20]:
!nessie contents --list

ICEBERG_TABLE:
	nba.salaries
	nba.allstar_games_stats
	nba.new_totals_stats



In [21]:
!nessie --verbose branch

  etl   c410d11b2f4c57e5d649d405d46b19d62c389c9837103c44fb2da35ddff835d9 comment
[33m* main  c410d11b2f4c57e5d649d405d46b19d62c389c9837103c44fb2da35ddff835d9 comment
[0m  dev   3f743d120c8b4898237b19f7e82d3193bea7c5cfb6e1ffb4b8671de548f70422 comment



In [22]:
table_count = table_env.from_path('main_catalog.nba.allstar_games_stats').select('Season.count').to_pandas().values[0][0]
csv_count = table_env.from_path('etl_catalog.nba.allstar_games_stats_temp').select('Season.count').to_pandas().values[0][0]
assert table_count == csv_count

2022-03-10 17:08:33,686 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:33,769 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:33,773 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:33,777 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:33,782 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:33,786 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:33,790 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new decompressor [.gz]
2022-03-10 17:08:33,794 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - G

Create `experiment` branch
--------------------------------
As a data analyst we might want to carry out some experiments with some data, without affecting `main` in any way.
As in the previous examples, we can just get started by creating an `experiment` branch off of `main`
and carry out our experiment, which could consist of the following steps:
- drop `totals_stats` table
- add data to `salaries` table
- compare `experiment` and `main` tables

In [23]:
create_ref_catalog("experiment")

In [24]:
# Drop the `new_totals_stats` table on the `experiment` branch
table_env.execute_sql("DROP TABLE experiment_catalog.nba.new_totals_stats")

<pyflink.table.table_result.TableResult at 0x7fce02a43e50>

In [25]:
# add some salaries for Dirk Nowitzki
table_env.execute_sql("""INSERT INTO experiment_catalog.nba.salaries VALUES
    ('2015-16', 'Dallas Mavericks', '$8333333', 'Dirk Nowitzki'),
    ('2016-17', 'Dallas Mavericks', '$25000000', 'Dirk Nowitzki'),
    ('2017-18', 'Dallas Mavericks', '$5000000', 'Dirk Nowitzki'),
    ('2018-19', 'Dallas Mavericks', '$5000000', 'Dirk Nowitzki')""").wait()

2022-03-10 17:08:35,088 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:35,088 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:35,088 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:35,088 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:35,088 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:35,088 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:35,088 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new compressor [.gz]
2022-03-10 17:08:35,089 INFO  org.apache.hadoop.io.compress.CodecPool                      [] - Got brand-new c

In [26]:
# We should see the `salaries` and `allstar_games_stats` tables only (since we just dropped `new_totals_stats`)
!nessie contents --list --ref experiment

ICEBERG_TABLE:
	nba.salaries
	nba.allstar_games_stats



In [27]:
# `main` hasn't changed been changed and still has the `new_totals_stats` table
!nessie contents --list

ICEBERG_TABLE:
	nba.salaries
	nba.allstar_games_stats
	nba.new_totals_stats



Let's take a look at the contents of the `salaries` table on the `experiment` branch.
Notice the use of the `nessie` catalog and the use of `@experiment` to view data on the `experiment` branch

In [28]:
table_env.from_path('main_catalog.nba.`salaries@experiment`').select(lit(1).count).to_pandas()

Unnamed: 0,EXPR$0
0,59


and compare to the contents of the `salaries` table on the `main` branch. Notice that we didn't have to specify `@branchName` as it defaulted
to the `main` branch

In [29]:
table_env.from_path('main_catalog.nba.`salaries@main`').select(lit(1).count).to_pandas()

Unnamed: 0,EXPR$0
0,55
