# Nessie Iceberg Spark Setup

This demo showcases how to use setup Apache Spark + Apache Iceberg for Nessie using Python.

## Initialize Pyspark + Nessie environment

To get started, we will first have to do a few setup steps that give us everything we need
to get started with Nessie. The `nessiedemo` lib is used to start a Nessie server for this demo.
In case you're interested in the detailed setup steps for Spark, you can check out the [docs](https://projectnessie.org/tools/spark/)
or also directly have a look into the source code of the `nessiedemo` lib [here](https://github.com/projectnessie/nessie-demos/blob/main/pydemolib/nessiedemo/spark.py).

In [None]:
# install the nessiedemo lib, which configures all required dependencies
!pip install nessiedemo

In [None]:
# Setup the Demo: installs the required Python dependencies, downloads the sample datasets and
# downloads + starts the Nessie-Quarkus-Runner.
from nessiedemo.demo import setup_demo
demo = setup_demo("nessie-0.5-iceberg-0.11.yml")

# This is separate, because NessieDemo.prepare() via .start() implicitly installs the required dependencies.
# Downloads and sets up Spark
from nessiedemo.spark_base import NessieDemoSparkSupport
helper = NessieDemoSparkSupport(demo)

In [None]:
# The above started the Nessie server for us and also installed the Nessie CLI.
!nessie branch

# Create a `SparkSession` to use Iceberg and Nessie

Creating a `SparkSession` is basically two steps:
1. Gather the configuration options and put those into a `SparkConf` instance
1. Create the `SparkSession` instance using that `SparkConf`

## Gather configuration in `SparkConf`

In [None]:
# Get the Iceberg version
iceberg_version = demo.get_iceberg_version()

print(f"Using Iceberg version {iceberg_version}")

We are using the Spark configuration option `spark.jars.packages`
(see [Spark Docs](https://spark.apache.org/docs/latest/configuration.html) for details) to let Spark pull the
Iceberg Spark runtime. This option takes so called Maven coordinates, which we will prepare in the `spark_jars`
variable.

In [None]:
spark_jars = f"org.apache.iceberg:iceberg-spark3-runtime:{iceberg_version}"

print(spark_jars)

We also need the Nessie server's API endpoint URI.

In [None]:
# Get the Nessie server's API endpoint URI
nessie_api_uri = demo.get_nessie_api_uri()

print(nessie_api_uri)

We need a name for our catalog as well.

In [None]:
catalog_name = "iceberg_spark_setup"

Since we use the local filesystem for our "warehouse", just put that into some directory on the disk.

In [None]:
import os
spark_warehouse = os.path.abspath("spark_warehouse")

In [None]:
from pyspark import SparkConf

conf = SparkConf()

conf.set("spark.jars.packages", spark_jars)
conf.set("spark.sql.execution.pyarrow.enabled", "true")
conf.set(f"spark.sql.catalog.{catalog_name}.warehouse", spark_warehouse)
conf.set(f"spark.sql.catalog.{catalog_name}.cache-enabled", "false")
conf.set(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog")
conf.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")

# Nessie specific configuration

# The Nessie API endpoint
conf.set(f"spark.sql.catalog.{catalog_name}.url", nessie_api_uri)
# Use the default branch called `main`
conf.set(f"spark.sql.catalog.{catalog_name}.ref", "main")
# Tell Iceberg to use the Nessie catalog implementation
conf.set(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog")
# Don't use authentication in this example
conf.set(f"spark.sql.catalog.{catalog_name}.auth_type", "NONE")

## Create the `SparkSession`

The next step creates the `SparkSession`.

Note: If this step errors out with a message like "Java Gateway process exited", it probably means that you are running
the demo on your local machine and the `JAVA_HOME` environment is not set. In that case, make sure you have Java 8 or
Java 11 installed and `JAVA_HOME` set using `os.environ["JAVA_HOME"] = <path to java-home>`

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.config(conf=conf).getOrCreate()

The `SparkSession` is created and ready to be used with Nessie.

Look into other Nessie demos about how to use Nessie with Iceberg and Spark. This demo is just about the "boilerplate
initialization and setup code". Other demos hide those parts using the `nessiedemo` library.

The following snippets illustrate a few more topics.

# Bonus content

There are a few things that are probably worth to know.

## Nessie references and Spark SQL with Iceberg

Using multiple Nessie branches or tags in Spark SQL is easy, only add something like `@my_branch_name` to the
table name.

Some Spark SQL examples:

| SQL | Explanation
| --- | ---
| ```SELECT * FROM my_table``` | Performs a `SELECT *` against the `my_table` table using the Nessie branch or tag from the `.ref` option in `SparkConf`.
| ```SELECT * FROM `my_table@dev_branch` ``` | Performs a `SELECT *` against the `my_table` table but using the Nessie branch `dev_branch`. Note the backticks (``` ` ```) around the table qualifier.
| ```SELECT * FROM `my_table@dev_branch` from_dev, `my_table@main` from_main WHERE ...``` | Performs a `SELECT` joining the `my_table` in the Nessie branch `dev_branch` with the `my_table` in the Nessie branch `main`, which can be handy to find differences in a table in different Nessie branches.


## Switching the `SparkSession` to another branch

With Spark 3, you can create a new `SparkSession` and just set the `.ref` configuration option to point it to the
Nessie branch or tag you like to use. In the following example, `spark_dev` will point to the Nessie `dev` branch.

Note: Spark has a few "static" ("thread local") pointers. One is a Java `ThreadLocal` that holds the current
`SparkSession`. If you want to use a different `SparkSession`, you have to call `SparkSession.setActiveSession(newSession)`
to inform Spark about the "right" `SparkSession`. It is _not_ sufficient to "just use" the "right" `SparkSession`.

In [None]:
# Create the dev branch
!nessie branch dev

In [None]:
# List the branches
!nessie branch

In [None]:
spark_dev = spark.newSession()
spark_dev.conf.set(f"spark.sql.catalog.{catalog_name}.ref", "dev")

from py4j.java_gateway import java_import
# Get the JVM (Java Virtual Machine) gateway used by pyspark
jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "org.apache.spark.sql.SparkSession")

# This step instructs Spark to use `spark_dev` for the current thread.
jvm.SparkSession.setActiveSession(spark_dev._jsparkSession)