# Nessie Iceberg Spark Setup

This demo showcases how to use setup Apache Spark + Apache Iceberg for Nessie using Python.

## Initialize Pyspark + Nessie environment

To get started, we will first have to do a few setup steps that give us everything we need
to get started with Nessie. The `nessiedemo` lib is used to start a Nessie server for this demo.
In case you're interested in the detailed setup steps for Spark, you can check out the [docs](https://projectnessie.org/tools/spark/)
or also directly have a look into the source code of the `nessiedemo` lib [here](https://github.com/projectnessie/nessie-demos/blob/main/pydemolib/nessiedemo/iceberg_spark.py).

In [None]:
# install the nessiedemo lib, which configures all required dependencies
!pip install nessiedemo


In [None]:
# Setup the Demo: installs the required Python dependencies, downloads the sample datasets and
# downloads + starts the Nessie-Quarkus-Runner.
from nessiedemo.demo import setup_demo
demo = setup_demo("nessie-0.5-iceberg-0.11.yml")

In [None]:
# The above started the Nessie server for us and also installed the Nessie CLI.
!nessie branch

## Setup Pyspark with Iceberg

We have a running Nessie server.

To use Iceberg + Spark with Nessie, the following components are needed. Those will be installed in the following steps.
1. Pyspark, the Python Spark library
1. Spark distribution (it's included in the `pyspark` library), so we do not need to download it separately
1. Iceberg-Spark, which will be installed by the Spark runtime

## Install `pyspark`

The `nessiedemo` lib already has installed `pyspark` for us, as shown below. Please take care to use the same release
version for the Apache Spark distribution from the [Spark Download page](https://spark.apache.org/downloads.html)
and the `pyspark` library.

Installing `pyspark` in a production environment is performed using `pip` with a requirement like `pyspark==3.0.2`,
for example `pip install pyspark==3.0.2`. The exact version depends on the actual environment.

In [None]:
!pip show pyspark

## Install Spark distribution

There are different approaches to setup the Spark distribution.
1. Use the Spark distribution that comes with the `pyspark` library
1. Download the Spark distribution from the [Spark Download page](https://spark.apache.org/downloads.html), choose the
   Spark release that matches the `pyspark` library. In this demo, we will use Spark "Pre-built for Apache Hadoop 2.7".
   In a production environment, take care to use the Spark distribution for the Hadoop version you need.

### Alternative 1: Using Spark from `pyspark`

The above `pip show pyspark` gives us the path to the library in the line starting with `Location:`.
It usually ends with `site-packages` or `dist-packages`. You need to add `/pyspark` to the path shown in the line
starting with `Location:` to get the full path.

The following Python code should give us the same result:

In [None]:
import os
import site

# Search for `pyspark` in Python package installation directories
spark_dir = None
for dir in site.getsitepackages():
    test_dir = os.path.join(dir, "pyspark")
    if os.path.isdir(test_dir):
        spark_dir = test_dir
        break

if not spark_dir:
    raise Exception("No pyspark package installed")

print(f"Found Spark distribution in {spark_dir}")

### Alternative 2: Download Spark

This will download the Spark distribution.

In [None]:
# The `nessiedemo` library provides the URL to download the Spark distribution.
spark_download_url = demo._get_versions_dict()["spark"]["tarball"]

print(f"Spark distribution download URL is {spark_download_url}")

In [None]:
# Get the Spark directory name from the download URL - so we have something like
# 'spark-3.0.2-bin-hadoop2.7' in spark_dir_name and
# 'spark-3.0.2-bin-hadoop2.7.tgz' in spark_file_name and
import re
spark_dir_name = re.match(".*[/]([a-zA-Z0-9-.]+)[.]tgz", spark_download_url).group(1)
spark_file_name = f"{spark_dir_name}.tgz"

# Now download the Spark distribution
from nessiedemo.demo import _Util
_Util.wget(spark_download_url, spark_file_name)

In [None]:
# Look for the downloaded tarball
!ls -al .

In [None]:
# Extract the downloaded tarball
_Util.exec_fail(["tar", "-x", "-f", spark_file_name])

In [None]:
# Look for the Spark distribution directory
!ls -al .

In [None]:
# Set the `spark_dir` variable
spark_dir = os.path.abspath(spark_dir_name)

print(f"Extracted Spark distribution in {spark_dir}")

### Common for both alternatives

Once we have the Spark distribution handy and its path in `spark_dir`, just set the `SPARK_HOME` environment variable
and use `findspark` to wire it up.

In [None]:
import os

# Set the SPARK_HOME environment variable
os.environ["SPARK_HOME"] = spark_dir

print(os.environ["SPARK_HOME"])

In [None]:
# Finally, use the Python `findspark` Package to "wire" it up.
import findspark
findspark.init()

# Create a `SparkSession` to use Iceberg and Nessie

Creating a `SparkSession` is basically two steps:
1. Gather the configuration options and put those into a `SparkConf` instance
1. Create the `SparkSession` instance using that `SparkConf`

## Gather configuration in `SparkConf`

In [None]:
# Get the Iceberg version
iceberg_version = demo.get_iceberg_version()

print(f"Using Iceberg version {iceberg_version}")

We are using the Spark configuration option `spark.jars.packages`
(see [Spark Docs](https://spark.apache.org/docs/latest/configuration.html) for details) to let Spark pull the
Iceberg Spark runtime. This option takes so called Maven coordinates, which we will prepare in the `spark_jars`
variable.

In [None]:
spark_jars = f"org.apache.iceberg:iceberg-spark3-runtime:{iceberg_version}"

print(spark_jars)

We also need the Nessie server's API endpoint URI.

In [None]:
# Get the Nessie server's API endpoint URI
nessie_api_uri = demo.get_nessie_api_uri()

print(nessie_api_uri)

We need a name for our catalog as well.

In [None]:
catalog_name = "iceberg_spark_setup"

Since we use the local filesystem for our "warehouse", just put that into some directory on the disk.

In [None]:
spark_warehouse = os.path.abspath("spark_warehouse")

In [None]:
from pyspark import SparkConf

conf = SparkConf()

conf.set("spark.jars.packages", spark_jars)
conf.set("spark.sql.execution.pyarrow.enabled", "true")
conf.set(f"spark.sql.catalog.{catalog_name}.warehouse", spark_warehouse)
conf.set(f"spark.sql.catalog.{catalog_name}.cache-enabled", "false")
conf.set(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog")
conf.set("spark.sql.catalog.spark_catalog", "org.apache.iceberg.spark.SparkSessionCatalog")

# Nessie specific configuration

# The Nessie API endpoint
conf.set(f"spark.sql.catalog.{catalog_name}.url", nessie_api_uri)
# Use the default branch called `main`
conf.set(f"spark.sql.catalog.{catalog_name}.ref", "main")
# Tell Iceberg to use the Nessie catalog implementation
conf.set(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog")
# Don't use authentication in this example
conf.set(f"spark.sql.catalog.{catalog_name}.auth_type", "NONE")

## Create the `SparkSession`

The next step creates the `SparkSession`.

Note: If this step errors out with a message like "Java Gateway process exited", it probably means that you are running
the demo on your local machine and the `JAVA_HOME` environment is not set. In that case, make sure you have Java 8 or
Java 11 installed and `JAVA_HOME` set using `os.environ["JAVA_HOME"] = <path to java-home>`

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.config(conf=conf).getOrCreate()

The `SparkSession` is created and ready to be used with Nessie.

Look into other Nessie demos about how to use Nessie with Iceberg and Spark. This demo is just about the "boilerplate
initialization and setup code". Other demos hide those parts using the `nessiedemo` library.

The following snippets illustrate a few more topics.

# Bonus content

There are a few things that are probably worth to know.

## Nessie references and Spark SQL with Iceberg

Using multiple Nessie branches or tags in Spark SQL is easy, only add something like `@my_branch_name` to the
table name.

Some Spark SQL examples:

| SQL | Explanation
| --- | ---
| ```SELECT * FROM my_table``` | Performs a `SELECT *` against the `my_table` table using the Nessie branch or tag from the `.ref` option in `SparkConf`.
| ```SELECT * FROM `my_table@dev_branch` ``` | Performs a `SELECT *` against the `my_table` table but using the Nessie branch `dev_branch`. Note the backticks (``` ` ```) around the table qualifier.
| ```SELECT * FROM `my_table@dev_branch` from_dev, `my_table@main` from_main WHERE ...``` | Performs a `SELECT` joining the `my_table` in the Nessie branch `dev_branch` with the `my_table` in the Nessie branch `main`, which can be handy to find differences in a table in different Nessie branches.


## Switching the `SparkSession` to another branch

With Spark 3, you can create a new `SparkSession` and just set the `.ref` configuration option to point it to the
Nessie branch or tag you like to use. In the following example, `spark_dev` will point to the Nessie `dev` branch.

Note: Spark has a few "static" ("thread local") pointers. One is a Java `ThreadLocal` that holds the current
`SparkSession`. If you want to use a different `SparkSession`, you have to call `SparkSession.setActiveSession(newSession)`
to inform Spark about the "right" `SparkSession`. It is _not_ sufficient to "just use" the "right" `SparkSession`.

In [None]:
# Create the dev branch
!nessie branch dev

In [None]:
# List the branches
!nessie branch

In [None]:
spark_dev = spark.newSession()
spark_dev.conf.set(f"spark.sql.catalog.{catalog_name}.ref", "dev")

from py4j.java_gateway import java_import
# Get the JVM (Java Virtual Machine) gateway used by pyspark
jvm = spark.sparkContext._gateway.jvm
java_import(jvm, "org.apache.spark.sql.SparkSession")

# This step instructs Spark to use `spark_dev` for the current thread.
jvm.SparkSession.setActiveSession(spark_dev._jsparkSession)