# Setting up Jupyter Notebooks on Spark Cluster

For setting up an interactive notebook environment on the Spark cluster, we follow the steps outlined [here](https://github.com/PiercingDan/spark-Jupyter-AWS).

## Preamble

In [1]:
from data_science_learning_paths import show_command

In [2]:
import html

## Preparation

In [3]:
proj_path = "/Users/cls/Documents/Work/Projects/point8/DataScienceLearningPaths/big-data-cluster"

In [4]:
show_command(f"cd {proj_path}")

In [5]:
cd {proj_path}

/Users/cls/Documents/Work/Projects/point8/DataScienceLearningPaths/big-data-cluster


In [6]:
cluster_name = "bigdata-cluster"

In [7]:
config_path = f"config/{cluster_name}.yaml"

In [8]:
import yaml

def describe_cluster():
    cluster_descr = !flintrock --config={config_path} describe {cluster_name} 
    cluster_descr = "\n".join(cluster_descr)
    print(cluster_descr)
    cluster_info = yaml.safe_load(cluster_descr)
    return cluster_info

In [9]:
cluster_info = describe_cluster()

bigdata-cluster:
  state: running
  node-count: 3
  master: ec2-54-93-231-162.eu-central-1.compute.amazonaws.com
  slaves:
    - ec2-35-158-229-212.eu-central-1.compute.amazonaws.com
    - ec2-52-29-157-202.eu-central-1.compute.amazonaws.com


In [10]:
master_url = cluster_info[cluster_name]["master"]
master_url

'ec2-54-93-231-162.eu-central-1.compute.amazonaws.com'

## Set up Python environment on Cluster

1. _Install [miniforge](), a lean version of the [Anaconda]() Python distribution_

In [11]:
command = "wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh"

In [12]:
show_command(f"flintrock --config={config_path} run-command {cluster_name} '{command}'")

In [13]:
command = "sh Miniforge3-Linux-x86_64.sh -b"

In [14]:
show_command(f"flintrock --config={config_path} run-command {cluster_name} '{command}'")

2. _Add the miniforge Python to the PATH_

In [15]:
command = 'echo "export PATH=$HOME/miniforge3/bin:$PATH" >> ~/.bashrc'

In [16]:
show_command(f"""flintrock --config={config_path} run-command {cluster_name} '{command}'""")

3. _Install Jupyter_

In [17]:
command = "conda install --yes jupyterlab"

In [18]:
show_command(f"flintrock --config={config_path} run-command {cluster_name} '{command}'")

4. Add Spark Python to the PYTHONPATH

In [19]:
command = 'echo "export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.1-src.zip:$PYTHONPATH" >> ~/.bashrc'

In [20]:
show_command(f"flintrock --config={config_path} run-command {cluster_name} '{command}'")

5. Set PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON

In [21]:
command = 'echo "export PYSPARK_PYTHON=~/miniforge3/bin/python PYSPARK_DRIVER_PYTHON=~/miniforge3/bin/python" >> ~/.bashrc'#

In [22]:
show_command(f"flintrock --config={config_path} run-command {cluster_name} '{command}'")

## Running the Notebook Server

4. Install tmux to keep the terminal session open after logging out

In [23]:
command = "sudo yum install -y tmux"

In [24]:
show_command(f"flintrock --config={config_path} run-command --master-only {cluster_name} '{command}'")

6. Create the Jupyter setup script and copy it to the master node.

In [25]:
memory = 2000  # [MB] RAM for each executor (worker node process) and the driver process

In [32]:
port = "22322"

In [33]:
jupyter_setup = f"""
PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="lab --no-browser --port={port}" pyspark --master spark://{master_url}:{port} --executor-memory {memory}M --driver-memory {memory}M
"""
print(jupyter_setup)


PYSPARK_DRIVER_PYTHON=jupyter PYSPARK_DRIVER_PYTHON_OPTS="lab --no-browser --port=22322" pyspark --master spark://ec2-54-93-231-162.eu-central-1.compute.amazonaws.com:22322 --executor-memory 2000M --driver-memory 2000M



In [34]:
script_path = "scripts/jupyter_setup.sh"

In [35]:
with open(script_path, "w") as script_file:
    script_file.write(jupyter_setup)

In [29]:
show_command(f"flintrock --config={config_path} copy-file --master-only {cluster_name} {script_path} /home/ec2-user/jupyter_setup.sh")

5. Login to the master node
6. Start a new tmux session.
7. `source` the Jupyter setup script to start Jupyter Lab.
8. Access Jupyter via the browser.

In [30]:
master_url

'ec2-54-93-231-162.eu-central-1.compute.amazonaws.com'

In [36]:
from IPython.core.display import display, HTML
display(HTML(f"<a href='http://{master_url}:{port}'><button type='button' style='padding: 5px;'><b>-> Open Jupyter</b></button></a>"))

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_