# Computing on a Spark Cluster 

After [📕**Launching Spark in the Amazon Cloud with flintrock**](spark-flintrock-setup.ipynb), there are a variety of ways of running Spark code on it. 

## Preamble

In [1]:
from data_science_learning_paths import show_command

In [2]:
proj_path = "/Users/cls/Documents/Work/Projects/point8/DataScienceLearningPaths/big-data-cluster"

In [3]:
show_command(f"cd {proj_path}")

In [4]:
cd {proj_path}

/Users/cls/Documents/Work/Projects/point8/DataScienceLearningPaths/big-data-cluster


## Prerequisites

Make sure that the cluster is running - see [**📕Setup**](spark-flintrock-setup.ipynb).

In [5]:
#cluster_name = "test-cluster"
cluster_name = "bigdata-cluster"

In [6]:
config_path = f"config/{cluster_name}.yaml"

## Describe Cluster

In [7]:
show_command(f"flintrock --config={config_path} describe {cluster_name}")

This piece of Python code uses the command above to extract the URL of the master node - we are going to need it in the following.

In [8]:
import yaml

def describe_cluster():
    cluster_descr = !flintrock --config={config_path} describe {cluster_name} 
    cluster_descr = "\n".join(cluster_descr)
    print(cluster_descr)
    cluster_info = yaml.safe_load(cluster_descr)
    return cluster_info

In [9]:
cluster_info = describe_cluster()

bigdata-cluster:
  state: running
  node-count: 3
  master: ec2-18-156-5-29.eu-central-1.compute.amazonaws.com
  slaves:
    - ec2-18-198-1-74.eu-central-1.compute.amazonaws.com
    - ec2-3-67-100-151.eu-central-1.compute.amazonaws.com


In [10]:
master_url = cluster_info[cluster_name]["master"]
master_url

'ec2-18-156-5-29.eu-central-1.compute.amazonaws.com'

## Working Directly on the Cluster

The following ways of working with Spark assume that you are logged in to the head node of the cluster. With our setup, we let `flintrock` help us with this:

In [11]:
show_command(f"flintrock --config={config_path} login {cluster_name}")

### Working with the Spark Shell

_start the PySpark shell_

In [12]:
show_command("pyspark")

**Exercise**

_Try out a few commands from [📕**Spark Fundamentals**](spark-fundamentals.ipynb) to verify that PySpark is working._

### Submitting a Job

1. _Copy `pi_approximation.py` to the cluster_ 
2. _Log into the cluster_
3. _Submit it as a batch job with the following command_

In [13]:
submit_cmd = f"spark-submit --master spark://{master_url}:7077 --deploy-mode client jobs/pi_approximation.py"
show_command(submit_cmd)

## Working Remotely with the Cluster

We can also stay on our local machine and send jobs to the cluster via `flintrock`.

**Exercise**

_Use `flintrock` to run the pi approximation job on the cluster._

## Monitoring the Cluster

### Cluster UI

The Spark UI is running on port 8080 (by default) of the master node.

In [14]:
master_url

'ec2-18-156-5-29.eu-central-1.compute.amazonaws.com'

In [15]:
from IPython.core.display import display, HTML
display(HTML(f"<a href='http://{master_url}:8080'><button type='button' style='padding: 5px;'><b>-> Open Spark UI</b></button></a>"))

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_