<a href="https://colab.research.google.com/github/groda/big_data/blob/master/Setting_up_Spark_Standalone_on_Google_Colab_BigtopEdition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<a href="https://github.com/groda/big_data"><div><img src="https://github.com/groda/big_data/blob/master/logo_bdb.png?raw=true" align=right width="90" alt="Logo Big Data for Beginners"></div></a>
# Install and run Spark in standalone mode‚ÄîApache Bigtop edition <div><img src="https://www.apache.org/logos/res/bigtop/bigtop.png" width="45" style='vertical-align:middle; display:inline;' alt="Apache Bigtop" data-url="https://www.apache.org/logos/#bigtop"><img src="https://www.apache.org/logos/res/spark/spark.png" width="45" style='vertical-align:middle; display:inline;' alt="Apache Spark" data-url="https://www.apache.org/logos/#spark"></div>

<br>

We will install Apache Spark on a single machine (the virtual machine hosting this notebook) in _standalone mode_, meaning it will run without any cluster manager like YARN, Mesos, or Kubernetes. For more information, see the [types of cluster managers supported by Spark](https://spark.apache.org/docs/latest/cluster-overview.html#cluster-manager-types)).

We're following the official [Spark Standalone documentation](https://spark.apache.org/docs/latest/spark-standalone.html), using Apache Bigtop's Spark distribution, which conveniently packages Spark's start scripts as services.




Before running this notebook, you may want to update the Bigtop version (currently version `3.3.0` [with Hadoop 3.3.5](https://bigtop.apache.org/release-notes.html), see also the [full list of releases](https://bigtop.apache.org/download.html)).

### A side note:





<a href="https://spark.apache.org/"><img src="https://www.apache.org/logos/res/spark/spark.png" width="120" align="right" style='vertical-align:middle; display:inline;' alt="Apache Spark" data-url="https://www.apache.org/logos/#spark"></a>
<a href="https://bigtop.apache.org/"><img src="https://www.apache.org/logos/res/bigtop/bigtop.png" width="120" align="right" style='vertical-align:middle; display:inline;' alt="Apache Bigtop" data-url="https://www.apache.org/logos/#bigtop"></a>


I recently discovered [a website](https://www.apache.org/logos/) where you can find all Apache project logos, including Spark, with transparent backgrounds. It‚Äôs a great resource for anyone needing these assets for presentations or documentation. <p>





## Install Spark from Bigtop repository

**Note:** since the current underlying Machine in Colab runs Ubuntu `22.04`, our choice of Bigtop versions is limited to at most `3.3.0`, since only Ubuntu `24.04` is supported since Bigtop `3.4.0`.

**Note 2:** Bigtop `3.3.0` includes Spark `3.3.4` (here is the [list of all libraries included in the release](https://cwiki.apache.org/confluence/display/BIGTOP/Bigtop+3.3.0+Release). Spark `3.3.4` [runs on Java 8/11/17](https://archive.apache.org/dist/spark/docs/3.3.4), so we do not need to install anything in Colab because it comes with Java 17 pre-installed.

In [1]:
!lsb_release -rs

22.04


In [2]:
!java -version

openjdk version "17.0.17" 2025-10-21
OpenJDK Runtime Environment (build 17.0.17+10-Ubuntu-122.04)
OpenJDK 64-Bit Server VM (build 17.0.17+10-Ubuntu-122.04, mixed mode, sharing)


In [3]:
%%bash
# 1. Use sudo to write the repo list
echo "Adding Bigtop repository..."
curl -s https://archive.apache.org/dist/bigtop/bigtop-3.3.0/repos/$(lsb_release -is | tr '[:upper:]' '[:lower:]')-$(lsb_release -rs)/bigtop.list | sudo tee /etc/apt/sources.list.d/bigtop-3.3.0.list

# 2. Add the GPG key (Updated to modern trusted.gpg.d method)
echo "Adding Bigtop GPG key..."
wget -qO - https://archive.apache.org/dist/bigtop/bigtop-3.3.0/repos/GPG-KEY-bigtop | sudo gpg --dearmor -o /etc/apt/trusted.gpg.d/bigtop.gpg

# 3. Use sudo for apt update
echo "Updating package cache..."
sudo apt-get update

Adding Bigtop repository...
deb http://repos.bigtop.apache.org/releases/3.3.0/ubuntu/22.04/$(ARCH) bigtop contrib
Adding Bigtop GPG key...
Updating package cache...
Get:1 http://repos.bigtop.apache.org/releases/3.3.0/ubuntu/22.04/amd64 bigtop InRelease [2,502 B]
Get:2 https://cli.github.com/packages stable InRelease [3,917 B]
Get:3 http://security.ubuntu.com/ubuntu jammy-security InRelease [129 kB]
Hit:4 http://archive.ubuntu.com/ubuntu jammy InRelease
Get:5 https://r2u.stat.illinois.edu/ubuntu jammy InRelease [6,555 B]
Get:6 https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/ InRelease [3,632 B]
Get:7 http://archive.ubuntu.com/ubuntu jammy-updates InRelease [128 kB]
Get:8 http://repos.bigtop.apache.org/releases/3.3.0/ubuntu/22.04/amd64 bigtop/contrib amd64 Packages [18.7 kB]
Get:9 https://cli.github.com/packages stable/main amd64 Packages [356 B]
Get:10 http://archive.ubuntu.com/ubuntu jammy-backports InRelease [127 kB]
Get:11 https://r2u.stat.illinois.edu/ubuntu jammy/main all

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


### Explore Bigtop and Spark packages

In [4]:
%%bash
echo 'List all available packages that match "bigtop"'
sudo apt-get -qq update && sudo apt search bigtop

echo 'List all available packages that match "spark"'
sudo apt search spark

List all available packages that match "bigtop"
Sorting...
Full Text Search...
bigtop-groovy/stable 2.5.4-1 all
  An agile and dynamic language for the Java Virtual Machine

bigtop-jsvc/stable 1.2.4-1 amd64
  Application to launch java daemon

bigtop-utils/stable 3.3.0-1 all
  Collection of useful tools for Bigtop

List all available packages that match "spark"
Sorting...
Full Text Search...
alluxio/stable 2.9.3-1 all
  Reliable file sharing at memory speed across cluster frameworks

libjs-jquery.sparkline/jammy 2.1.2-3 all
  library for jQuery to generate sparklines

libsparkline-php/jammy 0.2-7 all
  sparkline graphing library for php

livy/stable 0.8.0-1 all
  Livy is an open source REST interface for interacting with Apache Spark from anywhere.

node-sparkles/jammy 1.0.1-2 all
  Namespaced global event emitter

nspark/jammy 1.7.8B2+git20210317.cb30779-2 amd64
  Unarchiver for Spark and ArcFS files

pcp-export-pcp2spark/jammy 5.3.6-1build1 amd64
  Tool for exporting data from PCP to

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)






### Install the essential packages

In order to run a Spark job, we need the core libraries as well as the Spark master and Spark worker. Master and worker in this case are going to run both on the same machine, the localhost.

The package `bigtop-utils` will be used to start  the services.

In [5]:
%%bash
for p in spark-core spark-master spark-worker bigtop-utils; do
  echo "üõ†Ô∏è Installing $p"
  sudo apt install -qq -y $p
done

üõ†Ô∏è Installing spark-core
The following additional packages will be installed:
  bigtop-groovy bigtop-jsvc bigtop-utils hadoop hadoop-client hadoop-hdfs
  hadoop-mapreduce hadoop-yarn netcat-openbsd zookeeper
The following NEW packages will be installed:
  bigtop-groovy bigtop-jsvc bigtop-utils hadoop hadoop-client hadoop-hdfs
  hadoop-mapreduce hadoop-yarn netcat-openbsd spark-core zookeeper
0 upgraded, 11 newly installed, 0 to remove and 82 not upgraded.
Need to get 732 MB of archives.
After this operation, 905 MB of additional disk space will be used.
Selecting previously unselected package netcat-openbsd.
(Reading database ... (Reading database ... 5%(Reading database ... 10%(Reading database ... 15%(Reading database ... 20%(Reading database ... 25%(Reading database ... 30%(Reading database ... 35%(Reading database ... 40%(Reading database ... 45%(Reading database ... 50%(Reading database ... 55%(Reading database ... 60%(Reading database ... 65%(Reading database 



debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 11.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 


debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debconf/FrontEnd/Dialog.pm line 78, <> line 1.)
debconf: falling back to frontend: Readline
debconf: unable to initialize frontend: Readline
debconf: (This frontend requires a controlling tty.)
debconf: falling back to frontend: Teletype
dpkg-preconfigure: unable to re-open stdin: 


debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is insta

**Note;** in a future version of this notebook we are going to use an alternative to `apt` for installing packages in order to avoid the warning

```
WARNING: apt does not have a stable CLI interface. Use with caution in scripts.

```

## Start Spark services

Thanks to the Bigtop's utilities, we can now start the Spark master and Spark worker as services. Normally one would use `systemctl` but since this is not allowed on Colab, we are going to resort to `service`.

In [6]:
%%bash
for p in spark-master spark-worker; do
  echo "Starting $p"
  # systemctl start $p
  sudo service $p start
done

Starting spark-master
 * Starting Spark master (spark-master): 
Starting spark-worker
 * Starting Spark worker (spark-worker): 


## Run the `pi` example

This step may take some time.

We'll run the `SparkPi` demo from the examples included in the Spark distribution, which are packaged in the `spark-examples*.jar` file.

We'll submit the job using [`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html), and the output will be an approximation of œÄ (for more details, see the [official Spark examples](https://spark.apache.org/examples.html).


The following code defines the variable `$EXAMPLE_JAR`, which points to the archive containing all the examples from the Spark distribution.

The following command submits the SparkPi application (located in the `org.apache.spark.examples.SparkPi` class) to the Spark master at `spark://${HOSTNAME}:7077` using `spark-submit`:

```
$SPARK_HOME/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://${HOSTNAME}:7077 \
  $EXAMPLES_JAR \
  100
```

In this example, the number $100$ represents the number of iterations used to compute an approximation of œÄ by calculating the ratio of points inside versus outside the unit circle.

In [7]:
!find $(which hadoop|awk -F 'bin/hadoop' '{print $1}') -name 'spark-examples.jar' -print -quit

/usr/lib/spark/examples/jars/spark-examples.jar


In [8]:
%%bash

export EXAMPLES_JAR=$(find $(which hadoop|awk -F 'bin/hadoop' '{print $1}') -name 'spark-examples.jar' -print -quit)

$SPARK_HOME/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://${HOSTNAME}:7077 \
  $EXAMPLES_JAR \
  100

26/02/22 19:25:28 INFO SparkContext: Running Spark version 3.3.4
26/02/22 19:25:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
26/02/22 19:25:28 INFO ResourceUtils: No custom resources configured for spark.driver.
26/02/22 19:25:28 INFO SparkContext: Submitted application: Spark Pi
26/02/22 19:25:28 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
26/02/22 19:25:28 INFO ResourceProfile: Limiting resource is cpu
26/02/22 19:25:28 INFO ResourceProfileManager: Added ResourceProfile id: 0
26/02/22 19:25:28 INFO SecurityManager: Changing view acls to: root
26/02/22 19:25:28 INFO SecurityManager: Changing modify acls to: root
26/02/22 19:25:28 INFO SecurityManager:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]


## Run the Java Random Forest Regressor example

Next, we will run the Java Random Forest Regressor example. Source: [JavaRandomForestRegressorExample.java](https://github.com/apache/spark/blob/master/examples/src/main/java/org/apache/spark/examples/ml/JavaRandomForestRegressorExample.java).

In [9]:
%%bash
 j=$(find $(which hadoop|awk -F 'bin/hadoop' '{print $1}') -name 'spark-examples.jar' -print -quit)
 echo "Jar file containing examples: $j"

Jar file containing examples: /usr/lib/spark/examples/jars/spark-examples.jar


If you run

```
%%bash
j=$(find $(which hadoop|awk -F 'bin/hadoop' '{print $1}') -name 'spark-examples.jar' -print -quit)
spark-submit --class  org.apache.spark.examples.ml.JavaRandomForestRegressorExample $j
```

you'll get an error message telling you that the file `/content/data/mllib/sample_libsvm_data.txt` is missing. We are just going to create this file, but first we need to find it!

In [10]:
!find / -name 'sample_libsvm_data.txt' 2> /dev/null

/usr/local/lib/python3.12/dist-packages/pyspark/data/mllib/sample_libsvm_data.txt
/usr/lib/spark/data/mllib/sample_libsvm_data.txt


Copy the datafile to the desired location

In [11]:
%%bash
mkdir -p data/mllib/
cp /usr/lib/spark/data/mllib/sample_libsvm_data.txt data/mllib/sample_libsvm_data.txt

Run the JavaRandomForestRegressorExample example.

In [12]:
%%bash
j=$(find $(which hadoop|awk -F 'bin/hadoop' '{print $1}') -name 'spark-examples.jar' -print -quit)
spark-submit --class  org.apache.spark.examples.ml.JavaRandomForestRegressorExample $j

+----------+-----+--------------------+
|prediction|label|            features|
+----------+-----+--------------------+
|       0.0|  0.0|(692,[122,123,124...|
|       0.0|  0.0|(692,[125,126,127...|
|      0.05|  0.0|(692,[126,127,128...|
|      0.05|  0.0|(692,[126,127,128...|
|      0.05|  0.0|(692,[128,129,130...|
+----------+-----+--------------------+
only showing top 5 rows
Root Mean Squared Error (RMSE) on test data = 0.10850895607454299
Learned regression forest model:
RandomForestRegressionModel: uid=rfr_8579b0d5257b, numTrees=20, numFeatures=692
  Tree 0 (weight 1.0):
    If (feature 489 <= 37.5)
     If (feature 492 <= 205.5)
      Predict: 0.0
     Else (feature 492 > 205.5)
      Predict: 1.0
    Else (feature 489 > 37.5)
     Predict: 1.0
  Tree 1 (weight 1.0):
    If (feature 489 <= 5.5)
     Predict: 0.0
    Else (feature 489 > 5.5)
     Predict: 1.0
  Tree 2 (weight 1.0):
    If (feature 406 <= 126.5)
     Predict: 0.0
    Else (feature 406 > 126.5)
     Predict: 1.0


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
26/02/22 19:25:57 INFO SparkContext: Running Spark version 4.0.2
26/02/22 19:25:57 INFO SparkContext: OS info Linux, 6.6.113+, amd64
26/02/22 19:25:57 INFO SparkContext: Java version 17.0.17
26/02/22 19:25:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
26/02/22 19:25:58 INFO ResourceUtils: No custom resources configured for spark.driver.
26/02/22 19:25:58 INFO SparkContext: Submitted application: JavaRandomForestRegressorExample
26/02/22 19:25:58 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
26/02/22 19:25:58 INFO ResourceProfile: Limiting resource is cpu
26/02/22 19:25:58 INF

## Summary

In this guide, we demonstrated how to install the essential Spark services‚ÄîSpark Core, Spark Master, and Spark Worker‚Äîusing the Bigtop distribution. We also explored how to leverage Bigtop's utilities to easily launch a Spark engine. Additionally, we executed two example jobs included in the Spark package.