<a href="https://colab.research.google.com/github/groda/big_data/blob/master/Hadoop_Setting_up_Spark_Standalone_on_Google_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Install and run Spark in standalone mode

 Following the instructions at: https://spark.apache.org/docs/latest/spark-standalone.html

You might have to change the constant `HADOOP_SPARK_URL` (the URL for downloading the Hadoop+Spark distribution).

In [1]:
# URL for downloading Hadoop and Spark
HADOOP_SPARK_URL = "https://dlcdn.apache.org/spark/spark-3.3.1/spark-3.3.1-bin-hadoop3.tgz"

## Setup Spark

### Set some environment variables

In [2]:
import os
os.environ['HADOOP_SPARK_URL'] = HADOOP_SPARK_URL
os.environ['SPARK_HOME'] = os.path.join('/content', os.path.splitext(os.path.basename(HADOOP_SPARK_URL))[0])
os.environ['PATH'] = ':'.join([os.path.join(os.environ['SPARK_HOME'], 'bin'), os.environ['PATH']])
os.environ['JAVA_HOME'] = '/usr/lib/jvm/java-11-openjdk-amd64'

### Download package and unpack

**Note:** using `--no-clobber` option will prevent `wget` from downloading file if already present.

In [3]:
!wget --no-clobber $HADOOP_SPARK_URL

File ‘spark-3.3.1-bin-hadoop3.tgz’ already there; not retrieving.



In [4]:
!([ -f $(basename ${HADOOP_SPARK_URL}|sed 's/\.[^.]*$//') ] && echo "Folder already exists") || (tar xzf $(basename $HADOOP_SPARK_URL) && echo "Uncompressed")


Uncompressed


Check

In [5]:
!ls

sample_data  spark-3.3.1-bin-hadoop3  spark-3.3.1-bin-hadoop3.tgz


### Start a standalone master

First, stop Spark master and all workers in case there's some already running.

In [6]:
%%bash
$SPARK_HOME/sbin/stop-master.sh 
$SPARK_HOME/sbin/stop-worker.sh 


no org.apache.spark.deploy.master.Master to stop
no org.apache.spark.deploy.worker.Worker to stop


In [7]:
!ls $SPARK_HOME

bin   data	jars	    LICENSE   logs    python  README.md  sbin  yarn
conf  examples	kubernetes  licenses  NOTICE  R       RELEASE	 work


In [8]:
%%bash --out output
$SPARK_HOME/sbin/start-master.sh

#### Extract port number for the Spark Web UI
We captured the `output` in order to get the name of the logfile and extract from it the URL of the Spark Web UI. 

Here's the name of the master's logfile: 

In [9]:
!echo $(echo "$output" | grep -o '[^ ]*$')

/content/spark-3.3.1-bin-hadoop3/logs/spark--org.apache.spark.deploy.master.Master-1-726f1dacfbf9.out


Extract port where MasterUI (the Web interface for Spark's master) is running:

In [10]:
%%bash
PORT=$(grep  -m1 -Po "Successfully started service 'MasterUI' on port \d+" $SPARK_HOME/logs/spark--org.apache.spark.deploy.master.Master*.out| cut -d' ' -f7)
echo $PORT




In [11]:
%env PORT=8081

env: PORT=8081


### Start one worker

In order to start a worker you need the URL for the running master node, that is something like `spark://${HOSTNAME}:7077`




In [12]:
!$SPARK_HOME/sbin/start-worker.sh spark://${HOSTNAME}:7077

starting org.apache.spark.deploy.worker.Worker, logging to /content/spark-3.3.1-bin-hadoop3/logs/spark--org.apache.spark.deploy.worker.Worker-1-726f1dacfbf9.out


## Look at the master's Web UI

Open the WebUI in a new window or tab in your browser buy clicking on the link below:

In [13]:
from google.colab import output
output.serve_kernel_port_as_window(8081)

<IPython.core.display.Javascript object>

## Run a Spark job with `spark-submit`

This step might take some time. 

We are going to run the SparkPi demo from the examples in the Spark distribution contained in `spark-examples*.jar`. 

We are submitting the job with [`spark-submit`](https://spark.apache.org/docs/latest/submitting-applications.html). The output of this job is an approximation of π (see also https://spark.apache.org/examples.html).

In [14]:
%%bash

export EXAMPLES_JAR=$(find $SPARK_HOME/examples/jars/ -name "spark-examples_2*")

$SPARK_HOME/bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master spark://${HOSTNAME}:7077 \
  --executor-memory 2G \
  --total-executor-cores 100 \
  $EXAMPLES_JAR \
  100

22/10/26 18:27:25 INFO SparkContext: Running Spark version 3.3.1
22/10/26 18:27:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/10/26 18:27:25 INFO ResourceUtils: No custom resources configured for spark.driver.
22/10/26 18:27:25 INFO SparkContext: Submitted application: Spark Pi
22/10/26 18:27:25 INFO ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 2048, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
22/10/26 18:27:25 INFO ResourceProfile: Limiting resource is cpu
22/10/26 18:27:25 INFO ResourceProfileManager: Added ResourceProfile id: 0
22/10/26 18:27:25 INFO SecurityManager: Changing view acls to: root
22/10/26 18:27:25 INFO SecurityManager: Changing modify acls to: root
22/10/26 18:27:25 INFO SecurityManager:

## Shutdown 

In [15]:
%%bash
$SPARK_HOME/sbin/stop-master.sh 
$SPARK_HOME/sbin/stop-worker.sh

stopping org.apache.spark.deploy.master.Master
stopping org.apache.spark.deploy.worker.Worker
