<a href="https://colab.research.google.com/github/jalorenzo/SparkNotebookColab/blob/master/BDF_08_Running_Spark_on_a_cluster.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#00 - Configuration of Apache Spark on Collaboratory


###Installing Java, Spark, and Findspark


---


This code installs Apache Spark 3.0.1, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [None]:
import os

os.environ["SPARK_VERSION"] = "spark-3.5.0"
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget  http://apache.osuosl.org/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop3.tgz
!tar xf $SPARK_VERSION-bin-hadoop3.tgz
!echo $SPARK_VERSION-bin-hadoop3.tgz
!rm $SPARK_VERSION-bin-hadoop3.tgz
!pip install -q findspark

### Set Environment Variables
Set the locations where Spark and Java are installed.

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark/"
os.environ["DRIVE_DATA"] = "/content/gdrive/My Drive/Enseignement/2023-2024/ING3/HPDA/BigDataFrameworks/data/"

!rm /content/spark
!ln -s /content/$SPARK_VERSION-bin-hadoop3 /content/spark
!export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
!echo $SPARK_HOME
!env |grep  "DRIVE_DATA"

### Start a SparkSession
This will start a local Spark session.

In [None]:
!python -V

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

# Example: shows the PySpark version
print("PySpark version {0}".format(sc.version))

# Example: parallelise an array and show the 2 first elements
sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)

In [None]:
from pyspark.sql import SparkSession
# We create a SparkSession object (or we retrieve it if it is already created)
spark = SparkSession \
.builder \
.appName("My application") \
.config("spark.some.config.option", "some-value") \
.master("local[4]") \
.getOrCreate()
# We get the SparkContext
sc = spark.sparkContext

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')


---


# 08 - Running Spark on a cluster

## Running a Spark program

### Command `spark-submit`

-   Submits a Spark program (an application) to a cluster
-   More specifically, it launches the driver program and invokes the main() method specified by the user

-   Examples:
```sh
$ bin/spark-submit --master yarn --deploy-mode cluster \  
     --py-files anotherlib.zip,anotherfile.py \  
     --num-executors 10 --executor-cores 2 \  
     my-script.py script_options
```

### `spark-submit` options

-   `master`: cluster manager to use (options: `yarn`, `mesos://host:port`, `spark://host:port`, `local[n]`)

-   `deploy-mode`: Two ways of deploying

    -   `client`: runs the driver on the local node

    -   `cluster`: runs the driver on a node of the cluster

-   `class`: class to execute (Java or Scala)

-   `name`: name of the application (it will be shown in the Spark web)

-   `jars`: jar files to add to the classpath (Java o Scala)

-   `py-files`: files to add to the PYTHONPATH (`.py`,`.zip`,`.egg`)

-   `files`: data files for the applications

-   `executor-memory`: total memory of each executor

-   `driver-memory`: memory of the driver process

For more options: `spark-submit --help`

In [None]:
!spark/bin/spark-submit --help

![Spark-YARN: Client mode](https://i.pinimg.com/originals/e5/e6/a6/e5e6a6dbc4da4a2dbc1b54effd5995ee.jpg)


![Spark-on-YARN: Cluster mode](https://i.pinimg.com/originals/db/16/e9/db16e98baed2a9b54c64e931e1f9b2b5.jpg)

Source: [Spark-on-YARN: Empower Spark Applications on Hadoop Cluster](https://www.slideshare.net/Hadoop_Summit/sparkonyarn-empower-spark-applications-on-hadoop-cluster)

## Configuration parameters

Several parameters that can be adjusted in runtime

-   In the script
```python
from pyspark import SparkConf,SparkContext
conf = SparkConf()
conf.set("spark.app.name", "My app")
conf.set("spark.master", "local[2]") # Cluster manager local mode with 2 threads
conf.set("spark.ui.port", "3600")    # Port of the Spark web interface (by default: 4040)
sc = SparkContext(conf=conf)
```

-   Using flags in the `spark-submit` command
```sh
$ bin/spark-submit --master local[2] --name "My app" \  
    --conf spark.ui.port=3600 my-script.py
```    
    
-   Using a properties file
```sh
$ cat config.conf
spark.master     local[2]
spark.app.name   "My app"
spark.ui.port 3600
$ bin/spark-submit --properties-file config.conf my-script.py
```

More information: <http://spark.apache.org/docs/latest/configuration.html#spark-properties>


## Example: Python script execution

In [None]:
!cat "$DRIVE_DATA"/myscript.py


In [None]:
# NOTE: It won't work in a notebook.
# Do NOT modify the following line
cat << EOF > /tmp/myscript.py
from pyspark import SparkConf, SparkContext
from operator import add

def main():
    conf = SparkConf()
    conf.set("spark.app.name", "My Python script")

    # Initialise the SparkContext
    sc = SparkContext(conf=conf)
    sc.setLogLevel("FATAL")

    rdd = sc.parallelize(range(100000)).cache()

    rdd2 = rdd.map(lambda x: (x, 2*x))\
              .map(lambda (x,y): (x-100, y**2))\
              .reduceByKey(lambda x,y: x+y)\
              .values()

    r = rdd2.reduce(add)

    print("Final result = {0}".format(r))

    # Stop the SparkContext
    sc.stop()
if __name__ == "__main__":
    main()
EOF

In [None]:
!spark/bin/spark-submit --master local[8] "$DRIVE_DATA"myscript.py