# Setup the environment to submit PySpark jobs on YARN

The basis is to submit Spark jobs to a Spark cluster. 

* [Submitting pyspark script to a remote Spark server?
](https://stackoverflow.com/questions/54641574/submitting-pyspark-script-to-a-remote-spark-server)

In [1]:
import os
import sys
import gc

---
# HDFS permission

For a non-spark user to be able to submit a job, login to the HDFS node as the hadoop user to run:

```
hadoop fs -mkdir /user/${USERNAME}
hadoop fs -chown ${USERNAME} /user/${USERNAME}
hadoop fs -chmod g+w /user/${USERNAME}
```

Otherwise an error:
```
21/08/15 21:15:28 ERROR SparkContext: Error initializing SparkContext.
org.apache.hadoop.security.AccessControlException: Permission denied: user=${USERNAME}, access=WRITE, inode="/user":hadoop:hadoop:drwxrwxr-x
```

---
# Environment variables

## HADOOP_CONF_DIR

Copy the **HADOOP_CONF_DIR** from the Hadoop/YARN master node and set the ```HADOOP_CONF_DIR``` environment variable locally to point to the directory.

* [Launching Spark on YARN
](http://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn)

> Ensure that **HADOOP_CONF_DIR** or **YARN_CONF_DIR** points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. 

In [2]:
os.environ['HADOOP_CONF_DIR'] = "/opt/hadoop/hadoop-3.2.2/etc/hadoop"

## HADOOP_CONF_DIR access permission

* [spark - java.io.FileNotFoundException: File file:/home/user/.sparkStaging/](https://stackoverflow.com/a/71173475/4281353)

Make sure the user who submits the spark job can access the files under ```HADDOP_CONF_DIR```.

If there is no access:

```
$ ls $HADOOP_CONF_DIR
ls: cannot access '/opt/hadoop/hadoop-3.2.2/etc/hadoop': Permission denied
```

Then you will see the error:

```
Failing this attempt.Diagnostics: [2022-02-18 22:27:36.433]File file:$HOME/.sparkStaging/application_1645180452555_0008/__spark_libs__7275973196548620925.zip does not exist
java.io.FileNotFoundException: File file:$HOME/.sparkStaging/application_1645180452555_0008/__spark_libs__7275973196548620925.zip does not exist
```

or 

```
Failing this attempt.Diagnostics: [2022-02-18 21:35:35.917]File file:/home/oonisim/.sparkStaging/application_1645180452555_0001/pyspark.zip does not exist
java.io.FileNotFoundException: File file:$HOME/sparkStaging/application_1645180452555_0001/pyspark.zip does not exist
```


In [15]:
%%bash
export HADOOP_CONF_DIR=/opt/hadoop/hadoop-3.2.2/etc/hadoop
ls $HADOOP_CONF_DIR | head -n 5

capacity-scheduler.xml
configuration.xsl
container-executor.cfg
core-site.xml
core-site.xml.48132.2022-02-15@12:29:41~


## PYTHONPATH

Refer to the **pyspark** modules to load from the ```$SPARK_HOME/python/lib``` in the Spark installation.

* [PySpark Getting Started](https://spark.apache.org/docs/latest/api/python/getting_started/install.html)

> Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib. One example of doing this is shown below:

```
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
```

Alternatively install **pyspark** with pip or conda locally which installs the Spark runtime libararies (for standalone).

* [Can PySpark work without Spark?](https://stackoverflow.com/questions/51728177/can-pyspark-work-without-spark)

> As of v2.2, executing pip install pyspark will install Spark. If you're going to use Pyspark it's clearly the simplest way to get started. On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars  
> PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark. This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

In [4]:
# os.environ['PYTHONPATH'] = "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip:/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
sys.path.extend([
    "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip",
    "/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
])

## PYSPARK_SUBMIT_ARGS

Specify the [spark-submit](https://spark.apache.org/docs/3.1.2/submitting-applications.html#launching-applications-with-spark-submit) parameters.

```
./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]
```

The ```conf``` paramters are [Spark properties](https://spark.apache.org/docs/latest/configuration.html#available-properties) e.g. ```spark.executor.memory```

Alternatively, use [SparkSession.builder](https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession).

```
spark = SparkSession.builder\
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config('spark.executor.memory', '2g') \
    .getOrCreate()
```

### Example

```
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode client \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000
```

### Environment variable

```
export PYSPARK_SUBMIT_ARGS='--master yarn --executor-memory 20G --total-executor-cores 100 --num-executors 5 --driver-memory 2g --executor-memory 2g pyspark-submit'
```

# Test spark-submit command line

In [5]:
%%bash
export HADOOP_CONF_DIR=/opt/hadoop/hadoop-3.2.2/etc/hadoop
export HADOOP_HOME=/opt/hadoop/hadoop-3.2.2
export SPARK_MASTER=yarn
export SPARK_HOME=/opt/spark/spark-3.1.2
export SPARK_DEPLOY_MODE=cluster
export SPARK_EXAMPLE_JAR="spark-examples_2.12-3.1.2.jar"

$SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \
  $SPARK_HOME/examples/jars/$SPARK_EXAMPLE_JAR 10 \
  --master $SPARK_MASTER \
  --deploy-mode $SPARK_DEPLOY_MODE

2022-02-18 23:17:36,473 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2022-02-18 23:17:36,832 INFO spark.SparkContext: Running Spark version 3.1.2
2022-02-18 23:17:36,948 INFO resource.ResourceUtils: No custom resources configured for spark.driver.
2022-02-18 23:17:36,951 INFO spark.SparkContext: Submitted application: Spark Pi
2022-02-18 23:17:37,089 INFO resource.ResourceProfile: Default ResourceProfile created, executor resources: Map(cores -> name: cores, amount: 1, script: , vendor: , memory -> name: memory, amount: 1024, script: , vendor: , offHeap -> name: offHeap, amount: 0, script: , vendor: ), task resources: Map(cpus -> name: cpus, amount: 1.0)
2022-02-18 23:17:37,139 INFO resource.ResourceProfile: Limiting resource is cpus at 1 tasks per executor
2022-02-18 23:17:37,146 INFO resource.ResourceProfileManager: Added ResourceProfile id: 0
2022-02-18 23:17:37,271 INFO spark.SecurityManager: Chang

2022-02-18 23:17:46,808 INFO yarn.Client: Uploading resource file:/tmp/spark-1858b913-f59c-41a5-8931-d0ca9ccdb66f/__spark_conf__5166803083425083188.zip -> hdfs://ubuntu:8020/user/oonisim/.sparkStaging/application_1645186583689_0002/__spark_conf__.zip
2022-02-18 23:17:46,935 INFO spark.SecurityManager: Changing view acls to: oonisim
2022-02-18 23:17:46,935 INFO spark.SecurityManager: Changing modify acls to: oonisim
2022-02-18 23:17:46,935 INFO spark.SecurityManager: Changing view acls groups to: 
2022-02-18 23:17:46,935 INFO spark.SecurityManager: Changing modify acls groups to: 
2022-02-18 23:17:46,936 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(oonisim); groups with view permissions: Set(); users  with modify permissions: Set(oonisim); groups with modify permissions: Set()
2022-02-18 23:17:47,011 INFO yarn.Client: Submitting application application_1645186583689_0002 to ResourceManager
2022-02-18 23:17:47,5

2022-02-18 23:18:08,871 INFO scheduler.TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4) (ubuntu, executor 2, partition 4, PROCESS_LOCAL, 4348 bytes) taskResourceAssignments Map()
2022-02-18 23:18:08,892 INFO scheduler.TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 329 ms on ubuntu (executor 2) (3/10)
2022-02-18 23:18:08,984 INFO scheduler.TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5) (ubuntu, executor 2, partition 5, PROCESS_LOCAL, 4348 bytes) taskResourceAssignments Map()
2022-02-18 23:18:08,989 INFO scheduler.TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 119 ms on ubuntu (executor 2) (4/10)
2022-02-18 23:18:09,132 INFO scheduler.TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6) (ubuntu, executor 2, partition 6, PROCESS_LOCAL, 4348 bytes) taskResourceAssignments Map()
2022-02-18 23:18:09,152 INFO scheduler.TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 170 ms on ubuntu (executor 2) (5/10)
2022-02-18 23:18:09,275 INFO scheduler.Task

Pi is roughly 3.1376351376351375


2022-02-18 23:18:10,457 INFO server.AbstractConnector: Stopped Spark@6b85300e{HTTP/1.1, (http/1.1)}{127.0.1.1:4040}
2022-02-18 23:18:10,464 INFO ui.SparkUI: Stopped Spark web UI at http://ubuntu:4040
2022-02-18 23:18:10,735 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on ubuntu:38153 in memory (size: 1787.0 B, free: 366.3 MiB)
2022-02-18 23:18:10,737 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on ubuntu:36329 in memory (size: 1787.0 B, free: 912.3 MiB)
2022-02-18 23:18:10,744 INFO storage.BlockManagerInfo: Removed broadcast_0_piece0 on ubuntu:35703 in memory (size: 1787.0 B, free: 366.3 MiB)
2022-02-18 23:18:10,943 INFO cluster.YarnClientSchedulerBackend: Interrupting monitor thread
2022-02-18 23:18:11,031 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
2022-02-18 23:18:11,034 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
2022-02-18 23:18:11,063 INFO cluster.YarnClientSchedulerBackend: YARN client

---
# Build Spark Session

In [10]:
from pyspark.sql import SparkSession

In [11]:
spark = SparkSession.builder\
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config('spark.executor.memory', '2g') \
    .getOrCreate()

---
# PySprk Code Example

* [PySpark - QuickStart](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart.html)

In [12]:
from datetime import datetime, date
import pandas as pd
from pyspark.sql import Row

df = spark.createDataFrame([
    Row(a=1, b=2., c='string1', d=date(2000, 1, 1), e=datetime(2000, 1, 1, 12, 0)),
    Row(a=2, b=3., c='string2', d=date(2000, 2, 1), e=datetime(2000, 1, 2, 12, 0)),
    Row(a=4, b=5., c='string3', d=date(2000, 3, 1), e=datetime(2000, 1, 3, 12, 0))
])
df

DataFrame[a: bigint, b: double, c: string, d: date, e: timestamp]

In [13]:
del spark
gc.collect()

28