In [15]:
%%html
<style>
table {float:left}
</style>

In [16]:
%%html
<style>
div.output_area pre {
    white-space: pre;
}
</style>

# Spark Environment Variables

* [Environment Variables](https://spark.apache.org/docs/latest/configuration.html#environment-variables)

| Environment Variable | Meaning |
|:---|:---|
| JAVA_HOME | Location where Java is installed (if it's not on your default PATH). |
| PYSPARK_PYTHON | Python binary executable to use for PySpark in both driver and workers (default is python3 if available, otherwise python). Property spark.pyspark.python take precedence if it is set |
| PYSPARK_DRIVER_PYTHON | Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON). Property spark.pyspark.driver.python take precedence if it is set |
| SPARKR_DRIVER_R | R binary executable to use for SparkR shell (default is R). Property spark.r.shell.command take precedence if it is set |
| SPARK_LOCAL_IP | IP address of the machine to bind to. |
| SPARK_PUBLIC_DNS | Hostname your Spark program will advertise to other machines. |

## Spark Properties

### Executor Environment Variables

|Property|Default|Meaning|
|:---|:---|:---|
| spark.executorEnv.[EnvironmentVariableName] | (none) | Add the environment variable specified by EnvironmentVariableName to the Executor process. The user can specify multiple of these to set multiple environment variables. |


## YARN Cluster

### Application Master Environment Variables

> Note: When running Spark on YARN in cluster mode, **environment variables need to be set using the spark.yarn.appMasterEnv.[EnvironmentVariableName] property** in your conf/spark-defaults.conf file. Environment variables that are set in spark-env.sh will not be reflected in the YARN Application Master process in cluster mode. See the YARN-related Spark Properties for more information.

* [Running on YARN - Spark Properties](https://spark.apache.org/docs/latest/running-on-yarn.html#spark-properties)

|Property|Default|Meaning|
|:---|:---|:---|
| spark.yarn.appMasterEnv.[EnvironmentVariableName] | (none) | Add the environment variable specified by EnvironmentVariableName to the Application Master process launched on YARN. The user can specify multiple of these and to set multiple environment variables. In cluster mode this controls the environment of the Spark driver and in client mode it only controls the environment of the executor launcher. |

                          

In [3]:
import os
import sys
import gc
import numpy as np

# Hadoop Environemnt Variables

For Spark on YARN, ```HADOOP_CONF_DIR``` is required to access YARN.

In [4]:
os.environ['HADOOP_CONF_DIR'] = "/opt/hadoop/hadoop-3.2.2/etc/hadoop"

In [5]:
%%bash
export HADOOP_CONF_DIR="/opt/hadoop/hadoop-3.2.2/etc/hadoop"
ls $HADOOP_CONF_DIR | head -n 5

capacity-scheduler.xml
configuration.xsl
container-executor.cfg
core-site.xml
core-site.xml.48132.2022-02-15@12:29:41~


# PYTHONPATH

Need to access pyspark and py4j packages to import the **pyspark** modules from the ```$SPARK_HOME/python/lib``` in the Spark installation.

* [PySpark Getting Started](https://spark.apache.org/docs/latest/api/python/getting_started/install.html)

> Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib. One example of doing this is shown below:

```
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
```

Alternatively install **pyspark** with pip or conda locally which installs the Spark runtime libararies (for standalone).

* [Can PySpark work without Spark?](https://stackoverflow.com/questions/51728177/can-pyspark-work-without-spark)

> As of v2.2, executing pip install pyspark will install Spark. If you're going to use Pyspark it's clearly the simplest way to get started. On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars  
> PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark. This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

In [6]:
# os.environ['PYTHONPATH'] = "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip:/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
sys.path.extend([
    "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip",
    "/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
])

## PySpark packages

Execute after the PYTHONPATH setup.

In [6]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import (
    udf
)

---
# PYSPARK_PYTHON

To be able to import Python dependencies, ```PYSPARK_PYTHON``` environment variable needs to  pointo the Python Interpreter of the Python Environment in which the dependencies have been installed.


## spark.yarn.appMasterEnv

For YARN cluster mode, the environment varables of the application master process need to be setup with ```spark.yarn.appMasterEnv```.


In [9]:
spark = SparkSession.builder\
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config('spark.debug.maxToStringFields', 100) \
    .config('spark.executor.memory', '2g') \
    .config('spark.yarn.appMasterEnv.PYSPARK_PYTHON', "/home/oonisim/venv/ml/bin/python3")\
    .config('spark.yarn.executorEnv.PYSPARK_PYTHON', "/home/oonisim/venv/ml/bin/python3")\
    .getOrCreate()

2022-02-23 08:31:17,153 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-23 08:31:20,394 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


Confirm Spark can import Python dependences (numpy here) in the worker nodes via PYSPARK_PYTHON environment variable.

In [11]:
result = spark.createDataFrame(
    data = [(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)], 
    schema=["count","df","docs"]
)
result.show()

@udf("float")
def newFunction(count, df, docs):
    import numpy as np
    returnValue = (1 + np.log(count)) * np.log(docs/df)
    return returnValue.item()

result=result.withColumn("new_function_result", newFunction("count","df","docs"))
result.show()

[Stage 1:>                                                          (0 + 1) / 1]

+-----+---+----+
|count| df|docs|
+-----+---+----+
|  138|  5|  10|
|  128|  4|  10|
|  112|  3|  10|
|  120|  3|  10|
|  189|  1|  10|
+-----+---+----+



                                                                                

---
# Stop Spark Session

In [12]:
spark.stop()



# Cleanup

In [13]:
del spark
gc.collect()

695