In [1]:
%%html
<style>
table {float:left}
</style>

In [2]:
%%html
<style>
div.output_area pre {
    white-space: pre;
}
</style>

# PySpark - Manage Python Dependencies

There are multiple ways to manage Python dependencies so that PySpark executor can import the depdencies **in the worker nodes** as well as the driver node. Make clear awareness if you are addressing the driver node or worker nodes. If executor cannot find the dependenceis, you will see the error ```ModuleNotFoundError: No module named ...```.



## PYSPARK_PYTHON Enironment variable

Use the environment varilable to tell Spark runtime where is the Python interpreter of the Python environment that has the dependencies installed.

### Worker Nodes

Setup ```PYSPARK_PYTHON``` in the nodes to point to the python interpreter path in the Python enviornment installed in the worker nodes (virtual environment or system). The Python environment needs to have the required package installed with the package management tool (pip, anaaconda, etc).

There are a few ways:

Using ```$SPARK_HOME/conf/spark-env.sh``` in the worker nodes to point to the Python interpreter.
```
export PYSPARK_PYTHON=<Python interpreter path>
```

OR

Using ```spark.yarn.executorEnv.PYSPARK_PYTHON``` at the SparkSession creation.
```
spark = SparkSession.builder\
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config('spark.debug.maxToStringFields', 100) \
    .config('spark.executor.memory', '2g') \
    .config('spark.yarn.executorEnv.PYSPARK_PYTHON', "/home/oonisim/venv/ml/bin/python3")\
    .getOrCreate()
```

### Driver Node

Depends on the cluster mode. ```PYSPARK_DRIVER_PYTHON``` has to be **unset** in Kubernetes or YARN cluster modes.

#### YARN client mode

Using the ```$SPARK_HOME/conf/spark-env.sh`` in the driver node, or export from the command line.

```
export PYSPARK_DRIVER_PYTHON=<Python interpreter path> # Do not set in cluster modes.
```

#### YARN cluster mode

Using the ```spark.yarn.appMasterEnv.PYSPARK_PYTHON```.
```
spark = SparkSession.builder\
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config('spark.debug.maxToStringFields', 100) \
    .config('spark.executor.memory', '2g') \
    .config('spark.yarn.appMasterEnv.PYSPARK_PYTHON', "/home/oonisim/venv/ml/bin/python3")\
    .config('spark.yarn.executorEnv.PYSPARK_PYTHON', "/home/oonisim/venv/ml/bin/python3")\
    .getOrCreate()
```


## Virtual Environment Archive

Tell Spark runtime the location of the virtual environment archive in which the dependencies are installed using the ```--archive``` or ```spark-archives```. The archive can be placed in the driver node or in the HDFS.

1. Create a virtual environment (venv or conda).
2. Install the dependencies in the virtual environment.
3. Archieve the virtual environment directory into a ```tgz``` file.
4. Point to the archieve path via ```--archive``` at spark-submit or ```spark.archives``` in PySpark shell/Notebook.

* [Spark User Guide - Python Package Management](https://spark.apache.org/docs/latest/api/python/user_guide/python_packaging.html)

```
export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes.
export PYSPARK_PYTHON=./environment/bin/python
spark-submit --archives pyspark_venv.tar.gz#environment app.py
```

OR 

```
import os
from pyspark.sql import SparkSession
from app import main

os.environ['PYSPARK_PYTHON'] = "./environment/bin/python"
spark = SparkSession.builder\
    .config(
        "spark.archives",  # <----- 'spark.yarn.dist.archives' in YARN.
        "pyspark_venv.tar.gz#environment"
    )\
    .getOrCreate()
```

## Spark setups Virtual Environment

Hortonworks approach. This feature is currently only supported in yarn mode.

* [Cloudera Community - Using VirtualEnv with PySpark](https://community.cloudera.com/t5/Community-Articles/Using-VirtualEnv-with-PySpark/ta-p/245905)
* [SPARK-13587](https://issues.apache.org/jira/browse/SPARK-13587)

> This method is trying to create virtualenv before python worker start, and this virtualenv is application scope, after the spark application job finish, the virtualenv will be cleanup. A

Spark setups the virutal environment in the Spark nodes.


* Each node must have internet access (for downloading packages).
* Python 2.7 or Python 3.x must be installed (pip is also installed).

```
spark-submit --master yarn-client \
    --conf spark.pyspark.virtualenv.enabled=true 
    --conf spark.pyspark.virtualenv.type=native
    --conf spark.pyspark.virtualenv.requirements=/Users/jzhang/github/spark/requirements.txt
    --conf spark.pyspark.virtualenv.bin.path=/Users/jzhang/anaconda/bin/virtualenv 
    --conf spark.pyspark.python=/usr/local/bin/python3 \
spark_virtualenv.py
```

### Configuration Parameters 

| Property | Description |
|:---|:---|
| spark.pyspark.virtualenv.enabled | Property flag to enable virtualenv |
| spark.pyspark.virtualenv.type | Type of virtualenv. Valid values are “native”, “conda” |
| spark.pyspark.virtualenv.requirements | Requirements file (optional, not required for interactive mode) |
| spark.pyspark.virtualenv.bin.path | The location of virtualenv executable file for type native or conda executable file for type conda |
| spark.pyspark.virtualenv.python_version | Python version for conda. (optional, only required when you use conda in interactive mode) |

In [3]:
import os
import sys
import gc
import numpy as np

#  Environemnt Variables

## Hadoop

In [4]:
os.environ['HADOOP_CONF_DIR'] = "/opt/hadoop/hadoop-3.2.2/etc/hadoop"

In [5]:
%%bash
export HADOOP_CONF_DIR="/opt/hadoop/hadoop-3.2.2/etc/hadoop"
ls $HADOOP_CONF_DIR | head -n 5

capacity-scheduler.xml
configuration.xsl
container-executor.cfg
core-site.xml
core-site.xml.48132.2022-02-15@12:29:41~


## PYTHONPATH

Refer to the **pyspark** modules to load from the ```$SPARK_HOME/python/lib``` in the Spark installation.

* [PySpark Getting Started](https://spark.apache.org/docs/latest/api/python/getting_started/install.html)

> Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib. One example of doing this is shown below:

```
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
```

Alternatively install **pyspark** with pip or conda locally which installs the Spark runtime libararies (for standalone).

* [Can PySpark work without Spark?](https://stackoverflow.com/questions/51728177/can-pyspark-work-without-spark)

> As of v2.2, executing pip install pyspark will install Spark. If you're going to use Pyspark it's clearly the simplest way to get started. On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars  
> PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark. This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

In [6]:
# os.environ['PYTHONPATH'] = "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip:/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
sys.path.extend([
    "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip",
    "/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
])

## PySpark packages

Execute after the PYTHONPATH setup.

In [7]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import (
    udf
)

---
# Spark Session without PYSPARK_PYTHON


In [8]:
spark = SparkSession.builder\
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config('spark.debug.maxToStringFields', 100) \
    .config('spark.executor.memory', '2g') \
    .getOrCreate()

2022-02-23 10:15:17,981 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-23 10:15:19,811 WARN util.Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
2022-02-23 10:15:21,862 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


## Use numpy in the worker

Will see ```ModuleNotFoundError: No module named 'numpy'```.

In [9]:
result = spark.createDataFrame(
    data = [(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)], 
    schema=["count","df","docs"]
)
result.show()

                                                                                

+-----+---+----+
|count| df|docs|
+-----+---+----+
|  138|  5|  10|
|  128|  4|  10|
|  112|  3|  10|
|  120|  3|  10|
|  189|  1|  10|
+-----+---+----+



In [10]:
@udf("float")
def newFunction(count, df, docs):
    import numpy as np
    returnValue = (1 + np.log(count)) * np.log(docs/df)
    return returnValue.item()

In [11]:
result=result.withColumn("new_function_result", newFunction("count","df","docs"))
result.show()

2022-02-23 10:16:02,185 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 2.0 (TID 2) (ubuntu executor 1): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_

PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
    for obj in iterator:
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/serializers.py", line 200, in _batched
    for item in iterator:
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
    return lambda *a: f(*a)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0012/container_1645514515228_0012_01_000002/pyspark.zip/pyspark/util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_40170/3270350464.py", line 3, in newFunction
ModuleNotFoundError: No module named 'numpy'


In [None]:
# Stop Spark
spark.stop()
del spark
gc.collect()

# Need to restart the Jupyter kernel otherwise the error "Spark stopped"
import os
os._exit(00)

# Spark Session with PYSPARK_PYTHON

Providing the PYSPARK_PYTHON

In [1]:
import os
import sys
import gc
import numpy as np

In [8]:
os.environ['HADOOP_CONF_DIR'] = "/opt/hadoop/hadoop-3.2.2/etc/hadoop"
sys.path.extend([
    "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip",
    "/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
])


## PYSPARK_PYTHON Environment Variable

In [None]:
os.environ['PYSPARK_PYTHON'] = "/home/oonisim/venv/ml/bin/python3"

## PySpark packages

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from pyspark.sql.functions import (
    udf
)

In [4]:
spark = SparkSession.builder\
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config('spark.debug.maxToStringFields', 100) \
    .config('spark.executor.memory', '2g') \
#    .config('spark.yarn.appMasterEnv.PYSPARK_PYTHON', "/home/oonisim/venv/ml/bin/python3")\
#    .config('spark.yarn.executorEnv.PYSPARK_PYTHON', "/home/oonisim/venv/ml/bin/python3")\
    .getOrCreate()

2022-02-23 10:19:49,450 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2022-02-23 10:19:52,605 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


## Use numpy in the worker

In [5]:
result = spark.createDataFrame(
    data = [(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)], 
    schema=["count","df","docs"]
)
result.show()

                                                                                

+-----+---+----+
|count| df|docs|
+-----+---+----+
|  138|  5|  10|
|  128|  4|  10|
|  112|  3|  10|
|  120|  3|  10|
|  189|  1|  10|
+-----+---+----+



In [9]:
@udf("float")
def newFunction(count, df, docs):
    import numpy as np
    returnValue = (1 + np.log(count)) * np.log(docs/df)
    return returnValue.item()

result=result.withColumn("new_function_result", newFunction("count","df","docs"))
result.show()

2022-02-23 10:33:00,182 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 5.0 (TID 8) (ubuntu executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000003/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000003/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000003/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_

2022-02-23 10:33:00,417 ERROR scheduler.TaskSetManager: Task 0 in stage 5.0 failed 4 times; aborting job


PythonException: 
  An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000002/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000002/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000002/pyspark.zip/pyspark/serializers.py", line 211, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000002/pyspark.zip/pyspark/serializers.py", line 132, in dump_stream
    for obj in iterator:
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000002/pyspark.zip/pyspark/serializers.py", line 200, in _batched
    for item in iterator:
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000002/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000002/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000002/pyspark.zip/pyspark/worker.py", line 85, in <lambda>
    return lambda *a: f(*a)
  File "/tmp/hadoop-hadoop/nm-local-dir/usercache/oonisim/appcache/application_1645514515228_0014/container_1645514515228_0014_01_000002/pyspark.zip/pyspark/util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "/tmp/ipykernel_41389/1210656308.py", line 3, in newFunction
ModuleNotFoundError: No module named 'numpy'


---
# Stop Spark Session

In [10]:
spark.stop()



# Cleanup

In [11]:
del spark
gc.collect()

1202