# Import extra package into spark

Sometimes, we need to import third party packages into spark session. For example, to read sas files, we need to use a package `saurfang:spark-sas7bdat:3.0.0-s_2.12` developed by a saurfang. To read data from a postgres database, we need to use a package `org.postgresql:postgresql:42.2.24` provided by postgres.

There are many ways to import packages into spark session:
- by importing raw jar
- by using maven dependencies (maven will download the required jar automatically)



## 1. In submit mode


In submit mode, we can use two possible options:
- --jar option
- --packages option

### 1.1 Jar 

To use --jars option, you must make sure the jar files exist (already downloaded by admin or you) in all the workers. You can use below command to import the packages

```bash
# use the -- jars option
spark-submit --jars /path/file1.jar,/path/file2.jar,/path/file3.jar
             ......
             ......
             your-application.py 
             
# add all jars inside a folder
# tr will replace `space` by `,`
spark-submit --jars $(echo /path/*.jar | tr ' ' ',') \ 
             your-application.py 
             
# If you need a jar only on the **driver node** then use spark.driver.extraClassPath or --driver-class-path  
spark-submit --jars /path/file1.jar,/path/file2.jar \ 
    --driver-class-path /path/file3.jar \ 
    your-application.py
```

### 1.2 Packages

If you want to use the `--packages` option, the packages must be available in the maven repo configured in the worker (The default url is https://repo.maven.apache.org/maven2/).

```bash
# use the --packages option
spark-submit --packages saurfang:spark-sas7bdat:3.0.0-s_2.12, org.postgresql:postgresql:42.2.24
             ......
             ......
             your-application.py 
```

## 2. In spark shell mode

In spark shell mode, it's quite similar to submit mode, we can use two possible options:
- --jar option
- --packages option

### 2.1 Jar 

To use --jars option, you must make sure the jar files exist (already downloaded by admin or you) in all the workers. You can use below command to import the packages

```bash
# use the -- jars option
spark-shell --jars /path/file1.jar,/path/file2.jar,/path/file3.jar

             
# add all jars inside a folder
# tr will replace `space` by `,`
spark-shell --jars $(echo /path/*.jar | tr ' ' ',') \ 
          
             
# If you need a jar only on the **driver node** then use spark.driver.extraClassPath or --driver-class-path  
spark-shell --jars /path/file1.jar,/path/file2.jar \ 
    --driver-class-path /path/file3.jar \ 
 
    
# for pyspark, you may use blow command
pyspark --jars /path/file1.jar,/path/file2.jar,/path/file3.jar
```

### 2.2 Packages

If you want to use the `--packages` option, the packages must be available in the maven repo configured in the worker (The default url is https://repo.maven.apache.org/maven2/).

```bash
# use the --packages option
spark-shell --packages saurfang:spark-sas7bdat:3.0.0-s_2.12, org.postgresql:postgresql:42.2.24

# for pyspark
pyspark --packages saurfang:spark-sas7bdat:3.0.0-s_2.12, org.postgresql:postgresql:42.2.24
```

## 3. In notebook mode 

In notebook mode, we create sparksession with `SparkSession.builder`, we can also use two options:

- --jar option
- --packages option

### 3.1 Jar 

We can use --jars to import jars by using the driver and executor classpaths while creating SparkSession in PySpark as shown below. **This takes the highest priority over other approaches(It will overwrite the conf of other approach)**.

```python
from pyspark.sql import SparkSession,DataFrame

import os

local = True

if local:
    spark = SparkSession.builder \
        .master("local[4]") \
        .appName("import_packages")\
        .config("spark.jars", "/path/file1.jar,/path/file2.jar")\
        .config("spark.driver.extraClassPath", "/path/file3.jar") \
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("import_packages")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "4g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .config("spark.jars", "/path/file1.jar,/path/file2.jar")\
        .config("spark.driver.extraClassPath", "/path/file3.jar") \
        .getOrCreate()
```
In this example, the **file1.jar and file2.jar** are added to both driver and executors and **file3.jar is added only to the driver classpath**.
You can notice we have local mode, and k8s mode for the spark session config.

> Note, the jar files must exists on the worker/driver for both mode. The image of the executor in k8s mode must contain the jar file too. This can be difficult if you are not k8s admin. So the --packages option is better. 

### 3.2 Packages

This solution is recommended for spark k8s cluster, because we don't need extra privilege to import packages. Spark will download the necessary jars from the maven central repo.

In [1]:
from pyspark.sql import SparkSession,DataFrame

import os

In [3]:
local = True

if local:
    spark = SparkSession.builder \
        .master("local[4]") \
        .appName("import_packages")\
        .config('spark.jars.packages','saurfang:spark-sas7bdat:3.0.0-s_2.12,org.postgresql:postgresql:42.2.24') \
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443") \
        .appName("import_packages")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:py3.9.7-spark3.2.0")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory", "4g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .config('spark.jars.packages','saurfang:spark-sas7bdat:3.0.0-s_2.12,org.postgresql:postgresql:42.2.24') \
        .getOrCreate()

2023-04-11 10:36:32,481 WARN sql.SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.
