# How to use external jars

* [Spark Configuration](https://spark.apache.org/docs/latest/configuration.html)

> * spark.jars  
> Comma-separated list of jars to include on the **driver and executor classpaths**. Globs are allowed.

> * spark.driver.extraClassPath  
> Extra classpath entries to prepend to the classpath of the driver.

> * spark.**executor**.extraClassPath  
> This exists primarily for **backwards-compatibility** with older versions of Spark. Users typically **should NOT need to set this option**.

In [1]:
import os
import sys
import gc

# Example (Hadoop GCP/BQ Connector)

Demonstrate how to use the external [Apache Spark SQL connector for Google BigQuery](https://github.com/GoogleCloudDataproc/spark-bigquery-connector) jar using the example from [Use the BigQuery connector with Spark](https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#pyspark).

MUST follow the instruction [Installing the connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md).

> ### Configureing Spark
> ```
> spark.hadoop.fs.gs.impl=com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
> spark.hadoop.google.cloud.auth.service.account.enable=true
> spark.hadoop.google.cloud.auth.service.account.json.keyfile=<path/to/keyfile.json>
> ```

## References

* [https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md is not up-to-date #618](https://github.com/GoogleCloudDataproc/hadoop-connectors/issues/618)
* [Clarification about installation with spark #188](https://github.com/GoogleCloudDataproc/hadoop-connectors/issues/188) (MUST)
* [When writing to BQ run into this error No FileSystem for scheme: gs #206](https://github.com/GoogleCloudDataproc/spark-bigquery-connector/issues/206)
* [Accessing GCS from Spark/Hadoop outside Google Cloud #52](https://github.com/GoogleCloudDataproc/hadoop-connectors/issues/52)
* [Configuration properties](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/CONFIGURATION.md)

## GCP Setup


### Application Default Credentials

* [Installing the connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md).

> ### Ensure authenticated Cloud Storage access
> Depending on where the machines which comprise your cluster are located, you must do one of the following:  
> * non-Google Cloud Platform - Obtain an [OAuth 2.0 private key](https://cloud.google.com/storage/docs/authentication#generating-a-private-key). 

Spark needs to authenticate itself with GCP via the credential file pointed to by ```GOOGLE_APPLICATION_CREDENTIALS``` environment variable. 

* [Authentication on GCP with Docker: Application Default Credentials](https://medium.com/datamindedbe/application-default-credentials-477879e31cb5)
* [Accessing GCS from Spark/Hadoop outside Google Cloud #52](https://github.com/GoogleCloudDataproc/hadoop-connectors/issues/52)
* [Authentication overview](https://cloud.google.com/docs/authentication)
* [Authenticating as a service account](https://cloud.google.com/docs/authentication/production)


The credential file is created by executing the command:

```
gcloud auth application-default login
```

Set the Spark properties to enable the GCP authentication and refer to the key file.

```
spark.hadoop.google.cloud.auth.service.account.enable=true
spark.hadoop.google.cloud.auth.service.account.json.keyfile=<path/to/keyfile.json>
```

Note that both the Spark job submit account and Spark executor account need the permission to read the file, otherwise Permission Denied error.

In [2]:
GOOGLE_APPLICATION_CREDENTIALS = '/home/spark/.config/gcloud/application_default_credentials.json'

### BigQuery Connector jar

Need two jar files. As far as I know, it is not clearly documented.

1. spark-bigquery-latest_2.12.jar
2. gcs-connector-hadoop3-2.2.2-shaded.jar

* [Apache Spark SQL connector for Google BigQuery (Beta)](https://github.com/GoogleCloudDataproc/spark-bigquery-connector#downloading-the-connector)

| version | Link |
| --- | --- |
| Scala 2.11 | `gs://spark-lib/bigquery/spark-bigquery-latest_2.11.jar` ([HTTP link](https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-latest_2.11.jar)) |
| Scala 2.12 | `gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar` ([HTTP link](https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-latest_2.12.jar)) |

* [Getting the connector](https://cloud.google.com/dataproc/docs/concepts/connectors/cloud-storage#other_sparkhadoop_clusters)

> specific version from Apache Maven repository (you should download a shaded jar that has -shaded suffix in the name):  
> * Cloud Storage connector for [Hadoop 3.x](https://search.maven.org/search?q=g:com.google.cloud.bigdataoss%20AND%20a:gcs-connector%20AND%20v:hadoop3-*)



Download the jars for the Spark version to a local location and setup the Spark property (comma separated) to the  jar files.


```
spark.jar=/path/to/jar1,/path/to/jar2,/path/to/jar3
```


In [3]:
DIR = os.path.realpath(os.getcwd())

In [4]:
jar_spark_bigquery = "spark-bigquery-latest_2.12.jar"
source = f"gs://spark-lib/bigquery/{jar_spark_bigquery}"
classpath = f"{DIR}/{jar_spark_bigquery}"

!gsutil -q cp {source} {classpath}
# !ls {classpath}

In [5]:
hadoop_version = "3-2.2.2"
jar_hadoop_bigquery = f"gcs-connector-hadoop{hadoop_version}-shaded.jar"
url=f"https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop3-2.2.2/{jar_hadoop_bigquery}"

!wget -q -O {DIR}/{jar_hadoop_bigquery} {url}
classpath = f"{classpath},{DIR}/{jar_hadoop_bigquery}"

### GCP Project

In [6]:
result=!(gcloud config get-value project)
project=result[0]
project

'positive-theme-323611'


### BigQuery Dataset

* [Running the code](https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example#pyspark)
> Before running this example, create a dataset named "wordcount_dataset" or change the output dataset in the code to an existing BigQuery dataset in your Google Cloud project.

In [7]:
dataset = f"{project}:wordcount_example"
print(f"Recreating BiGQuery dataset {dataset}")

!bq --location=us rm --force=true --dataset=true {dataset} || echo ""
!bq --location=us mk --dataset {dataset}

print(f"Checking BiGQuery dataset {dataset}")
!bq --location=us ls --datasets=true

Recreating BiGQuery dataset positive-theme-323611:wordcount_example
Dataset 'positive-theme-323611:wordcount_example' successfully created.
Checking BiGQuery dataset positive-theme-323611:wordcount_example
      datasetId      
 ------------------- 
  gcpbook_ch5        
  wordcount_example  


### Cloud Storage Bucket

> Create a Cloud Storage bucket, which will be used to export to BigQuery

In [8]:
bucket = "positive-theme-323611-wordcount-example"

print(f"Recreating {bucket}")
!gsutil rm -r gs://{bucket}
!gsutil mb gs://{bucket}

print(f"Checking {bucket}")
!gsutil ls -b -L gs://{bucket}

Recreating positive-theme-323611-wordcount-example
Removing gs://positive-theme-323611-wordcount-example/...
Creating gs://positive-theme-323611-wordcount-example/...
Checking positive-theme-323611-wordcount-example
gs://positive-theme-323611-wordcount-example/ :
	Storage class:			STANDARD
	Location type:			multi-region
	Location constraint:		US
	Versioning enabled:		None
	Logging configuration:		None
	Website configuration:		None
	CORS configuration: 		None
	Lifecycle configuration:	None
	Requester Pays enabled:		None
	Labels:				None
	Default KMS key:		None
	Time created:			Fri, 17 Sep 2021 00:22:11 GMT
	Time updated:			Fri, 17 Sep 2021 00:22:11 GMT
	Metageneration:			1
	Bucket Policy Only enabled:	False
	Public access prevention:	unspecified
	ACL:				
	  [
	    {
	      "entity": "project-owners-412177242019",
	      "projectTeam": {
	        "projectNumber": "412177242019",
	        "team": "owners"
	      },
	      "role": "OWNER"
	    },
	    {
	      "entity": "project-editors-4

---
# PySpark Setup

---
# Environment variables

## HADOOP_CONF_DIR

Copy the **HADOOP_CONF_DIR** from the Hadoop/YARN master node and set the ```HADOOP_CONF_DIR``` environment variable locally to point to the directory.

* [Launching Spark on YARN
](http://spark.apache.org/docs/latest/running-on-yarn.html#launching-spark-on-yarn)

> Ensure that **HADOOP_CONF_DIR** or **YARN_CONF_DIR** points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager. The configuration contained in this directory will be distributed to the YARN cluster so that all containers used by the application use the same configuration. 

In [9]:
os.environ['HADOOP_CONF_DIR'] = "/opt/hadoop/hadoop-3.2.2/etc/hadoop"

## PYTHONPATH

Refer to the **pyspark** modules to load from the ```$SPARK_HOME/python/lib``` in the Spark installation.

* [PySpark Getting Started](https://spark.apache.org/docs/latest/api/python/getting_started/install.html)

> Ensure the SPARK_HOME environment variable points to the directory where the tar file has been extracted. Update PYTHONPATH environment variable such that it can find the PySpark and Py4J under SPARK_HOME/python/lib. One example of doing this is shown below:

```
export PYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip); IFS=:; echo "${ZIPS[*]}"):$PYTHONPATH
```

Alternatively install **pyspark** with pip or conda locally which installs the Spark runtime libararies (for standalone).

* [Can PySpark work without Spark?](https://stackoverflow.com/questions/51728177/can-pyspark-work-without-spark)

> As of v2.2, executing pip install pyspark will install Spark. If you're going to use Pyspark it's clearly the simplest way to get started. On my system Spark is installed inside my virtual environment (miniconda) at lib/python3.6/site-packages/pyspark/jars  
> PySpark has a Spark installation installed. If installed through pip3, you can find it with pip3 show pyspark. Ex. for me it is at ~/.local/lib/python3.8/site-packages/pyspark. This is a standalone configuration so it can't be used for managing clusters like a full Spark installation.

In [10]:
# os.environ['PYTHONPATH'] = "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip:/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
sys.path.extend([
    "/opt/spark/spark-3.1.2/python/lib/py4j-0.10.9-src.zip",
    "/opt/spark/spark-3.1.2/python/lib/pyspark.zip"
])

## PYSPARK_SUBMIT_ARGS

Specify the [spark-submit](https://spark.apache.org/docs/3.1.2/submitting-applications.html#launching-applications-with-spark-submit) parameters.

```
./bin/spark-submit \
  --class <main-class> \
  --master <master-url> \
  --deploy-mode <deploy-mode> \
  --conf <key>=<value> \
  ... # other options
  <application-jar> \
  [application-arguments]
```

The ```conf``` paramters are [Spark properties](https://spark.apache.org/docs/latest/configuration.html#available-properties) e.g. ```spark.executor.memory```

Alternatively, use [SparkSession.builder](https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession).

```
spark = SparkSession.builder\
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config('spark.executor.memory', '2g') \
    .getOrCreate()
```

### Example

```
./bin/spark-submit \
  --class org.apache.spark.examples.SparkPi \
  --master yarn \
  --deploy-mode client \
  --supervise \
  --executor-memory 20G \
  --total-executor-cores 100 \
  /path/to/examples.jar \
  1000
```

### Environment variable

```
export PYSPARK_SUBMIT_ARGS='--master yarn --executor-memory 20G --total-executor-cores 100 --num-executors 5 --driver-memory 2g --executor-memory 2g pyspark-submit'
```

---
# Spark Session

Set the Spark properties for GCP authentication.
```
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", GOOGLE_APPLICATION_CREDENTIALS) \
```

Note that ```.config("spark.executorEnv.GOOGLE_APPLICATION_CREDENTIALS", GOOGLE_APPLICATION_CREDENTIALS)``` is not required if the Spark properties have been setup as per [Installing the connector](https://github.com/GoogleCloudDataproc/hadoop-connectors/blob/master/gcs/INSTALL.md):

Set the Spark property for external jar files.
```
.config("spark.jars", classpath) \
```

In [11]:
from pyspark.sql import SparkSession

In [12]:
spark = SparkSession.builder\
    .appName('spark-bigquery-demo') \
    .master('yarn') \
    .config('spark.submit.deployMode', 'client') \
    .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
    .config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
    .config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", GOOGLE_APPLICATION_CREDENTIALS) \
    .config("spark.jars", classpath) \
    .config('spark.debug.maxToStringFields', 100) \
    .config('spark.executor.memory', '2g') \
    .getOrCreate()

21/09/17 10:22:17 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/09/17 10:22:21 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


## Access BigQuery using the BigQuery Connector jar

In [13]:
# Use the Cloud Storage bucket for temporary BigQuery export data used by the connector.
spark.conf.set('temporaryGcsBucket', bucket)

# Load data from BigQuery.
words = spark.read.format('bigquery') \
    .option('table', 'bigquery-public-data:samples.shakespeare') \
    .load()
words.createOrReplaceTempView('words')

# Perform word count.
word_count = spark.sql("""
SELECT 
    word, 
    SUM(word_count) AS word_count 
FROM words 
GROUP BY word
""")

word_count.show()
word_count.printSchema()

21/09/17 10:23:09 WARN DefaultCredentialsProvider: Your application has authenticated using end user credentials from Google Cloud SDK. We recommend that most server applications use service accounts instead. If your application continues to use end user credentials from Cloud SDK, you might receive a "quota exceeded" or "API not enabled" error. For more information about service accounts, see https://cloud.google.com/docs/authentication/.
[Stage 1:>                                                          (0 + 1) / 1]

+---------+----------+
|     word|word_count|
+---------+----------+
|     XVII|         2|
|    spoil|        28|
|    Drink|         7|
|forgetful|         5|
|   Cannot|        46|
|    cures|        10|
|   harder|        13|
|  tresses|         3|
|      few|        62|
|  steel'd|         5|
| tripping|         7|
|   travel|        35|
|   ransom|        55|
|     hope|       366|
|       By|       816|
|     some|      1169|
|    those|       508|
|    still|       567|
|      art|       893|
|    feign|        10|
+---------+----------+
only showing top 20 rows

root
 |-- word: string (nullable = false)
 |-- word_count: long (nullable = true)



                                                                                

In [14]:
# Saving the data to BigQuery.
# To avoid "No FileSystem for scheme: gs", make sure the Spark properties "spark.hadoop.fs.gs.impl"
word_count.write.format('bigquery') \
    .option('table', f'{dataset}.wordcount_output') \
    .save()

                                                                                

## BigQuery Connector Write Result

<img src="image/spark_gcp_bq_connector_wordcount_example_output.png" align="left" width=750/>
<img src="image/spark_gcp_bq_connector_wordcount_example_output_preview.png" align="left" width=550/>



# Cleanup

In [15]:
spark.stop()

del spark
gc.collect()

427