# Spark Cluster mode

- View cluster jobs at http://localhost:8080
- No longer can you load local files...
- Since its a cluster, how will the data get on the workers?
  - hadoop hdfs?
  - NFS?
  - S3 is the logical choice - cloud friendly.
- Since we use S3/minio, we need to load `org.apache.hadoop:hadoop-aws` jar onto the worker nodes. This version must match the spark hadoop version.

IMPORTANT: Before this demo can work, you will need to 
- login to the minio server http://localhost:9000
- create a `demo` bucket
- upload the `customers.csv` to the `demo` bucket.


In [1]:
import pyspark
from pyspark.sql import SparkSession
# Spark init
spark = SparkSession.builder \
    .master("spark://master:7077") \
    .appName('jupyter-pyspark') \
    .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:3.3.4")\
    .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
    .config("spark.hadoop.fs.s3a.access.key", "minio") \
    .config("spark.hadoop.fs.s3a.secret.key", "miniopass") \
    .config("spark.hadoop.fs.s3a.fast.upload", True) \
    .config("spark.hadoop.fs.s3a.path.style.access", True) \
    .getOrCreate()
sc = spark.sparkContext
sc.setLogLevel("ERROR")
#    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \


In [2]:
print('Spark Context : ', spark.sparkContext)
print('Spark Version : ', spark.sparkContext.version)
print('Spark appName :', spark.sparkContext.appName)
print('Hadoop version: ', spark.sparkContext._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion())
print('Spark Confiuration:')
for conf in spark.sparkContext._conf.getAll():
    print(f"\t{conf[0]} = {conf[1]}")

Spark Context :  <SparkContext master=spark://master:7077 appName=jupyter-pyspark>
Spark Version :  3.5.0
Spark appName : jupyter-pyspark
Hadoop version:  3.3.4
Spark Confiuration:
	spark.repl.local.jars = file:///home/jovyan/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.3.4.jar,file:///home/jovyan/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.12.262.jar,file:///home/jovyan/.ivy2/jars/org.wildfly.openssl_wildfly-openssl-1.0.7.Final.jar
	spark.hadoop.fs.s3a.secret.key = miniopass
	spark.hadoop.fs.s3a.access.key = minio
	spark.hadoop.fs.s3a.path.style.access = true
	spark.master = spark://master:7077
	spark.submit.pyFiles = /home/jovyan/.ivy2/jars/org.apache.hadoop_hadoop-aws-3.3.4.jar,/home/jovyan/.ivy2/jars/com.amazonaws_aws-java-sdk-bundle-1.12.262.jar,/home/jovyan/.ivy2/jars/org.wildfly.openssl_wildfly-openssl-1.0.7.Final.jar
	spark.app.initial.jar.urls = spark://ba0da97460c4:34195/jars/com.amazonaws_aws-java-sdk-bundle-1.12.262.jar,spark://ba0da97460c4:34195/jars/org.wildfly.openssl_wi

In [3]:
df = spark.read.csv("s3a://demo/customers.csv", header=True, inferSchema=True).cache()

In [4]:
df.show()

+------+----------+--------------------+------+---------------+-----------+-----+------------+---------------+---------------+
| First|      Last|               Email|Gender|Last IP Address|       City|State|Total Orders|Total Purchased|Months Customer|
+------+----------+--------------------+------+---------------+-----------+-----+------------+---------------+---------------+
|    Al|    Fresco|  afresco@dayrep.com|     M|  74.111.18.161|   Syracuse|   NY|           1|             45|              1|
|  Abby|      Kuss|     akuss@rhyta.com|     F|  23.80.125.101|    Phoenix|   AZ|           1|             25|              2|
| Arial|     Photo|   aphoto@dayrep.com|     F|     24.0.14.56|     Newark|   NJ|           1|            680|              1|
| Bette|     Alott|    balott@rhyta.com|     F| 56.216.127.219|    Raleigh|   NC|           6|            560|             18|
| Barb |    Barion|bbarion@superrito...|     F|   38.68.15.223|     Dallas|   TX|           4|           1590| 