# Spark
- A fast and general compute engine (originally for [Hadoop](https://hadoop.apache.org/) data).
    - Often paired with Hadoop for its distributed filesystem (HDFS), cluster resource management and parallel processing.
- Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL (extract, transform, load), Machine Learning, stream processing, and graph computation.
- Also communicates well with some databases and other resources.
- Installation of Spark and its dependencies is explained in the [Installation chapter](../../6_Appendix/Installation.ipynb).

In [1]:
# Set environment variables for PySpark (system and version dependent!) if not already set persistently
import os
os.environ["JAVA_HOME"] = "/Library/Java/JavaVirtualMachines/zulu-18.jdk/Contents/Home"
# os.environ["JAVA_HOME"] = "C:/Program Files/Java/jdk1.8.0_281" # or similar on Windows
# If you are using environments in Python, you can set the environment variables like the alternative below:
os.environ["PYSPARK_PYTHON"] = "python" # or similar to "/Users/kristian/miniforge3/envs/tf_M1/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"] = "python" # or similar to "/Users/kristian/miniforge3/envs/tf_M1/bin/python"
# On Windows you need to specify where the Hadoop drivers are located:
# os.environ["HADOOP_HOME"] = "C:/Hadoop/hadoop-3.3.1"
# Set the Hadoop version to the one you are using, e.g., none:
os.environ["PYSPARK_HADOOP_VERSION"] = "without"

## Spark and Cassandra
- Cassandra is one of the databases that work well with Spark.
    - Same type of distributed processing.
    - Same way of replicating for fault tolerance.
- Spark can be deployed on the same nodes as Cassandra for:
    - local (short traveled) data manipulation, and
    - combination of results to a central hub ([MapReduce](https://en.wikipedia.org/wiki/MapReduce)).
- Requires drivers from Datastax
    - Automatically downloaded and applied with the following configuration.
- A [SparkSession](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html) instantiates Spark, applies configurations and connects to a data source.

In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('SparkCassandraApp').\
    config('spark.jars.packages', 'com.datastax.spark:spark-cassandra-connector_2.12:3.4.1').\
    config('spark.cassandra.connection.host', 'localhost').\
    config('spark.sql.extensions', 'com.datastax.spark.connector.CassandraSparkExtensions').\
    config('spark.sql.catalog.mycatalog', 'com.datastax.spark.connector.datasource.CassandraCatalog').\
    config('spark.cassandra.connection.port', '9042').getOrCreate()
# Some warnings are to be expected.

23/10/11 14:51:24 WARN Utils: Your hostname, Kristians-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 10.42.33.132 instead (on interface en0)
23/10/11 14:51:24 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Ivy Default Cache set to: /Users/kristian/.ivy2/cache
The jars for the packages stored in: /Users/kristian/.ivy2/jars
com.datastax.spark#spark-cassandra-connector_2.12 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-df9b5da3-32ed-42fa-ae29-afebdac703d5;1.0
	confs: [default]


:: loading settings :: url = jar:file:/Users/kristian/miniforge3/envs/tf_M1/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.1.jar!/org/apache/ivy/core/settings/ivysettings.xml


	found com.datastax.spark#spark-cassandra-connector_2.12;3.4.1 in central
	found com.datastax.spark#spark-cassandra-connector-driver_2.12;3.4.1 in central
	found org.scala-lang.modules#scala-collection-compat_2.12;2.11.0 in central
	found com.datastax.oss#java-driver-core-shaded;4.13.0 in central
	found com.datastax.oss#native-protocol;1.5.0 in central
	found com.datastax.oss#java-driver-shaded-guava;25.1-jre-graal-sub-1 in central
	found com.typesafe#config;1.4.1 in central
	found org.slf4j#slf4j-api;1.7.26 in central
	found io.dropwizard.metrics#metrics-core;4.1.18 in central
	found org.hdrhistogram#HdrHistogram;2.1.12 in central
	found org.reactivestreams#reactive-streams;1.0.3 in central
	found com.github.stephenc.jcip#jcip-annotations;1.0-1 in central
	found com.github.spotbugs#spotbugs-annotations;3.1.12 in central
	found com.google.code.findbugs#jsr305;3.0.2 in central
	found com.datastax.oss#java-driver-mapper-runtime;4.13.0 in central
	found com.datastax.oss#java-driver-query-

## Accessing tables
**Note: The following sets of commands assume that the [Cassandra notebook](./3_Cassandra.ipynb) has been run first to set up the relevant keyspace and tables.**

In [3]:
# .load() is used to load data from Cassandra table as a Spark DataFrame.
spark.read.format("org.apache.spark.sql.cassandra").options(table="my_first_table", keyspace="my_first_keyspace").load().show()

+---+----------+-------+
|ind|   company|  model|
+---+----------+-------+
|  1|     Tesla|Model S|
|  2|     Tesla|Model 3|
|  3|  Polestar|      3|
|  4|Volkswagen|   ID.4|
+---+----------+-------+



### Database views
- Useful for "setting the scene" before a more simplified data extraction.
- The below example simply attaches to the correct keyspace and table.
    - The _view_ could also be a selection into that table to query further.

In [4]:
# Create view for simpler SQL queries
spark.read.format("org.apache.spark.sql.cassandra").options(table="table_with_uuid", keyspace="my_first_keyspace").load().createOrReplaceTempView("my_first_table_view")

### [Spark DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html)
- Related to a Pandas data frame, but can be distributed over compute nodes.
- Various functions like filters, statistical calculations, groupBy, Pandas functions (mapInPandas), joins, etc.
- Export to Pandas and JSON.
- Reads many formats, including SQL, JSON, Excel, ...

In [5]:
# Read CSV file into Spark DataFrame
planets = spark.read.csv("../../data/planets.csv", header=True, inferSchema=True)
planets.show()

+-------+---------+---------+
| planet| distance| diameter|
+-------+---------+---------+
|Mercury| 0.387 AU|  4878 km|
|  Venus| 0.723 AU| 12104 km|
|  Earth| 1.000 AU| 12756 km|
|   Mars| 1.524 AU|  6787 km|
|Jupiter| 5.203 AU|142796 km|
| Saturn| 9.546 AU|120660 km|
| Uranus|19.218 AU| 51118 km|
|Neptune|30.069 AU| 48600 km|
+-------+---------+---------+



In [6]:
# Select only Tesla company
#             DataFrame                    -->
spark.sql("select * from my_first_table_view").filter("company = 'Tesla'").show()

+--------------------+-------+-------+-------+
|                  id|company|  model|  price|
+--------------------+-------+-------+-------+
|cd8026e0-6834-11e...|  Tesla|Model S|20000.0|
|cd80c320-6834-11e...|  Tesla|Model S|21000.0|
+--------------------+-------+-------+-------+



In [7]:
# Equivalent to the above but in pure SQL
spark.sql("select * from my_first_table_view where company = 'Tesla'").show()

+--------------------+-------+-------+-------+
|                  id|company|  model|  price|
+--------------------+-------+-------+-------+
|cd8026e0-6834-11e...|  Tesla|Model S|20000.0|
|cd80c320-6834-11e...|  Tesla|Model S|21000.0|
+--------------------+-------+-------+-------+



In [8]:
# Select all data from the view and convert it to Pandas DataFrame
spark.sql("select * from my_first_table_view").toPandas()

Unnamed: 0,id,company,model,price
0,cd80c320-6834-11ee-8b2f-4f23491759a5,Tesla,Model S,21000.0
1,cd8026e0-6834-11ee-8b2f-4f23491759a5,Tesla,Model S,20000.0
2,cd80ea30-6834-11ee-8b2f-4f23491759a5,Oldsmobile,Model 6C,135000.0


In [9]:
# View data as a table and select only Tesla company
df = spark.sql("select * from my_first_table_view")
df.filter(df.company == 'Tesla').toPandas() # Equivalent to "company = 'Tesla'"


Unnamed: 0,id,company,model,price
0,cd80c320-6834-11ee-8b2f-4f23491759a5,Tesla,Model S,21000.0
1,cd8026e0-6834-11ee-8b2f-4f23491759a5,Tesla,Model S,20000.0


In [10]:
# Filter also on price > 20000
df.filter((df.company == 'Tesla') & (df.price > 20000)).toPandas()

Unnamed: 0,id,company,model,price
0,cd80c320-6834-11ee-8b2f-4f23491759a5,Tesla,Model S,21000.0


### Aggregation, grouping and filtering
- These can be combined in many ways.
- Starting from the left.
- Order is important.

In [11]:
# Aggregate prices by company and sort by company name
df.groupBy("company").agg({"price": "avg"}).orderBy('company').toPandas()

Unnamed: 0,company,avg(price)
0,Oldsmobile,135000.0
1,Tesla,20500.0


### Write data to Cassandra
- One can append or overwrite data in existing database tables.
- PySpark is picky regarding data formats.
    - Reading data from the existing table and extracting formatting is possible.

In [12]:
# Create two new cars in a Pandas DataFrame
import pandas as pd
newCars = pd.DataFrame([[459, 'Ford', 'Escort'], [460, 'Ford', 'Transit']], columns=['ind', 'company', 'model'])
newCars

Unnamed: 0,ind,company,model
0,459,Ford,Escort
1,460,Ford,Transit


In [13]:
# Convert the Pandas DataFrame to Spark DataFrame and save it to Cassandra (append mode)
spark.createDataFrame(newCars).write.format("org.apache.spark.sql.cassandra")\
.options(table="my_first_table", keyspace="my_first_keyspace").mode("append").save()

                                                                                

In [14]:
# Check if the new cars are in the table
spark.read.format("org.apache.spark.sql.cassandra")\
.options(table="my_first_table", keyspace="my_first_keyspace").load()\
.createOrReplaceTempView("my_first_table_view2")

spark.sql("select * from my_first_table_view2").toPandas()

Unnamed: 0,ind,company,model
0,3,Polestar,3
1,460,Ford,Transit
2,459,Ford,Escort
3,1,Tesla,Model S
4,2,Tesla,Model 3
5,4,Volkswagen,ID.4


In [15]:
# Stop Spark session
try:
    spark.stop()
except ConnectionRefusedError:
    print("Spark session already stopped.")

```{seealso} Resources
:class: tip
- [PySpark Tutorial For Beginners (sparkbyexample.com)](https://sparkbyexamples.com/pyspark-tutorial/)
- [PySpark documentation](https://spark.apache.org/docs/latest/api/python/index.html)
    - [PySpark DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.html)
    - [PySpark SparkSession](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.SparkSession.html)
- [YouTube: PySpark Tutorial: Spark SQL & DataFrame Basics](https://youtu.be/3-pnWVWyH-s?si=5AfOao23gqgh19en) (17m:12s)
```