SPARK =>  is a distributed data processing engine with its components working collaboratively on a cluster of machines.

Spark Core & Spark SQL Engine:

- Spark SQL (DataFrames + Datasets)

- Spark Streaming (Structured Streaming)

- Machine Learning => MLlib

- Graph Processing Graph X



SPARK ARCHITECTURE:

1) DRIVER PROGRAM -> is responsible for orchestrating parallel operations on the Spark cluster; the driver accesses the         distributed components in the cluster—the SPARK EXECUTORS and CLUSTER MANAGER — through a SPARKSESSION; it communicates with the cluster manager; it requests resources (CPU, memory, etc.) from the cluster manager for Spark’s executors (JVMs); once the resources are allocated, it communicates directly with the executors

2) SPARKSESSION - a connector to all Spark operations and data; provides a single unified entry point to all of Spark’s functionality (e.g. creating JVM runtime parameters, creating DF and Datasets, reading from datasources, accessing catalog metadata or issuing SQL queries); In an interactive Spark shell, the Spark driver instantiates a SparkSession, while in a Spark application,  a SparkSession is created by the user.

3) CLUSTER MANAGER - is responsible for managing and allocating resources for the cluster of nodes on which Spark application runs; currently exists 4 cluster managers : standalone cluster manager, Apache Hadoop YARN, Apache Mesos, Kubernetes

4) SPARK EXECUTOR - runs on each worker node in the cluster; the executors communicate with the driver program and are responsible for executing tasks on the workers; in most deployments modes, only a single executor runs per node


Data is distributed as partitions across the clusters, each partition Sparks treate as a DataFrame in memory; partitioning allows for efficient parallelism - it allows Spark Executors to process only data that is close to them.

The transformations (e.g. select(), filter(), join(), orderBy(), groupBy()) and actions (e.g. show(), take(), count(), collect(), save()) contribute to a Spark query plan, nothing in a query plan is executed until an action is invoked. The action is what triggers the execution of all transformations recorded as part of the query execution plan.

Transformations - executed by Spark lazily -> A huge advantage of the lazy evaluation scheme is that Spark can inspect your computational query and optimize it by: joining or pipelining some operations and assigning them to a stage, or breaking them into stages by determining which operations require a shuffle or exchange of data across clusters.

High Level API => DataFrame - SQL - DataSet

Low Level API => RDD = Resilient Distributed Dataset


SPARK UI -> localhost:4040

![image.png](attachment:image.png)




In [1]:
from pyspark.sql import SparkSession

spark=SparkSession.Builder().appName('LS').getOrCreate()
spark

In [13]:
# RDD creation:
from pyspark import SparkContext

# in pyspark shell sc variable is created automatically
sc = SparkContext(appName = "testRDD")

d=sc.textFile()

In [3]:
from pyspark import SparkFiles

url='https://raw.githubusercontent.com/justkacz/csvfiles/main/tips.csv'

spark.sparkContext.addFile(url)

df=spark.read.csv(SparkFiles.get('tips.csv'), header=True, inferSchema=True)
df.show(5)

+----------+----+------+---+------+----+
|total_bill| tip|smoker|day|  time|size|
+----------+----+------+---+------+----+
|     16.99|1.01|    No|Sun|Dinner|   2|
|     10.34|1.66|    No|Sun|Dinner|   3|
|     21.01| 3.5|    No|Sun|Dinner|   3|
|     23.68|3.31|    No|Sun|Dinner|   2|
|     24.59|3.61|    No|Sun|Dinner|   4|
+----------+----+------+---+------+----+
only showing top 5 rows



In [5]:
from pyspark.sql.functions import count

# TRANSFORMATION:
count_tips=(df.select('time', 'day', 'total_bill')
            .groupBy('time', 'day')
            .agg(count('total_bill').alias('Total'))
            .orderBy('Total', ascending=False))

# ACTION:
count_tips.show()

+------+----+-----+
|  time| day|Total|
+------+----+-----+
|Dinner| Sat|   87|
|Dinner| Sun|   76|
| Lunch|Thur|   61|
|Dinner| Fri|   12|
| Lunch| Fri|    7|
|Dinner|Thur|    1|
+------+----+-----+



In [9]:
from pyspark.sql.functions import col

# and with filter using where clause:
c_t=(df.select('time', 'day', 'total_bill')
    .where(col('time')=='Dinner') # .where(df.time=='Dinner')
    .groupBy('time', 'day')
    .agg(count('total_bill').alias('Total'))
    .orderBy('Total', ascending=False))

c_t.show()

+------+----+-----+
|  time| day|Total|
+------+----+-----+
|Dinner| Sat|   87|
|Dinner| Sun|   76|
|Dinner| Fri|   12|
|Dinner|Thur|    1|
+------+----+-----+

