# Introduction to Spark 

### Example 1 - Wordcount (Plain Map/Reduce)

**Wordcount**: find the number of occurences of each word in a body of text.

#### Data format
The file `text.txt` contains the following short text:

```
this is a text file with random words like text , words , like this is an example of a text file
```

The data is stored in S3: `s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/text.txt`.



In [1]:
# Spark RDD code
from pyspark.sql import SparkSession
# To log our application's execution time:
import time


sc = SparkSession \
    .builder \
    .appName("wordcount example") \
    .getOrCreate() \
    .sparkContext

# Start timing
start_time = time.time()
# Lambda functions overload!
wordcount = sc.textFile("s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/text.txt") \
    .flatMap(lambda x: x.split(" ")) \
    .map(lambda x: (x, 1)) \
    .reduceByKey(lambda x,y: x+y) \
    .sortBy(lambda x: x[1], ascending=False)

# Collect and print the result
print(wordcount.collect())

# Stop timing and print out the execution duration
end_time = time.time()
elapsed_time = end_time - start_time
print(f"Time taken: {elapsed_time:.2f} seconds")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
3423,application_1732639283265_3379,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[('text', 3), ('this', 2), ('is', 2), ('like', 2), ('a', 2), ('file', 2), ('words', 2), (',', 2), ('an', 1), ('of', 1), ('with', 1), ('random', 1), ('example', 1)]
Time taken: 4.81 seconds

#### Code explanation
- First, we create a `sparkSession` and a `SparkContex`:
    - `sparkSession` is an entry point for every programming library in Spark and we need to create one in order to execute code.
    - The `sparkContext` is an entry point specific for RDDs.

- Then, the program reads the `text.txt` file from HDFS. With the use of a `lambda function` we split the data every time there is a whitespace between them.
    - A lambda function is essentially an anonymous function we can use to write quick throwaway functions without defining or naming them.
    - The lambda function the program uses as a `flatMap` argument: `Lambda x: x.split(" ")`
    - `flatMap` vs `map`: instead of creating multiple lists -> single list with all values.

- Next, with the use of a `map` function we create a `(key,value)` pair for every word.
    - We set `key = $word` and `value = 1`

- We use the `reduceByKey` function: every tuple with the same `key` is sent to the same `reducer` so it **aggregate** them and create the result.
    - In our case, if more than one tuples with the same `key` exist,  we **add** their `values`

- Finally, we `sortBy` `value` (number of occurrence) and print the result.
___

### Example 2 - Simple Database
#### Data format
`employees.csv` contains the ID of the employee, the name of the employee, the salary of the employee and the ID of the department that the employee works at.

| ID          | Name        | Salary      | DepartmentID |
| ----------- | ----------- | ----------- | ------------ |
| 1           | George R    | 2000        | 1            |

`Departments.csv` contains the ID of the department and the name of the department.

| ID          | Name        |
| ----------- | ----------- |
| 1           | Dep A       |

The data is stored in S3:
- `s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees.csv`
- `s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/departments.csv`

#### QUERY 1: *Find the 5 worst paid employees*

##### Spark RDD API

In [2]:
# Spark RDD code
from pyspark.sql import SparkSession

sc = SparkSession \
    .builder \
    .appName("RDD query 1 execution") \
    .getOrCreate() \
    .sparkContext

start_time = time.time()
# Experiment with map vs flatmap here:
employees = sc.textFile("s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees.csv") \
                .map(lambda x: (x.split(",")))
sorted_employees = employees.map(lambda x: [int(x[2]), [x[0], x[1], x[3]] ]) \
                    .sortByKey()
print(sorted_employees.take(5))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[(550, ['6', 'Jerry L', '3']), (1000, ['2', 'John K', '2']), (1000, ['7', 'Marios K', '1']), (1050, ['5', 'Helen K', '2']), (1500, ['10', 'Yiannis T', '1'])]

Try to explain:
-  `flatMmap` vs `map`: why `map` here?
- what does the second `lambda` function do?

##### Spark DataFrame API

In [3]:
# Spark DataFrame code
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, FloatType, StringType
from pyspark.sql.functions import col

spark = SparkSession \
    .builder \
    .appName("DF query 1 execution") \
    .getOrCreate()

employees_schema = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType()),
    StructField("salary", FloatType()),
    StructField("dep_id", IntegerType()),
])

employees_df = spark.read.csv("s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees.csv", header=False, schema=employees_schema)
#print the schema of the DataFrame:
employees_df.printSchema()

## Alternative way to read csv:
# employees_df = spark.read.format('csv') \
#     .options(header='false') \
#     .schema(employees_schema) \
#     .load("s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees.csv")

sorted_employees_df = employees_df.sort(col("salary"))
sorted_employees_df.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: float (nullable = true)
 |-- dep_id: integer (nullable = true)

+---+---------+------+------+
| id|     name|salary|dep_id|
+---+---------+------+------+
|  6|  Jerry L| 550.0|     3|
|  2|   John K|1000.0|     2|
|  7| Marios K|1000.0|     1|
|  5|  Helen K|1050.0|     2|
| 10|Yiannis T|1500.0|     1|
+---+---------+------+------+
only showing top 5 rows

Remember to use `explain()` to check if the physical plan is what you expect.

In [4]:
## Use "explain()" to display physical plan:
sorted_employees_df.explain(mode="formatted")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

== Physical Plan ==
AdaptiveSparkPlan (4)
+- Sort (3)
   +- Exchange (2)
      +- Scan csv  (1)


(1) Scan csv 
Output [4]: [id#0, name#1, salary#2, dep_id#3]
Batched: false
Location: InMemoryFileIndex [s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees.csv]
ReadSchema: struct<id:int,name:string,salary:float,dep_id:int>

(2) Exchange
Input [4]: [id#0, name#1, salary#2, dep_id#3]
Arguments: rangepartitioning(salary#2 ASC NULLS FIRST, 1000), ENSURE_REQUIREMENTS, [plan_id=13]

(3) Sort
Input [4]: [id#0, name#1, salary#2, dep_id#3]
Arguments: [salary#2 ASC NULLS FIRST], true, 0

(4) AdaptiveSparkPlan
Output [4]: [id#0, name#1, salary#2, dep_id#3]
Arguments: isFinalPlan=false

In [4]:
# Write results to S3 -> 
#    1. create the output directory in your S3 bucket
#    2. change your group number below 
#    3. and uncomment
group_number = "53"
s3_path = "s3://groups-bucket-dblab-905418150721/group"+group_number+"/some_employees/"
sorted_employees_df.write.mode("overwrite").parquet(s3_path)
sorted_employees_df_again = spark.read.parquet(s3_path)
sorted_employees_df_again.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+---+----------+------+------+
| id|      name|salary|dep_id|
+---+----------+------+------+
|  6|   Jerry L| 550.0|     3|
|  2|    John K|1000.0|     2|
|  7|  Marios K|1000.0|     1|
|  5|   Helen K|1050.0|     2|
| 10| Yiannis T|1500.0|     1|
|  1|  George R|2000.0|     1|
|  3|    Mary T|2100.0|     1|
|  4|  George T|2100.0|     1|
|  8|  George K|2500.0|     2|
| 11| Antonis T|2500.0|     2|
|  9|Vasilios D|3500.0|     3|
+---+----------+------+------+

___
#### QUERY 2: *Find the 3 best paid employees from "Dep A"*

##### Spark RDD API

In [5]:
# Spark RDD code
from pyspark.sql import SparkSession

sc = SparkSession \
    .builder \
    .appName("RDD query 2 execution") \
    .getOrCreate() \
    .sparkContext

employees = sc.textFile("s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees.csv") \
                .map(lambda x: (x.split(",")))
departments = sc.textFile("s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/departments.csv") \
                .map(lambda x: (x.split(",")))
depA = departments.filter(lambda x: x[1] == "Dep A")
# print(depA.collect())


employees_formatted = employees.map(lambda x: [x[3] , [x[0],x[1],x[2]] ] )
depA_formatted = depA.map(lambda x: [x[0], [x[1]]])
# print(employees_formatted.collect())
# print(depA_formatted.collect())

joined_data = employees_formatted.join(depA_formatted)
# print(joined_data.collect())

get_employees = joined_data.map(lambda x: (x[1][0]))
# print(get_employees.collect())

sorted_employees= get_employees.map(lambda x: [int(x[2]),[x[0], x[1]] ] ) \
                                .sortByKey(ascending=False)
print(sorted_employees.take(3))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[(2100, ['3', 'Mary T']), (2100, ['4', 'George T']), (2000, ['1', 'George R'])]

##### Spark SQL API

In [6]:
# Spark SQL code
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, FloatType, StringType
spark = SparkSession \
    .builder \
    .appName("DF query 2 execution") \
    .getOrCreate()

employees_schema = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType()),
    StructField("salary", FloatType()),
    StructField("dep_id", IntegerType()),
])

employees_df = spark.read.format('csv') \
    .options(header='false') \
    .schema(employees_schema) \
    .load("s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees.csv")

departments_schema = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType()),
])

departments_df = spark.read.format('csv') \
    .options(header='false') \
    .schema(departments_schema) \
    .load("s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/departments.csv")

# To utilize as SQL tables
employees_df.createOrReplaceTempView("employees")
departments_df.createOrReplaceTempView("departments")

id_query = "SELECT departments.id, departments.name \
    FROM departments \
    WHERE departments.name == 'Dep A'"

depA_id = spark.sql(id_query)
# This works but is deprecated
depA_id.registerTempTable("depA")
inner_join_query = "SELECT employees.name, employees.salary \
    FROM employees \
    INNER JOIN depA ON employees.dep_id == depA.id \
    ORDER BY employees.salary DESC"

joined_data = spark.sql(inner_join_query)
joined_data.show(3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+------+
|    name|salary|
+--------+------+
|  Mary T|2100.0|
|George T|2100.0|
|George R|2000.0|
+--------+------+
only showing top 3 rows


In [8]:
joined_data.explain(mode="formatted")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

== Physical Plan ==
AdaptiveSparkPlan (11)
+- Sort (10)
   +- Exchange (9)
      +- Project (8)
         +- BroadcastHashJoin Inner BuildRight (7)
            :- Filter (2)
            :  +- Scan csv  (1)
            +- BroadcastExchange (6)
               +- Project (5)
                  +- Filter (4)
                     +- Scan csv  (3)


(1) Scan csv 
Output [3]: [name#42, salary#43, dep_id#44]
Batched: false
Location: InMemoryFileIndex [s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees.csv]
PushedFilters: [IsNotNull(dep_id)]
ReadSchema: struct<name:string,salary:float,dep_id:int>

(2) Filter
Input [3]: [name#42, salary#43, dep_id#44]
Condition : isnotnull(dep_id#44)

(3) Scan csv 
Output [2]: [id#49, name#50]
Batched: false
Location: InMemoryFileIndex [s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/departments.csv]
PushedFilters: [IsNotNull(name), EqualTo(name,Dep A), IsNotNull(id)]
ReadSchema: struct<id:int,name:string>

(4

Let's try to change the join strategy:

In [7]:
inner_join_query = "SELECT /*+ SHUFFLE_HASH(depA) */ employees.name, employees.salary \
    FROM employees \
    INNER JOIN depA ON employees.dep_id = depA.id \
    ORDER BY employees.salary DESC"

joined_data = spark.sql(inner_join_query)
joined_data.show(3)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+------+
|    name|salary|
+--------+------+
|  Mary T|2100.0|
|George T|2100.0|
|George R|2000.0|
+--------+------+
only showing top 3 rows

In [8]:
joined_data.explain(mode="formatted")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

== Physical Plan ==
AdaptiveSparkPlan (12)
+- Sort (11)
   +- Exchange (10)
      +- Project (9)
         +- ShuffledHashJoin Inner BuildRight (8)
            :- Exchange (3)
            :  +- Filter (2)
            :     +- Scan csv  (1)
            +- Exchange (7)
               +- Project (6)
                  +- Filter (5)
                     +- Scan csv  (4)


(1) Scan csv 
Output [3]: [name#76, salary#77, dep_id#78]
Batched: false
Location: InMemoryFileIndex [s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees.csv]
PushedFilters: [IsNotNull(dep_id)]
ReadSchema: struct<name:string,salary:float,dep_id:int>

(2) Filter
Input [3]: [name#76, salary#77, dep_id#78]
Condition : isnotnull(dep_id#78)

(3) Exchange
Input [3]: [name#76, salary#77, dep_id#78]
Arguments: hashpartitioning(dep_id#78, 1000), ENSURE_REQUIREMENTS, [plan_id=338]

(4) Scan csv 
Output [2]: [id#83, name#84]
Batched: false
Location: InMemoryFileIndex [s3://initial-notebook-data-bucket-dbl

___
### Example 3 - Simple Database with a twist (DataFrame UDFs)

Sometimes we need to define functions that process the values of specific columns of a single row.

Motivating example: a database with salaries and bonuses for our employees:

| ID          | Name        | Salary      | DepartmentID | Bonus        |
| ----------- | ----------- | ----------- | ------------ | ------------ |
| 1           | George R    | 2000        | 1            | 500          |

The data is stored in S3:
- `s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees2.csv`

#### QUERY: *Calculate the total yearly income for each employee: `14 x Salary + Bonus`*

##### Spark RDD API

In [9]:
# Spark RDD code: the 'map' function is enough
from pyspark.sql import SparkSession

sc = SparkSession \
    .builder \
    .appName("RDD query 1 execution") \
    .getOrCreate() \
    .sparkContext

employees = sc.textFile("s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees2.csv") \
                .map(lambda x: (x.split(",")))
employees_yearly = employees.map(lambda x: [x[1], 14*(int(x[2]))+int(x[4])])
print(employees_yearly.collect())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

[['George R', 28500], ['John K', 14150], ['Mary T', 29850], ['George T', 29720], ['Helen K', 14900], ['Jerry L', 7900], ['Marios K', 14550], ['George K', 36500], ['Vasilios D', 50000], ['Yiannis T', 21450], ['Antonis T', 35620]]

##### Spark DataFrame API

In [10]:
# Spark DataFrame code - UDF
from pyspark.sql import SparkSession
from pyspark.sql.types import StructField, StructType, IntegerType, FloatType, StringType
from pyspark.sql.functions import col, udf

spark = SparkSession.builder \
    .appName("UDF example") \
    .getOrCreate()

employees2_schema = StructType([
    StructField("id", IntegerType()),
    StructField("name", StringType()),
    StructField("salary", FloatType()),
    StructField("dep_id", IntegerType()),
    StructField("bonus", FloatType()),
])

def calculate_yearly_income(salary, bonus):
    return 14*salary+bonus

# Register the UDF
calculate_yearly_income_udf = udf(calculate_yearly_income, FloatType())

employees_df = spark.read.csv("s3://initial-notebook-data-bucket-dblab-905418150721/spark-example-data/employees2.csv", header=False, schema=employees2_schema)

employees_yearly_df = employees_df \
    .withColumn("yearly", calculate_yearly_income_udf(col("salary"), col("bonus"))).select("name", "yearly")

employees_yearly_df.show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------+-------+
|      name| yearly|
+----------+-------+
|  George R|28500.0|
|    John K|14150.0|
|    Mary T|29850.0|
|  George T|29720.0|
|   Helen K|14900.0|
|   Jerry L| 7900.0|
|  Marios K|14550.0|
|  George K|36500.0|
|Vasilios D|50000.0|
| Yiannis T|21450.0|
| Antonis T|35620.0|
+----------+-------+

___
### One Final Thing: Configuring our Spark Application Resources in Jupyter with SparkMagic

In [11]:
# Access configuration
conf = spark.sparkContext.getConf()

# Print relevant executor settings
print("Executor Instances:", conf.get("spark.executor.instances"))
print("Executor Memory:", conf.get("spark.executor.memory"))
print("Executor Cores:", conf.get("spark.executor.cores"))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Executor Instances: None
Executor Memory: 4743M
Executor Cores: 2

In [12]:
%%configure -f
{
    "conf": {
        "spark.executor.instances": "1",
        "spark.executor.memory": "1g",
        "spark.executor.cores": "1",
        "spark.driver.memory": "2g"
    }
}

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
227,application_1732639283265_0196,pyspark,idle,Link,Link,,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


ID,YARN Application ID,Kind,State,Spark UI,Driver log,User,Current session?
227,application_1732639283265_0196,pyspark,idle,Link,Link,,✔


In [13]:
# Access configuration
conf = spark.sparkContext.getConf()

# Print relevant executor settings
print("Executor Instances:", conf.get("spark.executor.instances"))
print("Executor Memory:", conf.get("spark.executor.memory"))
print("Executor Cores:", conf.get("spark.executor.cores"))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Executor Instances: 1
Executor Memory: 1g
Executor Cores: 1